教師資料查詢 | 類別: 期刊論文 | 教師: 周清江 Chichang Jou (瀏覽個人網頁)

標題:Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
學年107
學期2
出版(發表)日期2019/02/01
作品名稱Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
作品名稱(其他語言)
著者Chichang Jou
單位
出版者
著錄名稱、卷期、頁數Information Systems Frontiers 21(1), p.163–174
摘要Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels.We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user’s view and the designer’s view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.
關鍵字Deep web;Query interface;Schema extraction;XML;Heuristic rules;String similarity
語言英文(美國)
ISSN1387-3326
期刊性質國外
收錄於SCI;SSCI;
產學合作
通訊作者Chichang Jou
審稿制度
國別中華民國
公開徵稿
出版型式,電子版,紙本
相關連結
Google+ 推薦功能,讓全世界都能看到您的推薦!