RETRIEVINGDEEPWEBDATATHROUGHMULTI-ATTRIBUTESINTERFACESWITHSTRUCTUREDQUERIES资源-CSDN文库

研究论文

92 浏览量 2021-02-21 02:49:08 上传评论收藏 492KB PDF 举报

资源推荐

资源详情

资源评论

August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940

S0218194011005396

International Journal of Software Engineering

and Knowledge Engineering

Vol. 21, No. 4 (2011) 523–542

 World Scientiﬁc Publishing Company

DOI: 10.1142/S0218194011005396

RETRIEVING DEEP WEB DATA THROUGH

MULTI-ATTRIBUTES INTERFACES

WITH STRUCTURED QUERIES

JIAN-WEI TIAN

∗

, WEN-HUI QI

†

and XIAO-XIAO LIU

‡

Hunan Electric Co. Corporation Research Unit

State Grid Corporation of China

∗

tianjw@hn.sgcc.com.cn

†

qiwh@hn.sgcc.com.cn

‡

liuxx@hn.sgcc.com.cn

Received 21 October 2009

Revised 1 April 2010

Accepted 9 April 2010

A great deal of data on the Web lies in the hidden databases, or the deep Web. Most of

the deep Web data is not directly available and can only be accessed through the query

interfaces. Current research on deep Web search has focused on crawling the deep Web

data via Web interfaces with keywords queries. However, these keywords-based methods

have inherent limitations because of the multi-attributes and top-k features of the deep

Web. In this paper we propose a novel approach for siphoning structured data with

structured queries. Firstly, in order to retrieve all the data non-repeatedly in hidden

databases, we model the hidden database as a hierarchy tree. Under this theoretical

framework, data retrieving is transformed into the traversing problem in a tree. We also

propose techniques to narrow the query space by using heuristic rule, based on mutual

information, to guide the traversal process. We conduct extensive experiments over real

deep Web sites and controlled databases to illustrate the coverage and eﬃciency of our

techniques.

Keywords: Hidden databases; data retrieval; multi-attribute interfaces; top-k tuples.

1. General Appearance

Background. According to the depth of the information, the Web can be divided

into surface Web and deep Web [1, 2]. Surface Web consists of static pages which

canbeindexedbythecommonsearch engines (e.g., Google). Deep Web (or invisible

Web, hidden Web) refers to on-line databases which can be accessed through query

forms. These on-line databases are called Web databases or WDB for short [3].The

content of the deep Web is dynamically generated by the Web server and returned to

the visitors in the form of Web pages, which is the fundamental diﬀerence with the

surface web. It has been estimated that the scale of these hidden resources is about

523

August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940

S0218194011005396

524 J.-W. Tian, W.-H. Qi & X.-X. Liu

500 times more data than the surface Web [4], of which about 75% is high-quality

structured data.

Due to the great quantity and quality of the deep Web, researchers have carried

out a lot of work in the ﬁeld of deep Web, including the Web database resources

locating [16], query interface integration [17] and data extraction [18], etc. However,

our paper is concerned about a new issue, namely, completely and non-repeatedly

retrieving the structured data from Web databases.

Motivation. We believe that completely and non-repeatedly retrieving the deep

Web data can be used as primitives in many applications. We mention below some

possible directions.

Deep Web data integration. One main approach of managing the hidden database

is the data warehouse [13, 14] which integrates the data from multiple hidden

databases of the same topic. In this approach, one signiﬁcant standard is complete-

ness which refers to the ratio of the amount of data in the warehouse to the sum of

the data in all the resource Web databases. In order to improve the completeness,

it must try to completely siphon the data in the source hidden databases.

Data quality analysis. There are a large number of deep web sites on the same

topic. But not all of them are of high quality, some sites may be small and have

incorrect data. So it is needed to recommend high-quality Web database to the

visitors. Meanwhile, only through these distinctive samples, we can precisely analyze

the data quality of the Web databases. Therefore, our approach can play a signiﬁcant

role in the data quality analysis application.

Complementing to surface the deep Web sites. The Google’s Deep Web Crawler

[5, 15] is a deep Web crawling approach, which is used to surface the deep web at

scale. Its objective is to trade-oﬀ coverage of individual sites in the interest of cov-

erage of the entire Deep Web. As a result, this crawler fairly treats all the deep web

sites and does not consider their quality. However, it is more reasonable to acquire

more data from the web sites with high quality than the ones with low quality. So

our approach can be used to siphon the high quality data in the deep web surfacing.

Problem statement. Some previous approaches [5–7] have been proposed to

acquire these data by using the keywords-based queries. The keywords-based queries

refer to queries constructed by involving only one of the attributes on query inter-

faces. For example, the query only specify the value of the attribute “Price” in

the house sale Web site, although it contains other attributes, such as “Room”,

“Location”, “Style”.

Obviously, there may be lots of data records in the database satisfying a key-

words query, therefore the retrieving eﬃciency is usually high in the keywords based

approaches. Unfortunately, even though these keywords-based approaches seem to

be able to eﬃciently retrieve the data, none of them is very satisfying because of

the features of the deep Web.

Keywords-based methods will repeatedly retrieve the same data record. Most of

the hidden databases contain multiple attributes, so each tuple in the database

August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940

S0218194011005396

Retrieving Deep Web Data Through Multi-Attributes Interfaces 525

Table 1. Example of tuples

being repeatedly extracted.

000

011

111

contains more than one attribute value, which determines that each tuple may be

repeatedly retrieved for many times in the keywords-based approaches. For example,

in Table 1, the database has three attributes A

, A

. If at one time the condition

of query sentence is “where A

= ‘0’ ”, the tuples t

and t

are hit. Next time

another query which is “where A

= ‘1’ ” may extract t

and t

.Herethet

repeatedly extracted. Considering a hidden database which has T tuples and N

attributes, the worst situation is that one tuple may be repeatedly extracted for

N times and the result set may contain (N − 1)

∗

T duplicated tuples. Repeatedly

retrieving also reduces the IE eﬃciency and increases the server overhead besides

the bad inﬂuence of the duplicated tuples in the result set.

Keywords-based methods cannot completely retrieve the data records. Many deep

Web sites only return the top-k tuples of the results according to the ranking func-

tion. Let R(Q) be the set of tuples that satisfy Q. In many deep Web sites, only the

top-k tuples are returned when the R(Q) is very large, where k is a ﬁxed constant

such as 100 or 200. In most of the deep web sites, we can get the value of k at the

top of result pages. Examples of such deep web sites include MSN CareerBuilder

(http://msn.careerbuilder.com)whosek = 4000. In keywords-based approaches, one

query may easily hit many tuples, which means that the size of R(Q) exceeds the

ﬁxed threshold k. As a result, only the top-k tuples can be extracted and the others

are omitted. The direct result of this situation is that the coverage of the retrieving

result set cannot be guaranteed.

Due to the limitations of the keywords-based approaches, distinctively and com-

pletely retrieving the deep Web data is a signiﬁcant problem which we must address.

One appropriate approach is to divide the hidden database into a set of disjoint

blocks whose size are not larger than the ﬁxed constant k and generate a set of

queries to locate all of these disjoint blocks, then we can distinctively and com-

pletely retrieve all of the data in the hidden database. In this paper we propose

a method of constructing structured queries to retrieve the deep Web records. A

structured query can be regarded as a path that leads to a block of top-k tuples

in the databases. Structured queries are much stricter than the keywords-based

queries, for they involve multiple attribute values. For example, considering there is

a hidden database D with a set of attributes {a

,...,a

}, the keywords-based query

is generated on a single attribute, such as “SELECT ∗ FROM D WHERE a

= v

,”

where v

is the corresponding attribute value ﬁlled into the query interface, while

the structured query is the union query on the attributes in format of “SELECT

∗ FROM D WHERE a

= v

AND ...AND a

= v

.” The structured queries

August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940

S0218194011005396

526 J.-W. Tian, W.-H. Qi & X.-X. Liu

can only hit a set of tuples, the number of which is less than k. By using multiple

attribute values to construct the query, each block of top-k tuples is labelled by a

unique path. In this way, we can guarantee non-repeatedly retrieving all the tuples

in the hidden databases.

However, retrieving the hidden databases with structured queries faces two chal-

lenges. Firstly, how to model the database? A hidden database is composed of a set

of tuples, which is the same as the traditional databases. How can we divide all of

the tuples into disjoint diﬀerent top-k blocks that can be labeled by a unique query

path? The performance measures of retrieving data from the hidden database are

eﬃciency, coverage and repetition. So the second challenge is how to construct the

structured queries that we can retrieve the blocks with good eﬃciency and high

coverage.

Our contributions. In order to solve these problems, we ﬁrst model the hidden

database as a hierarchy tree. Each block of top-k tuples is represented as a leaf

node in the hierarchy tree. The edges from the root to the leaf node constitute the

query path. Then, we turn our attention to discuss how to narrow the query space.

Finally, we propose our retrieving algorithm called WDB-RETRIEVER.

In summary, our contributions are as follows:

First, we identify the novel problem of retrieving the data in the hidden database

with structured queries. Contrary to the traditional retrieving methods based on

the keywords-based approach, our structured queries based approach has no extra

duplicated tuples in the result set, which can be demonstrated in the later sections.

Second, we provide a theoretical framework that models the hidden database as

a hierarchy tree. The blocks of top-k tuples in the databases are represented as the

leaf nodes. And the edges from the root to the leaf nodes constitute the query paths.

Third, we propose the heuristic rule based on the mutual information between

the attribute values to guide the retrieving process.

Finally, we present extensive experiments to evaluate the eﬃciency and the

coverage of our retrieving method. We also conduct experiments to demonstrate

that employing the heuristic rule based on the mutual information can signiﬁcantly

narrow the query space and increase the retrieving eﬃciency.

This rest of this paper is organized as follows: In Sec. 2, we review some related

work. In Sec. 3, we formally specify the problem of retrieving data from the hidden

databases and describe how to model the hidden databases as the hierarchy trees.

In Sec. 4, we discuss the methods of narrowing the query space. We propose the

mutual information based heuristic rule. The experimental setting and evaluation

are given in Sec. 6. We draw the conclusion in Sec. 7.

2. Related Works

The current works about deep web data extracting are mainly the keywords-based

crawler [5–8]. In [5] a deep Web data crawling approach based on attribute value

graph is proposed. This method ﬁrst constructs the attribute value graph, then

剩余20页未读，继续阅读

评论收藏

内容反馈

weixin_38626858

粉丝: 2
资源: 898

RETRIEVING DEEP WEB DATA THROUGH MULTI-ATTRIBUTES INTERFACES WIT...

最新资源

RETRIEVING DEEP WEB DATA THROUGH MULTI-ATTRIBUTES INTERFACES WIT...

Deep Web查询接口的复杂模式匹配

DeepWeb信息抽取研究

基于层次树模型的Deep Web数据提取方法

DeepWeb 数据采集查询构造方法研究

证明身份密钥 (AIK)工具_C语言_代码_下载

Retrieving-data-from-multiple-tables

【Excel】Retrieving the COM class factory for component with CLSID

Oracle Hyperion Essbase - Retrieving Data - 培训资料

精选16篇农业遥感领域学术论文

Deep Web 数据集成问题研究

Full Stack Web Development with Raspberry Pi 3 - 2017.8

Office 2003 组件owc11

英文原版-Instant PHP Web Scraping 1st Edition

SQLMemTable for Delphi / C++ Builder

cosi-corr_pakOct14.zip

Teleport+Ultra.7z

Microsoft SQL Server 2012 T-SQL Fundamentals

PhoneGap 3.x Mobile Application Development Hotshot.2014.pdf

PHP和MySQL Web Development第五版(2016)原版完整英文版

Python Data Analytics 无水印pdf 0分

gcc编译工具-MinGW-w64.rar

Getting Started with TensorFlow

Practical Data Wrangling

IAR for stm8 debugger-support-files.exe

PHP.and.MySQL.Web.Development.5th.Edition

最新资源