没有合适的资源?快使用搜索试试~ 我知道了~
RETRIEVING DEEP WEB DATA THROUGH MULTI-ATTRIBUTES INTERFACES WIT...
0 下载量 92 浏览量
2021-02-21
02:49:08
上传
评论
收藏 492KB PDF 举报
温馨提示
A great deal of data on the Web lies in the hidden databases, or the deep Web. Most of the deep Web data is not directly available and can only be accessed through the query interfaces. Current research on deep Web search has focused on crawling the deep Web data via Web interfaces with keywords queries. However, these keywords-based methods have inherent limitations because of the multi-attributes and top-k features of the deep Web. In this paper we propose a novel approach for siphoning struct
资源推荐
资源详情
资源评论
August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940
S0218194011005396
International Journal of Software Engineering
and Knowledge Engineering
Vol. 21, No. 4 (2011) 523–542
c
World Scientific Publishing Company
DOI: 10.1142/S0218194011005396
RETRIEVING DEEP WEB DATA THROUGH
MULTI-ATTRIBUTES INTERFACES
WITH STRUCTURED QUERIES
JIAN-WEI TIAN
∗
, WEN-HUI QI
†
and XIAO-XIAO LIU
‡
Hunan Electric Co. Corporation Research Unit
State Grid Corporation of China
∗
tianjw@hn.sgcc.com.cn
†
qiwh@hn.sgcc.com.cn
‡
liuxx@hn.sgcc.com.cn
Received 21 October 2009
Revised 1 April 2010
Accepted 9 April 2010
A great deal of data on the Web lies in the hidden databases, or the deep Web. Most of
the deep Web data is not directly available and can only be accessed through the query
interfaces. Current research on deep Web search has focused on crawling the deep Web
data via Web interfaces with keywords queries. However, these keywords-based methods
have inherent limitations because of the multi-attributes and top-k features of the deep
Web. In this paper we propose a novel approach for siphoning structured data with
structured queries. Firstly, in order to retrieve all the data non-repeatedly in hidden
databases, we model the hidden database as a hierarchy tree. Under this theoretical
framework, data retrieving is transformed into the traversing problem in a tree. We also
propose techniques to narrow the query space by using heuristic rule, based on mutual
information, to guide the traversal process. We conduct extensive experiments over real
deep Web sites and controlled databases to illustrate the coverage and efficiency of our
techniques.
Keywords: Hidden databases; data retrieval; multi-attribute interfaces; top-k tuples.
1. General Appearance
Background. According to the depth of the information, the Web can be divided
into surface Web and deep Web [1, 2]. Surface Web consists of static pages which
canbeindexedbythecommonsearch engines (e.g., Google). Deep Web (or invisible
Web, hidden Web) refers to on-line databases which can be accessed through query
forms. These on-line databases are called Web databases or WDB for short [3].The
content of the deep Web is dynamically generated by the Web server and returned to
the visitors in the form of Web pages, which is the fundamental difference with the
surface web. It has been estimated that the scale of these hidden resources is about
523
August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940
S0218194011005396
524 J.-W. Tian, W.-H. Qi & X.-X. Liu
500 times more data than the surface Web [4], of which about 75% is high-quality
structured data.
Due to the great quantity and quality of the deep Web, researchers have carried
out a lot of work in the field of deep Web, including the Web database resources
locating [16], query interface integration [17] and data extraction [18], etc. However,
our paper is concerned about a new issue, namely, completely and non-repeatedly
retrieving the structured data from Web databases.
Motivation. We believe that completely and non-repeatedly retrieving the deep
Web data can be used as primitives in many applications. We mention below some
possible directions.
Deep Web data integration. One main approach of managing the hidden database
is the data warehouse [13, 14] which integrates the data from multiple hidden
databases of the same topic. In this approach, one significant standard is complete-
ness which refers to the ratio of the amount of data in the warehouse to the sum of
the data in all the resource Web databases. In order to improve the completeness,
it must try to completely siphon the data in the source hidden databases.
Data quality analysis. There are a large number of deep web sites on the same
topic. But not all of them are of high quality, some sites may be small and have
incorrect data. So it is needed to recommend high-quality Web database to the
visitors. Meanwhile, only through these distinctive samples, we can precisely analyze
the data quality of the Web databases. Therefore, our approach can play a significant
role in the data quality analysis application.
Complementing to surface the deep Web sites. The Google’s Deep Web Crawler
[5, 15] is a deep Web crawling approach, which is used to surface the deep web at
scale. Its objective is to trade-off coverage of individual sites in the interest of cov-
erage of the entire Deep Web. As a result, this crawler fairly treats all the deep web
sites and does not consider their quality. However, it is more reasonable to acquire
more data from the web sites with high quality than the ones with low quality. So
our approach can be used to siphon the high quality data in the deep web surfacing.
Problem statement. Some previous approaches [5–7] have been proposed to
acquire these data by using the keywords-based queries. The keywords-based queries
refer to queries constructed by involving only one of the attributes on query inter-
faces. For example, the query only specify the value of the attribute “Price” in
the house sale Web site, although it contains other attributes, such as “Room”,
“Location”, “Style”.
Obviously, there may be lots of data records in the database satisfying a key-
words query, therefore the retrieving efficiency is usually high in the keywords based
approaches. Unfortunately, even though these keywords-based approaches seem to
be able to efficiently retrieve the data, none of them is very satisfying because of
the features of the deep Web.
Keywords-based methods will repeatedly retrieve the same data record. Most of
the hidden databases contain multiple attributes, so each tuple in the database
August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940
S0218194011005396
Retrieving Deep Web Data Through Multi-Attributes Interfaces 525
Table 1. Example of tuples
being repeatedly extracted.
A
1
A
2
A
3
t
1
000
t
2
011
t
3
111
contains more than one attribute value, which determines that each tuple may be
repeatedly retrieved for many times in the keywords-based approaches. For example,
in Table 1, the database has three attributes A
1
,A
2
, A
3
. If at one time the condition
of query sentence is “where A
1
= ‘0’ ”, the tuples t
1
and t
2
are hit. Next time
another query which is “where A
2
= ‘1’ ” may extract t
2
and t
3
.Herethet
2
is
repeatedly extracted. Considering a hidden database which has T tuples and N
attributes, the worst situation is that one tuple may be repeatedly extracted for
N times and the result set may contain (N − 1)
∗
T duplicated tuples. Repeatedly
retrieving also reduces the IE efficiency and increases the server overhead besides
the bad influence of the duplicated tuples in the result set.
Keywords-based methods cannot completely retrieve the data records. Many deep
Web sites only return the top-k tuples of the results according to the ranking func-
tion. Let R(Q) be the set of tuples that satisfy Q. In many deep Web sites, only the
top-k tuples are returned when the R(Q) is very large, where k is a fixed constant
such as 100 or 200. In most of the deep web sites, we can get the value of k at the
top of result pages. Examples of such deep web sites include MSN CareerBuilder
(http://msn.careerbuilder.com)whosek = 4000. In keywords-based approaches, one
query may easily hit many tuples, which means that the size of R(Q) exceeds the
fixed threshold k. As a result, only the top-k tuples can be extracted and the others
are omitted. The direct result of this situation is that the coverage of the retrieving
result set cannot be guaranteed.
Due to the limitations of the keywords-based approaches, distinctively and com-
pletely retrieving the deep Web data is a significant problem which we must address.
One appropriate approach is to divide the hidden database into a set of disjoint
blocks whose size are not larger than the fixed constant k and generate a set of
queries to locate all of these disjoint blocks, then we can distinctively and com-
pletely retrieve all of the data in the hidden database. In this paper we propose
a method of constructing structured queries to retrieve the deep Web records. A
structured query can be regarded as a path that leads to a block of top-k tuples
in the databases. Structured queries are much stricter than the keywords-based
queries, for they involve multiple attribute values. For example, considering there is
a hidden database D with a set of attributes {a
1
,...,a
n
}, the keywords-based query
is generated on a single attribute, such as “SELECT ∗ FROM D WHERE a
i
= v
i
,”
where v
i
is the corresponding attribute value filled into the query interface, while
the structured query is the union query on the attributes in format of “SELECT
∗ FROM D WHERE a
1
= v
1
AND ...AND a
m
= v
m
.” The structured queries
August 20, 2011 10:42 WSPC/117-IJSEKE - SPI-J111 0218-1940
S0218194011005396
526 J.-W. Tian, W.-H. Qi & X.-X. Liu
can only hit a set of tuples, the number of which is less than k. By using multiple
attribute values to construct the query, each block of top-k tuples is labelled by a
unique path. In this way, we can guarantee non-repeatedly retrieving all the tuples
in the hidden databases.
However, retrieving the hidden databases with structured queries faces two chal-
lenges. Firstly, how to model the database? A hidden database is composed of a set
of tuples, which is the same as the traditional databases. How can we divide all of
the tuples into disjoint different top-k blocks that can be labeled by a unique query
path? The performance measures of retrieving data from the hidden database are
efficiency, coverage and repetition. So the second challenge is how to construct the
structured queries that we can retrieve the blocks with good efficiency and high
coverage.
Our contributions. In order to solve these problems, we first model the hidden
database as a hierarchy tree. Each block of top-k tuples is represented as a leaf
node in the hierarchy tree. The edges from the root to the leaf node constitute the
query path. Then, we turn our attention to discuss how to narrow the query space.
Finally, we propose our retrieving algorithm called WDB-RETRIEVER.
In summary, our contributions are as follows:
First, we identify the novel problem of retrieving the data in the hidden database
with structured queries. Contrary to the traditional retrieving methods based on
the keywords-based approach, our structured queries based approach has no extra
duplicated tuples in the result set, which can be demonstrated in the later sections.
Second, we provide a theoretical framework that models the hidden database as
a hierarchy tree. The blocks of top-k tuples in the databases are represented as the
leaf nodes. And the edges from the root to the leaf nodes constitute the query paths.
Third, we propose the heuristic rule based on the mutual information between
the attribute values to guide the retrieving process.
Finally, we present extensive experiments to evaluate the efficiency and the
coverage of our retrieving method. We also conduct experiments to demonstrate
that employing the heuristic rule based on the mutual information can significantly
narrow the query space and increase the retrieving efficiency.
This rest of this paper is organized as follows: In Sec. 2, we review some related
work. In Sec. 3, we formally specify the problem of retrieving data from the hidden
databases and describe how to model the hidden databases as the hierarchy trees.
In Sec. 4, we discuss the methods of narrowing the query space. We propose the
mutual information based heuristic rule. The experimental setting and evaluation
are given in Sec. 6. We draw the conclusion in Sec. 7.
2. Related Works
The current works about deep web data extracting are mainly the keywords-based
crawler [5–8]. In [5] a deep Web data crawling approach based on attribute value
graph is proposed. This method first constructs the attribute value graph, then
剩余20页未读,继续阅读
资源评论
weixin_38626858
- 粉丝: 2
- 资源: 898
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功