Deepweb经典文献(英文+中文)资源-CSDN文库

共19个文件

pdf：18个

kdh：1个

deep

web

deepweb

5星 · 超过95%的资源需积分: 10 74 浏览量 2010-07-02 11:09:41 上传评论 1 收藏 8.67MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

Deep Web经典文献.rar （19个子文件）

Focused Extraction of QA-Pagelets from the Deep Web.PDF 299KB

Deep Web数据集成问题研究.pdf 424KB

Google’s Deep-Web Crawl.pdf 283KB

Crawling the hidden web.pdf 2.62MB

Structured databases on the web Observations and Implications.PDF 367KB

Proposed Protocol to Solve Discovering Hidden Web Hosts Problem.PDF 200KB

DeepWeb爬虫研究与设计.pdf 726KB

DEByE–data extraction by example.PDF 1.15MB

WISE-IntegratorAn Automatic Integrator of Web Search Interfaces for E-Commerce.PDF 289KB

Automated discovery of search interfaces on the Web.pdf 200KB

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents.PDF 1.55MB

A Brief Survey of Web Data Extraction Tools .pdf 220KB

VIDE A Vision-Based Approach for Deep Web Data Extraction.pdf 845KB

The Deep Web surfacing hidden value.pdf 337KB

Extracting Structured Data for Web Pages based on visual representation.pdf 220KB

An interactive clustering-based approach to integrating source query interfaces on the deep Web.PDF 223KB

Deep Web数据集成研究综述.pdf 823KB

Probe, Cluster, Discover Focused Extraction of QA-Pagelets from the Deep Web.pdf 154KB

自动填充深度网入口表单.kdh 1.57MB

Crawling the Hidden Web

Sriram Raghavan Hector Garcia-Molina

Computer Science Department

Stanford University

Stanford, CA 94305, USA

{rsram, hector}@cs.stanford.edu

Abstract

Current-day crawlers retrieve content only from

the publicly indexable Web, i.e., the set of Web

pages reachable purely by following hypertext

links, ignoringsearch forms and pagesthat require

authorization or prior registration. In particular,

they ignore the tremendous amount of high qual-

ity content “hidden” behind search forms, in large

searchable electronic databases. In this paper, we

address the problem of designing a crawler capa-

ble of extracting content from this hidden Web.

We introduce a generic operational model of a

hidden Web crawler and describe how this model

is realized in HiWE (Hidden Web Exposer), a

prototype crawler built at Stanford. We intro-

duce a new Layout-based Information Extraction

Technique (LITE) and demonstrate its use in au-

tomatically extracting semantic information from

search forms and response pages. We also present

results from experiments conducted to test and

validate our techniques.

1 Introduction

Crawlers are programs that automatically traverse the Web

graph, retrieving pages and building a local repository of

the portion of the Web that they visit. Depending on the ap-

plication at hand, the pages in the repository are either used

to build search indexes, or are subjected to various forms

of analysis (e.g., text mining). Traditionally, crawlers have

only targeteda portion of the Web called the publicly index-

able Web (PIW) [13]. This refers to the set of pages reach-

able purely by following hypertext links, ignoring search

forms and pages that require authorization or prior regis-

tration.

Permission to copy without fee all or part of this material is granted pro-

vided that the copies are not made or distributed for direct commercial

advantage, the VLDB copyright notice and the title of the publication and

its date appear, and notice is given that copying is by permission of the

Very Large Data Base Endowment. To copy otherwise, or to republish,

requires a fee and/or special permission from the Endowment.

Proceedings of the 27th VLDB Conference,

Roma, Italy, 2001

However, a number of recent studies [2, 13, 14] have ob-

served that a signiﬁcant fraction of Web content in fact lies

outside the PIW. Speciﬁcally, large portions of the Web

are ‘hidden’ behind search forms, in searchable structured

and unstructured databases (called the hidden Web [8] or

deep Web [2]). Pages in the hidden Web are dynamically

generated in response to queries submitted via the search

forms. The hidden Web continues to grow, as organizations

with large amounts of high-quality information (e.g., the

Census Bureau, Patents and Trademarks Ofﬁce, news me-

dia companies) are placing their content online, providing

Web-accessible search facilities over existing databases.

For instance, the website InvisibleWeb.com lists over

10000 such databases ranging from archives of job listings

to directories, news archives, and electronic catalogs. Re-

cent estimates [2] place the size of the hidden Web (in terms

of generated HTML pages) at around 500 times the size of

the PIW.

In this paper, we address the problem of building a hid-

den Web crawler; one that can crawl and extract content

from these hidden databases. Such a crawler will enable

indexing, analysis, and mining of hidden Web content, akin

to what is currently being achieved with the PIW. In addi-

tion, the content extracted by such crawlers can be used to

categorize and classify the hidden databases.

Challenges. There are signiﬁcant technical challenges

in designing a hidden Web crawler. First, the crawler

must be designed to automatically parse, process, and in-

teract with form-based search interfaces that are designed

primarily for human consumption. Second, unlike PIW

crawlers which merely submit requests for URLs, hidden

Web crawlers must also provide input in the form of search

queries (i.e., “ﬁll out forms”). This raises the issue of how

best to equip crawlers with the necessary input values for

use in constructing search queries.

To address these challenges, we adopt a task-speciﬁc,

human-assisted approach to crawling the hidden Web.

Task-speciﬁcity: We aim to selectively crawl portions

of the hidden Web, extracting content based on the re-

quirements of a particular application or task. For exam-

ple, consider a market analyst who is interested in build-

ing an archive of news articles, reports, press releases, and

white papers pertaining to the semiconductor industry, and

(a) User form interaction (b) Crawler form interaction

Figure 1: Interacting with forms

dated sometime in the last ten years. There are two steps

in building this archive: resource discovery,whereinwe

identify sites and databases that are likely to be relevant to

the task; and content extraction, where the crawler actu-

ally visits the identiﬁed sites to submit queries and extract

hidden pages. In this paper, we do not directly address the

resource discovery problem (see Section 6 for citations to

relevant work). Rather, our work examines how best to au-

tomate content retrieval, given the results of the resource

discovery step.

Human-assistance: Human-assistance is critical to en-

sure that the crawler issues queries that are relevant to the

particular task. For instance, in the above example, the

market analyst may provide the crawler (see Section 3.4

for details) with lists of companies or products that are of

interest. This enables the crawler to use these values when

ﬁlling out forms that require a company or product name to

be provided. Furthermore, as we will see, the crawler will

be able to gather additional potential company and product

names as it visits and processes a number of pages.

At Stanford, we have built a prototype hidden Web

crawler called HiWE (Hidden Web Exposer). Based on

our experience with HiWE, we make the following contri-

butions in this paper:

• We develop a generic operational model of a hidden

Web crawler and illustrate how this model was put to

use in implementing HiWE. (Sections 2 and 3)

• We propose a new technique, called LITE (Layout-

based Information Extraction Technique), for infor-

mation extraction from Web pages. We illustrate how

LITE was employed in some parts of the HiWE design.

(Section 4)

• Finally, we present some experiments to demonstrate

the feasibility of hidden Web crawling and measure the

effectiveness of our approach and techniques. (Sec-

tion 5)

2 Hidden Web Crawlers

In this section, we ﬁrst present a generic high-level opera-

tional model of a hidden Web crawler. Next, we propose

Figure 2: Sample labeled form

metrics for measuring the performance of such crawlers

and justify the rationale behind our choices. Finally, we

identify the key design issues in implementing the model.

2.1 Operational model

The fundamental difference between the actions of a hid-

den Web crawler, such as HiWE, and that of a traditional

crawler [3, 6], is with respect to pages containing search

forms. Figure 1(a) illustrates the sequence of steps (as in-

dicated by the numbers above each arrow) that take place,

when a user uses a search form to submit queries on a

hidden database. Figure 1(b) illustrates the same interac-

tion, with the crawler now playing the role of the human-

browser combination.

Our model of a hidden Web crawler consists of the four

components described below (see Figure 1(b)). We shall

use the term form page, to denote the page containing a

search form, and response page, to denote the page re-

ceived in response to a form submission.

Internal Form Representation. On receiving a form

page, a crawler ﬁrst builds an internal representation of

the search form. Abstractly, the internal representation

of a form F includes the following pieces of information:

F = ({E

,...,E

},S,M}),where{E

,...,E

}

is a set of n form elements, S is the submission informa-

tion associated with the form (e.g., submission URL, in-

ternal identiﬁers for each form element, etc.), and M is

meta-information about the form (e.g., URL of the form

page, web-site hosting the form, set of pages pointing to

this form page, other text on the page besides the form,

etc.). A form element can be any one of the standard input

elements: selection lists, text boxes, text areas, checkboxes,

or radio buttons.

For example, Figure 2 shows a form with

three elements (ignore label(E

) and dom(E

) for now).

Details about the actual contents of M and the information

associated with each E

are speciﬁc to a particular crawler

implementation.

Task-speciﬁc database. A crawler is equipped, at

least conceptually, with a task-speciﬁc database D.This

database contains all the information that is necessary for

the crawler to formulate search queries relevant to the par-

ticular task. For example, in the ‘market analyst’ exam-

ple introduced in Section 1, D could contain lists of semi-

conductor company and product names that are of interest.

The actual format, structure, and organizationof D are spe-

ciﬁc to a particular crawler implementation. For example,

HiWE uses a set of labeled fuzzy sets (Section 3.2) to rep-

resent task-speciﬁc information. More complex represen-

tations are possible, depending on the kinds of information

used by the matching function (see below).

Matching function. A crawler’s matching algo-

rithm, M atch, takes as input, an internal form rep-

resentation, and the current contents of the database

D. It produces as output, a set of value assignments.

Formally, M atch(({E

,...,E

},S,M),D) = {[E

←

,...,E

← v

]}.

A value assignment [E

← v

,...,E

← v

] asso-

ciates value v

with form element E

(e.g., if E

is a textbox

that takes a company name as input, v

could be ‘National

Semicondutor Corp.’). The crawler uses each value assign-

ment to ‘ﬁll-out’ and submit the completed form. This pro-

cess is repeated until either the set of value assignments is

exhausted, or some other termination condition is satisﬁed.

Response Analysis. The response to a form submis-

sion is received by a response analysis module that stores

the page in the crawler’s repository. In addition, the re-

sponse module analysis attempts to distinguish between

pages containing search results and pages containing error

messages. This feedback can be used to tune the match-

ing function and update the set of value assignments (see

Section 3).

Notice that the above model lends itself to a number of

different implementations depending on the internal form

representation, the organization of D, and the algorithm

that underlies Match.

2.2 Performance Metric

Traditional PIW crawlers use metrics such as crawling

speed, scalability [10], page importance [6], and freshness

[5], to measure the effectiveness of their crawling activity.

Though all of these metrics are applicable and relevant to

hidden Web crawlers, none of these capture the fundamen-

Note that submit and reset buttons are not included, as they are only

used to manipulate forms, not provide input.

tal challenges in dealing with the Hidden Web, namely, au-

tomatic form processing and submission.

The choice of a good performance metric for hidden

Web crawlers itself turns out to be an interesting issue. We

considered a number of options. For instance, we consid-

ered a coverage metric that measures the ratio of the num-

ber of ‘relevant’ pages extracted by a crawler to the total

number of ‘relevant’ pages present in the targeted hidden

databases. Even though such a metric is conceptually ap-

pealing, there are two problems. First, without additional

information about the hidden databases, it is very difﬁcult

to estimate how much of the their content is relevant to the

task. Second, the metric is signiﬁcantly dependent on the

contents of D, the crawler’s task-speciﬁc database. This in

turn is determined by how well the crawler is conﬁgured

for the task, by the human. All other things being equal,

a crawler that has access to a more comprehensive task-

speciﬁc database can extract more content and hence report

better coverage. However, we seek a metric that can mea-

sure the effectiveness of the crawler’s form representation

and matching function, independent of the actual contents

of D. Below, we deﬁne two versions of a metric that meet

this requirement.

Submission Efﬁciency. Let N

total

be the total number

of forms that the crawler submits, during the course of its

crawling activity. Let N

success

denote the number of sub-

missions which result in a response page containing one or

more search results.

Then, we deﬁne the strict submission

efﬁciency (SE

strict

) metric as: SE

strict

success

total

Note that this metric is ‘strict’, because it penalizes the

crawler even for submissions which are intrinsically ‘cor-

rect’ but which did not yield any search results because the

content in the database did not match the query parameters.

We also deﬁne a lenient submission efﬁciency (SE

lenient

)

metric that penalizes a crawler only if a form submission is

semantically incorrect (e.g., submitting a company name as

input to a form element that was intended to receive names

of company employees). Speciﬁcally, if N

valid

denotes

the number of semantically correct form submissions, then

lenient

valid

total

lenient

is more difﬁcult to evaluate, since each form

submission must be compared manually with the actual

form, to decide whether it is a semantically correct. For

large experiments involving hundreds of form submissions,

computing SE

lenient

becomes highly cumbersome.

Intuitively, the submission efﬁciency metrics estimate

how much useful work a crawler accomplishes, in a given

period of time. In particular, if two identically conﬁgured

crawlers are allowed to crawl for the same amount of time,

the crawler with the higher rating is expected to retrieve

more ‘useful’ content than the other.

In our experiments, to obtain a precise value for N

success

,weused

manual inspection of the pages, rather than using information from the

crawler’s response analysis module.

评论收藏

内容反馈

zqmalyssa

2014-04-03

非常经典的文章
wcnma

2012-10-24

的确非常经典！
怎样飞的高

2014-03-04

看了一下，有些文章的确不错，准备马上着手准备这方面的研究。

無名小伙

粉丝: 26
资源: 13

Deep web经典文献(英文+中文)

最新资源

Deep web经典文献(英文+中文)

2017年区块链共识英文文献合集

3篇很经典的PCA原理的英文文献

数据挖掘经典英文文献

贝叶斯中文文献，都是经典的文献

定量和定性中文文献，都是经典文献

介绍深度学习(Deep Learning)的3篇经典英文综述和2篇中文综述

有关智能小车的外文文献翻译(原文+中文)-英文文献翻译.docx

Deep+Learning+with+Python+2017_w

Deep Learn 2017 英文版+中文版打包下载

KDD+2019+—+Deep+Learning+for+NLP+with+TensorFlow.pdf

Deep+Learning+with+Python+2017

有关智能小车的外文文献翻译(原文+中文)-英文文献翻译.doc

有关智能小车的外文文献翻译(原文+中文)-英文文献翻译.pdf

单片机--机电相关专业文献翻译（中文+英文）.doc

153篇永磁同步电机转子位置估计相关英文文献（少量中文）

基于单片机智能充电器的外文文献翻译--(英文+中文).pdf

Deep+Learning+with+Python

C++实现opencv+yolo+tensorflow+deepsort.txt

TensorFlow 1.x Deep Learning Cookbook 原版电子书+配套代码

Deep Web研究现状

2021-2022年收藏的精品资料有关智能小车的外文文献翻译原文中文英文文献翻译.docx

电机直接转矩控制相关英文文献和中文文献

基于web的学生成绩与管理系统计算机毕业设计英文文献及中文翻译.docx

英文文献 翻译 中文 从全球博士学位论文全文数据库翻译过来的几篇英文文献

Deep learning: basics and CNN很好的深度学习DNN英文文献

Deep Web 数据集成问题研究

Deep Web查询接口的复杂模式匹配

最新资源

英文文献翻译中文从全球博士学位论文全文数据库翻译过来的几篇英文文献