外文翻译--基于网络爬虫的有效URL缓存.doc资源-CSDN文库

161 浏览量 2023-07-08 22:20:22 上传评论收藏 105KB DOC 举报

基于网络爬虫的有效URL缓存摘要：本文探讨了基于网络爬虫的有效URL缓存技术，以提高爬虫的效率。作者对各种缓存算法进行了详细的研究和比较，包括随机替换、静态缓存、LRU和CLOCK等，并评估了理论极限：clairvoyant caching和infinite cache。通过对实际日志数据的模拟，作者得出了结论：缓存可以极大地提高爬虫的效率，一个大约50,000个条目的缓存可以达到80%的命中率。知识点1：爬虫算法爬虫算法是指从互联网上抓取数据的过程。基本算法包括：（a）抓取页面；（b）解析页面以提取链接的URL；（c）对于所有未见过的URL，重复（a）-（c）。然而，互联网的巨大规模（估计有40亿页）和变化速度（每周7%）使得这个计划从简单的编程练习变成了严肃的算法和系统设计挑战。知识点2：分布式架构由于爬虫算法的巨大规模和高速率，需要使用分布式架构来实现爬虫。这意味着membership test（判断是否已经访问过某个URL）需要在多个节点上进行，这增加了算法的复杂性。知识点3：缓存技术缓存技术是指将部分 URL 存储在内存中，以提高membership test的速度。常见的缓存算法包括：随机替换、静态缓存、LRU和CLOCK等。缓存技术可以极大地提高爬虫的效率，例如一个大约50,000个条目的缓存可以达到80%的命中率。知识点4：缓存算法比较作者对各种缓存算法进行了比较，包括随机替换、静态缓存、LRU和CLOCK等。结果表明，缓存大小对命中率的影响非常大，例如一个大约50,000个条目的缓存可以达到80%的命中率，而一个较小的缓存将大大降低命中率。知识点5：理论极限作者还探讨了理论极限：clairvoyant caching和infinite cache。clairvoyant caching是指理想的缓存算法，可以预知未来将要访问的URL，而infinite cache是指无限大的缓存，可以存储所有的URL。这两个理论极限提供了一个上限，对缓存算法的性能进行评估。知识点6：实际应用作者使用实际日志数据进行了模拟，模拟了实际的爬虫场景。结果表明，缓存技术可以极大地提高爬虫的效率，例如一个大约50,000个条目的缓存可以达到80%的命中率。这项研究为爬虫技术的发展提供了重要的参考价值。

资源推荐

资源详情

资源评论

外文原文

Efficient URL Caching for World Wide Web Crawling

Andrei Z. Broder

IBM TJ Watson Research Center

19 Skyline Dr

Hawthorne, NY 10532

abroder@us.ibm.com

Marc Najork

Microsoft Research

1065 La Avenida

Mountain View, CA 94043

najork@microsoft.com

Janet L. Wiener

Hewlett Packard Labs

1501 Page Mill Road

Palo Alto, CA 94304

janet.wiener@hp.com

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch

a page (b) Parse it to extract all linked URLs (c) For all the URLs not

seen before, repeat (a)–(c). However, the size of the web (estimated at

over 4 billion pages) and its rate of change (estimated at 7% per week)

move this plan from a trivial programming exercise to a serious

algorithmic and system design challenge. Indeed, these two factors alone

imply that for a reasonably fresh and complete crawl of the web, step (a)

must be executed about a thousand times per second, and thus the membership

test (c) must be done well over ten thousand times per second against a

set too large to store in main memory. This requires a distributed

architecture, which further complicates the membership test.

A crucial way to speed up the test is to

cache

, that is, to store in

main memory a (dynamic) subset of the “seen” URLs. The main goal of this

paper is to carefully investigate several URL caching techniques for web

crawling. We consider both practical algorithms: random replacement,

static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching

and infinite cache. We performed about 1,800 simulations using these

algorithms with various cache sizes, using actual log data extracted from

a massive 33 day web crawl that issued over one billion HTTP requests.

Our main conclusion is that caching is very effective – in our setup,

a cache of roughly 50,000 entries can achieve a hit rate of almost 80%.

Interestingly, this cache size falls at a critical point: a substantially

smaller cache is much less effective while a substantially larger cache

brings little additional benefit. We conjecture that such critical points

are inherent to our problem and venture an explanation for this

phenomenon.

1. INTRODUCTION

A recent Pew Foundation study [31] states that “Search engines have

become an indispensable utility for Internet users” and estimates that

as of mid-2002, slightly over 50% of all Americans have used web search

to find information. Hence, the technology that powers web search is of

enormous practical interest. In this paper, we concentrate on one aspect

of the search technology, namely the process of collecting web pages that

eventually constitute the search engine corpus.

Search engines collect pages in many ways, among them direct URL

submission, paid inclusion, and URL extraction from nonweb sources, but

the bulk of the corpus is obtained by recursively exploring the web, a

process known as

crawling

SPIDERing

. The basic algorithm is

(a) Fetch a page

(b) Parse it to extract all linked URLs

Crawling typically starts from a set of

seed

URLs, made up of URLs

obtained by other means as described above and/or made up of URLs collected

during previous crawls. Sometimes crawls are started from a single well

connected page, or a directory such as yahoo.com, but in this case a

relatively large portion of the web (estimated at over 20%) is never

reached. See [9] for a discussion of the graph structure of the web that

leads to this phenomenon.

If we view web pages as nodes in a graph, and hyperlinks as directed

edges among these nodes, then crawling becomes a process known in

mathematical circles as

graph traversal

. Various strategies for graph

traversal differ in their choice of which node among the nodes not yet

explored to explore next. Two standard strategies for graph traversal are

Depth First Search (DFS)

and

Breadth

First Search (BFS)

– they are easy

to implement and taught in many introductory algorithms classes. (See for

instance [34]).

However, crawling the web is not a trivial programming exercise but

a serious algorithmic and system design challenge because of the following

two factors.

1. The web is very large. Currently, Google [20] claims to have indexed

over 3 billion pages. Various studies [3, 27, 28] have indicated that,

historically, the web has doubled every 9-12 months.

2. Web pages are changing rapidly. If “change” means “any change”,

then about 40% of all web pages change weekly [12]. Even if we consider

only pages that change by a third or more, about 7% of all web pages change

weekly [17].

These two factors imply that to obtain a reasonably fresh and 679

complete snapshot of the web, a search engine must crawl at least 100

million pages per day. Therefore, step (a) must be executed about 1,000

times per second, and the membership test in step (c) must be done well

over ten thousand times per second, against a set of URLs that is too large

to store in main memory. In addition, crawlers typically use a distributed

architecture to crawl more pages in parallel, which further complicates

the membership test: it is possible that the membership question can only

be answered by a peer node, not locally.

A crucial way to speed up the membership test is to

cache

a (dynamic)

subset of the “seen” URLs in main memory. The main goal of this paper

is to investigate in depth several URL caching techniques for web crawling.

We examined four practical techniques: random replacement, static cache,

LRU, and CLOCK, and compared them against two theoretical limits:

clairvoyant caching and infinite cache when run against a trace of a web

crawl that issued over one billion HTTP requests. We found that simple

caching techniques are extremely effective even at relatively small cache

sizes such as 50,000 entries and show how these caches can be implemented

very efficiently.

The paper is organized as follows: Section 2 discusses the various

crawling solutions proposed in the literature and how caching fits in

their model. Section 3 presents an introduction to caching techniques and

describes several theoretical and practical algorithms for caching. We

implemented these algorithms under the experimental setup described in

Section 4. The results of our simulations are depicted and discussed in

Section 5, and our recommendations for practical algorithms and data

structures for URL caching are presented in Section 6. Section 7 contains

our conclusions and directions for further research.

2. CRAWLING

Web crawlers are almost as old as the web itself, and numerous crawling

systems have been described in the literature. In this section, we present

a brief survey of these crawlers (in historical order) and then discuss

why most of these crawlers could benefit from URL caching.

The crawler used by the Internet Archive [10] employs multiple

crawling processes, each of which performs an exhaustive crawl of 64 hosts

at a time. The crawling processes save non-local URLs to disk; at the end

of a crawl, a batch job adds these URLs to the per-host seed sets of the

next crawl.

The original Google crawler, described in [7], implements the

different crawler components as different processes. A single URL server

process maintains the set of URLs to download; crawling processes fetch

剩余25页未读，继续阅读

评论收藏

内容反馈

黑色的迷迭香

粉丝: 809
资源: 4万+

外文翻译--基于网络爬虫的有效URL缓存.doc

外文翻译--网络服务的爬虫引擎.doc

文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫.doc

基于.net外文翻译--大学毕业论文.doc

基于单片机的测速和倒车提示装置的设计--外文翻译.doc

毕业论文--基于飞思卡尔单片机的智能车设计(含外文翻译).doc

python爬虫作业-维普期刊文章数据爬取爬虫python实现源码.zip

机械-自动化-外文翻译-外文文献-英文文献-自动化制造系统关于PLC.doc

3-电气工程及其自动化专业-外文文献-英文文献-外文翻译-plc方面.doc

电子商务-外文翻译-外文文献-英文文献-日趋完善的电子商务.doc

计算机外文翻译--浅析网络安全的技术.doc

外文翻译--基于PLC的异步电动机监控系统的设计及应用.doc

电子信息及自动化-外文翻译-外文文献-英文文献-基于ZigBee无线传感器网络的矿工的位置探测研究.doc

精品中英文外文翻译--基于单片机的汽车防盗报警系统设计-定.doc

毕业论文外文翻译--基于AT89C52单片机的LED显示屏控制系统的设计.doc

计算机-外文翻译-英文文献-中英版--仓库管理系统(-wms-)大学毕设论文.doc

计算机-外文翻译-英文文献-中英版--仓库管理系统(-wms-)学士学位论文.doc

GPS-通信系统-外文翻译-外文文献-英文文献-全球移动通信系统.doc

8单片机-外文翻译-外文文献-英文文献-基于单片机的电动智能小车.doc

通信工程-网络技术-外文翻译-文献翻译-外文文献.doc

外文翻译--基于单片机的智能电子密码锁.doc

外文文献翻译--基于单片机的频率计设计本科学位论文.doc

外文文献翻译--基于单片机的频率计设计-学位论文.doc

外文翻译--LINUX_SERVER翻译嵌入式系统的网络服务器 英文版.doc

外文翻译--LINUX_SERVER翻译嵌入式系统的网络服务器 中文版.doc

外文翻译-java平台介绍的英文原文和中文翻译-可用于计算机毕业设计.doc

计算机网络-外文文献-外文翻译-英文文献-新技术的计算机网络.pdf

计算机外文翻译--菜单和工具栏.doc

计算机网络-外文文献-外文翻译-英文文献-新技术的计算机网络.docx

网页设计毕设外文翻译--基于JSP网页自动生成工具设计实现分析.doc

最新资源

文献网络计算机网络外文文献英文文献外文翻译探讨搜索引擎爬虫.doc

外文翻译--LINUX_SERVER翻译嵌入式系统的网络服务器英文版.doc

外文翻译--LINUX_SERVER翻译嵌入式系统的网络服务器中文版.doc