论文《CDNScontentoutsourcingviaGeneralizedcommunities》资源-CSDN文库

论文

需积分: 10 105 浏览量 2011-01-21 00:03:46 上传评论收藏 2.5MB PDF 举报

资源推荐

资源详情

资源评论

CDNs Content Outsourcing

via Generalized Communities

Dimitrios Katsaros, George Pallis, Konstantinos Stamos, Athena Vakali,

Antonis Sidiropoulos, and Yannis Manolopoulos

Abstract—Content distribution networks (CDNs) balance costs and quality in services related to content delivery. Devising an efficient

content outsourcing policy is crucial since, based on such policies, CDN providers can provide client-tailored content, improve

performance, and result in significant economical gains. Earlier content outsourcing approaches may often prove ineffective since they

drive prefetching decisions by assuming knowledge of content popularity statistics, which are not always available and are extremely

volatile. This work addresses this issue, by proposing a novel self-adaptive technique under a CDN framework on which outsourced

content is identified with no a priori knowledge of (earlier) request statistics. This is employed by using a structure-based approach

identifying coherent clusters of “correlated” Web server content objects, the so-called Web page communities. These communities are

the core outsourcing unit, and in this paper, a detailed simulation experimentation has shown that the proposed technique is robust and

effective in reducing user-perceived latency as compared with competing approaches, i.e., two communities-based approaches, Web

caching, and non-CDN.

Index Terms—Caching, replication, Web communities, content distribution networks, social network analysis.

1INTRODUCTION

ISTRIBUTING information to Web users in an efficient and

cost-effective manner is a challenging problem, espe-

cially, under the increasing requirements emerging from a

variety of modern applications, e.g., voice-over-IP and

streaming media. Eager audiences embracing the “digital

lifestyle” are requesting greater and greater volumes of

content on a daily basis. For instance, the Internet video site

YouTube hits more than 100 million v ideos per day.

Estimations of YouTube’s bandwidth go from 25 TB/day

to 200 TB/day. At the same t ime, more and more

applications (such as e-commerce and e-learning) are relying

on the Web but with high sensitivity to delays. A delay even

of a few milliseconds in a Web server content may be

intolerable. At first, solutions such as Web caching and

replication were considered as the key to satisfy such

growing demands and expectations. However, such solu-

tions (e.g., Web caching) have become obsolete due to their

inability to keep up with the growing demands and the

unexpected Web-related phenomen a such as the flash-

crowd events [17] occurring when numerous users access a

Web server content simultaneously (now often occurring on

the Web due to its globalization and wide adoption).

Content distribution networks (CDNs) have been pro-

posed to meet such challenges by providing a scalable and

cost-effective mechanism for accelerating the delivery of the

Web content [7], [27]. A CDN

is an overlay network across

Internet (Fig. 1), which consists of a set of surrogate servers

(distributed around the world), routers, and network

elements. Surrogate servers are the key elements in a

CDN, acting as proxy caches that serve directly cached

content to clients. They store copies of identical content,

such that clients’ requests are satisfied by the most

appropriate site. Once a client requests for content on an

origin server (managed by a CDN), his request is directed to

the appropriate CDN’s surrogate server. This results in an

improvement to both the response time (the requested

content is nearest to the client) and the system throughput

(the workload is distributed to several servers).

As emphasized in [4] and [34], CDNs significantly

reduce the bandwidth requirements for Web service

providers, since the requested content is closer to user

and there is no need to traverse all of the congested pipes

and peering points. So, reducing bandwidth reduces cost

for the Web service providers. CDNs provide also scalable

Web application hosting techniques (such as edge comput-

ing [10]) in order to accelerate the dynamic generation of

Web pages ; instead of replicating the dynamic pages

generated by a Web server, they replicate the means of

generating pages over multiple surrogate servers [34].

CDNs are expected to play a key role in the future of the

Internet infrastructure since their high user performance

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009 1

. D. Katsaros is with the Department of Computer and Communication

Engineering, University of Thessaly, Volos, Greece.

E-mail: dkatsar@inf.uth.gr.

. K. Stamos, A. Vakali, A. Sidiropoulos, and Y. Manolopoulos are with the

Department of Informatics, Aristotle University of Thessaloniki, 54124,

Thessaloniki, Greece.

E-mail: {kstamos, avakali, asidirop, manolopo}@csd.auth.gr.

. G. Pallis is with the Department of Computer Science, University of

Cyprus, 20537, Nicosia, Cyprus. E-mail: gpallis@cs.ucy.ac.cy.

Manuscript received 10 July 2007; revised 9 Feb. 2008; accepted 24 Apr. 2008;

published online 6 May 2008.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number

TKDE-2007-07-0347.

Digital Object Identifier no. 10.1109/TKDE.2008.92.

1. http://www.youtube.com/.

2. A survey on the status and trends of CDNs is given in [36]. Detailed

information about the CDNs’ mechanisms is presented in [30].

1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society

and cost savings have urged many Web entrepreneurs to

make contracts with CDNs.

1.1 Motivation and Paper’s Cont ributions

Currently, CDNs invest in large-scale infrastructure (surro-

gate servers, network resources, etc.) to provide high data

quality for their clients. To revenue their investment, CDNs

charge their customers (i.e., Web server owners) based on

two crite ria: the amount of cont ent (which has been

outsourced) and their traffic records (measured by the

content delivery from surrogate servers to clients). Accord-

ing to a CDN market report,

the average cost per GB of

streaming video transferred in 2004 through a CDN was

$1:75, while the average price to deliver a GB of Internet

radio was $1. Given that the bandwidth usage of Web

servers content may be huge (i.e., the bandwidth usage of

YouTube is about 6 petabytes per month), it is evident that

this cost may be extremely high.

Therefore, the proposal of a content outsourcing policy,

which will reduce both the Internet traffic and the replicas

maintenance costs, is a challenging research task due to the

huge-scale, the heterogeneity, the multilevels in the structure, the

hyperlinks and interconnections, and the dynamic and evolving

nature of Web content.

To the best of the authors’ knowledge, earlier approaches

for content outsourcing on CDNs assume knowledge of

content popularity statistics to drive t he prefetching

decisions [9], giving an indication of the popularity of

Web resources (a detailed review of relevant work will be

presented in Section 2). Such information though is not

always available, or it is extremely volatile, turning such

methods problematic. The use of popularity statistics has

several drawbacks. First, it requires quite a long time to

collect reliable request statistics for each object. Such a long

interval though may not be available, when a new site is

published to the Internet and should be protected from

“flash crowds” [17]. Moreover, the popularity of each object

varies considerably [4], [9]. In addition, the use of

administratively tuned parameters to select the hot objects

causes additional headaches, since there is no a priori

knowledge about how to set these parameters. Realizing the

limitations of such solutions, Rabinovich and Spatscheck

[30] implied the need for self-tuning content outsourcing

policies. In [32], we initiated the study of this problem by

outsourcing clusters of Web pages. The outsourced clusters

are identified by naively exploring the structure of the Web

site. Results showed that such an approach improves the

CDN’s performance in terms of user-perceived latency and

data redundancy.

The present work continues and improves upon the

authors’ preliminary efforts in [32], focusing on devising a

high-performance outsourcing policy under a CDN frame-

work. In this context, we point out that the following

challenges are involved:

. outsource objects that should be popular for long

time periods,

. refrai n from using (locally estima ted or server-

supplied) tunable parameters (e.g., number of

clusters) and keywords, which do not adapt well

to changing access distributions, and

. refrain from using popularity statistics which do not

represent effectively the dynamic users’ navigation

behavior. As observed in [9], only 40 percent of the

“popular” objects for one day remain “popular” and

the next day.

In accordance to the above challenges, we propose a

novel self-adaptive technique under a CDN framework on

which outsourced content is identified by exploring Web

server content structure and with no a priori knowledge of

(earl ier) request statistics. This paper’s cont ribution is

summarized as follows:

. Identifying content clusters, called Web page commu-

nities, based on the adopted Web graph structure

(where Web pages are nodes and hyperlinks are

edges), such that these communities serve as the core

outsourcing unit for replication. Typically, it can be

considered as a dense subgraph where the number

of edges within a community is larger than the

number of edges between communities. Such struc-

tures exist on the Web [1], [13], [20]—Web servers

content designers (humans or applications) tend to

organize sites into collections of Web pages related

to a common interest—and affect users’ navigation

behavior; a dense linkage implies a higher prob-

ability of selecting a link. Here, we exploit a

quantitative definition for Web page communities

introduced in [32], which is considered to be suitable

for CDNs content outsourcing problem. Our defini-

tion is flexible, allowing overlaps among commu-

nities (a Web page may belong to more than one

community), since a Web page usually covers a wide

range of topics (e.g., a news Web server content) and

cannot be classified by a single community. The

resulting communities are entirely being replicated

by the CDN’s surrogate servers.

. Defining a parameter-free outsourcing policy, since our

structure-based approach (unlike k-median, dense

k-subgraphs, min-sum, or min-max clustering) does

not require the number of communities as a pre-

determined parameter, but instead, the optimal

number of communities is any value between 1 and

the number of nodes of the Web site graph, depend-

ing on the node connectivity (captured by the Web

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009

3. “Content Delivery Networks, Market Strategies and Forecasts (2001-

2006),” AccuStream iMedia Research: http://www.marketresearch.com/.

Fig. 1. A typical CDN.

server content structure). The proposed policy called

Communities identification with Betweenness Cen-

trality (CiBC) identifies overlapped Web page com-

munities using the concept of Betweenness Centrality

(BC) [5]. Specifically, Newman and Girvan [22] have

used the concept of edge betweenness to select edges

to be removed from the graph so as to devise a

hierarchical agglomerative clustering procedure,

which though is not capable of providing the final

communities but requires intervening of adminis-

trators. Contrary to this work [22], the BC is used, in

this paper, to measure how central each node of the

Web site graph is within a community.

. Experimenting on a detailed simulation testbed, since the

experimentation carried out involves numerous

experiments to evaluate the proposed scheme under

regular traffic and under flash crowd events. Current

usage of Web technologies and Web server content

performance characteristics during a flash crowd

event are highlighted, and from our experimentation,

the proposed approach is shown to be robust and

effective in minimizing both the average response

time of users’ requests and the costs of CDNs’

providers.

1.2 Road Map

The rest of this paper is structured as follows: Section 2

discusses the related work. In Section 3, we formally define

the problem addressed in this paper. Section 4 presents the

proposed policy. Sections 5 and 6 present the simulation

testbed, examined policies, and performance measures.

Section 7 evaluates the proposed approach, and finally,

Section 8 concludes this paper.

2RELEVANT WORK

2.1 Content Outsourcing Policies

As identified by earlier research efforts [9], [15], the choice of

the outsourced content has a crucial impact in terms of

CDN’s pricing [15] and CDN’s performance [9], and it is

quite complex and challenging, if we consider the dynamic

nature of the Web. A naive solution to this problem is to

outsource all the objects of the Web server content (full

mirroring) to all the surrogate servers. The latter may seem

feasible, since the technological advances in storage media

and networking support have greatly improved. However,

the respective demand from the market greatly surpasses

these advantages. For instance, after the recent agreement

between Limelight Networks

and YouTube, under which

the first company is adopted as the content delivery platform

by YouTube, we can deduce, since this is proprietary

information, the huge storage requirements of the surrogate

servers. Moreover, the evolution toward completely perso-

nalized TV (e.g., the stage6)

reveals that the full content of

the origin servers cannot be completely outsourced as a

whole. Finally, the problem of updating such a huge

collection of Web objects is unmanageable. Thus, we have

to resort to a more “selective” outsourcing policy.

A few such content outsourcing policies have been

proposed in order to identify which objects to outsource for

replica ting to CDNs’ surrogate servers. These can be

categorized as follows:

. Empirical-based outsourcing. The Web server con-

tent administrators decide empirically about which

content will be outsourced [3].

. Popularity-based outsourcing. The most popular

objects are replicated to surrogate servers [37].

. Object-based outsourcing. The content is replicated

to surrogate servers in units of objects. Each object is

replicated to the surrogate server (under the storage

constraints) which gives the most performance gain

(greedy approach) [9], [37].

. Cluster-based outsourcing. The content is replicated

to surrogate servers in units of clusters [9], [14]. A

cluster is defined as a group of Web pages which

have some common characteristics with respect to

their content, the time of references, the number of

references, etc.

From the above content outsourcing policies, the object-

based one achieves high performance [9], [37]. However, as

pointed out by the authors of these policies, the huge

amount of objects results in not being implemented on a

real application. On the other hand, the popularity-based

outsourcing policies do not select the most suitable objects

for outsourcing, since the most popular objects remain

popular for a short time period [9]. Moreover, they require

quite a long time to collect reliable request statistics for each

object. Such a long interval though may not be available,

when a new Web server content is published to the Internet

and should be protected from flash crowd events.

Thus, we resort to exploit action of cluster-based out-

sourcing policies. The cluster-based one has also gained the

most attraction in the research community [9]. In such an

approach, the clusters may be identified by using conven-

tional data clustering algorithms. However, due to the lack

of a uniform schema for Web documents and dynamics of

Web data, the efficiency of these approaches is unsatisfac-

tory. Furthermore, most of them require administratively

tuned parameters (maximum cluster diameter, maximum

number of clusters) to decide the number of clusters, which

causes additional problems, since there is no a priori

knowledge about how many clusters of objects exist and of

what shape these clusters are.

In disaccordance with the above approaches, we exploit

the Web server content structure and consider each cluster

as a Web page community, where its characteristics are

that it reflects the dynamic and heterogeneity nature of the

Web. Specifically, it considers each page as a whole object,

rather than breaking down the Web page into information

pieces and reveals mutual relationships among the

concerned Web data.

2.2 Identifying Web Page Communities

In the literature there are several proposals for identifying

Web page communities [13], [16]. One of the key distin-

guishing properties of the algorithms that is usually

considered has to do with the degree of locality which is

used for assessing whether or not a page should be assigned

in a community. Regarding this feature, the methods for

identifying the communities can be summarized as follows:

KATSAROS ET AL.: CDNS CONTENT OUTSOURCING VIA GENERALIZED COMMUNITIES 3

4. http://www.limelightnetworks.com.

5. http://stage6.divx.com.

剩余14页未读，继续阅读

评论收藏

内容反馈

morre

粉丝: 187
资源: 2337

论文《CDNS content outsourcing via Generalized communities》

最新资源

论文《CDNS content outsourcing via Generalized communities》

A Taxonomy of CDNs

CDN权威书籍-Taxonomy of CDNs

网上图书销售系统论文

Content Delivery Networks

Content Delivery Networks Fundamentals, Design, and Evolution

双活数据中心实践分享和探讨赵振平.pdf

Unreeling Netflix: Understanding and Improving Multi-CDN Movie Delivery

抽奖软件

液晶12864知识全集.rar

DNS资料共享，超级易懂

Web Caching and Replication

浅议水利工程项目风险性成本管理

dnsjava

记事本的详细设计以及源代码程序

sql常用语句 个人收集的几个常用语句

DNS 配置教程免费下载

MIPI CSI2 RX设计手册

数控直流稳压电源

智能小车 STM32

Academic+Phrasebank+2021+Edition+_中英文对照.pdf

YOLO v1 ~ YOLO v5 论文解读和实现细节

基于python的超市管理系统的设计与实现毕业论文+项目文档源码

1000套计算机毕业设计带源码

数模国赛word模板.zip

2021年国赛A题（FAST主动反射面形状调节）论文+代码材料.zip

2023高教社数学建模C题 - 蔬菜类商品的自动定价与补货决策【数据处理详细代码】

基于高校校园网的网络规划设计与实现-以锦城学院为例-kaic.docx

Python大作业（包含论文）——可打包的双人五子棋程序

YOLOv9论文，2024.02发布

最新资源

sql常用语句个人收集的几个常用语句