PageRankasaFunctionoftheDampingFactor-计算机科学资源-CSDN文库

3 浏览量 2021-04-22 15:53:51 上传评论收藏 230KB PDF 举报

资源推荐

资源详情

资源评论

PageRank as a Function of the Damping Factor

∗

Paolo Boldi Massimo Santini Sebastiano Vigna

DSI, Università degli Studi di Milano

Abstract

PageRank is deﬁned as the stationary state of a Markov chain. The chain is obtained by perturb-

ing the transition matrix induced by a web graph with a damping factor α that spreads uniformly

part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion

α = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with

respect to changes in α was discovered to be useful in link-spam detection [21]. Moreover, an

analytical justiﬁcation of the value chosen for α is still missing. In this paper, we give the ﬁrst

mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to

popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking.

Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the

Power Method that approximates them with convergence O





for the k-th derivative. Finally,

we show a tight connection between iterated computation and analytical behaviour by proving that

the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclau-

rin polynomial of degree k. The latter result paves the way towards the application of analytical

methods to the study of PageRank.

1 Introduction

PageRank [17] is one of the most important ranking techniques used in today’s search engines. Not

only is PageRank a simple, robust and reliable way to measure the importance of web pages [3], but

it is also computationally advantageous with respect to other ranking techniques in that it is query

independent, and content independent. Otherwise said, it can be computed ofﬂine using only the web

graph

structure and then used later, as users submit queries to the search engine, typically aggregated

with other, query-dependent rankings [4, 12, 16].

One suggestive way to describe the idea behind PageRank is as follows: consider a random surfer

that starts from a random page, and at every time chooses the next page by clicking on one of the

links in the current page (selected uniformly at random among the links present in the page). As a

ﬁrst approximation, we could deﬁne the rank of a page as the fraction of time that the surfer spent on

that page on the average. Clearly, important pages (i.e., pages that happen to be linked by many other

pages, or by few important ones) will be visited more often, which justiﬁes the deﬁnition. However,

we also allow the surfer to restart with probability 1 − α from another node chosen randomly and

uniformly, instead of following a link.

As remarked in [5], a signiﬁcant part of the current knowledge about PageRank is scattered

through the research laboratories of large search engines, and its analysis “has remained largely in

the realm of trade secrets and economic competition”. As the authors of the aforementioned paper,

however, we believe that a scientiﬁc and detailed study of PageRank is essential to our understanding

of the web, and we hope this paper can be a contribution in such program.

∗

This work has been partially supported by a “Finanziamento per grandi e mega attrezzature scientiﬁche” of the Università

degli Studi di Milano and by the MIUR COFIN Project “Linguaggi formali e automi”.

The web graph is the directed graph whose nodes are URLs and whose arcs correspond to hyperlinks.

PageRank is deﬁned formally as the stationary distribution of a stochastic process whose states

are the nodes of the web graph. The process itself is obtained by combining the normalised adjacency

matrix of the web graph (with some patches for nodes without outlinks that will be discussed later)

with a trivial uniform process that is needed to make the combination irreducible and aperiodic,

so that the stationary distribution is well deﬁned. The combination depends on a damping factor

α ∈ 0, 1), which will play a major rôle in this paper. When α is 0, the web-graph part of the process

is annihilated, resulting in the trivial uniform process. As α goes to 1, the web part becomes more

and more important.

The problem of choosing α was curiously overlooked in the ﬁrst papers about PageRank: yet, not

only PageRank changes signiﬁcantly when α is modiﬁed [19, 18], but also the relative ordering of

nodes determined by PageRank can be radically different [14]. The original value suggested by Brin

and Page (α = 0.85) is the most common choice.

Intuitively, 1 − α is an amount of ranking that we agree to give uniformly at each page. This

amount will be then funneled through the outlinks of the node. A common form of link spamming

funnels carefully this amount towards a single page, giving it a preposterously great importance.

It is natural to wonder what is the best value of the damping factor, if such a thing exists. In a

way, when α gets close to 1 the Markov process is closer to the “ideal” one, which would somehow

suggest that α should be chosen as close to 1 as possible. This observation is not new, but it has some

naivety in it.

The ﬁrst issue is of computational nature: PageRank is traditionally computed using variants of

the Power Method. The number of iterations required for this method to converge grows with α, and

moreover more and more numerical precision is required as α gets closer to 1.

But there is an even more fundamental reason not to choose a value of α too close to 1: we shall

prove in Section 3 that when α goes to 1 PageRank gets concentrated in the recurrent states, which

correspond essentially to the nodes whose strongly connected components have no passage toward

other components. This phenomenon gives a null PageRank to all the pages in the core component,

something that is difﬁcult to explain and that is contrary to common sense. In other words, in real-

word web graphs the rank of all important nodes (in particular, all nodes of the core component) goes

to 0 as α goes to 1.

Thus, PageRank oscillates between a meaningless uniform distribution (α = 0) and a meaningless

distribution concentrated mostly in irrelevant nodes (α = 1). As a result, both for choosing the correct

damping factor and for detecting link spamming, being able to describe the behaviour of PageRank

when α changes is essential. Recently, indeed, a sophisticated form of link-spam detection has been

based on the study of the value of PageRank with respect to α [21].

To proceed further in this direction, it is essential that we have at our disposal analytical tools that

describe this behaviour. To this purpose, we shall provide closed-form formulae for the derivatives

of any order of PageRank with respect to α, and an iterative algorithm (an extension of the power

method) that approximates them.

The most surprising consequence, easily derived from our formulae, is that the vectors computed

during the PageRank computation for any α ∈ (0, 1) can be used to approximate PageRank for every

other α ∈ (0, 1). This happens because the k-th coefﬁcient of the Maclaurin series for PageRank can

be easily computed during the k-th iteration of the Power Method. This allows to study easily the

behaviour of PageRank for any node storing a minimal amount of data.

Free Java code implementing all the algorithms described in this paper will be available for download at

http://law.dsi.unimi.it/.

2 Basic deﬁnitions

Let G be the adjacency matrix of a directed graph of N nodes (identiﬁed hereafter with the numbers

from 0 to N − 1). A node is terminal if it does not have outlinks, except possibly for loops (or,

equivalently, if all arcs incident on the node are incoming). If we want to be speciﬁc about the

presence of a loop, we shall use the terms looped and loopless

We note that usually G is preprocessed before building the corresponding Markov chain. Com-

mon processing includes removal of all loops (as nodes should not give authoritativeness to them-

selves) and thresholding the number of links coming from pages of the same domain (to reduce the

effect of link spamming).

If no loopless terminal nodes are present (note that after the preprocessing sketched above they

will be the only kind of terminal nodes), we can just normalise uniformly to 1 the row-sums of G

by multiplying it by D

−1

, the inverse of the diagonal degree matrix. However, D is not invertible if

loopless terminal nodes are present. The classical way to handle this situation consists in substituting

them with nodes that have one outgoing arc toward every node (including the node itself. In other

words, in G rows of zeroes are substituted with rows of ones.

Let

G be the (adjacency matrix of the) resulting graph, and

D be the diagonal matrix of the

outdegrees of

G (i.e., d

is the number of ones on the i -th row of

G). Let also 1 be the vector

of all

1’s, and v be any personalisation vector (a vector whose elements are all non-negative and sum to 1,

which is used to bias PageRank w.r.t. a selected set of trusted pages).

We are providing a toy example in the Appendix that will guide the reader through the paper. In

Table 5, the example graph G and its modiﬁed version

G are presented.

In the rest of the paper, we shall use the matrices deﬁned in Figure 1; some of them are functions

of the damping factor α ∈ 0, 1), and we will use a notation reﬂecting this fact. Note that Q(α) is

well deﬁned for all α ∈ 0, 1), as (I − α P) is known to be invertible [20].

P =

−1

A(α) = α P + (1 − α)1

C(α) = I − α P

Q(α) = PC(α)

−1

Figure 1: Basic PageRank deﬁnitions.

The PageRank vector r(α) is deﬁned as the dominant eigenvector of A(α); more precisely, as the

only vector summing to 1 such that r(α) A(α) = r(α). Noting that r(α)1

= 1, we get

r(α)



α P + (1 − α)1



= r(α)

αr(α)P + (1 − α)v = r(α)

(1 − α)v = r(α)(I − α P),

which yields the following closed formula for PageRank:

r(α) = (1 − α)vC(α)

−1

. (1)

In PageRank-related literature, loopless terminal nodes are more commonly known as dangling nodes; the same kind of

node is often called a sink in graph-theoretic literature. Our choice avoids the usage of ambiguous terms that have been given

different meanings in different papers.

All vectors in this paper are row vectors.

剩余18页未读，继续阅读

评论收藏

内容反馈

weixin_38745859

粉丝: 3
资源: 969

PageRank as a Function of the Damping Factor-计算机科学

最新资源

PageRank as a Function of the Damping Factor-计算机科学

PageRank算法

Pagerank算法

pagerank算法讲解

pagerank算法

TruRank - Taking PageRank to the Limit-计算机科学

Traps and Pitfalls of Topic-Biased PageRank-计算机科学

The PageRank Citation Ranking: Bringing Order to the #资源达人分享计划#

PageRank-java.rar_pageRank_pagerank java

【Web挖掘】PageRank and Interaction Information Retrieval

The PageRank Citation Ranking-Bringing Order to the Web.pdf

The Anatomy of a Large-Scale Hypertextual Web Search Engine

pagerank-java实现查询

Arnoldi-type Algorithm vs. GMRES for the PageRank Problem

Spark The Definitive Guide-201712

The_Anatomy_of_a_Large-Scale_Hypertextual_Web_Search_Engine[译文]

Efficient Computation of PageRank.pdf

the condition number of the pagerank problem

WWW-Google-PageRank-0.12.tar.gz_pageRank_pagerank perl_perl page

PageRank算法的Matlab实现

Stanford 大学--Analysis of Networks课程13-19-PPT.rar

pagerank算法模拟实现

The Anatomy of a Large-Scale Hypertextual Web Search Engine[中文版]

Topic-sensitive PageRank - a context-sensitive ranking algorithm

Go-pagerank-加权PageRank算法Go实现

基于Python实现的pagerank算法.zip

pagerank.zip

最新资源