没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
PageRank as a Function of the Damping Factor∗Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di MilanoAbstractPageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturb- ing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. Recently, however, the behaviou
资源推荐
资源详情
资源评论
PageRank as a Function of the Damping Factor
∗
Paolo Boldi Massimo Santini Sebastiano Vigna
DSI, Università degli Studi di Milano
Abstract
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturb-
ing the transition matrix induced by a web graph with a damping factor α that spreads uniformly
part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion
α = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with
respect to changes in α was discovered to be useful in link-spam detection [21]. Moreover, an
analytical justification of the value chosen for α is still missing. In this paper, we give the first
mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to
popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking.
Then, we give closed-form formulae for PageRank derivatives of any order, and an extension of the
Power Method that approximates them with convergence O
t
k
α
t
for the k-th derivative. Finally,
we show a tight connection between iterated computation and analytical behaviour by proving that
the k-th iteration of the Power Method gives exactly the PageRank value obtained using a Maclau-
rin polynomial of degree k. The latter result paves the way towards the application of analytical
methods to the study of PageRank.
1 Introduction
PageRank [17] is one of the most important ranking techniques used in today’s search engines. Not
only is PageRank a simple, robust and reliable way to measure the importance of web pages [3], but
it is also computationally advantageous with respect to other ranking techniques in that it is query
independent, and content independent. Otherwise said, it can be computed offline using only the web
graph
1
structure and then used later, as users submit queries to the search engine, typically aggregated
with other, query-dependent rankings [4, 12, 16].
One suggestive way to describe the idea behind PageRank is as follows: consider a random surfer
that starts from a random page, and at every time chooses the next page by clicking on one of the
links in the current page (selected uniformly at random among the links present in the page). As a
first approximation, we could define the rank of a page as the fraction of time that the surfer spent on
that page on the average. Clearly, important pages (i.e., pages that happen to be linked by many other
pages, or by few important ones) will be visited more often, which justifies the definition. However,
we also allow the surfer to restart with probability 1 − α from another node chosen randomly and
uniformly, instead of following a link.
As remarked in [5], a significant part of the current knowledge about PageRank is scattered
through the research laboratories of large search engines, and its analysis “has remained largely in
the realm of trade secrets and economic competition”. As the authors of the aforementioned paper,
however, we believe that a scientific and detailed study of PageRank is essential to our understanding
of the web, and we hope this paper can be a contribution in such program.
∗
This work has been partially supported by a “Finanziamento per grandi e mega attrezzature scientifiche” of the Università
degli Studi di Milano and by the MIUR COFIN Project “Linguaggi formali e automi”.
1
The web graph is the directed graph whose nodes are URLs and whose arcs correspond to hyperlinks.
1
PageRank is defined formally as the stationary distribution of a stochastic process whose states
are the nodes of the web graph. The process itself is obtained by combining the normalised adjacency
matrix of the web graph (with some patches for nodes without outlinks that will be discussed later)
with a trivial uniform process that is needed to make the combination irreducible and aperiodic,
so that the stationary distribution is well defined. The combination depends on a damping factor
α ∈ 0, 1), which will play a major rôle in this paper. When α is 0, the web-graph part of the process
is annihilated, resulting in the trivial uniform process. As α goes to 1, the web part becomes more
and more important.
The problem of choosing α was curiously overlooked in the first papers about PageRank: yet, not
only PageRank changes significantly when α is modified [19, 18], but also the relative ordering of
nodes determined by PageRank can be radically different [14]. The original value suggested by Brin
and Page (α = 0.85) is the most common choice.
Intuitively, 1 − α is an amount of ranking that we agree to give uniformly at each page. This
amount will be then funneled through the outlinks of the node. A common form of link spamming
funnels carefully this amount towards a single page, giving it a preposterously great importance.
It is natural to wonder what is the best value of the damping factor, if such a thing exists. In a
way, when α gets close to 1 the Markov process is closer to the “ideal” one, which would somehow
suggest that α should be chosen as close to 1 as possible. This observation is not new, but it has some
naivety in it.
The first issue is of computational nature: PageRank is traditionally computed using variants of
the Power Method. The number of iterations required for this method to converge grows with α, and
moreover more and more numerical precision is required as α gets closer to 1.
But there is an even more fundamental reason not to choose a value of α too close to 1: we shall
prove in Section 3 that when α goes to 1 PageRank gets concentrated in the recurrent states, which
correspond essentially to the nodes whose strongly connected components have no passage toward
other components. This phenomenon gives a null PageRank to all the pages in the core component,
something that is difficult to explain and that is contrary to common sense. In other words, in real-
word web graphs the rank of all important nodes (in particular, all nodes of the core component) goes
to 0 as α goes to 1.
Thus, PageRank oscillates between a meaningless uniform distribution (α = 0) and a meaningless
distribution concentrated mostly in irrelevant nodes (α = 1). As a result, both for choosing the correct
damping factor and for detecting link spamming, being able to describe the behaviour of PageRank
when α changes is essential. Recently, indeed, a sophisticated form of link-spam detection has been
based on the study of the value of PageRank with respect to α [21].
To proceed further in this direction, it is essential that we have at our disposal analytical tools that
describe this behaviour. To this purpose, we shall provide closed-form formulae for the derivatives
of any order of PageRank with respect to α, and an iterative algorithm (an extension of the power
method) that approximates them.
The most surprising consequence, easily derived from our formulae, is that the vectors computed
during the PageRank computation for any α ∈ (0, 1) can be used to approximate PageRank for every
other α ∈ (0, 1). This happens because the k-th coefficient of the Maclaurin series for PageRank can
be easily computed during the k-th iteration of the Power Method. This allows to study easily the
behaviour of PageRank for any node storing a minimal amount of data.
2
2
Free Java code implementing all the algorithms described in this paper will be available for download at
http://law.dsi.unimi.it/.
2
2 Basic definitions
Let G be the adjacency matrix of a directed graph of N nodes (identified hereafter with the numbers
from 0 to N − 1). A node is terminal if it does not have outlinks, except possibly for loops (or,
equivalently, if all arcs incident on the node are incoming). If we want to be specific about the
presence of a loop, we shall use the terms looped and loopless
3
.
We note that usually G is preprocessed before building the corresponding Markov chain. Com-
mon processing includes removal of all loops (as nodes should not give authoritativeness to them-
selves) and thresholding the number of links coming from pages of the same domain (to reduce the
effect of link spamming).
If no loopless terminal nodes are present (note that after the preprocessing sketched above they
will be the only kind of terminal nodes), we can just normalise uniformly to 1 the row-sums of G
by multiplying it by D
−1
, the inverse of the diagonal degree matrix. However, D is not invertible if
loopless terminal nodes are present. The classical way to handle this situation consists in substituting
them with nodes that have one outgoing arc toward every node (including the node itself. In other
words, in G rows of zeroes are substituted with rows of ones.
Let
¯
G be the (adjacency matrix of the) resulting graph, and
¯
D be the diagonal matrix of the
outdegrees of
¯
G (i.e., d
ii
is the number of ones on the i -th row of
¯
G). Let also 1 be the vector
4
of all
1’s, and v be any personalisation vector (a vector whose elements are all non-negative and sum to 1,
which is used to bias PageRank w.r.t. a selected set of trusted pages).
We are providing a toy example in the Appendix that will guide the reader through the paper. In
Table 5, the example graph G and its modified version
¯
G are presented.
In the rest of the paper, we shall use the matrices defined in Figure 1; some of them are functions
of the damping factor α ∈ 0, 1), and we will use a notation reflecting this fact. Note that Q(α) is
well defined for all α ∈ 0, 1), as (I − α P) is known to be invertible [20].
P =
¯
D
−1
¯
G
A(α) = α P + (1 − α)1
T
v
C(α) = I − α P
Q(α) = PC(α)
−1
Figure 1: Basic PageRank definitions.
The PageRank vector r(α) is defined as the dominant eigenvector of A(α); more precisely, as the
only vector summing to 1 such that r(α) A(α) = r(α). Noting that r(α)1
T
= 1, we get
r(α)
α P + (1 − α)1
T
v
= r(α)
αr(α)P + (1 − α)v = r(α)
(1 − α)v = r(α)(I − α P),
which yields the following closed formula for PageRank:
r(α) = (1 − α)vC(α)
−1
. (1)
3
In PageRank-related literature, loopless terminal nodes are more commonly known as dangling nodes; the same kind of
node is often called a sink in graph-theoretic literature. Our choice avoids the usage of ambiguous terms that have been given
different meanings in different papers.
4
All vectors in this paper are row vectors.
3
剩余18页未读,继续阅读
资源评论
weixin_38745859
- 粉丝: 3
- 资源: 969
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- VB水费管理系统设计与实现(源代码+系统)(2024bf).7z
- vb通讯录管理信息系统(源代码+可执行程序+论文+开题报告+外文翻译)(2024f9).7z
- VB通讯录系统设计与实现(源代码+系统)(2024ri).7z
- VB通用C++试题库系统的设计与开发(论文+源代码)(2024af).7z
- VB图书管理系统(论文)(2024fv).7z
- vb图书馆管理系统(源代码+论文)(20245j).7z
- VB通用药品公司进销售存管理系统设计(源代码+系统)(2024uo).7z
- vb图书管理系统(论文+源代码+开题报告+外文翻译+答辩ppt)(20249q).7z
- vb图书管理系统(源代码+论文)(20241z).7z
- VB图书管理系统(完全可以运行)修改好的(2024ql).7z
- VB图书管理系统设计(论文+源代码)(2024dr).7z
- vb人事工资管理系统毕业设计(论文+源代码+答辩PPT)(2024x7).7z
- VB人口登记管理系统(源代码+系统+答辩PPT)(2024us).7z
- VB人事工资管理系统设计(论文+源代码+答辩PPT)(2024l6).7z
- VB人事管理系统(源代码+论文)(2024b7).7z
- VB人事管理系统(系统+论文)(2024qn).7z
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功