近期社交网络的凸性推断资源-CSDN文库

社交网络

需积分: 10 45 浏览量 2015-05-15 09:36:44 上传评论收藏 214KB PDF 举报

资源详情

资源评论

arXiv:1010.5504v1 [cs.SI] 26 Oct 2010

On the Convexity of Latent Social Network Inference

Seth A. Myers

Institute for Computational

and Mathematical Engineering

Stanford University

samyers@stanford.edu

Jure Leskovec

Department of Computer Science

Stanford University

jure@cs.stanford.edu

Abstract

In many real-world scenarios, it is nearly impossible to collect explicit social net-

work data. In such cases, whole networks must be inferred from underlying ob-

servations. Here, we formulate the problem of inferring latent social networks

based on network diffusion or disease propagation data. We consider contagions

propagating over the edges of an unobserved social network, where we only ob-

serve the times when nodes became infected, but not who infected them. Given

such node infection times, we then identify the optimal network that best explains

the observed data. We present a maximum likelihood approach based on convex

programming with a l

-like penalty term that encourages sparsity. Experiments

on real and synthetic data reveal that our method near-perfectly recovers the un-

derlying network structure as well as the parameters of the contagion propagation

model. Moreover, our approach scales well as it can infer optimal networks of

thousands of nodes in a matter of minutes.

1 Introduction

Social network analysis has traditionally relied on self-reported data collected via interviews and

questionnaires [27]. As collecting such data is tedious and expensive, traditional social network

studies typically involved a very limited number of people (usually less than 100). The emergence

of large scale social computing applications has made massive social network data [16] available,

but there are important settings where network data is hard to obtain and thus the whole network

must thus be inferred from the data. For example, populations, like drug injection users or men who

have sex with men, are “hidden” or “hard-to-reach”. Collecting social networks of such populations

is near impossible, and thus whole networks have to be inferred from the observational data.

Even though inferring social networks has been attempted in the past, it usually assumes that the

pairwise interaction data is already available [5]. In this case, the problem of network inference

reduces to deciding whether to include the interaction between a pair of nodes as an edge in the un-

derlying network. For example, inferring networks from pairwise interactions of cell-phone call [5]

or email [4, 13] records simply reduces down to selecting the right threshold τ such that an edge

(u, v) is included in the network if u and v interacted more than τ times in the dataset. Similarly,

inferring networks of interactions between proteins in a cell usually reduces to determining the right

threshold [9, 20].

We address the problem of inferring the structure of unobserved social networks in a much more

ambitious setting. We consider a diffusion process where a contagion (e.g., disease, information,

product adoption) spreads over the edges of the network, and all that we observe are the infection

times of nodes, but not who infected whom i.e. we do not observe the edges over which the contagion

spread. The goal then is to reconstruct the underlying social network along the edges of which the

contagion diffused.

We think of a diffusion on a network as a process where neighboring nodes switch states from in-

active to active. The network over which activations propagate is usually unknown and unobserved.

Commonly, we only observe the times when particular nodes get “infected” but we do not observe

who infected them. In case of information propagation, as bloggers discover new information, they

write about it without explicitly citing the source [15]. Thus, we only observe the time when a blog

gets “infected” but not where it got infected from. Similarly, in disease spreading, we observe peo-

ple getting sick without usually knowing who infected them [26]. And, in a viral marketing setting,

we observe people purchasing products or adopting particular behaviors without explicitly knowing

who was the inﬂuencer that caused the adoption or the purchase [11]. Thus, the question is, if we as-

sume that the network is static over time, is it possible to reconstruct the unobserved social network

over which diffusions took place? What is the structure of such a network?

We develop convex programming based approach for inferring the latent social networks from dif-

fusion data. We ﬁrst formulate a generative probabilistic model of how, on a ﬁxed hypothetical

network, contagions spread through the network. We then write down the likelihood of observed

diffusion data under a given network and diffusion model parameters. Through a series of steps we

show how to obtain a convex program with a l

-like penalty term that encourages sparsity. We evalu-

ate our approach on synthetic as well as real-world email and viral marketing datasets. Experiments

reveal that we can near-perfectly recover the underlying network structure as well as the parameters

of the propagation model. Moreover, our approach scales well since we can infer optimal networks

of a thousand nodes in a matter of minutes.

Further related work. There are several different lines of work connected to our research. First is

the network structure learning for estimating the dependency structure of directed graphical mod-

els [7] and probabilistic relational models [7]. However, these formulations are often intractable

and one has to reside to heuristic solutions. Recently, graphical Lasso methods [25, 21, 6, 19] for

static sparse graph estimation and extensions to time evolving graphical models [1, 8, 22] have been

proposed with lots of success. Our work here is similar in a sense that we “regress” the infection

times of a target node on infection times of other nodes. Additionally, our work is also related to a

link prediction problem [12, 23, 18, 24] but different in a sense that this line of work assumes that

part of the network is already visible to us.

The work most closely related to ours, however, is [10], which also infers networks through cascade

data. The algorithm proposed (called NetInf) assumes that the weights of the edges in latent network

are homogeneous, i.e. all connected nodes in the network infect/inﬂuence their neighbors with the

same probability. When this assumption holds, the algorithm is very accurate and is computationally

feasible, but here we remove this assumption in order to address a more general problem. Further-

more, where [10] is an approximation algorithm, our approach guarantees optimality while easily

handling networks with thousands of nodes.

2 Problem Formulation and the Proposed Method

We now deﬁne the problem of inferring a latent social networks based on network diffusion data,

where we only observe identities of infected nodes. Thus, for each node we know the interval

during which the node was infected, whereas the source of each node’s infection is unknown. We

assume only that an infected node was previously infected by some other previously infected node

to which it is connected in the latent social network (which we are trying to infer). Our method-

ology can handle a wide class of information diffusion and epidemic models, like the independent

contagion model, the Susceptible–Infected(SI), Susceptible–Infected–Susceptible(SIS) or even the

Susceptible–Infected–Recovered (SIR) model [2]. We show that calculating the maximum likeli-

hood estimator (MLE) of the latent network (under any of the above diffusion models) is equivalent

to a convex problem that can be efﬁciently solved.

Problem formulation: The cascade model. We start by ﬁrst introducing the model of the diffusion

process. As the contagion spreads through the network, it leaves a trace that we call a cascade.

Assume a population of N nodes, and let A be the N × N weighted adjacency matrix of the network

that is unobserved and that we aim to infer. Each entry (i, j) of A models the conditional probability

of infection transmission:

= P (node i infects node j | node i is infected).

The temporal properties of most types of cascades, especially disease spread, are governed by a

transmission (or incubation) period. The transmission time model w(t) speciﬁes how long it takes

for the infection to transmit from one node to another, and the recovery model r(t) models the time

of how long a node is infected before it recovers. Thus, whenever some node i, which was infected

at time τ

, infects another node j, the time separating two infection times is sampled from w(t), i.e.,

infection time of node j is τ

= τ

+ t, where t is distributed by w(t). Similarly, the duration of each

node’s infection is sampled from r(t). Both w(t) and r (t) are general probability distributions with

strictly nonnegative support.

A cascade c is initiated by randomly selecting a node to become infected at time t = 0. Let τ

denote the time of infection of node i. When node i becomes infected, it infects each of its neighbors

independently in the network, with probabilities governed by A. Speciﬁcally, if i becomes infected

and j is susceptible, then j will become infected with probability A

. Once it has been determined

which of i’s neighbors will be infected, the infection time of each newly infected neighbor will

be the sum of τ

and an interval of time sampled from w(t). The transmission time for each new

infection is sampled independently from w(t).

Once a node becomes infected, depending on the model, different scenarios happen. In the SIS

model, node i will become susceptible to infection again at time τ

+ r

. On the other hand, under

the SIR model, node i will recover and can never be infected again. Our work here mainly considers

the SI model, where nodes remain infected forever, i.e., it will never recover, r

= ∞. It is important

to note, however, that our approach can handle all of these models with almost no modiﬁcation to

the algorithm.

For each cascade c, we then observe the node infection times τ

as well as the duration of infection,

but the source of each node’s infection remains hidden. The goal then is to, based on observed set

of cascade infection times D, infer the weighted adjacency matrix A, where A

models the edge

transmission probability.

Maximum Likelihood Formulation. Let D be the set of observed cascades. For each cascade

c, let τ

be the time of infection for node i. Note that if node i did not get infected in cascade c,

then τ

= ∞. Also, let X

(t) denote the set of all nodes that are in an infected state at time t in

cascade c. We know the infection of each node was the result of an unknown, previously infected

node to which it is connected, so the component of the likelihood function for each infection will be

dependent on all previously infected nodes. Speciﬁcally, the likelihood function for a ﬁxed given A

L(A; D ) =

c∈D









i;τ

<∞

P (i infected at τ

(τ

))









i;τ

=∞

P (i never infected|X

(t) ∀ t)









c∈D









i;τ

<∞





1 −

j;τ

≤τ

(1 − w(τ

− τ

)













i;τ

=∞

j;τ

<∞

(1 − A

)









The likelihood function is composed of two terms. Consider some cascade c. First, for every node

i that got infected at time τ

we compute the probability that at least one other previously infected

node could have infected it. For every non-infected node, we compute probability that no other

node ever infected it. Note that we assume that both the cascades and infections are conditionally

independent. Moreover, in the case of the SIS model each node can be infected multiple times

during a single cascade, so there will be multiple observed values for each τ

and the likelihood

function would have to include each infection time in the product sum. We omit this detail for the

sake of clarity.

Then the maximum likelihood estimate of A is a solution to min

− log(L(A; D)) subject to the

constraints 0 ≤ A

≤ 1 for each i, j.

Since a node cannot infect itself, the diagonal of A is strictly zero, leaving the optimization problem

with N (N − 1) variables. This makes scaling to large networks problematic. We can, however,

break this problem into N independent subproblems, each with only N − 1 variables by observing

that the incoming edges to a node can be inferred independently of the incoming edges of any other

node. Note that there is no restriction on the structure of A (for example, it is not in general a

stochastic matrix), so the columns of A can be inferred independently.

剩余10页未读，继续阅读

评论收藏

内容反馈

近期社交网络的凸性推断

评论0

最新资源

近期社交网络的凸性推断

评论0

最新资源

相关推荐

凸分析（史树中）

French-DeGroot类型社交网络的结构推断和参数识别

一种成本有效的社交网络中两个人之间的信任推断算法

从持股推断潜在社交网络-研究论文

基于卷积神经网络的在线社交网络用户档案推断

基于卷积神经网络的社交网络用户兴趣推断

网络游戏-一种考虑凹凸性的岩石孔隙网络模型的孔喉截面构造方法.zip

最优化理论-20200428保持凸性的操作.pdf

论文研究-判断矩阵一致性的凸性.pdf

有限温度下的凸性和非广义热力学

§5 函数的凸性与拐点

原创视频：MATLAB计算债券的凸性

2-01-04-凸性定义1

IT-1B 基础知识 - 凸性及应用1

基于目标的凸性分析和凸性指导的深度残差网络的泛全景化新方法

函数的单调性与凸性PPT学习教案.pptx

非线性规划问题函数凸性的判定 (2007年)

清华微积分高等数学极值与凸性PPT学习教案.pptx

数据凸性检查：检查给定数据是否可以由指定容差内的凸函数表示。-matlab开发

凝聚函数拟凸性与伪凸性不变的充分条件

MATLAB实现多边形顶点凹凸性的识别.rar_凹凸性_多边形_多边形凸性的判断_多边形识别_顶点识别

大数据-算法-拟Banach空间上的凸性模与特殊鞅不等式.pdf

论文研究-保凸性扩散函数PDE模型的UDWT图像去噪方法研究 .pdf

a3金融风险的度量__久期、凸性及久期缺口模型.pptx

采用局部凸性和八叉树的点云分割算法_傅欢1

Java 面经手册·小傅哥.pdf

解压后拖入浏览器扩展程序使用.zip

103套PPT模板.zip