k-core decomposition

所需积分/C币:49 2018-12-04 19:55:43 343KB PDF
收藏 收藏 2

k-core decomposition
The problem of core decomposition is given a graph G, Algorithm I BottomUp find the k-core of G for k=0, 1, .. kimaa, where kmar is the Input: G=(V, E) maximum core number of any vertex in G. Equivalently, the Output: The k-class, Wk, of G for all k problem is to find the h-class of g for k =0, 1,..., him.aa. In the k-classes of G when main memory is not large enough to 3. While(G is not empty) ascending this paper, we propose an external-memory algorithm that finds I. order the vertices in g in ascending order of their degree let d be the minimum vertex degree in g hold the whole graph G. From the k-classes, we can easily 4. wd t-0 obtain any k-core as the induced subgraph of G by the vertex 5. while( there exists a vertex v with degree of at most d) set Vk= 6 ya←业aU{v号} The following example illustrates the concept of core de 7 remove u and all edges incident lu v from G composition. ertices in ascending order of their degree, Example 1 Figure I shows a graph G that contains 14 9. output all yki vertices, a,..., n. The O-class, yo, of G is an empty set since there is no isolated vertex in G. The l-class yy Ya, b, c,jf, the 2-class y2=c d, e, f, g, m), and the 3-class degree of d or less, they cannot be in a k-class where k>d y 3=h, i, k, I, n In this example, we have kimar =3.The and thus must be in y d. Therefore, they are added to y 0-core is the induced subgraph of g by vertex set 0i<3 and removed from G. This process continues until all vertices Since yo=0 in this example, the 0-core is the same as the remaining in g have degree greater than d. Then, the algorithm 1-core, which is simply the entire graph. The 2-core is the moves on to the next iteration to compute the next k-class induced subgraph by(y2UY3)and the 3-core is the induced where k d. The algorithm terminates when all vertices, and subgraph by y 3. One can easily verify that in the k-core, their edges are removed from G every vertex is connected to at least k other vertices, where The following example further explains the bottom-up com- 0≤k<3. utation Example 2 St e that we compute the k-cla of th graph in Figure l by Algorithm 1. Line I sorts the vertices as follows:{a(1),c(1),j(1),f(2),m(2),b(3),d(3),g(3), e(4), k(4), n(4), h(5),l(5),i(6), where the number parentheses indicates the degree of the vertex in the current G. Then, in the first iteration(Lines 3-8)that computes the 1 class, the vertices a, C and j are first removed from g together with their incident edges since their degree is I. After that, the Fig. 1. A Graph and Its h-classcs vertices are re-ordered as (b(1), f(2), m(2), d(3), 9(3), c(4) k (4), n(4),h(5), 1(5), i(5)). Then vertex b is also removed IIL. IN-MEMORY CORE DECOMPOSITION because after removing a and c, the degree of b becomes 1. at this point, the ordered vertex list becomes If(2), m(2), d(2), 1. In this section, we describe the existing in-memory al- .9(3), e(4), k(4), n(4),(5), l(), i(5)1. The first iteration gorithm for core decomposition and explain why it is not completes and obtains V1=[a, c,j, b. In the second iteration suitable to be extended to an external-memory algorithm. We that computes the 2-class, the vertices removed are (in the also discuss the advantage of the top-down approach over the order): f, m, d, g, and c. In the third iteration that computes bottom-up approach for core decomposition in large graphs. the 3-class, the vertices removed are(in the order): n, k, h, I Algorithm 1 describes the skeleton of the existing in- and i memory algorithm for core decomposition [20]. It is a bottom up approach as the algorithm starts the computation of the k- When main memory is sufficient to hold the entire input class from smaller values of k and moves to larger values of graph, the most costly step of the in-memory algorithm is k sorting the vertices according to their degree at each iteration The algorithm first sorts the vertices in ascending order of (Line 8).Batagelj and Zaversnik [20] use bin-sort to order the their degree(among the vertices with the Same degree, the vertices to achieve O(m n)time complexity ordering can be arbitrary ) Then, it starts to compute the d- When the graph is too large to be placed n main memory class y d, where d is the minimum vertex degree in G. It the bottom-up approach fails because it requires memory space removes from G all vertices with degree of d, together with of Q(m t n). The bottom-up approach is not suitable for all the edges incident to them, and puts these vertices into y d. designing an efficient external-memory algorithm since it starts After removing these vertices and their edges, the degree of core decomposition over the entire graph to compute the k some vertices that are previously connected with the removed class with the smallest k. Although sorting can be done in vertices decreases. If any vertices remaining in G now have O(B log 4 B)1Os(assuming that we store separately the 53 vertices and their degree on consecutive disk blocks ), it is non- Algorithm 2 PartitionGraph trivial to address the problem of random accesses to vertices Input: G=(V, E), memory size m, disk block size B and edges on disk during the computation Output: A partition of disjoint vertex sets, =U1,.,U1 In this paper, we propose a top-down approach, which first computes the k-class for larger k and then moves down create an empty partition w to smaller values of k. The top-down approach is more roLET 2l; // the part of that are currently in memory 3. for each vertex, vEV. do suitable for designing an external-memory algorithm. An if(e; is not connected to a vertex in any vertex set in mem) other advantage of the top-down approach over the bottom-5 create a new set U= u and add U to mem; up approach is that most applications and algorithm designs 6 using k-cores only need the k-cores with larger values of 7 find the vertex set e mem lo which k, and sometimes, only the kimar-core(e. g, for defining the u has the most connections nucleus of a communication network [4], for predicting protein 8 add u to u f(∑nerd()=B)∥ memory used for U reaches B functions [8 for approximating the densest subgraph [161, 10 Write block(u) etc. ). The top-down approach can obtain the k-cores with 11. if(LI U∈arem larger k without computing those with Smaller k. This can 12 let Umam be the vertex set in mem s. result in huge savings of computational resources for handling 13 vr e mem,∑ d()≥∑eol(v); massive networks because the smaller the value of k, the Writeblock(umas) larger is the graph from which the k-core is computed IV OVERALL FRAMEWORK Procedure 3 WriteBlock(o) In this section, we first give the overall framework of our 1. let H be the subgraph of G that consists of all edges external-memory algorithm, called EMcore, for core decom connected to the vertices in U 2. estimate l(u) for each vertex v in H position, and then discuss the details of each part in the next 3. write h to the disk and remove U from eu section 1)Graph partitioning The first step of EMcore is to partition the vertex set Y. EXTERNAL-MEMORY CORE DECOMPOSITION V of G, so that core decomposition by EMcore in later In this section, we discuss in detail the three parts in the steps can be processed by loading only the correspond- overall framework, followed by a detailed analysis of the ing subgraphs of the relevant subsets of vertices (instead algorithm of the whole graph) in main memory. We describe an efficient partitioning strategy that uses only one scan of A. Graph Partitioning The purpose of graph partitioning is to decompose the orig inal large graph into a set of smaller subgraphs, so that core 2) Estimation of the upper bound on the core number: decomposition can be processed by loading only those relevant We develop heuristics to estimate the upper bound on subgraphs into main memory. There are many existing graph the core number of the vertices in each subset in the partitioning algorithms [22],[23], but they are in-memon partition. The upper bound is refined progressively at algorithms. For core decomposition in massive networks that each step of core decomposition cannot fit into main menory, we need an efficient algorithm that partitions a large graph with limited memory consumption Recursive top-down core decomposition based on the To this end, we devise a simple graph partitioning algorithm upper-bound core number that requires only one scan of the original graph and has linear The main step of EMcore is a recursive procedur CPU time complexity of core decomposition based on the upper bound on The outline of the partitioning algorithm is given in Algo the core number of the vertices, EMcore recursively rithm 2. The algorithm reads the input graph g from external computes the k-classes on a set of vertices whose core memory once and partitions v into a set of disjoint vertex sets number falls within a range such that the corresponding for all i+ 3 U}, where∪1<tUz= v and L∩U=0 relevant subgraph can be kept in main memory. Our agorithm is top-down, meaning that we compute the k For each vertex u, the algorithIm processes v as follows classes with ranges of larger core numbers before those There are two cases. The first case (Lines 4-5 of Algorithm of smaller core numbers 2)is that v is not connected to any vertex in any vertex set in Llmem (the part of that are currently in main memory). In At the end of each recursive step, we remove from G this case, we simply create a new vertex set u with a single the k-classes already computed, as well as their incident vertex v and add u to nneT edges, to reduce the search space and disk io cost for The second case (Lines 6-10 of Algorithm 2)is that v is the next recursion of core decomposition connected to some vertex(es)in some existing vertex sel(s)in men. The algorithm finds the vertex set u E mem with in X. That is, mar (X)=mar lu):wE Xl,where which o has the most connections; that is,wU′∈mem,X≠0. b(0)n> nb(u)nU. This vertex set U can be found The following lemma describes how to refine l(u) in O(d(u)) expected time using a hashtable. Note that we do not keep all vertices in G in the hashtable, but only those in LEMMA 2: Given a vertex v E V. where the current a vertex set in mem. After finding U, we add v to U. If the y(a)>0. let Z contain all neighbors of v with v values memory used for U is as large as the block size B, we write lower than w (u), that is, z=u: u E nb(o), Q(u)<v(u) u to disk by procedure 3 If (d(u)-zd< (u), then y (u) can be refined as After processing each vertex v, we also check (Lines 11-13 mar d(v)-, 2 ma (Z)1 of Algorithm 2) if the memory used by the current partition reaches the available memory space assigned. When this is Proof: Based on the definition of al,no4: (Z), we know the case, we reclaim the memory by writing the vertex set that all vertices in z can only exist in the k-cores for currently with the largest memory usage to disk k< vmaz (z), which means that the edges connecting v and The writing of a vertex set u to the disk as shown in its neighbors in Z cannot exist in the (a/mar(Z)+1)-core Procedure 3 proceeds as follows. When we read the vertices Therefore, if v exists in the (af mas(Z)+1)-core,the degree in U from the disk( during the scanning of G), we read their of v is at most (d(v)-1Zl) edges as well. Let h be the corresponding subgraph; that is, If(d(0)-IZD>1mo(z), we have v(u)s(d(u)-1Z1) H=(VH, EH), where VH=CUlv:ueU,(a, v)E Ej and because it is possible that in the(d(u)-1ZD)-core, v has a EH=i(a,U): uEU,u, v)EEJ. We write H to disk and, degree of (d(u)-1Z1) Otherwise, we have(d(v at the same time, release the memory occupied by h(also 0) rns Z), which means that v cannot exist in the (umar (2)+ We delay the discussion of the estimation of the upper-bound 1)-core, and thus we have y(u)< core number, y/(v), of each v in H to Section V-B a (a)s mac d(o)-z, imar (z) Before we move on to present our main algorithm for core Finally, for the current value of u(u), we have(d(u decomposition, we ask whether we can devise a divide-and-))<l(o)(given)and i,nar(Z)< l(o)(by the definition conquer solution that computes core decomposition on each of Z). Therefore, macd()-Zl, alma( z)) is a tighter subgraph obtained by the partition, and then merges the results value of vi(u), which completes the proof to find the k-cores in the original graph we note that the divide-and-conquer method does not work because at each Lemma 2 can be applied iteratively to replace the current conquer phase, this approach still needs to perform search in value of v(v) whenever a tighter value is possible for o() the merged problem space to determine the core number of However, simply applying Lemma 2 might not be good enough each vertex, and hence it demands as much memory as does since it is likely that mar(2)>(d(u)-1ZD especially if an in-memorv algorithm Z is not small. As a result, we often choose vmca (z)to refine y(a), which weakens the effectiveness of Lemma 2 in B. Upper bound on Core Number obtaining a good estimation of vb(u Our top-down core decomposition algorithm requires the refinement based on the following corollary of Lemma ner To get a better estimation of v we app selection of a relevant set of vertices based on the upper bound of their core number. In this subsection, we develop heuristics COrollary 1: Given a vertex vE v, where the current to estimate this upper bound v(a)>o, let Z=u: uE nb(o), yb(a <y(uh e use w(u to denote the upper bound on v (u)of a vertex 2. The following lemma gives a coarse estimation on vp(u1) (a)can be refined as maid(a)-1e7e For any Z'cZ,Z+0, if (d(u) ) y(u),then ,0ma(Z)} LEMMA I: y(u)=d(u) Proof: Similar to proof of Lemma 2 Proof: It is trivial to show that vl(u)=d(u)because (v)≤ad(v). Based on Corollary l, we can select the subset z of Z that attains the tightest estimation of v(u), rather than restrictin Lemma 1 serves as an initialization of o (o) for each 0. to the whole set Z for the estimation After the initialization, we can further refine b(v). The basic Procedure 4 describes our algorithm to estimate (u. Lines idea is to refine y(u)based on the y values of us neighbors. 1-2 first initialize v(u) to be d(u) by Lemma 1. Then,a Intuitively, v's neighbors that have v values lower than v (u) refinement is applied to obtain a tighter bound on the core definitely have the true core number lower than y(u). These number of each vertex in V(Lines 3-7), as explained in neighbors cannot contribute to the degree of v in the q(u)-Corollary 1 ore and thus if v has many of such neighbors, w(u) can be Note that Line 5 of Procedure 4 uses the condition " d(u) further tightened 2|<y(v)”, while Corollary I uses“d(v)-|2<v()” We first define a notation mor (X) for a non-empty vertex An incorrect value of yb()may be assigned if Lines 6-7 of set X to denote the naximum w value among all the vertices Procedure 4 chooses some Z with (d(u)-zd2v(v) 55 Procedure 4 Estimate-ab Example 3: Assume that the vertex set of the graph in for each 2∈vdo Figure 1 is partitioned into three sets: U1=fa,b,c,d,e, f, 2. initialize al(o) as d(v) U2=9, h,i,j), and U3=e, k, m, n. Figure 2 shows the 3. for each∈Vdo three corresponding lbgraphs'. The first number next to each 4. let Z=u: Enb(u), yl(u)<( 0); vertex v is the initial y (u)=d(u), while .->y indicates that 5.if(d(v)-|2<v(v) the al value is tightened from to y by Procedure 4 6 let f(X)-mard(o)-X, vma(X); 7. v(0)+minf(z):z′sE,Z≠0} 8. repeat steps 3-7 ③@Q人 However, this Z cannot be selected to refine v(u), since Line 7 selects the subset of z with the lowest f value and Z with(d(v)-2D)<v (u) is obviously a better refinement (a)U/={a,b,c,d,e,f}(b)U2={g,h,,j} (c)U/y={k,,m,n} Therefore, Lines 5-7 of Procedure 4 refines v(u) as Slated in Fig. 2. Vertex Sct Partition and Thcir Subgraphs Corollary 1 and the correctness of the refinement of /(v) is ensured Let us consider vertex b in Figure 2(a). We have Z=ta,ch In each refinement process as described in Lines 3-7, we since v(a)=v(c)=1<v(b)=3 at this moment. It is easy always refine the vertices with smaller w values before those to see minf(2):Z'cZ, Z't0= l in this case and with higher ones. This strategy allows the refinement to tighten thus we obtain (b)=1. Similarly for vertex d, we compute he bounds more quickly since only the neighbors with low Z=6 since now v(b)=1<u(d)=3.Thus, we obtain values are useful in tightening the bounds (by Corollary v(d)=2. Note that we cannot use g to compute (d )since we 1). The smaller y/; values refined are thus particularly effective do not have the degree or al information of g at the current in tightening the y values for other vertices in the remaining stage, which will become available in a later stage of core decomposition To tighten the value of v(o), we do not need to compute f(2) for all subsets of z in Line 7 of Procedure 4. The C. Recursive Top-Down Core decomposition definition of f(2) in Line 6 of Procedure 4 actually implies a dilemma: the larger the size of Z the smaller is the value We now discuss our algorithm eMcore, as outlined in algo of (d(u)- z'd but the larger may be the value of y, mar(Z) rithm 5. EMcore first invokes Partition Graph(i.e, Algorithm while a smaller Z implies the opposite. Moreover, for subsets 2)to partition G. Then, it invokes Procedure 6, EmCoreStep to compute the k-classes with k in the range [kl, kul,where of the same size, clearly those vertices in Z that have a smaller kl s hiu, recursively from higher value ranges to lower value a, help more in obtaining a smaller y,mar (z)than other ones vertices. Therefore, we can sort the vertices in Z in ascending Zl, where the size-i Z/simply contains the first i vertices in range [kil, hiu as follows. We first detine ar etermine each order of their y values. Then, we select Z of size from 1 to In order to bound the memory usage, we determine each the ordered Z, until we find that f(z)reaches its minimum V, kl < w(u)s hia: that is, yk is the set of vertices whose In this way, we process at most Z number of subsets of z y values fall within the range kl, k At each recursive step, instead of 2 Z subsets we construct a subgraph that is relevant for computing the We apply Procedure 4 to estimate l(u) for each v in H in real core number of the vertices in y k in main memory. Let Line 2 of Procedure 3. Since H is only a small subgraph of G, b=r. Then, b is the maximum number of blocks that can be the refinement process(lines 3-7 of Procedure 4)converges held in main memory. We use b and a given hu to determine quickly. To handle the worst-case scenario, in our algorithm, kl as follows we set a limit on the number of iterations for the refinement Let K be a set of k values such that the vertices in y k process. We bound the cost of the refinement by the cost of are scattered in at most b vertex sets in l. That is a disk i/o. That is, we terminate the refinement process once the accumulated time for refinement reaches the time for a disk I/O. Therefore. the total cost of our algorithm is always K={k:1<k≤k,|{U:U∈2,U∩亚k“≠0≤ bounded by the number of disk 1/Os We then define In Procedure 3, the subgraph h is associated with a vertex set U in partition l. However, there may be some vertices in min{:k∈K},ifK≠0, I that are not inu but are connected to some vertices in l k (1) otherwise Since we may not have the degree of these vertices to initialize their v values, we consider their a/ values to be oo when we We use an adjacency list to store nb(u)of a vertex v In each subgraph, we refine v(v)for v E U. The following example illustrates how Thus an edge between two vertices in the same U: is counted twice and the we refine y by Procedure 4. three subgraphs have 14, 15, and 15 edges, respectively 56 Algorithm 5 EMcore Procedure 7 Computecore(gw, kl, ku) Input: G=(V, E), memory size M, disk block size B 1. initialize (u) to be-1 for each vE w: Output The k-class. yk, of g for all k k←k 3. while (h hu) 1. invoke Partitiongraph 4. while (vE W,(d(u, Gw)+ deposit(u, hv))<k) compute h by setting hu oo and b emove v and all edges incident to v from gw 3. invoke Em Core Step(kl, kiu) 6. update w(u) to be k for each remaining vertex v in Gw; 7 k←-k+1; 8. for each k, where k<k< ku, do ∈W,()=k} Procedure 6 Em Core Step(hl, hin) l0. output业k 1.W<亚 2. let H be the set of subgraphs of the vertex sets in chat conlain any vertex in w 3. read H into main memory to construct shown in Procedure 7, to compute the ke-classes y k from Gw the induced subgraph Gw by n where ki <ksku. However, an immediate question raised 4. invoke Compute Core(g w, key, hu) here is: since Gw is the induced subgraph, is the core number 5. refine v(u) for any vertex v currently in main memory; 6. deposit any vertex connected to a vertex in业k(kt≤k≤ka) computed from Gw correct? We will prove the correctness of 7. remove all vertices in亚k(kt≤k≤hu) and their edges our algorithm in Theorem 1. which will also become clearer from the subgraphs in H, and merge any of as we continue to unwrap the details of our algorithm these subgrap if their combined size is less than B One key concept to ensure the correctness of our algorithm 8.if(k>2) is the concept of deposit assigned to each vertex ( line 6 of 9. compute kl in I k by setting ku =(k!-1) Procedure 6). We define the deposit of a vertex as follows. 10. write the subgraphs in H that consist of no vertex in to the disk, unless all the subgraphs can now be kept Definition 2(DEPOSIT ) Given a vertex v E V and an inte In main memorv ger hu, where ha >v(u), the deposit of v with respect to h 11. invoke EmCore Step(ki, ku is defined as deposit(u, kiu)=Ru: u E nb(u).(u)>ku 12. else 13. add all the remaining vertices to 1 and output 41: The concept of deposit is used to ensure that the number of dge connections to a vertex v in the k-core, where kl<k< ku, can be correctly counted even if the vertices and edges that belong to all k -core are removed. where k'>> k. After we Intuitively, ht= min(k h e k means that we load compute the k-classes, where k>kl, we can deposit a"coin as Inany parts the graph as possible into main memory to each neighbor of a vertex in the k-classes if the neighbor's (bounded by the memory size n for core decomposition at core number has not been determined (i. e, the neighbors y each recursive step. However, in the rare case when K-o is less than k and its core number will be computed in a later (i. e, the vertices in y ku are in more than b vertex sets in u ), recursive step). This"coin" accounts for an edge connection we simply assign kt= ku and load b vertex sets in U: Ue to this neighbor vertex in the computation of its core number u,Unykt0 each time into main memory to construct Using the deposit, we can safely remove all the vertices the subgraph for core decomposition and their edges after the core number of these vertices in the To obtain hl, we keep a heap in main memory to maintain current recursive step is computed(Line 7), without worrying the highest in each vertex setU E l. The highest v among about the correctness of the core decomposition in the later the vertices in each U E l is inserted into the heap before recursive steps. In this way, we can save considerable disk i/o U is written to disk during graph partitioning. Given b and costs for reading and writing these large-core-number vertices kiu, we can easily determine the corresponding value of h by and their edges again and again in computing the he -classes querying the heap with a smaller core num ber We now discuss how EmCoreStep computes the k-classes, Procedure 7 computes the correct core number of th where kl< k< ka, as outlined in Procedure 6 vertices in the induced subgraph gw as follows. The algorithm Let w=y k. In other words, W is the set of candidate iteratively deletes the lowest degree vertex and its edges as vertices whose exact core number is to be computed at this in an in-memory algorithm. However, in selecting the lowest recursive step. The first step is to construct the induced degree vertex, the algorithm considers the sum of the local subgraph Gw of g by W. We construct Gw from the set degree in Gw and the deposit of the vertex. We remark that of subgraphs of the vertex sets in u that contain a vertex in the initial value of the deposit of all vertices with respect to W, since all vertices in W and their inter-connection edges hu oo is set to 0. We prove the correctness of Procedure 7 are contained in these subgraphs. During the construction, we in Lemma 3 can discard those vertices that are not in w, together with After we compute yk, where hI <k< ku, we return to their edges Lines 5-6 of Procedure 6. We need to refine the upper bound After Gw is constructed, we invoke Compute Core, as on the core number of remaining vertices as well as updating 57 their deposit. We first refine v(u) for any vertex v that is currently in main memory but does not have its core number computed. The refinement of ()involves two steps: (1)We first set v(u to be(kt-1) if currently vl(u)> kl.This is because if v(u)>kl, then u must be processed in the current recursive step. However, since us core number is not determined, v is not in any y k, where hi h<hu, which a)W{g,h,讠k4,n} (b)w-d,e,fg, mi implies that l(u)<(kl-1).(2)We then refine all v Fig 3. Induced Subgraph Gn Procedure 4. since now we have the exact core number of more vertices computed which can be used to tighten the v value of the remaining vertices After computing the 3-class, we can refine v(g from 3 to After the refinement, we have l(o)<kl s ap (a) for all 2. We also update the deposit of any vertex that i lected v whose core number has not been determined and any u in to a vertex in the 3-class for k n=(bi-1)=2, as shown in Y k, where h> hil. Then, Line 6 updates the deposit of a the following table vertex v with respect to ha=(hl-1). The deposit (u, hl-1) can be calculated as (deposit(u, ku)+ Ha: u E nb(u),u E TABLEⅡ deposit(v,2) WHERE U∈ ,m} ∪k≤k<k亚k}), where the first part is the number of edge connections to u that were removed at previous recursive steps m while the second part is the number of edge connections to u) deposit(v, 2) that are to be removed at the current recursive step Therefore deposit(u, k-1) ensures the correct computation of v (u)at Now we can remove all the edges connected to any vertex later recursive steps in h, i, k, l,n. After removing the edges, we only have 1 After the update of the deposit, we can safely remove all edge(d g)and 3 isolated vertices e, f and in Figure 2(b), Ⅴ ertices in yy, where kil≤k≤ku, and their edges from the and only 2 isolated vertices g and m in Figure 2(c) Next we compute for k,M=2 and we load the subgraph in subgraphs in H (Line 7 of Procedure 6). Then, we merge any Figure 2(a)(and also Figure 2(c)which is still in main mem- of these subgraphs if their combined size is less than B,So as to further reduce the disk I/O cost Furthermore, we keep ory)to construct the induced subgraph by W=d,c, f, 9,m) those subgraphs that will be used in the next recursive step in as shown in Figure 3(b ). By using the deposit, it is easy to main memory and only write the rest to disk, or keep all the obtain the 2-class, d, e, f,g, m, by Procedure 7 subgraphs in main memory if they are now small enough to After computing the 2-class, all the remaining vertices be all kept in main memory(Line 10 of Procedure 6) fa, b,c,j1, belong to the 1-class The recursive step goes on to process the next range of k kf, k-1,until kl=2(Lines8-11 of Procedure 6). Note that D. Analysis of Algorithm the upper bound of the range, hl, for the next recursive step In this subsection, we provide the theoretical analysis re- now changes to(hl-1). This is because, after the refinement garding the correctness as well as the complexity of our in Line 5 in the current recursive step, we have v(u)<(k-1) algorithm EMcore for any vertex v whose core number has not been determined When kl= 2, the remaining vertices must be all in the l-class LEMMA 3: Given W= yku and Gw, for any vEw 业1( Lines12-13). al(o)> kl, then Compute Core (i. e, Procedure 7)correctly There is still a small problem: are there any vertices in the computes v(u) 0-class? The case of k=0.i.e. the 0-class means that no Proof: If w(u)> kl, to compute y,(u)we only need the vertex in this class is connected to any other vertices. Thus, kI-core of G. Ck, In Gw, the largest core number that a vertex the o-class contains only isolated vertices in the graph, which can have is ku. Thus, in computing v(u)using Gw instead can be obtained trivially by one scan of the graph. To do this, of Ck,i, we are missing the edge set (u, u): vEW, v(u)> we add a check in Algorithm 2: as we scan the input graph, ku,(u, U)E EJ, where kl < p(u)<ky if a vertex has a degree of 0, we output it as a O-class vertex. Since our algorithm follows a top-down approach, Vv E w The following example explains how the top-down core if 3u E nb(o), vb(u)>kia, then v(u) must have already been decomposition is processed computed in some previous recursive step and deposited to deposit(v, hu ) Thus, we have(d(v, Gw)+ deposit (v, ku)) cxample 4: Let us consider the three subgraphs given in d(u, Cki) for all vE W. Since Compute Core removes a vertex Figure 2 and suppose 6=2. Then, we obtain k!=3 and n a only if (d(, Gw)+ deposit (v. kia))<k, or equivalently, 1g, h, i, h, l, n. We load the subgraphs in Figures 2(b) d(u, Ck,)<k, it correctly computes v() for all v E w and 2( )into main memory to construct the induced subgraph v(c)>kl by w, Gw, which is shown in Figure 3(a), where the number next to each vertex v is the current v(o). It is then easy to Using Lemma 3, we prove the correctness of our main obtain the 3-class, h, i, ki, I, 7. by Procedure 7 algorithm emcore 58 THEOREM 1: Given a graph G, EMcore (i.e, Algorithm 5) The worst-case complexity is given in terms of hmar. In the correctly computes the k-class of G for all k, where < k< following, we give an estimation on the value of kmar kimar and kmar is the maximum core number of any vertex We focus our analysis on real-world networks whose degree distribution follows a power law. We note that similar analysis Proof: Procedure 6 recursively computes a range of k- can also be applied for graphs with other degree distributions be the ranges of k of two sule. 9 However, extensive studies [],[251,[,[271,[281, 1291, classes with k e kn, ku] each time, from ku =o to ki Let cursive [30 have shown that graphs with power-law degree distri recursive steps. Procedure 6 assigns ka= hl butions are prevalent in real-world applications, such as the since lemma 3 proves the correctness of each recursion, the hi-i, u, wwW, many social networks, neural networks, genome and recursive procedure correctly computes all the h-classes for protein networks. More importantly, most real-world networks k 22. After Procedure 6 computes the 2-class, the remaining that are too large to be kept in main memory follow a power vertices must belong to the l-class, since the 0-class is already law degree distribution [27 ,[3 outputted by a simple checking on the degree of each vertex Faloutsos et al. [26] give the following power law degree being read as we scan g in Algorithm 2, by the fact that a distribution for real-world networks vertex u is in the 0-class iff d()=0 d(u) R We now analyze the complexity of EMcore. For external- memory algorithms, the complexity analysis is mainly based where r(m) is the degree rank of a vertex v, i. e, 21 is the on the 1/O model, which focuses on the importance of disk (r(a))-th highest-degree vertex in G, and R<0 is a constant 1Os over CPU computation. We remark that our algorithm called the rank exponent is not CPU intensive, it takes O(kmaz(m +n)) CPU time, Given a graph g with the above vertex degree distributio since graph partitioning takes linear time and each recursion we now try to obtain the maximum core number kimar that G of ke-class computation takes at most O(m n)time. Note can have kmar is essentially the maximum value of k such that the refinement of the y value follows a similar strategy that the top-(R+1) highest-degree vertices in G are pairwise as the in-memory core decomposition algorithm and thus its connected (i.e of degree at least k). Therefore, by substituting time complexity is also bounded by that of the in-memory r(u) by(kaz +1)in Equation(2)and having d(a) of at least algorithm, i. e, O(m +n) time. The amortized CPu time kmar, we have should be close to o(m+n)since each recursion operates on an induced subgraph relevant for the range kl, ku only, which is verified by our experimental results as compared with mR(naz +1)≥kma (3) the in-memory algorithm. Thus, disk yo time dominates the overall complexity An approximation for the solution of the inequality stated in Equation(3)by putting(Fima +1)akmar gives the following THEOREM 2: Given a graph G, EMcore computes the k- upper bound on h classes of G using O(kman (mm tn))1/Os and ((+m)disk block space in the worst case hmac Proof: The graph partitioning process scans the graph only once and the total size of the subgraphs written to The value of r measured in [26] for three snapshots of the disk is the same as the size of the original graph. Thus. internet graph is between -0 8 and.7. For a graph of 1 AlgorithIn 2 uses O(t I/Os. Each subsequent recursive million vertices. we have kmgr 296 when R=-07 Note tep reads/writes only those subgraphs relevant for computing that this is the worst case. Some existing studies on large the k-classes that fall into the range k e ka. Thus, the scale networks [61[3], [5] show that kimas: is less than 50 number of disk I/Os at each recursive step is bounded by approximately. Our experimental results also show that hima O(Er).Thus, the total number of disk I/Os is O(kma (mtn)) is small and the maximum is 372 among the datasets used in the worst case, since there are at most kima.z recursive steps (see Table III). But suppose that imaa is close to its upper The algorithm uses O(r disk blocks since the total size bound given by Equation (4)in the worst case, we have the of the subgraphs written to disk is O(m +n) following analysis on our algorith Let s be the subgraph of G such that one or both end We further remark that the actual 1/O cost of the algorithm vertices of every edge in S are in the kmaz-class; that is, S can be significantly smaller for the following reasons: (1)each (Vs, Es), where Vs= Wkmar t o: uE Vkma,(u,U)EEI recursive step reads only the relevant subgraphs from the disk, and Es=(u, u):(uU)CE, u C Hkma. By Equation(2), which takes O(b)=0(F)instead of 0(23)disk I/Os; (2) we have the following lower bound for S we cache some subgraphs in main memory for reuse in the next recursive step; and (3)the size of each subgraph reduces kmas+1 after each recursive step In our experiments we show that the nar(nna z actual i/o is close to m+n instead of kimar(m+n) S2∑() The equality holds when there are only(kmar +1) vertices all experiments on a machine with an Intel Xeon 2.67GHz in the kimar-core. The first component in the right-hand side of CPU and 6GB RAM, running CentOS 5. 4(Linux) Equation(5)is the sum of degrees of the(kmax +1) vertices Since they are pairwise connected in this case, their inter-edges Datasets. We use the following four datasets in our exper are double counted and hence deducting the number of these iments: road-network-California(RNCA), LiveJournal(LD) edges from the degree sum gives the lower bound of S Billion Triple Challenge(btc), and Web. The RNCA dataset By Equation(2), we also obtain the size of the entire graph a road network of California, where vertices represent interse G, which is half of the degree sum of all vertices in V. Since tions and endpoints, and edges represent the roads connectin g all edges are counted twice in this case, the degree sum needs these intersections or road endpoints. The L) dataset is the free to be halved on-linecommunitycalledLivejournal(www.livejournal.com) where vertices are members and edges represent friend- ship between members. The RNCA and l datasets can be R (6)downloaded from the Stanford Network Analysis Platform (http:/snap.stanfordedu/datal/index.html).Thebtcda a semantic graph converted from the billion Triple challenge Inourtop-downapproachaftercomputingthekmaz-core,2009Rdfdataset(http://vmlion25.deri.ie/,whereeachnod we can remove S from G. According to Equation (5)and represents an object such as a person, a document, and an Equation(6), for a network with R=-0.7 and 1 million event, and each edge represents the relationship between two vertices,S is at least 12% of the entire network, which is a nodes such as "has-author""links-to'" and " has-title"The huge reduction on the search space and disk t/o cost Web graph is obtained from the YAHOO webspam dataset hThevalueofkmazisingeneralsmallforlargescalereal-(http://barcelona.researchyahoonet/webspam),wherevertices world networks, in which case emcore has fewer number of are pages and edges are hyperlinks. We give the number of recursive steps. On the other hand, if kaz is larger, we have vertices and edges, as well as the storage size in the disk, of greater reduction on the search space and 1O cost. Thus, in the datasets in Table IIl both cases, the analysis shows the effectiveness of our top down approach for core decomposition TABLE III DATASETS (M=10, G=10) E. Algorithm Optimization RICA ⅠJ BTC Web There is one optimization that may considerably improve 2.0M 4.8 1485M529M the performance of the algorithm. Before we write a subgraph m=E55M690M673.3M1.65G h to the disk in Procedure 3. we can further trim h to reduce Disk size(byte)67.4M 809 1M 8.7G 17.2G 372 both the search space and the disk io cost by pre-computing a partial l-class. Such l-class vertices in H can be found by iteratively deleting vertices with degree of I and the edges incident to them. Note that the l-class so computed may Experiment Settings. We test our algorithm for two scenar not be complete, since some inter-edges may exist between los: (1) when main memory can hold the entire input graph, different subgraphs in the partition and the computation here and (2) when main memory cannot hold the entire input graph only considers each local subgraph H We use Case(1)to demonstrate that our external-memory Since vertices with a smaller core number, together with algorithm does not process a large graph at the expense of their edges, cannot be in k-cores with a greater core number, significantly increased CPu time complexity. We use Case we can safely remove all these vertices and their edges from (2)to verify that our algorithm computes core decomposition H. If there are many such vertices and edges in H, we can efficiently when the existing in-memory algorithm becomes continue to keep H in main memory and put into H more infeasible vertices and edges from G. We write h to the disk until its size reaches b A. When Main memory is Sufficient We remark that if after removing the partial l-class vertices We first assess the performance of our algorithm and their edges, the graph becomes small enough to be main memory is sufficient to hold the entire graph, by mainly kept in main memory, then our algorithm simply performs comparing the running time with the in-memory algorithm core decomposition in main memory, as does an in-memory We use two medium size networks. RNCa and L/ algorithm Table iv reports the partitioning time, the recursive core decomposition time(and the time to compute the k.r-class VI EXPERIMENTAL EVALUATION y r), the total running time, and the memory consumption of In this section, we evaluate the performance of our al- EMcore and IMcore. All time taken is the elapsed wall-clock gorithm EMcore, comparing it with the state-of-the-art in- time. The total running time of EMcore is comparable to that memory core decomposition algorithm by Batagelj and Za- of IMcore For EMcore, the partitioning time occupies a large versnik [201, denoted as IMcore in our experiments. We portion of the total running time. Thus, the actual time taken implement our algorithm in C++(as does IMcore). We ran by the recursive core decomposition procedure(Procedure 6) 60

试读 12P k-core decomposition
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    k-core decomposition 49积分/C币 立即下载
    k-core decomposition第1页
    k-core decomposition第2页
    k-core decomposition第3页
    k-core decomposition第4页


    49积分/C币 立即下载 >