下载  >  人工智能  >  机器学习  > Community detection in complex networks using extended compact genetic algorithm

Community detection in complex networks using extended compact genetic algorithm 评分

研究重叠社区发现具有重要的意义. 随着对社区发现研究不断深入, 现实生活中, 节点往往会同时隶属多个社团, 如语义网中, 单词会因其具有的多个词义或词性而同时属于多个社区; 在蛋白质网络中, 一个氨基酸往往也具有不同的生物功能[7]; 在科学研究合作网中, 科学家或研究员具有研究不同领域的能力.因此, 网络中的社区结构的划分可能有重叠节点的存在. 图1-3所示网络中节点6属于两个社区共同节点, 属于重叠节点. 以前, 社区发现工作者很少考虑到重叠节点, 简单地将网络所有的节点划分成不相交的社区结构, 必然导致划分社区结构的结果不够精确. 相反, 考虑网络社区结构的重叠性,不仅增加对社区结构稳定
Community detection in complex networks 927 Moreover genetic algorithms are also used to optimize Modularity @(Tasgin et al. 2007), where different chro mosomes represent different partitions, and modularity can be used as the fitness function. With appropria choices of the parameters of genetic algorithms, good results on the gn benchmark networks (Girvan and Newman 2002) can be obtained. As an interesting representative GA-Net(Pizzuti 2008) adopts a special representation of chromosome, and a special fitness function called com- munity score, to detect community structure in complex networks. Generally speaking, the encoding of chromo- somes of genetic algorithms for community detection mainly include two ways: string-of-group encoding(Tasgin Fig. 1 Graph description of a complex network with three commu et al. 2007; He et al. 2009 Li et al. 2010) and locus-based nities adjacency representation(Pizzuti 2008, 2009) Although genetic algorithms have achieved effective results in community detection, there still exist some when designed reasonably (Tan et aL. 2008: Singh and problems, such as how to solve more effectively Baghel 2009). The key factors of applying genetic algo complex nonlinear optimization problem, and how to deal rithms to community detection include the encoding way of with the serious epistasis in the representation of chromo- chromosomes the choice of fitness functions and the some. This paper attempts to apply extended compact design of genetic operators. In the next two subsections, we cnetic algorithm(ECGa)to find communities in complex choose two representatives but completely different genetic networks, because ECGA is announced to have ad vantages algorithms to show the modeling community to tackle these complex problems (Sastry and Goldberg detection 2000). This paper is organized as follows. Problem description 2.2 GATHB algorithm on community detection and two related genetic algorithms are reviewed in Sect. 2. ECGa and community detection Tasgin et aL. (2007) apply genetic algorithms to find based on ECGA are given in Sect. 3. Experiments are pre communities in complex networks. We express briefly their sented in Sect. 4, and the last section concludes this paper method as GatHB. Given an undirected graph G(v, e) with n vertices v1, V2.., Vn, where V=n, the chromo some &, of GaTHB can be represented by a vector with n 2 Problem description and related work elements, gr=[8, 802, ..,gml, and gti E 1, 2,.. ., n) and vi belong to the same community if and only if 2.1 Graph representation of community detection &ti=&ti. A population is expressed as P=81, 82,,gpI In complex networks where the population size is p GATHB chooses modularity @, proposed in Newman A complex network can be conveniently described as a and Girvan(2004), as its fitness function. Modularity 2 can graph, which includes vertices and edges. Detecting com be written as munities in a complex network can be viewed as finding some groups of vertices from the graph corresponding to Q (4、 2m (C,C) the network, such that there are more edges between any two vertices in the same group than two vertices from where the sum runs over all pairs of vertices. Aii is the different groups, respectively. The problem is shown in adjacency matrix of graph, m the total number of edges of Fig. 1, where a group of vertices in each dotted circle form a community. Thus, there exist three communities in graph, and hi the degree of vertex i. Vertex i belongs to community Ci, and vertex j belongs to community Ci. The Fig. 1, corresponding to three dotted circles. respectivel d-function outputs one if vertices i and j belong to the same Community detection in complex networks can b community (i.e, Ci= Ci), otherwise zero. Modularity Q modeled as an optimiz ation problem via building an can also be rewritten as objective function, such as maximizing modularity QDuch and Arenas 2005). However, the optimization problem is NP-hard (Brian Karrer et al. 2008). Genetic 2 algorithms are often effective to solve NP-hard problems pringer 928 J. Li, Y Song where nc is the total number of communities, li the total The volume vs of a sub-matrix s=(, is the total number of edges in community i, and d; the sum of the number of 1 among entries Ai where i E I and jE J. Thus, degrees of all vertices of community i. The problem of we have vs i∈lj∈/Aj community detection can be formulated as the problem of Given a sub-matrix S=(1, ) the power mean of s of maximizing nodularity Q order r, M(S, r), is defined as The steps of GATHB are described in Algorithm I M(S, r) ∑(au) Algorithm I GATHB algorithm Step 1. Initialize population. For each chromosome, each vertex belongs to a different cluster ommunity ) Choose randomly a vertex and assign its cluster label to all of its neighbors. The The score of s is defined as Q(S, r)=M(S, r)vs, which process is repeated an times for every chromosome can be used to detect maximal and dense sub-matrices The Step 2. Evaluate fitness. Compute the fitness values of every chromosome in current population sing modularity o community score of a partition S1, S2,..., Sk of A is also Step 3. Rank the chromosomes with a descending order of fitness values and take the top p defined as chromosomes Step 4. Save the top Bp chromosomes from the top p. Step 5. Perform crossover operation. Pair chromosomes sequentially two by two. For each pair CS(r) @(S mes, call one of them the source chromosome gsrc, and the other the destination chromosome gdcs. Create one random number i from (1, 2, n), and then assign the values of the genes which have the same value with the ith gene in the source chromosome to the corresponding Thus, the problem of community detection from a genes in the destination chromosome, i.e., g des=gse. kek lg r=g3n).Then,this complex network can be formulated as the problem of Step 6. Perform mutation operation. Select one chromosome randomly, and create two random maximizing CS(r). The steps of the GA-Net algorithm are numbers i and/ from ( 1, 2, .., Then the value of the ith gene is assigned to the / th gene, i.e,g given in Algorithm 2 Step 7. Combine newly obtained p chromosomes and previously saved Bp chromosom Step 8. Repeat from Step 2 to Step 7 until the stopping criterion is satisfie Algorithm 2 GA-Net algorithm to be checked whether o: not it is effective. The gene at site i with allele value j is saved if an edge In Algorithm 1, a, B, n, and s are parameters of GATHB exists betwen two genes at site i and site j; otherwise, j is replaced by any one of the neighbors of Igorithm, and all of them belong to the Step 4. Perform crossover operation, Given two parents, a random binary vector is tirstly 2. 3 GA-Net created. Then uniform crossever selects the genes from the firs. paren: ii their corresponding sites are I in the binary vector, otherwise from the second parent. Step 5. Perfollll nuta iun operatioN. The mutation operator randomly chooses a value froml uhe GA-Net(Pizzuti 2008) is also designed to detect commu neighbors of gene i as the valle of the ith gene ity structure in complex networks. Its chromosome adopts Step 6. Repeat from Step 2 to Step 5 until the terminal criterion is satistied a special representation, adjacency representation, and its fitness function is called community score, which is also different from that of gathB GA-Net algorithm uses the concept of the locus-based 3 Applying ECGa to community detection adjacency representation, which is proposed in Park and Song(1989), and is based on a graph. an individual of the 3. 1 ECGa population consists of n genes, 81:82,.,gn, and the value of each gene is from the set [1, 2,...,n. Thus, vertices of Extended compact genetic algorithm (ECGiA), as the the graph G(V, E)modeling a complex network are rep- extended version of compact genetic algorithms(CGa) resented as genes and alleles. value j assigned to the ith (Harik et al. 1998), is proposed in Harik (1999). ECGA s gene is explained as a link between two vertices i and j in also a kind of estimation of distribution algorithm (EDA, V, and the vertices i and j are also considered as belonging (Pelikan et al. 1998). Given an optimization problem, EDA to the same cluster (or community) first builds a population to express a group of candidate Let S=(I, n) be sub-matrix of the adjacency matrix A, solutions, and then constructs a probability model to J is a subset of the columns set y=(1, J2,., n)of d nd where I is a subset of the rows setX=11, 1,,. I,ofA, and describe the probability distribution of these candidate solutions by means of statistical learning. The new popu- Let aiy denotes the mean value of the ith row of S, and lation is created by sampling the individuals randomly ali the mean value of the jth column of S. air and ar can be from the current population according to their probabilit more formally expressed as distribution instead of using traditional crossover and mutation operations in genetic algorithms (GA). The sp i ∑A and a (3) cial design makes EDa be suitable to solve some special optimization problems Springer Community detection in complex networks 929 Moreover, ECGa is designed according to the idea that choosing a good probability distribution is equivalent to linkage learning a good probability distribution can be evaluated in terms of minimum description length(MDL) and Cp represents the compressed population complexity model ECGA adopts a class of probability models known which is written as as marginal product model(mPm)to model the probability distribution of individuals. MPM is formed as a product of marginal distributions on a partition of the genes. Unlike C? ∑∑Mk( the models used in CGA(Harik et al. 1998),MPM can represent probability distributions for more than one gene where k; is the length of subset i, and Nii is the number of each time. In other words, MPM models the probability chromosomes in the current population possessing the jth distribution of subsets of genes. Genes in each subset are sequence for subset i. It should be noted that a subset of thought of as being tightly linked, and genes from different genes with length k has xk possible alphabet sequences subsets are considered to be independent where the first sequence is 11. I and the last sequence is Thus, ECGA uses MPm to divide genes into inde xx·7 pendent clusters, each of which is viewed as an indi The modeling process of MP is described in vidual" variable. At the beginning of modeling, all Algorithm 4 clusters are assumed to be independent of each other Then, two clusters that make minimum description Algorithm 4 The modeling process of MPM length (MDL) to improve most are chosen to merge. The Step 1. Assume all genes(variables) be independent with each other, and every gene forms an individual subset (cluster) merging process is iterated until the MDl is not able to Step 2. Compute the combined complexity C be further improved. Thus, an MPM model is con- Step 3. Select and merge the two subsets with the minimum combined complexity Cc'from all of m(m-1)/2 combinations with any two subsets, where m represents the number of the current structed. Furthermore, the probability distribution of all clusters in the whole population is computed, and then a Step 4 Step the iteration, if Cc Cc; otherwise, go to Step 2. table of probability distribution is obtained. According to the table, a new population is created by sampling Once the model is built and the marginal probabilities subsets (clusters) are computed, P x Pc new individuals can be generated by The original ECGa adopts binary encoding. Further- this model as described in algorithm 5 more,z-ECGA(Ossa et al. 2006)extends the encoding way to the n-ary alphabets. We also adapt the x-ECGa to Algorithm 5 Generating new individuals via MPM model community detec Step 1. Choose n, ind: viduals randomly, where nh is the number of subsets (clusters of genes)in the current generation Let P represent population size, Ts tournament size and Step 2. Choose set of genes randomly without repetit: on from lb individuals Pc crossover probability. The steps of ECGa or z-ECGA nb subset Step 3. Combine the a, subsets to form a new individual (similar to the multiple -point are shown in AlgorithIn 3 Step 4, Perform the process from Step I to Step 3 for P*Pc times to create P*Pc individuals. Algorithm 3 ECGA or y-ECGA algorithm Step l Initialize population randomly. Step 2. Evaluate fitness of chromosomes Step 3, Perform tournament selection without replacement(Sastry and Goldberg 2001) 3.2 Community detection based on ecga Step 4. Build MPM model using MDL, and the process is given in detail in Algorithm 4. Step 5. Create P*P chromosomes based on the MPM model (the details are described in oe gorithm 5); meanwhile, transfe: P*(1-Po) best individuals from the current generation to the One main problem of applying genetic algorithms to xt generation. Thus, the new population with P individuals is formed Step 6. Terminate the algorithm if a stopping criterion is satisfied otherwise go to Step 2 community detection in complex network lies in how to design effectively the crossover and the mutation operators However, this problem can be avoided when 2-ECGA is The construction of MPM in every generation is for used to find communities nulated as a constrained optimization problem (Verma The z-ECGA of community detection adopts the same 010) representation way of chromosome and the same initial- Minimize Cc= cm+ zation way of population with the GAtHB. Modularity Q St,x≤n,vi∈{1,2,,m} (6) is utilized as its fitness function. Moreover, the process building the mPm in original ECGa is too complex and where x is the alphabet cardinality, z=2 for the binary inefficient, mainly because performing Step 3 in Algorithm strings, m is the number of subsets(clusters), Cc expresses 4 needs too much computation. Therefore, we build the combined complexity, Cm is the model complexity the MPM model based on cluster mutual information which is written as (ClusterMI)(Duque and Goldberg 2009) to improve the pringer 930 J. Li, Y Song efficiency of ECGA. The mutual information between two connecting external vertices in other communities. The subsets (clusters)is expressed as larger the value of Zout, the more complex the conmunity structure in the network and then the more difficult is MI_Cluster(I, J m+团∑∑ MI(i,j) (9) community detection for an algorithm. The gn benchmark network has two clear drawbacks: one is that each vertex has the same degree; the other is that each community has MI(i,j=pli,j log p(i, j √p(i)p() the same size The lfr benchmark network assumes that both the where p() denotes the probability distribution of the gene degrees of vertices and the sizes of communities obey in the whole population exponential distribution ( Lancichinetti et al. 2008). The The steps for building the MPM model based on the gn benchmark network can be created by the LFR ClusterMI algorithm is shown in Algorithm 6 benchmark network. In other words the gn benchmark The procedures of applying X-ECGA to community network is a special case of the lfr benchmark network.a detection refer to Algorithm 3. In the subsequent expe mixing coefficient is used to decide the number of shared nents, for the convenience of expression, x-ECGA edges of one vertex with the vertices in other communities simply written as ECGA The larger the mixing coefficient, the more complex the community structure in the lfr benchmark network or the Algorithm 6 Build: ng MPM model based or ClusterMi algorithm. Step 1. Assume all genes (variables)are independent of each other, and every gene forms an gn benchmark network individual subsct(cluster According to information theory, results from each Step 2. Compute the combined complexity Co Step 3. Choose the two subsets I and with the maximum mutual information (Mn), from all of algorithm can he evaluated in terms of normalized mutual (m-1)2 combinations with any two subsets, where m represents the number of the current nformation(NMI) Danon et al. 2005), which is defined by subsets Step 4. Compute the new combined complexity Cc after merging subset/ and/ a confusion matrix M, where the rows correspond to the real Step 5. If Cc <Cc, i.e., merging subset I and J can improve the combined complexity, then communities, and the columns correspond to the found merge subset I and and go to Step 3; otherwise, terminate the algorithm communities by an algorithm. The member of N, Nin,rep resents the number of vertices in the real community i that appears in the found community j. The number of real communities is denoted as Ca and the number of found 4 Experiments communities is expressed as cB, where A represents real partition and B means found partition, the sum over row i All experiments are done on a single Dell Server of matrix N is written as Ni, and the sum over column i is (Intel(R)Core(TM)i7-2600 CPU @3.40 GHz x8 pro- represented as N;. Thus, the NMI, which measures the cessor with 4G bytes of main memory on Ubuntu Linux similarity between the partitions, can be defined as OS). C + programming language is used to realize all algorithms ∑ NMI(A B) NiN ∑;Mlg()+>2,Mleg( 4.1 Experiments based on benchmark networks The larger value of NMI(A, B) shows the better It is feasible to evaluate and compare different algorithms performance of the algorithm to detect communities. If of community detection using benchmark networks pro- all found communities are identical to all real communities duced by computer simulations. Two benchmark networks, then NMi takes its maximum value. 1. If the communities namely, the gn benchmark network( Girvan and Newman found by one algorithm are totally independent of the real 2002)and the Lfr benchmark network lancichinetti et al. communities, NMI=0, for example, when the whole 2008)are adopted in the experiments. ECGA is compared complex network with multiple real communities is found with GATHB (Tasgin et al. 2007), GA-Net(Pizzuti 2008), to be only one community Gn (Girvan and Newman 2002; Newman and Girvan There are 128 vertices and 4 communities in the gn 2004), and CNM(Clauset et al. 2004) based on the two benchmark graph. The degree of each vertex is 16 benchmark networks Experiments are also done based on undirected and The gn benchmark network consists of a four-tuple unweighted LFR benchmark graphs without overlapping (a, S, k, Zoutlzin)(Girvan and Newman 2002), where a rep- communities. The parameters of the lfr graphs are set as resents the number of created communities, s expresses the follows. The number of vertices n= 1,000, the average number of vertices in cach community, k is the degree of degree of vertices k= 20, the maximum degree max 0 each vertex, and each vertex has Zin edges connecting the exponents of the power law distributions t2=2 for internal vertices in the same community and zout edges vertex degree and t,= 1 for community size Springer Community detection in complex networks 931 For both the gn benchmark networks and the lfr GN benchmark benchmark networks, the parameters of the GatHB, the GA-Net. and the ecga are set as follows. For each of the -" GA-Net GATHB, the GA-Net, and the eCga, the population size GATHB CN P=4,000, the crossover probability Pc=0.8, and the 0.8 maximum generation 8=250 For the GATHB algorithm and the GA-Net algorithm, the mutation probability P=0.2. For the GA-Net algorithm. r=1.5. for the 0, 5 ECGA, the tournament size t.=9. additionally. for each of the three gas. the iteration will be terminated in advance if the optimal fitness keeps unchanged during 10 5Nm successive generations. Moreover, for each optimization problem, each of ECGa, GatHb and GA-Net, is per 2 formed 30 times, and then the average value of 30 results and the standard deviation are computed For every fixed mixing coefficient of the gn benchmark networks and the 10.203040506070809 lfr benchmark networks, the wilcoxon rank -sum test Mixing coefficient with 95 %o confidence level( Conover 1998) is used to Fig. 2 GN benchmark test heck the statistical significance of the results from the ECGA. the gath. and the gA-Net for 30 individual lfr benchmark runs. The Wilcoxon rank-sum test is a non-parametric test O.ECGA for assessing whether the medians from two sets of 一:AMet observations respectively have equally large values without requiring any information on the datas distribution. Fur 03 至 thermore, the average convergence generation nunber and E。生 the average running time for each of the ecga, the GATHB. and the Ga-Net for 30 individual runs. as well as 05 the running time for the CNm, are also recorded to com pare the efficiency of all algorithms It should be noted that 王 the gn algorithm is too slow to perform in the lfr test, UONE 王 and its results are not given in this section 平 Figure 2 shows the average normalized mutual infor 三02 mation and the standard deviations of the ecga. the GATHB. and the ga-Net for the gn benchmark test with the mixing coefficient increasing. Figure 3 illustrates the 010203040506070809 corresponding results for the lfr benchmark test. The Mixing coefficient normalized mutual information from the cnm is also Fig 3 LFR benchmark test given in Figs. 2 and 3, but no standard deviation is shown because the result of the cnm is certain indicates a rejection of the null hypothesis at the 5 si According to Figs. 2 and 3, the ECGa is more effective nificance level, and h= 0 indicates a failure to reject the null than any of the other algorithms when the complexity hypothesis at the 5 significance level. When the p value extent of networks is not high. However, all algorithms obtained with rank-Sum test is less than 0.05 (5%o signifi tend to have nearly the same result when the mixing cance level), h= l, that is to say, the eCga performs sta coefficient reaches 0.9. The GA-Net shows the worst per- tistically more effectively than the gath or the GA-Net formance. but its standard deviation is also the smallest According to Table 1, when the ECGa is compared with one. By and large, the standard deviations from the ecga the GA-Net, 7 cases among the total 9 cases for the gn and the gatha are comparable network, except the first and the last, show that h=1;7 The results of the Wilcoxon rank-sum tests of the ECGa cases among the total 9 cases for the LFR network, except versus the gathA, and the ecga versus the ga-Net, are the last two, show that h= 1. when the ecga is compared given in Table 1. The null hypothesis means that the results with the gathb, 4 cases among the total 9 cases for the from the eCga and the gathb (or the ecga and the ga- gn network show that h= 1; 7 cases among the total 9 Net), respectively, are independent samples from identical cases for the LFR network show that h=l. Therefore continuous distributions with equal medians. h= 1 from the viewpoint of statistical significance, the eCga is pringer 932 J. Li, Y Song Table i wilcoxon rank-sum XIn GN networks LFR networks test of eCGa versuS GATHB and ecga versus ga-net coefficient ECGA versus ECGA versus ECGA versus ECGA Versus based on gn and lfr GATHB GA-Net GATHB GA-Net enchmark networks h h P h P 0.068332 0 0.144914 0.020860 0.046951 0.058005 0 0.002202 0.001085 0.033785 0.051783 0.003687 0.02176910.002198 0.4 0.0ll185 l 0.004077 0.0l1435 0.003676 0.5 0.020030 0.012729 0.013019 0.036772 0.6 0.029177 0.034262 0.022504 0.029189 0.7 0.034350 0.038028 0.071001 100 0.039998 1110 0.039132 0.031585 0.060150 0.060831 11100 0.9 0.068090 0.259889 0.076436 0 0.061342 Table 2 Average running time(s)or average maximum generation Table 3 Average running time(s) or average maximum generation number(gmax )comparison for gn benchmark networks number(gmax) comparison for lFR benchmark networks CNM GA-Net GATHB ECGA Mix CNM GA-Net GATH ECGA coefficient coefficient T Time 8ax Time 8ax g1 Time 8, Time max 0.1866.60428.303.6787.372.111843 0.15836.22491.3059634498.27156.78128.43 0.165949394.23 7.30164.303.7733.3002 0.26655.50416.1772825495.20166.55124.30 0.3 0.3812218249747776.10499.37153.93125.0 0.3 0.165904400.3713.24297.173.19282704 0.561395.16499.27779835084315749126.2 0.4 0.1838.442594716.97373.473.9534.67 0.72684.10413.3376806513.2315439125.17 0.5 0.1852.68333.1721.90469333.75322306 0.85716.44440.3070953520.6718812124.2 0.1820.49132.334474947.4339734.2707 0.75652714242066824510.3321688175.20 0.7 0.184600296.2026.78574.3039834470.8 0846836042123665.39509.171581513 0.9 076256.85331.17650.38508.30166.71138.33 0.8 0.1559.06384.1719.52418.273.7631.17 0.9 0.1631.72200.6719.12405.233.5930.20 2003), Football (Girvan and Newman 2002), Jazz(Gleiser obviously better than the ga-Net. Additionally, for the LFr and Danon 2003 )and, Karate(Zachary 1977)are also used network, the eCGa performs more effectively than the to compare different community detection algorithms GATHB. However. for the gn network the ecga is not Generally, in real networks, modularity Q is computed superior to the gathB statistically significantly to evaluate the performance of algorithms. Because Tables 2 and 3 describe the average convergence gener- GA-Net does not adopt modularity Q as its fitness function ation numbers and the average running time for the eCga, it is not considered in the experiments of the real-world the gatha, and the ga-Net, respectively, as well as the networks running time of the cnm. for the gn network and the lfr he parameters and experimental settings for the network, respectively. From Tables 2 and 3, the CNm per- GATHB and the eCGa are the same as described in forms faster than any of genetic algorithms including the Sect. 4.1 ECGA, the GATHB, and the Ga-Net. The ECGa converges Table 4 shows the six real-world networks. Table 5 with smaller numbers of generations than both the gath illustrates the experimental results and the gA-Net, and uses less average time to reach con- From Table 5. we can find that the ecga is more vergence than the other two genetic algorithms for both the effective than each of the n. the cnm. and the gathB GN network and the lfr network. Moreover, the GA-Net is for all of the six real-world networks. Especially, for the the slowest among the three genetic algorithms Cele network. the gn. the Cnm. and the gatha are not able to obtain good results, but the performance of the 4.2 Experiments based on real-world network ECGA is still outstanding Furthermore, the running time for the gn and the CNm Six widely used real-world complex networkS, Book and the average converging time for the eCGa and the (Krebs 2008), Cele (White et al. 1986), Dolphins (lusseau GatHB is also given in Table 6. Again, the CNM shows Springer Community detection in complex networks 933 the fastest running speed and the gn is the slowest among population size increasing for the six real-world networks, the four algorithms. Moreover, the ECGa needs less especially for the Cele network. For the Book network, the average convergence generation numbers and has a fast Cele network, the Dolphins network and the karate net average convergence speed than the gathB work, the standard deviations of Q values from the eCGa are smaller than those from the gath. however for 4.3 Comparing eCga with Gathb based on the six the Football network and the jazz network the standard real-world networks deviations of e values from the eCga are larger than those from the gathA Because the ECGa is a special genetic algorithm, we also The wilcoxon rank-sum test is also used to check the compare it with the gathb by the modularity e based on statistical significance of the ECGa versus the gathb the six real-world networks. The fitness function of the and the results are listed in table 7. for the cele network GA-Net is different from that of the ecga and thethe football network and the Karate network the ecga is GATHB, therefore the GA-Net is not used as a reference. uniformly superior to the gathb in a statistically signif- For each of the six real-world networks, the population icant way with a 5 o significance level. For the Jazz net- sizes of the ECGa or the GatHb are changed from 1,000 work, except two cases among all nine cases corresponding to 5,000, increasing 500 individuals each time. Except the to nine different population sizes, respectively, the eCga population size, the parameters and experimental settings outperforms significantly the gathB. In addition, five for the eCGa and the gathb, are the same with those in cases for the book network and four cases for the dolphins Sect. 4.1 network show that the ecga is better than the gath in Figures 4, 5,6, 7,8,9 show the average values of the a statistically significant sense modularity Q and the corresponding standard deviations for Furthermore, for each real-world network 30 runs of the ECGa and the Gathb for the six real- recorded the convergence generation numbers world networks, respectively. In each of the six figures, the corresponding running time of the ECGa and the row axis represents the population size, and the vertical GATHB respectively, and compute their average values axis denotes the average value of the modularity o and the to evaluate the convergence speed of the ECGa and the corresponding standard deviation GATHB. Table 6 only shows the results of the ecga From Figs. 4, 5, 6,7,8,9, the effectiveness of the and the gathb with the population size 4,000. We find ECGA is nearly always better than the gathb with the that the eCga needs less generation numbers and less running time to reach convergence than the Gath For other population sizes, the same conclusion is also Table 4 Six real-world complex networks found Networks Number Number of vertices of edges 4.4 Experimental conclusions Book 105 441 According to all experiments as described above, we can Cele 1179 conclude some arguments as follows Dolphins 160 ecga is most effective in a statistically significant Football 115 613 Jazz 2742 sense, for not only the benchmark networks produced by computer simulations, but also the real-world Karate networks Table 5 The o values(the Networks GATHB ECGA average o values for the eCga and the GatHA)and the A Std Std standard deviations(only for the ECGA and the gatha) of the Book 0.509998 0.501974 0.517627 0.010773 0.526938 0.005887 four algorithms for the six Cele 0.282924 0.376672 0.296686 0.015l62 0.472821 0.011754 real-world networks Dolphins 0.519382 0.514596 0.521943 0.018864 0.524168 0.004642 Football 0.599459 0.564630 0.550839 0.013156 0.601012 0.015323 azz 0.390511 0.438903 0.438607 0.007534 0.442678 0.016410 Karate 0.401298 0.380671 0.402445 0.015362 0.419790 0.003489 Average 0.450595 0.462908 0.454691 0.013475 0.497901 0.009584 pringer 934 J. Li, Y Song Table 6 Average running time(s)or average maximum generation Dolphins 056 number(gmax) comparison on the four algorithms for the six rcal orld networks 55 Network GN CNM GATHB ECGA 54 Time 81 max Time gmax Book 1.36 155.230.14 31.17 Cclc 2180.031199250 0.34 59.27 Dolphins 0.28 0.00 0.80 77.270.11 32.20 5.320.00 1.64211.330.14 38.03 05 149 0. 4.67217.470.2937.17 Karate 0.060.000.5472430.0824.30 05 0,49 048 5001000150020002500300035004004500500550 055 ECCA Population Size 054 Fig 6 Q values for Dolphins network 05 052 Football 051 064 中…EC的A 05 02 048 047 另05 1000150020002500300030040004500900550 Population Size Fig.4Q values for Book network Cele 500100015002000250030003004000450050005500 06 Population Size CGA 一ATHB 05 Fig. 7Q values for Football network 05 CNM has the fastest running speed, and gn is the slowest, among GN, CNM, ECGA, GA-Net, and 045 GATHB Compared to GaTHB and gA-Net, ECGA can use less 04 generation numbers and less running time to reach convergence. Thus, ECGA is more efficient than the 035 other two genetic algorithms. ECGa is more stable than gath and GA-Net 0:3 +++++: For each of ecga. gathb and gA-Net an interest ing finding is that, when the population size is larger 02 than 4,000, the obtained optimum solutions seem to 00100015002000250300035004000400500050 become stable. This is the reason why 4.000 is set as the Population Size population size of each genetic algorithm in Sects. 4 Fig. 5 0 valucs for Ccle nctwork and 4.2 Springer

...展开详情
所需积分/C币:1 上传时间:2018-08-08 资源大小:2.04MB
举报 举报 收藏 收藏
分享 分享
Community detection in graphs

这个领域近十年来的总结,很牛的综述!做网络分析的可以看看

立即下载
Community Detection and Mining in Social Media

It is a pleasure to acknowledge many colleagues who made substantial contributions in various ways to this time-consuming book project. The members of the Social Computing Group, Data Mining and Machine Learning Lab at Arizona State University made this project enjoyable. They include Ali Abbasi, Ge

立即下载
Fast Community Detection

Newman实现的 Fast GN方法,是密西西根大学Newman实现的快速社区发现算法,使用C语言实现

立即下载
community detection in graph.pdf

community detection in graph.pdf

立即下载
community detection by signaling on complex networks

community detection by signaling on complex networks

立即下载
2010-PhysicalReport-Community detection in graphs

2010年社会计算综述,Community detection in graphs

立即下载
Community detection 社区发现数据集 football

用于社区发现算法测试的公共数据集,为football联赛数据集。大家公认数据集

立即下载
Community detection in complex networks using extended compact genetic algorithm

研究重叠社区发现具有重要的意义. 随着对社区发现研究不断深入, 现实生活中, 节点往往会同时隶属多个社团, 如语义网中, 单词会因其具有的多个词义或词性而同时属于多个社区; 在蛋白质网络中, 一个氨基酸往往也具有不同的生物功能[7]; 在科学研究合作网中, 科学家或研究员具有研究不同领域的能力.因此, 网络中的社区结构的划分可能有重叠节点的存在. 图1-3所示网络中节点6属于两个社区共同节点, 属于重叠节点. 以前, 社区发现工作者很少考虑到重叠节点, 简单地将网络所有的节点划分成不相交的社区结构, 必然导致划分社区结构的结果不够精确. 相反, 考虑网络社区结构的重叠性,不仅增加对社区结构稳定

立即下载
论文研究-Mining Seeds for Community Detection.pdf

社团探测中的种子挖掘,陈凤娇,李侃,社团结构是复杂网络重要的组成部分,例如社会网络和蛋白质网络。社团是混合的、重叠的,社团成员可以同时属于多个社团。能够检测这�

立即下载
论文研究-Refining Graph Partitioning for Community Detection.pdf

一种改进的社区检测算法,钱铁云,杨洋,图划分是一个传统的问题,用于将节点区分成簇,使得预先定义的目标函数最优。近来,社会网络分析的广泛开展再次吸引了图形聚类方

立即下载
论文研究-Resolution limit in community detection for an information-theoretic method.pdf

一种基于信息理论的社团识别方法的分辨极限问题研究,孙鹏岗,,分辨极限对于社团识别非常重要,同时也对社团检测方法提出了新的挑战。众所周知,模块优化算法暴露出严重的分辨极限问题在于多个

立即下载
论文研究-Circle-based multi-label propagation algorithm for community detection.pdf

基于圈子的多标签传播社区检测算法,马千里,张俊浩,近年来,社区检测引起了复杂网络领域研究者的极大关注。过去十年以来,已经出现了许多种社区检测算法,这其中,标签传播算法(LPA)�

立即下载
论文研究-A Gamma-Poisson Block Model for Community Detection in Directed Network.pdf

用于发现有向网络中社区结构的伽马泊松分布随机块模型,高思远,刘瑞芳,本文提出了一种基于伽马-泊松分布的随机块模型,用于发现有向网络中的社区结构。传统的用于对网络建模和发现社区的图模型大多存��

立即下载
Detection of community structures in networks via global optimization

社会网络中的社区发现已应用到许多领域,已成为学术界研究的热点问题,本文是社区发现的好论文,有机会一定要看看,呵呵!

立即下载
社区发现 网络数据集 texas cornell washington wiconsin

压缩包中包含了small网络数据集:texas, cornell, washington, wisconsin, 可以用于测试社区发现(community detection),网络嵌入(network embedding

立即下载
斯坦福大学CS246课件

Introduction; MapReduce Frequent Itemsets Mining Locality-Sensitive Hashing I Locality-Sensitive Hashing II Clustering Dimensionality Reduction Recommender Systems I Recommender Systems II PageRank Link Spam and Introduction to Social Networks Social Networks: Community Detection and Trawling Algori

立即下载
Stanford 大学--Analysis of Networks课程13-19-PPT.rar

Stanford 大学--Analysis of Networks课程13-19章 Handouts Info Sheet Lecture 01 - 09/25 Course Introduction and Structure of Graphs Lecture 02 - 09/27 Measuring Networks, and Random Graph Model Lecture 03 - 10/02 Link Analysis: PageRank Lecture 04 - 10/04 Network Construction, Inference, a

立即下载
Stanford 大学--Analysis of Networks课程7-12-PPT.rar

Stanford 大学--Analysis of Networks课程的PPT资源第7-12章。 Handouts Info Sheet Lecture 01 - 09/25 Course Introduction and Structure of Graphs Lecture 02 - 09/27 Measuring Networks, and Random Graph Model Lecture 03 - 10/02 Link Analysis: PageRank Lecture 04 - 10/04 Network Construction, Infer

立即下载
Analyzing_Social_Networks

Designed to walk beginners through core aspects of collecting, visualizing, analyzing, and interpreting social network data, this book will get you up-to-speed on the theory and skills you need to conduct social network analysis. Using simple language and equations, the authors provide expert, clear

立即下载
Practical Graph Mining with R(CRC,2013)

Discover Novel and Insightful Knowledge from Data Represented as a Graph Practical Graph Mining with R presents a “do-it-yourself” approach to extracting interesting patterns from graph data. It covers many basic and advanced techniques for the identification of anomalous or frequently recurring pat

立即下载