谱聚类实现Matlab程序_matlab谱聚类代码资源-CSDN文库

共25个文件

m：13个

mat：10个

txt：1个

需积分: 49 137 浏览量 2018-05-22 19:04:25 上传评论 5 收藏 15.07MB RAR 举报

谱聚类是一种无监督机器学习方法，常用于数据聚类，尤其在高维数据处理中表现出色。在Matlab环境中，实现谱聚类可以利用其强大的矩阵运算和算法库。本资源包提供了一个Matlab环境下的谱聚类实现，简化了用户对数据进行聚类分析的流程。我们要理解谱聚类的基本原理。谱聚类的核心思想是通过构建图论中的拉普拉斯矩阵来解决数据分割问题。它基于这样一个假设：在同一聚类内的数据点之间的连接权重应该比不同聚类之间的连接权重大。在数据集上构建邻接矩阵或归一化邻接矩阵，然后计算对应的拉普拉斯矩阵，接着对拉普拉斯矩阵进行特征分解，找出对应的特征向量。这些特征向量可以看作是在低维空间中数据的新表示，最后利用K-means等聚类算法对这些低维表示进行分组。 "read me.txt"文件可能是资源包的说明文档，通常包含如何使用代码、运行环境需求、输入输出格式等关键信息。用户在使用前应仔细阅读该文件，确保正确理解和操作。 "spectralclustering-1.1"是谱聚类的实现文件，可能包含了函数或脚本。该文件可能包括以下部分： 1. 数据预处理：可能包括数据加载、标准化、缺失值处理等步骤，确保输入数据适合进行谱聚类。 2. 图构建：根据数据间的相似度构建邻接矩阵，可以是二值邻接矩阵（相似度阈值）或加权邻接矩阵（根据相似度大小赋值）。 3. 拉普拉斯矩阵：计算归一化拉普拉斯矩阵，它是图拉普拉斯矩阵的一种改进形式，有助于处理不均匀密度的数据集。 4. 特征分解：对拉普拉斯矩阵进行特征分解，获取特征向量，这一步骤在Matlab中可以通过`eig`函数实现。 5. K-means聚类：在特征向量空间中使用K-means算法进行聚类。Matlab的`kmeans`函数可以轻松完成此任务。 6. 结果评估：可能包括可视化聚类结果、计算内部一致性指数（如Davies-Bouldin Index或Silhouette Coefficient）等，以评估聚类质量。在实际应用中，用户需要调整参数，如选择合适的相似性度量、设置邻接矩阵阈值、确定聚类数量等，以优化聚类效果。对于复杂的数据集，可能还需要结合其他预处理技术，如降维或异常值检测。这个资源包提供了一个便捷的Matlab谱聚类工具，适用于科研、数据分析或教学等领域，通过简单的调用和配置，即可对数据进行有效的无监督聚类分析。

资源推荐

资源详情

资源评论

收起资源包目录

spectralclustering-1.1.rar （25个子文件）

read me.txt 9KB

spectralclustering-1.1

hungarian.m 9KB

script_nystrom_no_orth.m 2KB

sc.m 3KB

nmi.m 2KB

nystrom.m 5KB

script_sc_selftune.m 2KB

data

corel_5_NN_sym_distance.mat 127KB

corel_200_NN_sym_distance.mat 5.07MB

corel_15_NN_sym_distance.mat 377KB

corel_100_NN_sym_distance.mat 2.51MB

corel_label.mat 236B

corel_20_NN_sym_distance.mat 504KB

corel_150_NN_sym_distance.mat 3.8MB

corel_50_NN_sym_distance.mat 1.24MB

corel_feature.mat 1.23MB

corel_10_NN_sym_distance.mat 253KB

script_kmeans.m 1KB

README 9KB

k_means.m 4KB

script_nystrom.m 2KB

script_sc.m 2KB

gen_nn_distance.m 4KB

accuracy.m 750B

nystrom_no_orth.m 4KB

Table of Contents ================= - Introduction - Usage - Examples - Hardware Requirement - Additional Information Introduction ============ This directory includes sources used in the following paper: Parallel Spectral Clustering in Distributed Systems Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Chang IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 3, pp. 568-586, March 2011 This code has been tested under 64-bit Linux environment using MATLAB 7.4.0.287 (R2007a). You will be able to regenerate experiment results in the paper. However, results may be slightly different due to the randomness, the CPU speed, and the load of your computer. In the data/ directory, we include the Corel data set and its nearest neighbors files. For the RCV1 data set, due to its large size (~100MB), please check our project page for download link: http://alumni.cs.ucsb.edu/~wychen/sc.html#Download To generate nearest neighbors for RCV1, please refer to the Examples section. Please also note that we assume true/cluster labels are integers 1, 2, ..., num_labels. Usage ===== 1. Generate sparse symmetric distance matrix using t-nearest-neighbor method: matlab> gen_nn_distance(data, num_neighbors, block_size, save_type) -data: N-by-D data matrix, where N is the number of data, D is the number of dimensions. -num_neighbors: Number of nearest neighbors. -block_size: Block size for partitioning the data matrix. We process the data matrix in a divide-and-conquer manner to alleviate memory use. This is useful for processing very large data set when physical memory is limited. -save_type: 0 for .mat file, 1 for .txt file, 2 for both. [Note] The file format of .txt is as follows: data_id #_of_neighbors data_id:distance_value data_id:distance_value ... 2. Run spectral clustering using a sparse similarity matrix: matlab> [cluster_labels evd_time kmeans_time total_time] = sc(A, sigma, num_clusters); -A: N-by-N sparse symmetric distance matrix, where N is the number of data. -sigma: Sigma value used in similarity function S, where S_ij = exp(-dist_ij^2 / 2*sigma*sigma); if sigma is 0, apply self-tunning technique, where S_ij = exp(-dist_ij^2 /2*avg_dist_i*avg_dist_j). -num_clusters: Number of clusters. 3. Run sepctral clustering using Nystrom method with orthogonalization: matlab> [cluster_labels evd_time kmeans_time total_time] = nystrom(data, num_samples, sigma, num_clusters); -data: N-by-D data matrix, where N is the number of data, D is the number of dimensions. -num_samples: Number of random samples. -sigma: Sigma value used in similarity function S, where S = exp(-dist^2 / 2*sigma*sigma). -num_clusters: Number of clusters. 4. Run sepctral clustering using Nystrom method without orthogonalization: matlab> [cluster_labels evd_time kmeans_time total_time] = nystrom_no_orth(data, num_samples, sigma, num_clusters); -data: N-by-D data matrix, where N is the number of data, D is the number of dimensions. -num_samples: Number of random samples. -sigma: Sigma value used in similarity function S, where S = exp(-dist^2 / 2*sigma*sigma). -num_clusters: Number of clusters. 5. Run k-means clustering: matlab> cluster_labels = k_means(data, centers, num_clusters); -data: N-by-D data matrix, where N is the number of data, D is the number of dimensions. -centers: K-by-D centers matrix, where K is num_clusters, or 'random', random initialization, or [], empty matrix, orthogonal initialization -num_clusters: Number of clusters. 6. Evaludate clustering quality using NMI (Normalized Mutual Information): matlab> score = nmi(true_labels, cluster_labels) -true_labels: N-by-1 vector containing true labels. -cluster_labels: N-by-1 vector containing cluster labels. 7. Evaludate clustering quality using accuracy (Hungarian algorithm): matlab> score = accuracy(true_labels, cluster_labels) -true_labels: N-by-1 vector containing true labels. -cluster_labels: N-by-1 vector containing cluster labels. 8. Run scripts (sparse/sparse selftune/nystrom/nystrom without orthogonalization/kmeans): matlab> script_sc(dataset) # spectral clustering using sparse smilarity matrix with given sigma -dataset: data set number, 1 = Corel, 2 = RCV1. matlab> script_sc_selftune(dataset) # spectral clustering using sparse similarity matrix with selftune sigma -dataset: data set number, 1 = Corel, 2 = RCV1. matlab> script_nystrom(dataset) # spectral clustering using nystrom with orthogonalization -dataset: data set number, 1 = Corel, 2 = RCV1. matlab> script_nystrom_no_orth(dataset) # spectral clustering using nystrom without orthogonalization -dataset: data set number, 1 = Corel, 2 = RCV1. matlab> script_kmeans(dataset) # k-means clustering -dataset: data set number, 1 = Corel, 2 = RCV1. Examples ======== 1. Ggenerate sparse symmetric distance matrix using t-nearest-neighbor method: matlab> load data/corel_feature.mat; matlab> gen_nn_distance(feature, 50, 10, 2); This generates two sparse symmetric distance files: 50_NN_sym_distance.mat and 50_NN_sym_distance.txt 2. Run spectral clustering using a sparse similarity matrix: matlab> load data/corel_50_NN_sym_distance.mat; matlab> [cluster_labels evd_time kmeans_time total_time] = sc(A, 20, 18); matlab> load data/corel_label.mat; matlab> nmi_score = nmi(label, cluster_labels) matlab> accuracy_score = accuracy(label, cluster_labels) 3. Run spectral clustering using Nystrom method with orthogonalization: matlab> load data/corel_feature.mat; matlab> [cluster_labels evd_time kmeans_time total_time] = nystrom(feature, 200, 20, 18); matlab> load data/corel_label.mat; matlab> nmi_score = nmi(label, cluster_labels) matlab> accuracy_score = accuracy(label, cluster_labels) 4. Run spectral clustering using Nystrom method without orthogonalization: matlab> load data/corel_feature.mat; matlab> [cluster_labels evd_time kmeans_time total_time] = nystrom_no_orth(feature, 200, 20, 18); matlab> load data/corel_label.mat; matlab> nmi_score = nmi(label, cluster_labels) matlab> accuracy_score = accuracy(label, cluster_labels) 5. Run k-means clustering: matlab> load data/corel_feature.mat; matlab> cluster_labels = k_means(feature, 'random', 18); matlab> nmi_score = nmi(label, cluster_labels) matlab> accuracy_score = accuracy(label, cluster_labels) 6. Run scripts for Corel data: matlab> script_sc(1); matlab> script_sc_selftune(1); matlab> script_nystrom(1); matlab> script_nystrom_no_orth(1); matlab> script_kmeans(1); Hardware Requirement ==================== Note that when running Nystrom with RCV1 data (193,844 instances), it may consume more than 3GB memory with 2000 random samples (193,844 * 2,000 * 8Bytes > 3GB). If you want to run with more number of samples, please make sure you have enough memory on your machine. Frequent Aasked Questions (FAQ) =========

评论收藏

内容反馈