下载  >  课程资源  >  数据库  > Clustering

Clustering 评分:

Data mining 讲义 来源于美国明尼苏达大学教授 Gedas Adomavicius
Using Clustering Helps to gain insights into your data a Instead of trying to look at the entire dataset (=a huge number of transactions ), you can inspect the representative data groups/clusters( a small number of groups, into which your data can be arranged most naturally The basic idea has been used throughout the history n Periodic table of the elements a Classification of species a Grouping securities in portfolios a Grouping firms for structural analysis of economy Many applications a Market segmentation medical diagnostics bioinformatics, text mining /information retrieval, etc EXample: Public Utilities Goal: based on the information about different public utility companies, find clusters/groups of similar utilities (according to their descriptive attributes) Data: 22 firms 8 variables a Fixed-charge covering ratio a Rate of return on capital a Cost per kilowatt capacity 口 Annua|| oad factor 口 Growth in peak demand 口Sa|es nuclear 口 Fuel costs per kwh Company Fixed_charge RoR Cost Load A Demand Sales Nuclear Fuel_Cost Arizona 1.069.215154.4 1.69077 0 0.628 Boston 0.8910.320257.9 2.25088 253 1.555 Central 1.4315.411353 3.49212 0 1.058 Commonwealth 1.0211.216856 0.36423 34.3 07 Con Ed NY 1.498.819251.2 3300 15.6 2.044 Florida 1.3213.5111 60 2.211127 22 1.241 Hawaiian 1.2212.217567.6 2.27642 000 652 Idaho 9.2245 57 3.313082 0309 Kentuck 1.34 1316860.4 7.28406 0.862 Madison 1.1212419753 2 6455 0623 Nevada 0.75 7351.5 6.517441 0.768 New England 1.1310917862 3.76154 897 Northern 1.1512.719953.7 6.47179 50 0.527 oklahoma 1.09 9649.8 1.49673 0.588 Pacific 0.967.616462.2 0.16468 Puget 1.169.9252 56 9.215991 0.62 San Diego 0.766413661.9 95714 8 02002090300 1.92 Southern 1.0512.615056.7 2.710140 1.108 Texas 1.1611.710454 2.113507 0.636 Wisconsin 1.211.8148599 3.57287 41.1 0.702 United 1.048.620461 3.56650 0 2.116 Virginia 1.079.317454.3 5.910093 26.6 1.306 Finding Clusters by Plotting Data manually eyeballing data by 25 High fuel cost, low sales plotting it in two dimensions sales and fuel cost ited 2 an Diego·→ New End and Hawaiian Boston acific IrgInIa rida Central e Soutrein'Low fuel cost, high sales Commonwealt Kentucky Nevada h viscosi Arzona Texas· Puget 05 Madison Northern Oklahoprie 4 daho Low fuel cost, low sales 5.000 10000 15.000 20.000 Sales How about going beyond eyeballing the data in two dimensions? Need general-purpose techniques to deal with any-dimensional data Things to Think about: General Issues a Similarity/ connectivity /distance metrics a Direct similarity(distance) a Contextual similarity(influence of other points) a Conceptual similarity(what exactly are clusters representing a Stopping criteria a How many clusters should you have? a Clustering algorithms and their parameters Similarity: EXample Consider data points A, B, and C. Is b more similar to Aor to C? Direct similarity Contextual similarity Conceptual similarity (based on direct distance, one with additional information it is clear (in the general/philosophical sense, the might assume that points b and c that points a and b intuitively belong concept of similarity can become very are more likely to be in the same to the same cluster and that b and complex; e.g., here a and b belong to cluster than points A and B) C should be in diferent clusters the same cluster not because of direct distance or the context of other points, but because they represent a common Most clustering techniques deal with these types of similarity underlying concept, i.e., a rectangle, as opposed to c which represents a different concept, a triangle) Things to Think about: Specific Issues a Scalability a Different techniques may scale to larger datasets than others Ability to deal with different types of attributes a Typically, good with numeric data a How about binary, categorical, ordinal? Discovery of clusters with arbitrary shape a Many methods are good in discovering spherical clusters(works well for most practical applications); discovery of clusters with arbitrary shape is a much more complex problems(occurs in highly specialized applications) a Domain knowledge to determine input parameters E.g., which distance metric is most appropriate for what kinds of data? a ability to deal with noisy data Missing, incorrect, unknown data; outliers a Insensitivity to the order of input records If the data records are presented in a different order, are the results the same? High dimensionality a Does the technique perform well for highly multidimensional data? Constraint-based clustering a E.g., consider geographic clustering(rivers, highways a Interpretability and usability a Does the technique present output that is intuitive and easily interpretable? Distance metric A key component of most clustering techniques is the distance metric Given the distance metric, clustering techniques will cluster data according to that metric Therefore the choice of a metric is important

...展开详情
2017-11-05 上传 大小:629KB
举报 收藏
分享
Clustering Toolbox

matlab的分群工具箱 包括agglom(Basic Agglomerative Clustering)、 kmeans(k-means clustering )、mixtureEM(cluster by estimating a mixture of Gaussians)、mixtureSelect(estimate a mixture with unknown K using BIC)、EM(Expectation-Maximization) 以及相关的Demo和C程序

立即下载
Clustering Mass Spectra (MS-Clustering)

MS-Clustering is designed to rapidly cluster large MS/MS datasets. The program merges similar spectra (having similar m/z values ?within a given tolerance), and creates a single consensus spectrum as a representative. The input formats accepted are: dta, mgf, mzXML. The output format is mgf.

立即下载
markov clustering

markov clustering algorithm, fuxx you this fuxx asshole

立即下载
HIERARCHICAL CLUSTERING SCHEMES

Techniques for partitioning objects into optimally homogeneous groups on the basis of empirical measures of similarity among those objects have received increasing attention in several different fields. This paper develops a useful correspondence between any hierarchical system of such clusters, and

立即下载
Clustering_Coefficient

复杂网络与我们的生活息息相关,它常常包括三类特征参数:度分布、聚类系数、平均路径长度,该文档是关于聚类系数计算的简单程序,很有用。

立即下载
vertex clustering implementation

it implement a kind of adaptive vertex clustering algorithm in Mesh simplification.

立即下载
subspace clustering

subspace clustering 数据降维,自空间聚类,优于Kmean等聚类算法

立即下载
层次聚类 hierarchical clustering

层次聚类算法描述

立即下载
Sparse Subspace Clustering

a method based on sparse representation (SR) to cluster data drawn from multiple low-dimensional linear or affine subspaces embedded in a high-dimensional space.

立即下载
Sparse subspace clustering算法代码

Sparse subspace clustering_ algorithm, theory, and applications文章的源代码。

立即下载
Algorithms for Fuzzy Clustering

Algorithms for Fuzzy Clustering Methods in c-Means Clustering with Applications

立即下载
聚类分析算法实现clustering-algorithms-master

本资源中包括聚类分析算法实现,基于时间序列分析的聚类算法实现,主要应用于股票时间序列等的数据分析,clustering-algorithms-master

立即下载
wsn clustering matlab code

wireless sensor network 分簇程序,利用voronoi,leach协议最基本的程序

立即下载
CVPR-2009-Sparse Subspace Clustering.pdf

CVPR-2009-Sparse Subspace Clustering.pdf

立即下载
Graph-based Clustering Approach

A Graph-based Clustering Approach to Evaluate Interestingness Measures: A Tool and a Comparative Study

立即下载
Clustering (RUI XU DONALD C. WUNSCH, II)

Clustering, a book, writen by RUI XU and DONALD C. WUNSCH, II

立即下载
K means clustering algorithm

该文档包含K means 聚类算法的matlab程序代码,主要用于图像分割

立即下载
Data clustering algorithm and application

Data clustering algorithm and application ,数据聚类算法和应用

立即下载
Data Clustering Theory, Algorithms, and Applications

Guojun Gan York University Toronto, Ontario, Canada Chaoqun Ma Hunan University Changsha, Hunan, People’s Republic of China Jianhong Wu York University Toronto, Ontario, Canada

立即下载
html+css+js制作的一个动态的新年贺卡

该代码是http://blog.csdn.net/qq_29656961/article/details/78155792博客里面的代码,代码里面有要用到的图片资源和音乐资源。

立即下载