
Clustering 评分:
Data mining 讲义 来源于美国明尼苏达大学教授 Gedas Adomavicius
Using Clustering Helps to gain insights into your data a Instead of trying to look at the entire dataset (=a huge number of transactions ), you can inspect the representative data groups/clusters( a small number of groups, into which your data can be arranged most naturally The basic idea has been used throughout the history n Periodic table of the elements a Classification of species a Grouping securities in portfolios a Grouping firms for structural analysis of economy Many applications a Market segmentation medical diagnostics bioinformatics, text mining /information retrieval, etc EXample: Public Utilities Goal: based on the information about different public utility companies, find clusters/groups of similar utilities (according to their descriptive attributes) Data: 22 firms 8 variables a Fixedcharge covering ratio a Rate of return on capital a Cost per kilowatt capacity 口 Annua oad factor 口 Growth in peak demand 口Saes nuclear 口 Fuel costs per kwh Company Fixed_charge RoR Cost Load A Demand Sales Nuclear Fuel_Cost Arizona 1.069.215154.4 1.69077 0 0.628 Boston 0.8910.320257.9 2.25088 253 1.555 Central 1.4315.411353 3.49212 0 1.058 Commonwealth 1.0211.216856 0.36423 34.3 07 Con Ed NY 1.498.819251.2 3300 15.6 2.044 Florida 1.3213.5111 60 2.211127 22 1.241 Hawaiian 1.2212.217567.6 2.27642 000 652 Idaho 9.2245 57 3.313082 0309 Kentuck 1.34 1316860.4 7.28406 0.862 Madison 1.1212419753 2 6455 0623 Nevada 0.75 7351.5 6.517441 0.768 New England 1.1310917862 3.76154 897 Northern 1.1512.719953.7 6.47179 50 0.527 oklahoma 1.09 9649.8 1.49673 0.588 Pacific 0.967.616462.2 0.16468 Puget 1.169.9252 56 9.215991 0.62 San Diego 0.766413661.9 95714 8 02002090300 1.92 Southern 1.0512.615056.7 2.710140 1.108 Texas 1.1611.710454 2.113507 0.636 Wisconsin 1.211.8148599 3.57287 41.1 0.702 United 1.048.620461 3.56650 0 2.116 Virginia 1.079.317454.3 5.910093 26.6 1.306 Finding Clusters by Plotting Data manually eyeballing data by 25 High fuel cost, low sales plotting it in two dimensions sales and fuel cost ited 2 an Diego·→ New End and Hawaiian Boston acific IrgInIa rida Central e Soutrein'Low fuel cost, high sales Commonwealt Kentucky Nevada h viscosi Arzona Texas· Puget 05 Madison Northern Oklahoprie 4 daho Low fuel cost, low sales 5.000 10000 15.000 20.000 Sales How about going beyond eyeballing the data in two dimensions? Need generalpurpose techniques to deal with anydimensional data Things to Think about: General Issues a Similarity/ connectivity /distance metrics a Direct similarity(distance) a Contextual similarity(influence of other points) a Conceptual similarity(what exactly are clusters representing a Stopping criteria a How many clusters should you have? a Clustering algorithms and their parameters Similarity: EXample Consider data points A, B, and C. Is b more similar to Aor to C? Direct similarity Contextual similarity Conceptual similarity (based on direct distance, one with additional information it is clear (in the general/philosophical sense, the might assume that points b and c that points a and b intuitively belong concept of similarity can become very are more likely to be in the same to the same cluster and that b and complex; e.g., here a and b belong to cluster than points A and B) C should be in diferent clusters the same cluster not because of direct distance or the context of other points, but because they represent a common Most clustering techniques deal with these types of similarity underlying concept, i.e., a rectangle, as opposed to c which represents a different concept, a triangle) Things to Think about: Specific Issues a Scalability a Different techniques may scale to larger datasets than others Ability to deal with different types of attributes a Typically, good with numeric data a How about binary, categorical, ordinal? Discovery of clusters with arbitrary shape a Many methods are good in discovering spherical clusters(works well for most practical applications); discovery of clusters with arbitrary shape is a much more complex problems(occurs in highly specialized applications) a Domain knowledge to determine input parameters E.g., which distance metric is most appropriate for what kinds of data? a ability to deal with noisy data Missing, incorrect, unknown data; outliers a Insensitivity to the order of input records If the data records are presented in a different order, are the results the same? High dimensionality a Does the technique perform well for highly multidimensional data? Constraintbased clustering a E.g., consider geographic clustering(rivers, highways a Interpretability and usability a Does the technique present output that is intuitive and easily interpretable? Distance metric A key component of most clustering techniques is the distance metric Given the distance metric, clustering techniques will cluster data according to that metric Therefore the choice of a metric is important20171105 上传 大小：629KB

vertex clustering implementation
it implement a kind of adaptive vertex clustering algorithm in Mesh simplification.
立即下载

Sparse subspace clustering算法代码
Sparse subspace clustering_ algorithm, theory, and applications文章的源代码。
立即下载

Algorithms for Fuzzy Clustering
Algorithms for Fuzzy Clustering Methods in cMeans Clustering with Applications
立即下载

聚类分析算法实现clusteringalgorithmsmaster
本资源中包括聚类分析算法实现，基于时间序列分析的聚类算法实现，主要应用于股票时间序列等的数据分析，clusteringalgorithmsmaster
立即下载

Clustering (RUI XU DONALD C. WUNSCH, II)
Clustering, a book, writen by RUI XU and DONALD C. WUNSCH, II
立即下载