# SparsifiedKMeans
KMeans for big data using preconditioning and sparsification, Matlab implementation. Uses the [KMeans clustering algorithm](https://en.wikipedia.org/wiki/K-means_clustering) (also known as [Lloyd's Algorithm](https://en.wikipedia.org/wiki/Lloyd%27s_algorithm) or "K Means" or "K-Means") but sparsifies the data in a special manner to achieve significant (and tunable) savings in computation time and memory.
The code provides `kmeans_sparsified` which is used much like the `kmeans` function from the Statistics toolbox in Matlab.
There are three benefits:
1. The basic implementation is much faster than the Statistics toolbox version. We also have a few modern options that the toolbox version lacks; e.g., we implement [K-means++](https://en.wikipedia.org/wiki/K-means%2B%2B) for initialization. (Update: Since 2015, Matlab has improved the speed of their routine and initialization, and now their version and ours are comparable).
2. We have a new variant, called sparsified KMeans, that preconditions and then samples the data, and this version can be thousands of times faster, and is designed for big data sets that are unmangeable otherwise
3. The code also allows a big-data option. Instead of passing in a matrix of data, you give it the location of a .mat file, and the code will break the data into chunks. This is useful when the data is, say, 10 TB and your computer only has 6 GB of RAM. The data is loaded in smaller chunks (e.g., less than 6 GB), which is then preconditioned and sampled and discarded from RAM, and then the next data chunk is processed. The entire algorithm is one-pass over the dataset.
/Note/: if you use our code in an academic paper, we appreciate it if you cite us:
"Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means", F. Pourkamali Anaraki and S. Becker, IEEE Trans. Info. Theory, 2017.
## Why use it?
For moderate to large data, we believe this is one of the fastest ways to run k-means. For extremely large data that cannot all fit into core memory of your computer, we believe there are almost no good alternatives (in theory and practice) to this code.
# Installation
Every time you start a new Matlab session, run `setup_kmeans` and it will correctly set the paths. The first time you run it, it may also compile some mex files; for this, you need a valid `C` compiler (see http://www.mathworks.com/support/compilers/R2015a/index.html).
# Version
Current version is 2.1
# Authors
* [Prof. Stephen Becker](http://amath.colorado.edu/faculty/becker/), University of Colorado Boulder (Applied Mathematics)
* [Prof. Farhad Pourkamali Anaraki](http://www.pourkamali.com/), University of Massachusetts Lowell (Computer Science)
# Reference
[Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means](https://doi.org/10.1109/TIT.2017.2672725), F. Pourkamali Anaraki and S. Becker, IEEE Trans. Info. Theory, 2017. See also the [arXiv version](https://arxiv.org/abs/1511.00152)
Bibtex:
@article{SparsifiedKmeans,
title = {Preconditioned Data Sparsification for Big Data with Applications to {PCA} and {K}-means},
Author = {Pourkamali-Anaraki, F. and Becker, S.},
year = 2017,
doi = {10.1109/TIT.2017.2672725},
journal = {IEEE Trans. Info. Theory},
volume = 63,
number = 5,
pages = {2954--2974}
}
# Related projects
* [sparsekmeans](https://github.com/EricKightley/sparsekmeans) by Eric Kightley is our joint project to implement the algorithm in python, and support out-of-memory operations.
The [sparseklearn](https://github.com/EricKightley/sparseklearn) is the generalization of this idea to other types of machine learning algorithms (also python).
# Further information
Some images taken from the paper or slides from presentations; see the journal paper for full explanations
## Example on synthetic data
![Example on synthetic data](figs/example.png?raw=true "Example on synthetic data")
## Main idea
![Main idea](figs/slides_mainIdea.jpg?raw=true "Explaining our concept")
## MNIST experiment
![MNIST experiment](figs/slides_experiment1.jpg?raw=true "Experiment 1")
![MNIST accuracy](figs/slides_experiment2.jpg?raw=true "Experiment 2")
## Infinite MNIST big data experiment
![MNIST2 accuracy](figs/slides_experiment3.jpg?raw=true "Experiment 3")
## Two-pass mode for increased accuracy
![Two pass](figs/slides_experiment4.jpg?raw=true "Experiment 4")
## Theory
![Theory](figs/slides_theory.jpg?raw=true "Theorems")
没有合适的资源?快使用搜索试试~ 我知道了~
k-means聚类算法及matlab代码-SparsifiedKMeans:KMeans使用预处理和稀疏化实现大数据,Matla...
共27个文件
m:10个
jpg:6个
c:5个
需积分: 38 13 下载量 107 浏览量
2021-05-21
14:56:46
上传
评论 2
收藏 860KB ZIP 举报
温馨提示
k-means聚类算法及matlab代码稀疏的KMeans KMeans使用预处理和稀疏化实现大数据,Matlab实施。 使用(也称为“ K均值”或“ K均值”),但以特殊方式稀疏数据,以显着(且可调)节省计算时间和内存。 该代码提供kmeans_sparsified ,其用法与Matlab统计工具箱中的kmeans函数非常相似。 有三个好处: 基本实现比“统计信息”工具箱版本快得多。 我们还提供了一些工具箱版本所缺少的现代选项。 例如,我们实现了初始化。 (更新:自2015年以来,Matlab改进了例程和初始化的速度,现在它们的版本与我们的版本相当)。 我们有一个新的变体,称为稀疏KMeans,它可以对数据进行预处理和采样,而该版本可以快数千倍,并且是为无法处理的大数据集设计的 该代码还允许使用大数据选项。 无需传递数据矩阵,而是给它提供.mat文件的位置,并且代码会将数据分成多个块。 当数据为10 TB并且您的计算机只有6 GB的RAM时,这很有用。 数据以较小的块(例如,小于6 GB)加载,然后进行预处理,采样并从RAM中丢弃,然后处理下一个数据块。 整个算法仅遍历数据集。 /注
资源推荐
资源详情
资源评论
收起资源包目录
SparsifiedKMeans-master.zip (27个子文件)
SparsifiedKMeans-master
setup_kmeans.m 3KB
private
hadamard.mexmaci64 9KB
hadamard_pthreads.c 8KB
Arthur_initialization.m 3KB
SparseMatrixMinusCluster.mexmaci64 13KB
SparseMatrixInnerProduct.c 3KB
sampleAndMixFromLargeFile.m 5KB
hadamard_pthreads.mexmaci64 9KB
SparseMatrixColumnNormSq.c 2KB
recalculateAssignmentLargeFile.m 4KB
randsample_block.m 3KB
hadamard.c 4KB
SparseMatrixMinusCluster.c 7KB
findClusterAssignments.m 9KB
randsample_fixedNumberEntries.m 2KB
kmeans_sparsified.m 24KB
LICENSE 1KB
README.md 4KB
example_sparseKMeans.m 3KB
figs
slides_experiment2.jpg 182KB
slides_mainIdea.jpg 174KB
slides_theory.jpg 240KB
slides_experiment1.jpg 133KB
slides_experiment3.jpg 155KB
example.png 17KB
slides_experiment4.jpg 71KB
example_loadFromDisk.m 4KB
共 27 条
- 1
资源评论
weixin_38631225
- 粉丝: 5
- 资源: 908
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功