### Python implementation of the [Gap Statistic](http://www.web.stanford.edu/~hastie/Papers/gap.pdf)
[![PythonCI](https://github.com/milesgranger/gap_statistic/workflows/PythonCI/badge.svg?branch=master)](https://github.com/milesgranger/gap_statistic/actions?query=branch=master)
[![RustCI](https://github.com/milesgranger/gap_statistic/workflows/RustCI/badge.svg?branch=master)](https://github.com/milesgranger/gap_statistic/actions?query=branch=master)
[![Downloads](http://pepy.tech/badge/gap-stat)](http://pepy.tech/project/gap-stat)
[![Coverage Status](https://coveralls.io/repos/github/milesgranger/gap_statistic/badge.svg)](https://coveralls.io/github/milesgranger/gap_statistic)
[![Code Health](https://landscape.io/github/milesgranger/gap_statistic/master/landscape.svg?style=flat)](https://landscape.io/github/milesgranger/gap_statistic/master)
[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
---
#### Purpose
Dynamically identify the suggested number of clusters in a data-set
using the gap statistic.
---
### Full example available in a notebook [HERE](Example.ipynb)
---
#### Install:
Bleeding edge:
```commandline
pip install git+git://github.com/milesgranger/gap_statistic.git
```
PyPi:
```commandline
pip install --upgrade gap-stat
```
With Rust extension:
```commandline
pip install --upgrade gap-stat[rust]
```
---
#### Uninstall:
```commandline
pip uninstall gap-stat
```
---
### Methodology:
This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in ["Estimating the number of clusters in a data set via the gap statistic"](https://statweb.stanford.edu/~gwalther/gap) (Tibshirani et al.).
The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:
- Taking the `k` maximizing the Gap value, which is calculated for each `k`. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
- Taking the smallest `k` such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure `diff = Gap(k) - Gap(k+1) + s(k+1)` is calculated for each `k`; the parallel here, then, is to take the smallest `k` for which `diff` is positive. Note that in some cases this can be true for the entire range of `k`.
- Taking the `k` maximizing the Gap\* value, an alternative measure suggested in ["A comparison of Gap statistic definitions with and
with-out logarithm function"](https://core.ac.uk/download/pdf/12172514.pdf) by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap\* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.
Note that none of the above methods is guaranteed to find an optimal value for `k`, and that they often contradict one another. Rather, they can provide more information on which to base your choice of `k`, which should take numerous other factors into account.
---
### Use:
First, construct an `OptimalK` object. Optional intialization parameters are:
- `n_jobs` - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
- `parallel_backend` - Possible values are `joblib`, `rust` or `multiprocessing` for the built-in Python backend. If `parallel_backend == 'rust'` it will use all cores.
- `clusterer` - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
- `clusterer_kwargs` - Any keyword arguments to be forwarded to the custom clusterer function on each call.
An example intialization:
```python
optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')
```
After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:
- `X` - A pandas dataframe or numpy array of data points of shape `(n_samples, n_features)`.
- `n_refs` - The number of random reference data sets to use as inertia reference to actual data. Optional.
- `cluster_array` - A 1-dimensional iterable of integers; each representing `n_clusters` to try on the data. Optional.
For example:
```python
import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
```
After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the `gap_df` attributre of the `OptimalK` object:
```python
optimalK.gap_df.head()
```
The columns of the dataframe are:
- `n_clusters` - The number of clusters for which the statistics in this row were calculated.
- `gap_value` - The Gap value for this `n`.
- `gap*` - The Gap\* value for this `n`.
- `ref_dispersion_std` - The standard deviation of the reference distributions for this `n`.
- `sk` - The standard error of the Gap statistic for this `n`.
- `sk*` - The standard error of the Gap\* statistic for this `n`.
- `diff` - The diff value for this `n` (see the methodology section for details).
- `diff*` - The diff\* value for this `n` (corresponding to the diff value for Gap\*).
Additionally, the relation between the above measures and the number of clusters can be plotted by calling the `OptimalK.plot_results()` method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:
- A plot of the Gap value versus n, the number of clusters.
- A plot of diff versus n.
- A plot of the Gap\* value versus n, the number of clusters.
- A plot of the diff\* value versus n.
---
没有合适的资源?快使用搜索试试~ 我知道了~
gap_statistic:动态获取数据中的建议聚类,以进行无监督学习
共28个文件
py:5个
rs:4个
yml:3个
2星 需积分: 32 8 下载量 174 浏览量
2021-02-03
04:11:07
上传
评论 1
收藏 64KB ZIP 举报
温馨提示
Python实现 目的 使用差距统计量动态识别数据集中建议的聚类数量。 在笔记本上使用完整的例子 安装: 出血边缘: pip install git+git://github.com/milesgranger/gap_statistic.git PyPi: pip install --upgrade gap-stat 使用Rust扩展名: pip install --upgrade gap-stat[rust] 卸载: pip uninstall gap-stat 方法: 该程序包提供了几种方法,可根据 (Tibshirani等人)中介绍的Gap方法,协助选择给定数据集的最佳簇。 所实现的方法可以使用一系列提供的k个值对给定的数据集进行聚类,并为您提供统计信息,以帮助您为数据集选择正确的聚类数。 三种可能的方法是: 取k最大化针对每个k计算的Gap值。 但是,这并非总是可能的,因为对于许多数据集,此值是单调增加或减少的。 取最小的k ,使得Gap(k)> = Gap(k + 1)-s(k + 1)。 这是Tibshirani等人建议的方法。 (有关详细信息,请咨询本文
资源详情
资源评论
资源推荐
收起资源包目录
gap_statistic-master.zip (28个子文件)
gap_statistic-master
MANIFEST.in 43B
LICENSE-MIT 1KB
pyproject.toml 77B
pytest.ini 36B
Cargo.lock 22KB
.github
workflows
python.yml 566B
release.yml 1KB
rust.yml 1KB
tests
test_formatting.py 682B
test_optimalK.py 5KB
src
tests
mod.rs 6KB
lib.rs 2KB
kmeans
mod.rs 10KB
gap_statistic
mod.rs 5KB
Cargo.toml 415B
setup.cfg 57B
UNLICENSE 1KB
setup.py 2KB
.gitignore 1KB
.cargo
config 111B
CHANGELOG.md 2KB
Example.ipynb 46KB
conda-recipe
conda_build_config.yaml 23B
meta.yaml 617B
README.md 6KB
.gitattributes 52B
gap_statistic
__init__.py 70B
optimalK.py 15KB
共 28 条
- 1
羊欲穷
- 粉丝: 88
- 资源: 4591
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1