# Chapter 10: Grouping unlabeled items using k-means clustering
## k-means clustering
```text
Pros: Easy to implement
Cons: Can converge at local minima; slow on very large datasets
Works with: Numeric values
```
k-means is an algorithm that will find k clusters for a given dataset. The number of
clusters k is user defined. Each cluster is described by a single point known as the
*centroid*. Centroid means it's at the center of all the points in the cluster.
The k-means algorithm works like this. First, the k centroids are randomly assigned
to a point. Next, each point in the dataset is assigned to a cluster. The assignment is done by finding the closest centroid and assigning the point to that cluster. After this step, the centroids are all updated by taking the mean value of all the points in that cluster.
pseudo code for k-Means algo:
```text
Algo:
Create k points for starting centroids (often randomly)
While any point has changed cluster assignment
for every point in our dataset:
for every centroid
calculate the distance between the centroid and point
assign the point to the cluster with the lowest distance
for every cluster calculate the mean of the points in that cluster
assign the centroid to the mean
```
## Improving cluster performance with postprocessing
k-means has converged, but the cluster assignment isn't that great. The reason that k-means converged but we had poor clustering was that k-means converges on a local minimum, not a global minimum. (A local minimum means that the result is good but not necessarily the best possible. A global minimum is the best possible.)
We can use SSE(sum of squared error) to measure the quality of your cluster assignments. A lower SSE means that points are closer to their centroids, and you’ve done a better job of clustering.
## Bisecting k-means
pseudo code for bisecting k-Means algo:
```
Algo:
Start with all the points in one cluster
While the number of clusters is less than k
for every cluster
measure total error
perform k-means clustering with k=2 on the given cluster
measure total error after k-means has split the cluster in two
choose the cluster split that gives the lowest error and commit this split
```
没有合适的资源?快使用搜索试试~ 我知道了~
《机器学习实战》学习笔记及源码(Python3).zip
共29个文件
py:10个
txt:7个
md:4个
需积分: 5 0 下载量 41 浏览量
2024-04-16
20:10:32
上传
评论
收藏 2.78MB ZIP 举报
温馨提示
《机器学习实战》学习笔记及源码(Python3).zip
资源推荐
资源详情
资源评论
收起资源包目录
《机器学习实战》学习笔记及源码(Python3).zip (29个子文件)
content
chapter_2
__init__.py 0B
testDigits.zip 240KB
trainingDigits.zip 508KB
datingTestSet.txt 35KB
kNN.py 6KB
README.md 1KB
datingTestSet2.txt 26KB
chapter_10
__init__.py 0B
places.txt 5KB
Portland.png 448KB
testSet2.txt 1KB
kMeans.py 7KB
__pycache__
kMeans.cpython-36.pyc 2KB
__init__.cpython-36.pyc 150B
test.py 58B
portlandClubs.txt 3KB
README.md 2KB
testSet.txt 2KB
chapter_13
__init__.py 0B
pca.py 2KB
secom.data 5.14MB
README.md 2KB
testSet.txt 18KB
.gitignore 8B
README.md 601B
chapter_14
__init__.py 0B
image_compress.py 884B
svdRec.py 4KB
minions.jpg 12KB
共 29 条
- 1
资源评论
生瓜蛋子
- 粉丝: 3794
- 资源: 4173
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功