pythonkNN算法实现MNIST数据集分类k值1-120_基于手写数字识别数据集Mnist的C4.5算法实现python资源-CSDN文库

共2个文件

py：2个

需积分: 34 145 浏览量 2017-03-14 10:13:25 上传评论 1 收藏 3KB RAR 举报

Python中的k近邻(k-Nearest Neighbors, kNN)算法是一种简单且强大的监督学习方法，常用于分类和回归任务。MNIST数据集是手写数字识别领域的一个经典基准，包含60,000个训练样本和10,000个测试样本。在这个项目中，我们将详细探讨如何使用Python实现kNN算法来对MNIST数据集进行分类，并观察不同k值（1到120）对分类性能的影响。我们需要导入必要的库，包括`numpy`用于数值计算，`pandas`处理数据，`matplotlib`进行可视化，以及`sklearn`库中的`datasets`模块加载MNIST数据集和`metrics`模块评估模型性能： ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score ``` 接下来，加载MNIST数据集： ```python mnist = fetch_openml('mnist_784', version=1, cache=True) X, y = mnist['data'], mnist['target'] ``` 由于数据集已经标准化，我们不需要做额外的预处理。为了分析不同k值的效果，我们可以创建一个函数，该函数接受k值并构建、训练和测试kNN模型： ```python def evaluate_knn(k): # 分割数据集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建并训练模型 knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train, y_train) # 预测并评估模型 y_pred = knn.predict(X_test) accuracy = accuracy_score(y_test, y_pred) return accuracy ``` 现在，我们可以遍历k值范围（1到120），并记录每个k值对应的准确率： ```python accuracies = [] for k in range(1, 121): accuracy = evaluate_knn(k) accuracies.append((k, accuracy)) ``` 将结果可视化以查看k值与分类性能之间的关系： ```python plt.figure(figsize=(12, 6)) plt.plot([k for k, _ in accuracies], [accuracy for _, accuracy in accuracies]) plt.xlabel('k值') plt.ylabel('分类准确率') plt.title('kNN在MNIST上的分类性能（k=1-120）') plt.grid(True) plt.show() ``` 通常情况下，较小的k值会使得模型更加敏感于噪声，而较大的k值可能导致模型过于平滑，丢失一些关键信息。在实践中，选择合适的k值通常涉及折衷，既需要避免过拟合，也要确保模型具有足够的表达能力。通过观察上述曲线，我们可以找到一个平衡点，以获得最佳的分类效果。此外，kNN算法的性能还受到距离度量方式、数据的降维处理、异常值检测等因素的影响。例如，欧氏距离是最常用的距离度量，但在高维空间中可能会导致“维数灾难”。考虑使用其他度量如曼哈顿距离或余弦相似性可能会有所改善。降维技术如主成分分析(PCA)或t-SNE可以帮助减少计算复杂度并提高分类性能。总结来说，这个项目展示了如何使用Python的scikit-learn库实现kNN算法，处理MNIST数据集，并通过改变k值来探索其对模型性能的影响。这为我们提供了关于选择k值的直观理解，并为后续的模型优化提供了基础。

资源推荐

资源详情

资源评论

收起资源包目录

kNN.rar （2个子文件）

kNN

data_util.py 3KB

kNN.py 3KB

# -*- coding: utf-8 -*- """ Input: inX: vector to compare to existing dataset (1xN) dataSet: size m data set of known vectors (NxM) labels: data set labels (1xM vector) k: number of neighbors to use for comparison Output: the most popular class label """ from numpy import * import operator import os from data_util import DataUtils import datetime import time import matplotlib.pyplot as plt import numpy as np trainfile_X = 'train-images.idx3-ubyte' trainfile_y = 'train-labels.idx1-ubyte' testfile_X = 't10k-images.idx3-ubyte' testfile_y = 't10k-labels.idx1-ubyte' # 定义kNN分类函数 def kNNClassify(newInput, dataSet, labels, k): numSamples = dataSet.shape[0] # shape[0] 代表行数 ## step 1: 计算欧式距离 # tile(A, reps): Construct an array by repeating A reps times # the following copy numSamples rows for dataSet diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise squaredDiff = diff ** 2 # squared for the subtract squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row distance = squaredDist ** 0.5 ## step 2: 计算距离 # argsort() returns the indices that would sort an array in a ascending order sortedDistIndices = argsort(distance) classCount = {} # 定义一个字典用于放入元素 for i in xrange(k): ## step 3: 选择k值对应的距离 voteLabel = labels[sortedDistIndices[i]] ## step 4: 记录标签出现的次数 # when the key voteLabel is not in dictionary classCount, get() # will return 0 classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 ## step 5: 返回对应标签最多的类别 maxCount = 0 for key, value in classCount.items(): if value > maxCount: maxCount = value maxIndex = key return maxIndex # 将图片转化为向量形式 def img2vector(filename): rows = 32 cols = 32 imgVector = zeros((1, rows * cols)) fileIn = open(filename) for row in xrange(rows): lineStr = fileIn.readline() for col in xrange(cols): imgVector[0, row * 32 + col] = int(lineStr[col]) return imgVector # 测试HandWriting类 def testHandWritingClass(): ## step 1: 加载数据 ISOTIMEFORMAT='%Y-%m-%d %X' load_time=time.strftime(ISOTIMEFORMAT, time.localtime()) print "step 1: loading data...", print"load_time:%s"%load_time train_x=DataUtils(filename=trainfile_X).getImage() train_y=DataUtils(filename=trainfile_y).getLabel() test_x=DataUtils(testfile_X).getImage() test_y=DataUtils(testfile_y).getLabel() ## step 2: 训练 train_time=time.strftime(ISOTIMEFORMAT, time.localtime()) print "step 2: training...", print "train_time:%s"%train_time pass ## step 3: 测试 test_time=time.strftime(ISOTIMEFORMAT, time.localtime()) print "step 3: testing...", print "test_time:%s"%test_time numTestSamples = test_x.shape[0] matchCount = 0 result=[] for m in range(1,121): for i in xrange(numTestSamples): predict = kNNClassify(test_x[i], train_x, train_y, m) if predict == test_y[i]: matchCount += 1 error = 1 - float(matchCount) / numTestSamples matchCount=0 result_time=time.strftime(ISOTIMEFORMAT, time.localtime()) print m,"result_time:%s"%result_time print 'The classify error is: %.4f' %error result.append(error) ##step4:出图 x = range(1,121) y = result plt.plot(x, y) plt.show() if __name__ == '__main__': testHandWritingClass()

评论收藏

内容反馈