K-近邻算法（一）

程序员文章站 2022-07-14 11:54:41

...

简单的说K-近邻算法采用测量不同特征值之间距离方法进行分类：

一、原理

在k近邻算法中，当训练集、最近邻值k、距离度量、决策规则等确定下来时，整个算法实际上是利用训练集把特征空间划分成一个个子空间，训练集中的每个样本占据一部分空间。对最近邻而言，当测试样本落在某个训练样本的领域内，就把测试样本标记为这一类。
K-近邻算法（一）

二、算法

就是在训练集中数据和标签已知的情况下，输入测试数据，将测试数据的特征与训练集中对应的特征进行相互比较，找到训练集中与之最为相似的前K个数据，则该测试数据对应的类别就是K个数据中出现次数最多的那个分类，其算法的描述为：

1）计算测试数据与各个训练数据之间的距离；

2）按照距离的递增关系进行排序；

3）选取距离最小的K个点；

4）确定前K个点所在类别的出现频率；

5）返回前K个点中出现频率最高的类别作为测试数据的预测分类。

三、python实现

（1）约会网站配对效果预测

实验数据介绍

每年获得的飞行常客里程数	玩视频游戏所消耗时间百分比	每周消费的冰激凌公升数	喜欢的程度
40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1
75136	13.147394	0.428964	1
38344	1.669788	0.134296	1
72993	10.141740	1.032955	1
35948	6.830792	1.213192	3
…
42666	13.276369	0.543880	3
67497	8.631577	0.749278	1

datintTestSet2.txt文件下载：http://download.csdn.net/detail/jay_xio/8543027

from numpy import *
import operator

def file2matrix (filename) :
    fr =open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)
    returnMat = zeros((numberOfLines,3))
    classLabelVector = []
    index =0
    for line in arrayOLines :
        line =line.strip()
        listFromLine = line.split('\t')
        returnMat[index ,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return returnMat,classLabelVector

def classify0 (inX,dataSet,labels,k):
    dataSetSize =dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()
    ClassCount = {}
    for i in range(k) :
        voteIlabel = labels[sortedDistIndicies[i]]
        ClassCount[voteIlabel] = ClassCount.get(voteIlabel,0)+1
    sortedClassCount =sorted(ClassCount.items(),key = operator.itemgetter(1),reverse =True)
    return sortedClassCount[0][0]

def autoNorm(dataSet):
    minVals =dataSet.min(0)
    maxVals =dataSet.max(0)
    ranges= maxVals-minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet =dataSet - tile(minVals,(m,1))
    normDataSet =normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

def datingClassTest():
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix('E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt')
    normMat,ranges,minVals =autoNorm(datingDataMat)
    m= normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount =0.0
    print(normMat[numTestVecs:m,:])
    for i in range (numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],5)
        print("the classifier came back with : %d,the real answer is :%d" %(classifierResult,datingLabels[i]))
        if(classifierResult !=datingLabels[i]):
            errorCount += 1.0
    print("the total error rate is: %f"%(errorCount/float(numTestVecs)))
datingClassTest()

运行结果

the classifier came back with : 3,the real answer is :3
the classifier came back with : 2,the real answer is :2
the classifier came back with : 1,the real answer is :1
the classifier came back with : 1,the real answer is :1
...
the classifier came back with : 3,the real answer is :3
the classifier came back with : 3,the real answer is :3
the classifier came back with : 2,the real answer is :2
the classifier came back with : 2,the real answer is :1
the classifier came back with : 1,the real answer is :1
the total error rate is: 0.050000

对原始数据的画图表示：
方法一：

import kNN
import matplotlib
import matplotlib.pyplot as plt
from numpy import *
datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")
fig =plt.figure()
ax =fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

运行结果：
K-近邻算法（一）

方法二：

import kNN
import matplotlib
import matplotlib.pyplot as plt
from numpy import *
datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")
fig = plt.figure()
# 指定图像所在的子视图位置，add_subplot(nmi),意思为在fig视图被划分为n*m个子视图，i指定接下来的图像放在哪一个位置
ax = fig.add_subplot(111)
l = datingDataMat.shape[0]
# 存储第一类，第二类，第三类的数组
X1 = []
Y1 = []
X2 = []
Y2 = []
X3 = []
Y3 = []
for i in range(l):
    if int(datingLabels[i]) == 1:
        X1.append(datingDataMat[i, 1])
        Y1.append(datingDataMat[i, 2])
    elif int(datingLabels[i]) == 2:
        X2.append(datingDataMat[i, 1])
        Y2.append(datingDataMat[i, 2])
    else:
        X3.append(datingDataMat[i, 1])
        Y3.append(datingDataMat[i, 2])
        # 画出散点图，坐标分别为datingDataMat的第一列数据与第二列数据，c='color'指定点的颜色
type1 = ax.scatter(X1, Y1, c='red')
type2 = ax.scatter(X2, Y2, c='green')
type3 = ax.scatter(X3, Y3, c='blue')
ax.axis([-2, 20, -0.2, 1.75])
ax.legend([type1, type2, type3], ["Did Not Like", "Liked in Small Doses", "Liked in Large Doses"], loc=2)
plt.xlabel('Percentage of Time Spent Playing Video Games')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()

运行结果：
K-近邻算法（一）

上一篇： Matlab处理气象数据（七）分段趋势线的做法

下一篇： Matlab处理气象数据（十六）城市与非城市区域的对比

K-近邻算法（一）

一、原理

二、算法

三、python实现

php 一元分词算法

FZU2018级算法第一次作业 1.1fibonacci （矩阵快速幂）

java字符串查找算法（讲解java写一个冒泡排序）

一致性哈希算法以及其PHP实现详细解析

一组PHP可逆加密解密算法实例代码

9月百度算法大调整的一些新内幕说明

C#算法函数：获取一个字符串中的最大长度的数字

C#用递归算法实现：一列数的规则如下: 1、1、2、3、5、8、13、21、34，求第30位数是多少

Python实现在某个数组中查找一个值的算法示例

C#数据结构与算法揭秘一