K-近邻算法(一)
程序员文章站
2022-07-14 11:54:41
...
简单的说K-近邻算法采用测量不同特征值之间距离方法进行分类:
一、原理
在k近邻算法中,当训练集、最近邻值k、距离度量、决策规则等确定下来时,整个算法实际上是利用训练集把特征空间划分成一个个子空间,训练集中的每个样本占据一部分空间。对最近邻而言,当测试样本落在某个训练样本的领域内,就把测试样本标记为这一类。
二、算法
就是在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
三、python实现
(1)约会网站配对效果预测
实验数据介绍
每年获得的飞行常客里程数 | 玩视频游戏所消耗时间百分比 | 每周消费的冰激凌公升数 | 喜欢的程度 |
---|---|---|---|
40920 | 8.326976 | 0.953952 | 3 |
14488 | 7.153469 | 1.673904 | 2 |
26052 | 1.441871 | 0.805124 | 1 |
75136 | 13.147394 | 0.428964 | 1 |
38344 | 1.669788 | 0.134296 | 1 |
72993 | 10.141740 | 1.032955 | 1 |
35948 | 6.830792 | 1.213192 | 3 |
… | |||
42666 | 13.276369 | 0.543880 | 3 |
67497 | 8.631577 | 0.749278 | 1 |
datintTestSet2.txt文件下载:http://download.csdn.net/detail/jay_xio/8543027
from numpy import *
import operator
def file2matrix (filename) :
fr =open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = zeros((numberOfLines,3))
classLabelVector = []
index =0
for line in arrayOLines :
line =line.strip()
listFromLine = line.split('\t')
returnMat[index ,:]=listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index+=1
return returnMat,classLabelVector
def classify0 (inX,dataSet,labels,k):
dataSetSize =dataSet.shape[0]
diffMat = tile(inX,(dataSetSize,1))-dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
ClassCount = {}
for i in range(k) :
voteIlabel = labels[sortedDistIndicies[i]]
ClassCount[voteIlabel] = ClassCount.get(voteIlabel,0)+1
sortedClassCount =sorted(ClassCount.items(),key = operator.itemgetter(1),reverse =True)
return sortedClassCount[0][0]
def autoNorm(dataSet):
minVals =dataSet.min(0)
maxVals =dataSet.max(0)
ranges= maxVals-minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet =dataSet - tile(minVals,(m,1))
normDataSet =normDataSet/tile(ranges,(m,1))
return normDataSet,ranges,minVals
def datingClassTest():
hoRatio = 0.10
datingDataMat,datingLabels = file2matrix('E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt')
normMat,ranges,minVals =autoNorm(datingDataMat)
m= normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount =0.0
print(normMat[numTestVecs:m,:])
for i in range (numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],5)
print("the classifier came back with : %d,the real answer is :%d" %(classifierResult,datingLabels[i]))
if(classifierResult !=datingLabels[i]):
errorCount += 1.0
print("the total error rate is: %f"%(errorCount/float(numTestVecs)))
datingClassTest()
运行结果
the classifier came back with : 3,the real answer is :3
the classifier came back with : 2,the real answer is :2
the classifier came back with : 1,the real answer is :1
the classifier came back with : 1,the real answer is :1
...
the classifier came back with : 3,the real answer is :3
the classifier came back with : 3,the real answer is :3
the classifier came back with : 2,the real answer is :2
the classifier came back with : 2,the real answer is :1
the classifier came back with : 1,the real answer is :1
the total error rate is: 0.050000
对原始数据的画图表示:
方法一:
import kNN
import matplotlib
import matplotlib.pyplot as plt
from numpy import *
datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")
fig =plt.figure()
ax =fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()
运行结果:
方法二:
import kNN
import matplotlib
import matplotlib.pyplot as plt
from numpy import *
datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")
fig = plt.figure()
# 指定图像所在的子视图位置,add_subplot(nmi),意思为在fig视图被划分为n*m个子视图,i指定接下来的图像放在哪一个位置
ax = fig.add_subplot(111)
l = datingDataMat.shape[0]
# 存储第一类,第二类,第三类的数组
X1 = []
Y1 = []
X2 = []
Y2 = []
X3 = []
Y3 = []
for i in range(l):
if int(datingLabels[i]) == 1:
X1.append(datingDataMat[i, 1])
Y1.append(datingDataMat[i, 2])
elif int(datingLabels[i]) == 2:
X2.append(datingDataMat[i, 1])
Y2.append(datingDataMat[i, 2])
else:
X3.append(datingDataMat[i, 1])
Y3.append(datingDataMat[i, 2])
# 画出散点图,坐标分别为datingDataMat的第一列数据与第二列数据,c='color'指定点的颜色
type1 = ax.scatter(X1, Y1, c='red')
type2 = ax.scatter(X2, Y2, c='green')
type3 = ax.scatter(X3, Y3, c='blue')
ax.axis([-2, 20, -0.2, 1.75])
ax.legend([type1, type2, type3], ["Did Not Like", "Liked in Small Doses", "Liked in Large Doses"], loc=2)
plt.xlabel('Percentage of Time Spent Playing Video Games')
plt.ylabel('Liters of Ice Cream Consumed Per Week')
plt.show()
运行结果: