机器学习:knn识别数字
程序员文章站
2023-12-29 13:09:52
识别数字原理通过求出测试样例与所有训练样例的距离来判断分类求出试样例与所有训练样例的距离(小学二年级学的距离公式:距离的平方=(x1-x2)^ 2+(y1-y2)^ 2)将距离从小到大排序后取前k个k个中最多是哪个分类结果既是那个分类纯python不依赖任何库代码由该原理易得代码:写完代码后运行可得结果,但是却发现一个致命缺陷,整个过程需要耗费30分钟(训练集2000,测试集1000)加numpy库代码一系列学习后得知,使用numpy库可以大大提高矩阵的运行速度,得知次消息加入nu...
识别数字
原理
通过求出测试样例与所有训练样例的距离来判断分类
- 求出试样例与所有训练样例的距离(小学二年级学的距离公式:距离的平方=(x1-x2)^ 2+(y1-y2)^ 2)
- 将距离从小到大排序后取前k个
- k个中最多是哪个分类结果既是那个分类
纯python不依赖任何库代码
由该原理易得代码:
import os #我猜不会有人杠我说我用了os库,嗯,不会的 '''
参数:
dir:文件所在目录名
filename:文件名
返回:
数组[n,a]
n:该文件实际代表数字(根据文件名判断)
a:文件中01数据所组成的二维数组
''' def fileToArray(dir,filename:str): a = [] with open(dir+'/'+filename,'r') as f: for i in f.readlines(): a.append(list(map(int,i.strip('\n')))) n = int(filename.split('_')[0]) return [n,a] '''
参数:
dir:目录名
返回:
该目录下所有文件被fileToArray函数处理后的结果
''' def dirFileYield(dir): for i in os.listdir(dir): yield fileToArray(dir,i) '''
参数:
test:测试数据(即fileToArray函数返回的由实际分类与01组成的二维数组所组成的数组)嘿嘿嘿
train:训练数据
返回:
两个数据的距离
''' def distance(test,train): sum = 0 for i in range(32): for j in range(32): sum += (train[1][i][j]-test[1][i][j])**2 return [train[0],sum**0.5] #给sort函数传参用来定制排序规则 def rule(a): return a[1] '''
none
''' def judge(test): disArray = [] for train in dirFileYield('trainingDigits'): disArray.append(distance(test,train)) disArray.sort(key=rule) res = [] for i in range(3): res.append(disArray[i][0]) return max(res,key=res.count) #主函数,没啥可说 def main(): sum=res=0 for i in dirFileYield('testDigits'): n = judge(i) print('实际为' + str(i[0]) + ',判断为' + str(n)) if n != i[0]: res += 1 sum += 1 print('错误率为' + str(res / sum)) if __name__ == '__main__': main() #不会吧不会吧,不会真的有人杠我说用了os库吧
写完代码后运行可得结果,但是却发现一个致命缺陷,整个过程需要耗费30分钟(训练集2000,测试集1000)
加numpy库代码
一系列学习后得知,使用numpy库可以大大提高矩阵的运行速度,得知次消息加入numpy库后易得代码:
import os import numpy #1.268499 '''
看到这里的相信已经能够接受我
优(辣)雅(眼)
的代码风格
除用numpy代替了矩阵操作外
其他大体并无修改
''' def testToArray(filename): dirname='testDigits' matrix = numpy.zeros((1,1024)) with open(dirname+'/'+filename,'r') as f: for i in range(32): fr = f.readline() for j in range(32): matrix[0,32*i+j] = int(fr[j]) return matrix def trainToArray(): dirname = 'trainingDigits' train = os.listdir(dirname) trainArray = numpy.zeros((len(train),1024)) for i in range(len(train)): with open(dirname+'/'+train[i],'r') as f: for j in range(32): fr = f.readline() for k in range(32): trainArray[i,32*j+k] = int(fr[k]) return trainArray def dirToArray(): for i in os.listdir('testDigits'): yield (i.split('_')[0],testToArray(i)) def judge(test,trainArray): id = os.listdir('trainingDigits') for i,j in enumerate(id): id[i] = int(j.split('_')[0]) testArray = numpy.zeros(trainArray.shape) for i in range(trainArray.shape[0]): testArray[i] = test
testArray = testArray - trainArray
testArray = testArray**2 testArray = numpy.sum(testArray,axis=1) testArray = numpy.argsort(testArray) res = [] for i in range(3): res.append(id[testArray[i]]) return max(res,key=res.count) def main(): res = sum = 0 trainArray = trainToArray() for i in dirToArray(): n = judge(i[1],trainArray) if n!=int(i[0]): res += 1 sum += 1 print('实际为'+i[0]+',测试为'+str(n)) print(res/sum*100) if __name__ == '__main__': main()
经过修改后,代码运行时间缩短为20秒
加sklearn库代码
再次经历一系列学习后,决定使用机器学习库(sklearn)来实验效果,加入sklearn库后易得代码:
import numpy from sklearn import neighbors import os '''
嘿嘿嘿
''' def testToArray(filename): dirname='testDigits' matrix = numpy.zeros((1,1024)) with open(dirname+'/'+filename,'r') as f: for i in range(32): fr = f.readline() for j in range(32): matrix[0,32*i+j] = int(fr[j]) return matrix def trainToArray(): dirname = 'trainingDigits' train = os.listdir(dirname) trainArray_x = numpy.zeros((len(train),1024)) trainArray_y = numpy.zeros((len(train),1)) for i in range(len(train)): with open(dirname+'/'+train[i],'r') as f: for j in range(32): fr = f.readline() for k in range(32): trainArray_x[i,32*j+k] = int(fr[k]) trainArray_y[i,0] = int(train[i].split('_')[0]) return trainArray_x,trainArray_y def dirToArray(): for i in os.listdir('testDigits'): yield (i.split('_')[0],testToArray(i)) train = trainToArray() knn = neighbors.KNeighborsClassifier(n_neighbors=3).fit(train[0],train[1]) sum = res = 0 for i in dirToArray(): n = int(knn.predict(i[1])[0]) print('实际为' + str(i[0]) + ',测试为' + str(n)) if int(n) != int(i[0]): res+=1 sum += 1 print(res/sum*100)
这次修改后,代码运行速度再次提高,缩短时间为6秒。(呐喊:数学家牛逼!!!!!)
本文地址:https://blog.csdn.net/qq_38590766/article/details/108838881