Machine Learning Project1, KNN

程序员文章站 2022-07-14 20:33:01

...

Header

Name: Shusen Wu
OS: Win10
Cpu: I7-7700
Language: Python3.6
Environment: Jupyter Notebook

library:

numpy
matplotlib.pyplot
collections 
time
operator

Reference

Machine Learning in Action, Peter Harrington, ISBN 9781617290183

Datasets

The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data

Part 1: Pima

Dataset details

Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.

Besides, since the scales of features are quite different, so I normalize them into 0~1.

Machine Learning Project1, KNN

Algorithm Description

Learn from the book <<Machine Learning in Action>>, I get the KNN and normalize functions.

Machine Learning Project1, KNN

Normalize

Machine Learning Project1, KNN

Algorithm Results

		Predit
		1	0	合计
Reality	1	True Positive（TP）	False Negative（FN）	Actual Positive(TP+FN)
	0	False Positive（FP)	True Negative(TN)	Actual Negative(FP+TN)
Total		Predicted Positive(TP+FP)	Predicted Negative(FN+TN)	TP+FP+FN+TN

First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.

Here are the results:

Machine Learning Project1, KNN

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Machine Learning Project1, KNN

Runtime

The cost time on test set is:0.006999969482421875

Other running times are showing on the above pictures.

Part2:Recognise Digits

Dataset details

Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.

Machine Learning Project1, KNN

The distribution of 0~9 numbers:

Machine Learning Project1, KNN

Random sample and show:

Machine Learning Project1, KNN

Besides, we'd like to have a quick look on one image:

Machine Learning Project1, KNN

Algorithm Description

KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)

Machine Learning Project1, KNN

Algorithm Results && Runtime

KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.

Machine Learning Project1, KNN Thus, I put the code back to pycharm and run.

Machine Learning Project1, KNN

From this picture, we can find that when K is 5, it does best on the validation set. So we set K=5 to run on the test set. There is the result:

top 100 results:

Machine Learning Project1, KNN

Machine Learning Project1, KNN

Header

Reference

Datasets

Part 1: Pima

Dataset details

Algorithm Description

Algorithm Results

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Runtime

Part2:Recognise Digits

Dataset details

Algorithm Description

Algorithm Results && Runtime

天云大数据CEO雷涛：AI建模平台演进趋势着力于Auto Machine Learning

第五周（web，machine learning笔记）

[flink]#32_扩展库:Machine Learning

Machine Learning Project1, KNN

Feature Selection For Machine Learning in Python (Python机器学习中的特征选择)

[Machine Learning & Algorithm] 随机森林（Random Forest）

machine-learning-ex4

基于机器学习(machine learning)的SEO实战日记2--寻找切入点

基于机器学习(machine learning)的SEO实战日记4--数据抓取

基于机器学习(machine learning)的SEO实战日记1--序言