欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Machine Learning Project1, KNN

程序员文章站 2022-07-14 20:33:01
...

Header

  • Name: Shusen Wu
  • OS: Win10 
  • Cpu: I7-7700
  • Language: Python3.6
  • Environment: Jupyter Notebook
  • library: 
    numpy
    matplotlib.pyplot
    collections 
    time
    operator

Reference

Machine Learning in Action, Peter Harrington, ISBN 9781617290183

Datasets

The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data

Part 1: Pima

Dataset details

Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.

Besides, since the scales of features are quite different, so I normalize them into 0~1.

Machine Learning Project1, KNN

Algorithm Description

Learn from the book <<Machine Learning in Action>>,  I get the KNN and normalize functions.

  • KNN 

Machine Learning Project1, KNN

  • Normalize 

Machine Learning Project1, KNN

Algorithm Results

   

Predit

   
   

1

0

合计

Reality

1

True Positive(TP)

False Negative(FN)

Actual Positive(TP+FN)

 

0

False Positive(FP)

True Negative(TN)

Actual Negative(FP+TN)

Total

 

Predicted Positive(TP+FP)

Predicted Negative(FN+TN)

TP+FP+FN+TN

 

First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.

Here are the results:

 

 

Machine Learning Project1, KNN

Machine Learning Project1, KNN

Machine Learning Project1, KNN

 Machine Learning Project1, KNN

Machine Learning Project1, KNN

Machine Learning Project1, KNN

Machine Learning Project1, KNN 

 

 

 

Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.

So we choose k as 7 to run on the test set

Machine Learning Project1, KNN

Runtime

The cost time on test set is:0.006999969482421875

Other running times are showing on the above pictures.

Part2:Recognise Digits

Dataset details

Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.

Machine Learning Project1, KNN

 

The distribution of 0~9 numbers: 

Machine Learning Project1, KNN

Random sample and show:

Machine Learning Project1, KNN

Besides, we'd like to have a quick look on one image:

Machine Learning Project1, KNN

Algorithm Description

  • KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
  • Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)

Machine Learning Project1, KNN

 

Algorithm Results && Runtime

KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.

Machine Learning Project1, KNN Thus, I put the code back to pycharm and run.

Machine Learning Project1, KNN

From this picture, we can find that when K is 5, it does best on the validation set. So we set K=5 to run on the test set. There is the result:

top 100 results:

Machine Learning Project1, KNN

相关标签: KNN