Machine Learning Project1, KNN
Header
- Name: Shusen Wu
- OS: Win10
- Cpu: I7-7700
- Language: Python3.6
- Environment: Jupyter Notebook
- library:
numpy matplotlib.pyplot collections time operator
Reference
Machine Learning in Action, Peter Harrington, ISBN 9781617290183
Datasets
The project will explore two datasets, the famous MNIST dataset of very small pictures of handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data
Part 1: Pima
Dataset details
Here, I use 80% data as trainning data, 10% as validation data and 10% as test data.
Besides, since the scales of features are quite different, so I normalize them into 0~1.
Algorithm Description
Learn from the book <<Machine Learning in Action>>, I get the KNN and normalize functions.
- KNN
- Normalize
Algorithm Results
Predit |
||||
1 |
0 |
合计 |
||
Reality |
1 |
True Positive(TP) |
False Negative(FN) |
Actual Positive(TP+FN) |
0 |
False Positive(FP) |
True Negative(TN) |
Actual Negative(FP+TN) |
|
Total |
Predicted Positive(TP+FP) |
Predicted Negative(FN+TN) |
TP+FP+FN+TN |
First, we need to predit the results by using the validation set with different K. Here, I set K from 3 to 9.
Here are the results:
Compare the accuracies when k is from 3 to 9, we find that when K is set to 7 it works well.
So we choose k as 7 to run on the test set
Runtime
The cost time on test set is:0.006999969482421875
Other running times are showing on the above pictures.
Part2:Recognise Digits
Dataset details
Here, again, I use 90% of traning data to train, 10% of traning set as validation set. Cause there is already a test set, so I do not need to split it from training set. From the following picture, we can see the shapes (row x col) of these data sets.
The distribution of 0~9 numbers:
Random sample and show:
Besides, we'd like to have a quick look on one image:
Algorithm Description
- KNN: we know that (A-B)^2 equals to A^2+B^2+2AB. So, when we compute the distance, we can use this way to calculate and the matric computation will save a lot of time.
- Besides, we also need to computer the accuracy for every K. So, here, we can computer it quite fast by using np.sum(y-y') /len(y)
Algorithm Results && Runtime
KNN do cost a lot of time, and it costs huge memory. When putting this algorithm on the Jupyter Notebooks, the memory is not enough without splitting into small batch sizes.
Thus, I put the code back to pycharm and run.
From this picture, we can find that when K is 5, it does best on the validation set. So we set K=5 to run on the test set. There is the result:
top 100 results:
上一篇: keepalived配置注意事项
下一篇: TypeScript使用HashMap
推荐阅读
-
天云大数据CEO雷涛:AI建模平台演进趋势着力于Auto Machine Learning
-
第五周(web,machine learning笔记)
-
[flink]#32_扩展库:Machine Learning
-
Machine Learning Project1, KNN
-
Feature Selection For Machine Learning in Python (Python机器学习中的特征选择)
-
[Machine Learning & Algorithm] 随机森林(Random Forest)
-
machine-learning-ex4
-
基于机器学习(machine learning)的SEO实战日记2--寻找切入点
-
基于机器学习(machine learning)的SEO实战日记4--数据抓取
-
基于机器学习(machine learning)的SEO实战日记1--序言