欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

CS231n Assignment1:KNN

程序员文章站 2024-03-25 10:23:12
...

cs231n/classifiers/k_nearest_neighbor.py代码:

import numpy as np
class KNearestNeighbor(object):
  """ a kNN classifier with L2 distance """

  def __init__(self):
    pass

  def train(self, X, y):
    """
    Train the classifier. For k-nearest neighbors this is just 
    memorizing the training data.

    Inputs:
    - X: A numpy array of shape (num_train, D) containing the training data
      consisting of num_train samples each of dimension D.
    - y: A numpy array of shape (N,) containing the training labels, where
         y[i] is the label for X[i].
    """
    self.X_train = X
    self.y_train = y
    
  def predict(self, X, k=1, num_loops=0):
    """
    Predict labels for test data using this classifier.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data consisting
         of num_test samples each of dimension D.
    - k: The number of nearest neighbors that vote for the predicted labels.
    - num_loops: Determines which implementation to use to compute distances
      between training points and testing points.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    if num_loops == 0:
      dists = self.compute_distances_no_loops(X)
    elif num_loops == 1:
      dists = self.compute_distances_one_loop(X)
    elif num_loops == 2:
      dists = self.compute_distances_two_loops(X)
    else:
      raise ValueError('Invalid value %d for num_loops' % num_loops)

    return self.predict_labels(dists, k=k)

  def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      for j in range(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        dists[i,j] = np.sqrt(np.sum((X[i,:] - self.X_train[j,:])**2))
        """standard answer"""
        # dists[i, j] = np.sqrt(np.sum(np.square(X[i, :] - self.X_train[j, :]) ))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.
    X shape (
    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      dists[i,:] = np.sqrt(np.sum(np.square(self.X_train - X[i,:]),axis=1))

      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    dists = np.multiply(np.dot(X,self.X_train.T),-2)
    dists2 = np.sum(np.square(X),axis=1,keepdims=True)
    dists3 = np.sum(np.square(self.X_train),axis=1)
    dists = np.add(dists,dists2)
    dists = np.add(dists,dists3)
    dists = np.sqrt(dists)
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j] dist向量是测试集和训练集一起形成的矩阵
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      closest_y = self.y_train[np.argsort(dists[i,:])[:k]]

      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      y_pred[i] = np.argmax(np.bincount(closest_y))
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred


knn.ipynb代码,cross validation部分:

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
for k in k_choices:
    k_to_accuracies[k] = []
    
for k in k_choices:
    for i in range(num_folds):
        #这里出现错误,前若干的数组是[:i]不是[;,i],这个代表着前几列
        X_train_cv = np.vstack(X_train_folds[:i] + X_train_folds[i+1:]) #(4000,3072)
        X_test_cv = X_train_folds[i]                                    #(1000,3072)
        y_train_cv = np.hstack(y_train_folds[:i] + y_train_folds[i+1:]) #(4000,)
        y_test_cv = y_train_folds[i]    #(1000,)
       
          
        #这里不需要第二个classifier = KNearestNeighbor()么
        classifier.train(X_train_cv,y_train_cv)
        dists_cv = classifier.compute_distances_no_loops(X_test_cv) #(1000,4000)
        y_test_pred_cv = classifier.predict_labels(dists_cv,k)

        num_correct_cv = np.sum(y_test_pred_cv == y_test_cv)
        accuracy_cv = float(num_correct_cv)/y_test_cv.shape[0]
        k_to_accuracies[k].append(accuracy_cv)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

做作业中遇到的问题汇总:
1、numpy.sum函数:

numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>)

作用就是求和函数,参考文档如下所示:https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
具体用到的参数为axis 和 keepdims。
参数及结果如下:

a = np.arange(12).reshape(3,4)

print(a)   
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(np.sum(a))
#66

print(np.sum(a,axis=0))
#[12 15 18 21]

print(np.sum(a,axis=0,keepdims=True))
#[[12 15 18 21]]

print(np.sum(a,axis=1))
# [ 6 22 38]

print(np.sum(a,axis=1,keepdims=True))
# [[ 6]
#  [22]
#  [38]]

2、numpy.argsort,numpy.argmax ,numpy.bincount函数:
numpy.argsort :将矩阵a按照axis排序,并返回排序后的下标
numpy.argmax:返回数组或矩阵最大值的索引
numpy.bincount:返回数组中每个元素出现的次数,尤其适用于计算数据集的标签列(y_train)的分布

3、numpy.vstack,numpy.hstack函数:
作用是堆叠数组,vstack按行堆叠数组,hstack按列堆叠数组
代码运行结果如下:

a = np.arange(12).reshape(3,4)
print(a)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]
print(np.vstack(a))
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]
print(np.hstack(a))
# [ 0  1  2  3  4  5  6  7  8  9 10 11]

4、计算KNN dists数组时no loop的算法及原理:
个人理解,不一定很严谨,但是可以推论出来。
这个以后会用公式证明,简单的说下思路。
我们现在有dists[i,j]=Xtrain(j)Xtrain(j)T+X(i)X(i)T2X(i)Xtrain(j)Tdists[i,j] = X_{train_{(j)}}\cdot X_{train_{(j)}}^T+X_{(i)}\cdot X_{(i)}^T-2X_{(i)}\cdot X_{train_{(j)}}^T
我们可以先推出dists[:,j]dists[:,j]的公式,再推出distsdists的公式,或者直接用numpy.sum对矩阵的操作获得我们想要的结果。