Kmeans聚类算法在python下的实现--附测试数据

程序员文章站 2022-07-14 20:18:37

...

Kmeans算法

1：随机初始化一个聚类中心

2：根据距离将数据点划分到不同的类中

3：计算代价函数

4：重新计算各类数据的中心作为聚类中心

5：重复2-4步直到代价函数不发生变化

测试数据：

X Y
-1.26 0.46
-1.15 0.49
-1.19 0.36
-1.33 0.28
-1.06 0.22
-1.27 0.03
-1.28 0.15
-1.06 0.08
-1.00 0.38
-0.44 0.29
-0.37 0.45
-0.22 0.36
-0.34 0.18
-0.42 0.06
-0.11 0.12
-0.17 0.32
-0.27 0.08
-0.49 -0.34
-0.39 -0.28
-0.40 -0.45
-0.15 -0.33
-0.15 -0.21
-0.33 -0.30
-0.23 -0.45
-0.27 -0.59
-0.61 -0.65
-0.61 -0.53
-0.52 -0.53
-0.42 -0.56
-1.39 -0.26

python代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#---------------数据读入-------------------
Df = pd.read_table('D:/22.txt')
DataSet = Df.as_matrix(columns=None)
Rows_i,Dim_j = DataSet.shape

#--------欧式距离函数----------------
euclidean_Metric = np.linalg.norm
#-----随机初始化聚类中心------------------------
def init_Center(DataSet,k):
    Center = np.zeros((k, Dim_j))
    for j in range(k):
        n_index = np.random.random_integers(0, Rows_i)
        Center[j, :] = DataSet[n_index, :]
    return Center, k
#---------------根据距离将数据分成不同的类-------------
#cluster_Data第一列是类别代码，第二列是距离聚类中心距离的平方
def cluster_Set(Center, k):
    cluster_Data = np.zeros((Rows_i, 2))
    for i in range(Rows_i):
        class_index = 0
        for j in range(k):
            temp1 = euclidean_Metric(DataSet[i, :]-Center[j, :])
            temp2 = euclidean_Metric(DataSet[i, :]-Center[class_index, :])
            if temp1 < temp2:
                class_index = j
        cluster_Data[i, 0] = class_index
        cluster_Data[i, 1] = temp2**2
    return cluster_Data

#---------------把各类的质心作为新的聚类中心-------------
def center_Update(cluster_data, Center, k):
    for j in range(k):
        get_Data = DataSet[np.nonzero(cluster_data[:, 0]==j)[0]]
        Center[j, :] = np.mean(get_Data, axis= 0)
    return Center

#---------------代价函数为各类到中心距离的平方和-------------
def cost_f(cluster_Data):
    cost = sum(cluster_Data[:,1 ])
    return cost

#-----------------作为4类的分类显示----------------------------------
def show(DataSet, cluster_Data):
    df = pd.DataFrame(DataSet, index=cluster_Data[:,0], columns=['x1','x2'])
    df1 = df[df.index==0]
    df2 = df[df.index==1]
    df3 = df[df.index==2]
    df4 = df[df.index==3]
    plt.figure(figsize=(10,8), dpi=80)
    axes = plt.subplot()
    type1 = axes.scatter(df1.loc[:,['x1']],  df1.loc[:,['x2']], s=50, c='red', marker='d')
    type2 = axes.scatter(df2.loc[:,['x1']], df2.loc[:,['x2']], s=50, c='green', marker='*')
    type3 = axes.scatter(df3.loc[:,['x1']], df3.loc[:,['x2']], s=50, c='brown', marker='p')
    type4 = axes.scatter(df4.loc[:,['x1']], df4.loc[:,['x2']], s=50, c='black')
    type_center = axes.scatter(Center[:,0], Center[:,1], s=40, c='blue')
    plt.xlabel('x', fontsize=16)
    plt.ylabel('y', fontsize=16)
    axes.legend((type1, type2, type3, type4, type_center),('0','1','2','3','center'), loc=1)
    plt.show()

#----------------------主程序------------------------------
Center,k = init_Center(DataSet, 4)
cost = 100000
cost_temp = 1
while cost_temp != cost:  # 代价函数不变时停止
    cost_temp = cost
    cluster_Data = cluster_Set(Center, k)   #   根据距离将数据点划分到不同的类中
    show(DataSet, cluster_Data)             #   显示一次分类结果
    cost = cost_f(cluster_Data)            #   计算代价函数
    print(cost)
    Center = center_Update(cluster_Data, Center, k)  #   重新计算各类数据的中心作为聚类中心

输出结果：

Kmeans聚类算法在python下的实现--附测试数据、