欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

KMeans算法,采用肘部法则获取类簇中心个数K的值。

程序员文章站 2022-07-14 21:03:55
...

K-Means是一种非常常见的聚类算法,在处理聚类任务中经常使用,K-Means算法是一种原型聚类算法。

该算法重要的一步就是确定K的值的划分,通常我们采用肘部法则选取K值,再依据轮廓系数,及各个数据集中数据的数量综合去评估哪个K值为最佳。

 

肘部法则

 

KMeans算法,采用肘部法则获取类簇中心个数K的值。

1 )、对于n个点的数据集,迭代计算k from 1 to n,每次聚类完成后计算每个点到其所属的簇中心的距离的平方和;

2)、平方和是会逐渐变小的,直到k时平方和为0,因为每个点都是它所在的簇中心本身。

3)、在这个平方和变化过程中,会出现一个拐点也即“肘”点,下降率突然变缓时即认为 是最佳的k值。 在决定什么时候停止训练时,肘形判据同样有效,数据通常有更多的噪音,在增加分类无法带来更多回报时,停止增加类别。

 

轮廓系数法

针对聚类算法另一个评估指标: 轮廓系数法 ,结合了聚类的凝聚度(Cohesion)和分离度 (Separation),用于评估聚类的效果:

KMeans算法,采用肘部法则获取类簇中心个数K的值。

1)、计算样本i到同簇其他样本的平均距离ai,ai 越小样本i的簇内不相似度越小,说明样 本i越应该被聚类到该簇。

2)、计算样本i到最近簇Cj 的所有样本的平均距离bij,称样本i与最近簇Cj 的不相似度,定 义为样本i的簇间不相似度:bi =min{bi1, bi2, ..., bik},bi越大,说明样本i越不属于其他簇。

3)、求出所有样本的轮廓系数后再求平均值就得到了平均轮廓系数。平均轮廓系数的取 值范围为[-1,1],系数越大,聚类效果越好。簇内样本的距离越近,簇间样本距离越远。

 

下面案例针对鸢尾花数据集进行聚类,使用KMeans算法,采用肘部法则Elbow获取K的值,使用轮廓系数评估模型。

准备工作:

A: 数据准备 iris_kmeans.txt

1 1:5.1 2:3.5 3:1.4 4:0.2
1 1:4.9 2:3.0 3:1.4 4:0.2
1 1:4.7 2:3.2 3:1.3 4:0.2
1 1:4.6 2:3.1 3:1.5 4:0.2
1 1:5.0 2:3.6 3:1.4 4:0.2
1 1:5.4 2:3.9 3:1.7 4:0.4
1 1:4.6 2:3.4 3:1.4 4:0.3
1 1:5.0 2:3.4 3:1.5 4:0.2
1 1:4.4 2:2.9 3:1.4 4:0.2
1 1:4.9 2:3.1 3:1.5 4:0.1
1 1:5.4 2:3.7 3:1.5 4:0.2
1 1:4.8 2:3.4 3:1.6 4:0.2
1 1:4.8 2:3.0 3:1.4 4:0.1
1 1:4.3 2:3.0 3:1.1 4:0.1
1 1:5.8 2:4.0 3:1.2 4:0.2
1 1:5.7 2:4.4 3:1.5 4:0.4
1 1:5.4 2:3.9 3:1.3 4:0.4
1 1:5.1 2:3.5 3:1.4 4:0.3
1 1:5.7 2:3.8 3:1.7 4:0.3
1 1:5.1 2:3.8 3:1.5 4:0.3
1 1:5.4 2:3.4 3:1.7 4:0.2
1 1:5.1 2:3.7 3:1.5 4:0.4
1 1:4.6 2:3.6 3:1.0 4:0.2
1 1:5.1 2:3.3 3:1.7 4:0.5
1 1:4.8 2:3.4 3:1.9 4:0.2
1 1:5.0 2:3.0 3:1.6 4:0.2
1 1:5.0 2:3.4 3:1.6 4:0.4
1 1:5.2 2:3.5 3:1.5 4:0.2
1 1:5.2 2:3.4 3:1.4 4:0.2
1 1:4.7 2:3.2 3:1.6 4:0.2
1 1:4.8 2:3.1 3:1.6 4:0.2
1 1:5.4 2:3.4 3:1.5 4:0.4
1 1:5.2 2:4.1 3:1.5 4:0.1
1 1:5.5 2:4.2 3:1.4 4:0.2
1 1:4.9 2:3.1 3:1.5 4:0.1
1 1:5.0 2:3.2 3:1.2 4:0.2
1 1:5.5 2:3.5 3:1.3 4:0.2
1 1:4.9 2:3.1 3:1.5 4:0.1
1 1:4.4 2:3.0 3:1.3 4:0.2
1 1:5.1 2:3.4 3:1.5 4:0.2
1 1:5.0 2:3.5 3:1.3 4:0.3
1 1:4.5 2:2.3 3:1.3 4:0.3
1 1:4.4 2:3.2 3:1.3 4:0.2
1 1:5.0 2:3.5 3:1.6 4:0.6
1 1:5.1 2:3.8 3:1.9 4:0.4
1 1:4.8 2:3.0 3:1.4 4:0.3
1 1:5.1 2:3.8 3:1.6 4:0.2
1 1:4.6 2:3.2 3:1.4 4:0.2
1 1:5.3 2:3.7 3:1.5 4:0.2
1 1:5.0 2:3.3 3:1.4 4:0.2
2 1:7.0 2:3.2 3:4.7 4:1.4
2 1:6.4 2:3.2 3:4.5 4:1.5
2 1:6.9 2:3.1 3:4.9 4:1.5
2 1:5.5 2:2.3 3:4.0 4:1.3
2 1:6.5 2:2.8 3:4.6 4:1.5
2 1:5.7 2:2.8 3:4.5 4:1.3
2 1:6.3 2:3.3 3:4.7 4:1.6
2 1:4.9 2:2.4 3:3.3 4:1.0
2 1:6.6 2:2.9 3:4.6 4:1.3
2 1:5.2 2:2.7 3:3.9 4:1.4
2 1:5.0 2:2.0 3:3.5 4:1.0
2 1:5.9 2:3.0 3:4.2 4:1.5
2 1:6.0 2:2.2 3:4.0 4:1.0
2 1:6.1 2:2.9 3:4.7 4:1.4
2 1:5.6 2:2.9 3:3.6 4:1.3
2 1:6.7 2:3.1 3:4.4 4:1.4
2 1:5.6 2:3.0 3:4.5 4:1.5
2 1:5.8 2:2.7 3:4.1 4:1.0
2 1:6.2 2:2.2 3:4.5 4:1.5
2 1:5.6 2:2.5 3:3.9 4:1.1
2 1:5.9 2:3.2 3:4.8 4:1.8
2 1:6.1 2:2.8 3:4.0 4:1.3
2 1:6.3 2:2.5 3:4.9 4:1.5
2 1:6.1 2:2.8 3:4.7 4:1.2
2 1:6.4 2:2.9 3:4.3 4:1.3
2 1:6.6 2:3.0 3:4.4 4:1.4
2 1:6.8 2:2.8 3:4.8 4:1.4
2 1:6.7 2:3.0 3:5.0 4:1.7
2 1:6.0 2:2.9 3:4.5 4:1.5
2 1:5.7 2:2.6 3:3.5 4:1.0
2 1:5.5 2:2.4 3:3.8 4:1.1
2 1:5.5 2:2.4 3:3.7 4:1.0
2 1:5.8 2:2.7 3:3.9 4:1.2
2 1:6.0 2:2.7 3:5.1 4:1.6
2 1:5.4 2:3.0 3:4.5 4:1.5
2 1:6.0 2:3.4 3:4.5 4:1.6
2 1:6.7 2:3.1 3:4.7 4:1.5
2 1:6.3 2:2.3 3:4.4 4:1.3
2 1:5.6 2:3.0 3:4.1 4:1.3
2 1:5.5 2:2.5 3:4.0 4:1.3
2 1:5.5 2:2.6 3:4.4 4:1.2
2 1:6.1 2:3.0 3:4.6 4:1.4
2 1:5.8 2:2.6 3:4.0 4:1.2
2 1:5.0 2:2.3 3:3.3 4:1.0
2 1:5.6 2:2.7 3:4.2 4:1.3
2 1:5.7 2:3.0 3:4.2 4:1.2
2 1:5.7 2:2.9 3:4.2 4:1.3
2 1:6.2 2:2.9 3:4.3 4:1.3
2 1:5.1 2:2.5 3:3.0 4:1.1
2 1:5.7 2:2.8 3:4.1 4:1.3
3 1:6.3 2:3.3 3:6.0 4:2.5
3 1:5.8 2:2.7 3:5.1 4:1.9
3 1:7.1 2:3.0 3:5.9 4:2.1
3 1:6.3 2:2.9 3:5.6 4:1.8
3 1:6.5 2:3.0 3:5.8 4:2.2
3 1:7.6 2:3.0 3:6.6 4:2.1
3 1:4.9 2:2.5 3:4.5 4:1.7
3 1:7.3 2:2.9 3:6.3 4:1.8
3 1:6.7 2:2.5 3:5.8 4:1.8
3 1:7.2 2:3.6 3:6.1 4:2.5
3 1:6.5 2:3.2 3:5.1 4:2.0
3 1:6.4 2:2.7 3:5.3 4:1.9
3 1:6.8 2:3.0 3:5.5 4:2.1
3 1:5.7 2:2.5 3:5.0 4:2.0
3 1:5.8 2:2.8 3:5.1 4:2.4
3 1:6.4 2:3.2 3:5.3 4:2.3
3 1:6.5 2:3.0 3:5.5 4:1.8
3 1:7.7 2:3.8 3:6.7 4:2.2
3 1:7.7 2:2.6 3:6.9 4:2.3
3 1:6.0 2:2.2 3:5.0 4:1.5
3 1:6.9 2:3.2 3:5.7 4:2.3
3 1:5.6 2:2.8 3:4.9 4:2.0
3 1:7.7 2:2.8 3:6.7 4:2.0
3 1:6.3 2:2.7 3:4.9 4:1.8
3 1:6.7 2:3.3 3:5.7 4:2.1
3 1:7.2 2:3.2 3:6.0 4:1.8
3 1:6.2 2:2.8 3:4.8 4:1.8
3 1:6.1 2:3.0 3:4.9 4:1.8
3 1:6.4 2:2.8 3:5.6 4:2.1
3 1:7.2 2:3.0 3:5.8 4:1.6
3 1:7.4 2:2.8 3:6.1 4:1.9
3 1:7.9 2:3.8 3:6.4 4:2.0
3 1:6.4 2:2.8 3:5.6 4:2.2
3 1:6.3 2:2.8 3:5.1 4:1.5
3 1:6.1 2:2.6 3:5.6 4:1.4
3 1:7.7 2:3.0 3:6.1 4:2.3
3 1:6.3 2:3.4 3:5.6 4:2.4
3 1:6.4 2:3.1 3:5.5 4:1.8
3 1:6.0 2:3.0 3:4.8 4:1.8
3 1:6.9 2:3.1 3:5.4 4:2.1
3 1:6.7 2:3.1 3:5.6 4:2.4
3 1:6.9 2:3.1 3:5.1 4:2.3
3 1:5.8 2:2.7 3:5.1 4:1.9
3 1:6.8 2:3.2 3:5.9 4:2.3
3 1:6.7 2:3.3 3:5.7 4:2.5
3 1:6.7 2:3.0 3:5.2 4:2.3
3 1:6.3 2:2.5 3:5.0 4:1.9
3 1:6.5 2:3.0 3:5.2 4:2.0
3 1:6.2 2:3.4 3:5.4 4:2.3
3 1:5.9 2:3.0 3:5.1 4:1.8

B:maven依赖准备(项目依赖,偷懒,不想择取,酌情使用)

 <repositories>
        <repository>
            <id>ali-repo</id>
            <name>ali-repo</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <layout>default</layout>
        </repository>
        <repository>
            <id>mvn-repo</id>
            <name>mvn-repo</name>
            <url>https://mvnrepository.com</url>
        </repository>
        <repository>
            <id>cdh-repo</id>
            <name>cdh-repo</name>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>hdp-repo</id>
            <name>hdp-repo</name>
            <url>http://repo.hortonworks.com/content/repositories/releases/</url>
        </repository>
    </repositories>

    <properties>
        <java.version>1.8</java.version>
        <!-- project compiler -->
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.encoding>UTF-8</maven.compiler.encoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>

        <scala.version>2.11.8</scala.version>
        <hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
        <spark.version>2.2.0</spark.version>
        <hive.version>1.1.0-cdh5.14.0</hive.version>
        <oozie.version>4.1.0-cdh5.14.0</oozie.version>
        <hbase.version>1.2.0-cdh5.14.0</hbase.version>
        <solr.version>4.10.3-cdh5.14.0</solr.version>
        <jsch.version>0.1.53</jsch.version>
        <jackson.spark.version>2.6.5</jackson.spark.version>
        <mysql.version>5.1.46</mysql.version>

        <!-- maven plugins -->
        <mybatis-generator-maven-plugin.version>1.3.5</mybatis-generator-maven-plugin.version>
        <maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
        <maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
        <wagon-ssh.version>3.1.0</wagon-ssh.version>
        <wagon-maven-plugin.version>2.0.0</wagon-maven-plugin.version>
        <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
        <maven-war-plugin.version>3.2.1</maven-war-plugin.version>
        <jetty-maven-plugin.version>9.4.10.v20180503</jetty-maven-plugin.version>
    </properties>

    <dependencies>
        <!-- Scala -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- jackson -->
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>${jackson.spark.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-annotations</artifactId>
            <version>${jackson.spark.version}</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>${jackson.spark.version}</version>
        </dependency>
        <!-- spark -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.scalanlp</groupId>
                    <artifactId>breeze_2.11</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.scalanlp</groupId>
            <artifactId>breeze_2.11</artifactId>
            <version>0.13</version>
            <exclusions>
                <exclusion>
                    <groupId>org.scala-lang</groupId>
                    <artifactId>scala-library</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- hadoop -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-util</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.codehaus.jackson</groupId>
                    <artifactId>jackson-core-asl</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.codehaus.jackson</groupId>
                    <artifactId>jackson-mapper-asl</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-sslengine</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.codehaus.jackson</groupId>
                    <artifactId>jackson-xc</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- hbase -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>${hbase.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-util</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>servlet-api-2.5</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-util-6.1.26.hwx</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-util</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.mortbay.jetty</groupId>
                    <artifactId>jetty-sslengine</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- solr -->
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-core</artifactId>
            <version>${solr.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-solrj</artifactId>
            <version>${solr.version}</version>
        </dependency>
        <!-- mysql -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>

        <dependency>
            <groupId>com.typesafe</groupId>
            <artifactId>config</artifactId>
            <version>1.2.1</version>
        </dependency>
    </dependencies>

    <build>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
        <resources>
            <resource>
                <directory>${project.basedir}/src/main/resources</directory>
            </resource>
        </resources>
        <!-- Maven 编译的插件 -->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

C:代码如下:

package ml

import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

import scala.collection.immutable

/**
 * @Author: sou1yu
 * @Email:aaa@qq.com
 */

object IrisClusterDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .appName(this.getClass.getSimpleName.stripSuffix("$"))
      .master("local[3]")
      .config("spark.sql.shuffle.partitions", "2")
      .getOrCreate()

    import spark.implicits._

    //1.读取鸢尾花数据集
    val irisDF: DataFrame = spark.read.format("libsvm")
      .option("numFeatures", 4)
      .load("datas/iris_kmeans.txt")

    /**
     *  irisDF.printSchema()
     *     irisDF.show(10,false)
     * root
     * |-- label: double (nullable = true)
     * |-- features: vector (nullable = true)
     *
     * +-----+-------------------------------+
     * |label|features                       |
     * +-----+-------------------------------+
     * |1.0  |(4,[0,1,2,3],[5.1,3.5,1.4,0.2])|
     * |1.0  |(4,[0,1,2,3],[4.9,3.0,1.4,0.2])|
     * |1.0  |(4,[0,1,2,3],[4.7,3.2,1.3,0.2])|
     * |1.0  |(4,[0,1,2,3],[4.6,3.1,1.5,0.2])|
     * |1.0  |(4,[0,1,2,3],[5.0,3.6,1.4,0.2])|
     */
    //2.设置不同K值,从2-6,采用肘部法则确定K值
    val values: immutable.IndexedSeq[(Int, KMeansModel, String, Double)] = (2 to 6).map {
      k =>
        //a.创建KMeans算法实例对象,设置数值
        val kMeans = new KMeans()
          //设置输入特征列名称和输出列的名称
          .setFeaturesCol("features")
          .setPredictionCol("prediction")
          //动态设置K值
          .setK(k)
          //设置迭代次数
          .setMaxIter(50)
          //设置聚类模式,也可不设置。默认就是K-means算法,即k-means++变形体(随机初始化k*log2N【N代表数据集个数】个,再从中选取K个作为聚类中心点)
          .setInitMode("k-means||")
          //距离测量方式:欧几里得(默认)或余弦方式,测量数据坐标距离聚类中心点的长短的方式
          // .setDistanceMeasure("euclidean")
          .setDistanceMeasure("cosine")

        //b.应用数据集 训练模型 获取转换器
        val kmeansModel: KMeansModel = kMeans.fit(irisDF)
        //c 模型预测
        val predictionDF: DataFrame = kmeansModel.transform(irisDF)

        // 统计出各个类簇中的数据个数
        val clusterNumber: String = predictionDF.groupBy($"prediction").count()
          .select($"prediction", $"count")
          .as[(Int, Long)]
          .rdd
          .collectAsMap()
          .toMap
          .mkString(",")

        //d.模型评估
        val evaluator: ClusteringEvaluator = new ClusteringEvaluator()
          .setPredictionCol("prediction")
          //设置轮廓系数
          .setMetricName("silhouette")
          // 分别采用欧式距离计算距离(API中默认值)评估 和余弦计算距离
          //.setDistanceMeasure("squaredEuclidean")
          .setDistanceMeasure("cosine")

        /*轮廓系数(结合了聚类的凝聚度(Cohesion)【类簇中 数据距类簇中心的凝聚程度】和分离度(Separation)【各个类簇之间的分离程度】,
          用于评估聚类的效果越接近1越好(平均轮廓系数的取值范围为[-1,1])。
          但同时还要结合各个类簇中的数据个数尽量要平均
        * */
        val scValue: Double = evaluator.evaluate(predictionDF)

        //e.返回四元组
        (k, kmeansModel, clusterNumber, scValue)
    }


    //遍历指标
    values.foreach(println)

    //应用程序结束,关闭资源
    spark.stop()

  }

}

D: 分别使用了欧几里得方式计算距离和余弦定理计算距离结果如下

 

欧几里得方式计算距离

K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_33af8f322a80,1 -> 97,0 -> 53,0.8501515983265806)
(3,kmeans_dddad8bd3858,2 -> 39,1 -> 50,0 -> 61,0.7342113066202725)
(4,kmeans_251d99eaeae4,2 -> 28,1 -> 50,3 -> 43,0 -> 29,0.6748661728223084)
(5,kmeans_5a9a066aaa9a,0 -> 23,1 -> 33,2 -> 30,3 -> 47,4 -> 17,0.5593200358940349)
(6,kmeans_734c87051c61,0 -> 30,5 -> 18,1 -> 19,2 -> 47,3 -> 23,4 -> 13,0.5157126401818913)

 余弦定理计算距离

K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_99c4cabaa950,1 -> 50,0 -> 100,0.9579554849242657)
(3,kmeans_73251a945156,2 -> 46,1 -> 50,0 -> 54,0.7484647230660575)
(4,kmeans_5f8bce0297d5,2 -> 46,1 -> 19,3 -> 31,0 -> 54,0.5754341193280768)
(5,kmeans_92f07728d30f,0 -> 27,1 -> 50,2 -> 23,3 -> 28,4 -> 22,0.6430770644178772)
(6,kmeans_acbd159f5a1e,0 -> 24,5 -> 21,1 -> 29,2 -> 43,3 -> 15,4 -> 18,0.4512255960897416)

 心得:根据以上数据得出结论,列簇分K设置为3更合适。

余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。 相比欧氏距离,余 弦距离更加注重两个向量在方向上的差异。借助三维坐标系来看下欧氏距离和余弦距离的区别:

KMeans算法,采用肘部法则获取类簇中心个数K的值。

总结:在日常使用中需要注意区分,余弦距离虽然不是一个严格意义上的距离度量公式,但是形容两个特征向量之间的关系还是有很大用处的。比如人脸识别,推荐系统等。

相关标签: 算法 kmeans算法