KMeans算法,采用肘部法则获取类簇中心个数K的值。
K-Means是一种非常常见的聚类算法,在处理聚类任务中经常使用,K-Means算法是一种原型聚类算法。
该算法重要的一步就是确定K的值的划分,通常我们采用肘部法则选取K值,再依据轮廓系数,及各个数据集中数据的数量综合去评估哪个K值为最佳。
肘部法则
1 )、对于n个点的数据集,迭代计算k from 1 to n,每次聚类完成后计算每个点到其所属的簇中心的距离的平方和;
2)、平方和是会逐渐变小的,直到k时平方和为0,因为每个点都是它所在的簇中心本身。
3)、在这个平方和变化过程中,会出现一个拐点也即“肘”点,下降率突然变缓时即认为 是最佳的k值。 在决定什么时候停止训练时,肘形判据同样有效,数据通常有更多的噪音,在增加分类无法带来更多回报时,停止增加类别。
轮廓系数法
针对聚类算法另一个评估指标: 轮廓系数法 ,结合了聚类的凝聚度(Cohesion)和分离度 (Separation),用于评估聚类的效果:
1)、计算样本i到同簇其他样本的平均距离ai,ai 越小样本i的簇内不相似度越小,说明样 本i越应该被聚类到该簇。
2)、计算样本i到最近簇Cj 的所有样本的平均距离bij,称样本i与最近簇Cj 的不相似度,定 义为样本i的簇间不相似度:bi =min{bi1, bi2, ..., bik},bi越大,说明样本i越不属于其他簇。
3)、求出所有样本的轮廓系数后再求平均值就得到了平均轮廓系数。平均轮廓系数的取 值范围为[-1,1],系数越大,聚类效果越好。簇内样本的距离越近,簇间样本距离越远。
下面案例针对鸢尾花数据集进行聚类,使用KMeans算法,采用肘部法则Elbow获取K的值,使用轮廓系数评估模型。
准备工作:
A: 数据准备 iris_kmeans.txt
1 1:5.1 2:3.5 3:1.4 4:0.2 1 1:4.9 2:3.0 3:1.4 4:0.2 1 1:4.7 2:3.2 3:1.3 4:0.2 1 1:4.6 2:3.1 3:1.5 4:0.2 1 1:5.0 2:3.6 3:1.4 4:0.2 1 1:5.4 2:3.9 3:1.7 4:0.4 1 1:4.6 2:3.4 3:1.4 4:0.3 1 1:5.0 2:3.4 3:1.5 4:0.2 1 1:4.4 2:2.9 3:1.4 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:5.4 2:3.7 3:1.5 4:0.2 1 1:4.8 2:3.4 3:1.6 4:0.2 1 1:4.8 2:3.0 3:1.4 4:0.1 1 1:4.3 2:3.0 3:1.1 4:0.1 1 1:5.8 2:4.0 3:1.2 4:0.2 1 1:5.7 2:4.4 3:1.5 4:0.4 1 1:5.4 2:3.9 3:1.3 4:0.4 1 1:5.1 2:3.5 3:1.4 4:0.3 1 1:5.7 2:3.8 3:1.7 4:0.3 1 1:5.1 2:3.8 3:1.5 4:0.3 1 1:5.4 2:3.4 3:1.7 4:0.2 1 1:5.1 2:3.7 3:1.5 4:0.4 1 1:4.6 2:3.6 3:1.0 4:0.2 1 1:5.1 2:3.3 3:1.7 4:0.5 1 1:4.8 2:3.4 3:1.9 4:0.2 1 1:5.0 2:3.0 3:1.6 4:0.2 1 1:5.0 2:3.4 3:1.6 4:0.4 1 1:5.2 2:3.5 3:1.5 4:0.2 1 1:5.2 2:3.4 3:1.4 4:0.2 1 1:4.7 2:3.2 3:1.6 4:0.2 1 1:4.8 2:3.1 3:1.6 4:0.2 1 1:5.4 2:3.4 3:1.5 4:0.4 1 1:5.2 2:4.1 3:1.5 4:0.1 1 1:5.5 2:4.2 3:1.4 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:5.0 2:3.2 3:1.2 4:0.2 1 1:5.5 2:3.5 3:1.3 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:4.4 2:3.0 3:1.3 4:0.2 1 1:5.1 2:3.4 3:1.5 4:0.2 1 1:5.0 2:3.5 3:1.3 4:0.3 1 1:4.5 2:2.3 3:1.3 4:0.3 1 1:4.4 2:3.2 3:1.3 4:0.2 1 1:5.0 2:3.5 3:1.6 4:0.6 1 1:5.1 2:3.8 3:1.9 4:0.4 1 1:4.8 2:3.0 3:1.4 4:0.3 1 1:5.1 2:3.8 3:1.6 4:0.2 1 1:4.6 2:3.2 3:1.4 4:0.2 1 1:5.3 2:3.7 3:1.5 4:0.2 1 1:5.0 2:3.3 3:1.4 4:0.2 2 1:7.0 2:3.2 3:4.7 4:1.4 2 1:6.4 2:3.2 3:4.5 4:1.5 2 1:6.9 2:3.1 3:4.9 4:1.5 2 1:5.5 2:2.3 3:4.0 4:1.3 2 1:6.5 2:2.8 3:4.6 4:1.5 2 1:5.7 2:2.8 3:4.5 4:1.3 2 1:6.3 2:3.3 3:4.7 4:1.6 2 1:4.9 2:2.4 3:3.3 4:1.0 2 1:6.6 2:2.9 3:4.6 4:1.3 2 1:5.2 2:2.7 3:3.9 4:1.4 2 1:5.0 2:2.0 3:3.5 4:1.0 2 1:5.9 2:3.0 3:4.2 4:1.5 2 1:6.0 2:2.2 3:4.0 4:1.0 2 1:6.1 2:2.9 3:4.7 4:1.4 2 1:5.6 2:2.9 3:3.6 4:1.3 2 1:6.7 2:3.1 3:4.4 4:1.4 2 1:5.6 2:3.0 3:4.5 4:1.5 2 1:5.8 2:2.7 3:4.1 4:1.0 2 1:6.2 2:2.2 3:4.5 4:1.5 2 1:5.6 2:2.5 3:3.9 4:1.1 2 1:5.9 2:3.2 3:4.8 4:1.8 2 1:6.1 2:2.8 3:4.0 4:1.3 2 1:6.3 2:2.5 3:4.9 4:1.5 2 1:6.1 2:2.8 3:4.7 4:1.2 2 1:6.4 2:2.9 3:4.3 4:1.3 2 1:6.6 2:3.0 3:4.4 4:1.4 2 1:6.8 2:2.8 3:4.8 4:1.4 2 1:6.7 2:3.0 3:5.0 4:1.7 2 1:6.0 2:2.9 3:4.5 4:1.5 2 1:5.7 2:2.6 3:3.5 4:1.0 2 1:5.5 2:2.4 3:3.8 4:1.1 2 1:5.5 2:2.4 3:3.7 4:1.0 2 1:5.8 2:2.7 3:3.9 4:1.2 2 1:6.0 2:2.7 3:5.1 4:1.6 2 1:5.4 2:3.0 3:4.5 4:1.5 2 1:6.0 2:3.4 3:4.5 4:1.6 2 1:6.7 2:3.1 3:4.7 4:1.5 2 1:6.3 2:2.3 3:4.4 4:1.3 2 1:5.6 2:3.0 3:4.1 4:1.3 2 1:5.5 2:2.5 3:4.0 4:1.3 2 1:5.5 2:2.6 3:4.4 4:1.2 2 1:6.1 2:3.0 3:4.6 4:1.4 2 1:5.8 2:2.6 3:4.0 4:1.2 2 1:5.0 2:2.3 3:3.3 4:1.0 2 1:5.6 2:2.7 3:4.2 4:1.3 2 1:5.7 2:3.0 3:4.2 4:1.2 2 1:5.7 2:2.9 3:4.2 4:1.3 2 1:6.2 2:2.9 3:4.3 4:1.3 2 1:5.1 2:2.5 3:3.0 4:1.1 2 1:5.7 2:2.8 3:4.1 4:1.3 3 1:6.3 2:3.3 3:6.0 4:2.5 3 1:5.8 2:2.7 3:5.1 4:1.9 3 1:7.1 2:3.0 3:5.9 4:2.1 3 1:6.3 2:2.9 3:5.6 4:1.8 3 1:6.5 2:3.0 3:5.8 4:2.2 3 1:7.6 2:3.0 3:6.6 4:2.1 3 1:4.9 2:2.5 3:4.5 4:1.7 3 1:7.3 2:2.9 3:6.3 4:1.8 3 1:6.7 2:2.5 3:5.8 4:1.8 3 1:7.2 2:3.6 3:6.1 4:2.5 3 1:6.5 2:3.2 3:5.1 4:2.0 3 1:6.4 2:2.7 3:5.3 4:1.9 3 1:6.8 2:3.0 3:5.5 4:2.1 3 1:5.7 2:2.5 3:5.0 4:2.0 3 1:5.8 2:2.8 3:5.1 4:2.4 3 1:6.4 2:3.2 3:5.3 4:2.3 3 1:6.5 2:3.0 3:5.5 4:1.8 3 1:7.7 2:3.8 3:6.7 4:2.2 3 1:7.7 2:2.6 3:6.9 4:2.3 3 1:6.0 2:2.2 3:5.0 4:1.5 3 1:6.9 2:3.2 3:5.7 4:2.3 3 1:5.6 2:2.8 3:4.9 4:2.0 3 1:7.7 2:2.8 3:6.7 4:2.0 3 1:6.3 2:2.7 3:4.9 4:1.8 3 1:6.7 2:3.3 3:5.7 4:2.1 3 1:7.2 2:3.2 3:6.0 4:1.8 3 1:6.2 2:2.8 3:4.8 4:1.8 3 1:6.1 2:3.0 3:4.9 4:1.8 3 1:6.4 2:2.8 3:5.6 4:2.1 3 1:7.2 2:3.0 3:5.8 4:1.6 3 1:7.4 2:2.8 3:6.1 4:1.9 3 1:7.9 2:3.8 3:6.4 4:2.0 3 1:6.4 2:2.8 3:5.6 4:2.2 3 1:6.3 2:2.8 3:5.1 4:1.5 3 1:6.1 2:2.6 3:5.6 4:1.4 3 1:7.7 2:3.0 3:6.1 4:2.3 3 1:6.3 2:3.4 3:5.6 4:2.4 3 1:6.4 2:3.1 3:5.5 4:1.8 3 1:6.0 2:3.0 3:4.8 4:1.8 3 1:6.9 2:3.1 3:5.4 4:2.1 3 1:6.7 2:3.1 3:5.6 4:2.4 3 1:6.9 2:3.1 3:5.1 4:2.3 3 1:5.8 2:2.7 3:5.1 4:1.9 3 1:6.8 2:3.2 3:5.9 4:2.3 3 1:6.7 2:3.3 3:5.7 4:2.5 3 1:6.7 2:3.0 3:5.2 4:2.3 3 1:6.3 2:2.5 3:5.0 4:1.9 3 1:6.5 2:3.0 3:5.2 4:2.0 3 1:6.2 2:3.4 3:5.4 4:2.3 3 1:5.9 2:3.0 3:5.1 4:1.8
B:maven依赖准备(项目依赖,偷懒,不想择取,酌情使用)
<repositories>
<repository>
<id>ali-repo</id>
<name>ali-repo</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<layout>default</layout>
</repository>
<repository>
<id>mvn-repo</id>
<name>mvn-repo</name>
<url>https://mvnrepository.com</url>
</repository>
<repository>
<id>cdh-repo</id>
<name>cdh-repo</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>hdp-repo</id>
<name>hdp-repo</name>
<url>http://repo.hortonworks.com/content/repositories/releases/</url>
</repository>
</repositories>
<properties>
<java.version>1.8</java.version>
<!-- project compiler -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.encoding>UTF-8</maven.compiler.encoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>
<scala.version>2.11.8</scala.version>
<hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
<spark.version>2.2.0</spark.version>
<hive.version>1.1.0-cdh5.14.0</hive.version>
<oozie.version>4.1.0-cdh5.14.0</oozie.version>
<hbase.version>1.2.0-cdh5.14.0</hbase.version>
<solr.version>4.10.3-cdh5.14.0</solr.version>
<jsch.version>0.1.53</jsch.version>
<jackson.spark.version>2.6.5</jackson.spark.version>
<mysql.version>5.1.46</mysql.version>
<!-- maven plugins -->
<mybatis-generator-maven-plugin.version>1.3.5</mybatis-generator-maven-plugin.version>
<maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
<maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
<wagon-ssh.version>3.1.0</wagon-ssh.version>
<wagon-maven-plugin.version>2.0.0</wagon-maven-plugin.version>
<maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
<maven-war-plugin.version>3.2.1</maven-war-plugin.version>
<jetty-maven-plugin.version>9.4.10.v20180503</jetty-maven-plugin.version>
</properties>
<dependencies>
<!-- Scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- jackson -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.spark.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.spark.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>${jackson.spark.version}</version>
</dependency>
<!-- spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.11</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.11</artifactId>
<version>0.13</version>
<exclusions>
<exclusion>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-util</artifactId>
</exclusion>
<exclusion>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
</exclusion>
<exclusion>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-sslengine</artifactId>
</exclusion>
<exclusion>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-xc</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
<exclusions>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-util</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
<exclusions>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>servlet-api-2.5</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-util-6.1.26.hwx</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-util</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty</artifactId>
</exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-sslengine</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive-thriftserver_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- solr -->
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
<version>${solr.version}</version>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>${solr.version}</version>
</dependency>
<!-- mysql -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.2.1</version>
</dependency>
</dependencies>
<build>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<resources>
<resource>
<directory>${project.basedir}/src/main/resources</directory>
</resource>
</resources>
<!-- Maven 编译的插件 -->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
C:代码如下:
package ml
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import scala.collection.immutable
/**
* @Author: sou1yu
* @Email:aaa@qq.com
*/
object IrisClusterDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName(this.getClass.getSimpleName.stripSuffix("$"))
.master("local[3]")
.config("spark.sql.shuffle.partitions", "2")
.getOrCreate()
import spark.implicits._
//1.读取鸢尾花数据集
val irisDF: DataFrame = spark.read.format("libsvm")
.option("numFeatures", 4)
.load("datas/iris_kmeans.txt")
/**
* irisDF.printSchema()
* irisDF.show(10,false)
* root
* |-- label: double (nullable = true)
* |-- features: vector (nullable = true)
*
* +-----+-------------------------------+
* |label|features |
* +-----+-------------------------------+
* |1.0 |(4,[0,1,2,3],[5.1,3.5,1.4,0.2])|
* |1.0 |(4,[0,1,2,3],[4.9,3.0,1.4,0.2])|
* |1.0 |(4,[0,1,2,3],[4.7,3.2,1.3,0.2])|
* |1.0 |(4,[0,1,2,3],[4.6,3.1,1.5,0.2])|
* |1.0 |(4,[0,1,2,3],[5.0,3.6,1.4,0.2])|
*/
//2.设置不同K值,从2-6,采用肘部法则确定K值
val values: immutable.IndexedSeq[(Int, KMeansModel, String, Double)] = (2 to 6).map {
k =>
//a.创建KMeans算法实例对象,设置数值
val kMeans = new KMeans()
//设置输入特征列名称和输出列的名称
.setFeaturesCol("features")
.setPredictionCol("prediction")
//动态设置K值
.setK(k)
//设置迭代次数
.setMaxIter(50)
//设置聚类模式,也可不设置。默认就是K-means算法,即k-means++变形体(随机初始化k*log2N【N代表数据集个数】个,再从中选取K个作为聚类中心点)
.setInitMode("k-means||")
//距离测量方式:欧几里得(默认)或余弦方式,测量数据坐标距离聚类中心点的长短的方式
// .setDistanceMeasure("euclidean")
.setDistanceMeasure("cosine")
//b.应用数据集 训练模型 获取转换器
val kmeansModel: KMeansModel = kMeans.fit(irisDF)
//c 模型预测
val predictionDF: DataFrame = kmeansModel.transform(irisDF)
// 统计出各个类簇中的数据个数
val clusterNumber: String = predictionDF.groupBy($"prediction").count()
.select($"prediction", $"count")
.as[(Int, Long)]
.rdd
.collectAsMap()
.toMap
.mkString(",")
//d.模型评估
val evaluator: ClusteringEvaluator = new ClusteringEvaluator()
.setPredictionCol("prediction")
//设置轮廓系数
.setMetricName("silhouette")
// 分别采用欧式距离计算距离(API中默认值)评估 和余弦计算距离
//.setDistanceMeasure("squaredEuclidean")
.setDistanceMeasure("cosine")
/*轮廓系数(结合了聚类的凝聚度(Cohesion)【类簇中 数据距类簇中心的凝聚程度】和分离度(Separation)【各个类簇之间的分离程度】,
用于评估聚类的效果越接近1越好(平均轮廓系数的取值范围为[-1,1])。
但同时还要结合各个类簇中的数据个数尽量要平均
* */
val scValue: Double = evaluator.evaluate(predictionDF)
//e.返回四元组
(k, kmeansModel, clusterNumber, scValue)
}
//遍历指标
values.foreach(println)
//应用程序结束,关闭资源
spark.stop()
}
}
D: 分别使用了欧几里得方式计算距离和余弦定理计算距离结果如下
欧几里得方式计算距离
K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_33af8f322a80,1 -> 97,0 -> 53,0.8501515983265806)
(3,kmeans_dddad8bd3858,2 -> 39,1 -> 50,0 -> 61,0.7342113066202725)
(4,kmeans_251d99eaeae4,2 -> 28,1 -> 50,3 -> 43,0 -> 29,0.6748661728223084)
(5,kmeans_5a9a066aaa9a,0 -> 23,1 -> 33,2 -> 30,3 -> 47,4 -> 17,0.5593200358940349)
(6,kmeans_734c87051c61,0 -> 30,5 -> 18,1 -> 19,2 -> 47,3 -> 23,4 -> 13,0.5157126401818913)
余弦定理计算距离
K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_99c4cabaa950,1 -> 50,0 -> 100,0.9579554849242657)
(3,kmeans_73251a945156,2 -> 46,1 -> 50,0 -> 54,0.7484647230660575)
(4,kmeans_5f8bce0297d5,2 -> 46,1 -> 19,3 -> 31,0 -> 54,0.5754341193280768)
(5,kmeans_92f07728d30f,0 -> 27,1 -> 50,2 -> 23,3 -> 28,4 -> 22,0.6430770644178772)
(6,kmeans_acbd159f5a1e,0 -> 24,5 -> 21,1 -> 29,2 -> 43,3 -> 15,4 -> 18,0.4512255960897416)
心得:根据以上数据得出结论,列簇分K设置为3更合适。
余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。 相比欧氏距离,余 弦距离更加注重两个向量在方向上的差异。借助三维坐标系来看下欧氏距离和余弦距离的区别:
总结:在日常使用中需要注意区分,余弦距离虽然不是一个严格意义上的距离度量公式,但是形容两个特征向量之间的关系还是有很大用处的。比如人脸识别,推荐系统等。
上一篇: 什么是IOC和什么是AOP