欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Spark ml之Tokenizer

程序员文章站 2024-02-15 15:07:22
...

Spark ml中的Tokenizer(分词器)

  • Tokenizer是将文本如一个句子拆分城单词的过程,在spark ml中提供Tokenizer实现此功能RegexTokenizer提供了跟高级的基于正则表达式匹配的单词拆分。默认情况下,参数pattern(默认的正则表达式:"\s+") 作为分隔符用于拆分输入的文本,或者,用户将参数 gaps设置为false,指定正则表达式pattern表示为tokens,而不是分隔符,这样作为划分结果找到的所有匹配项,很简单,主要是看自己业务数据切分的逻辑。
    示例代码,也是官网给出的示例:
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/**
  *
  * @author wjc
  *
  *         Tokenizer
  **/
object Tokenizer extends App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("ml_learn")
    //  .enableHiveSupport()
    .config("", "")
    .getOrCreate()

  val sentenceDataFrame = spark.createDataFrame(Seq(
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
  )).toDF("id","sentence")

  sentenceDataFrame.show(false)

  //Tokenizer实例
  val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
  //RegexTokenizer分词器
  val regexTokenizer = new RegexTokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
    .setPattern("\\W")
  //或者通过gaps设置为false,指定正则表达式pattern表示tokens 而不是分隔符
  val regexTokenizer2 = new RegexTokenizer()
    .setInputCol("sentence")
    .setOutputCol("words")
    .setPattern("\\w+")
    .setGaps(false)

  //udf 计算长度
  val countTokens = udf { (words:Seq[String]) => words.length}
  //tokenizer分词结果
  val tokenized = tokenizer.transform(sentenceDataFrame)
      tokenized.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)

  //regexTokenizer分词结果
  val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
      regexTokenized.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)

  //regexTokenizer分词结果
  val regexTokenized2 = regexTokenizer2.transform(sentenceDataFrame)
      regexTokenized2.select("sentence","words")
    .withColumn("tokens",countTokens(col("words"))).show(false)
}

运行结果
Spark ml之Tokenizer

相关标签: Spark