tokenize函数（正则表达式的tokenize）

程序员文章站 2024-03-31 22:44:04

本文为大家介绍了主题建模的概念、lda算法的原理，示例了如何使用python建立一个基础的lda主题模型，并使用pyldavis对主题进行可视化。图片来源：kamil polak引言主题建模包括从文档...

本文为大家介绍了主题建模的概念、lda算法的原理，示例了如何使用python建立一个基础的lda主题模型，并使用pyldavis对主题进行可视化。

图片来源：kamil polak

引言

主题建模包括从文档术语中提取特征，并使用数学结构和框架（如矩阵分解和奇异值分解）来生成彼此可区分的术语聚类（cluster）或组，这些单词聚类继而形成主题或概念。

主题建模是一种对文档进行无监督分类的方法，类似于对数值数据进行聚类。

这些概念可以用来解释语料库的主题，也可以在各种文档中一同频繁出现的单词之间建立语义联系。

主题建模可以应用于以下方面：

发现数据集中隐藏的主题；

将文档分类到已经发现的主题中；

使用分类来组织/总结/搜索文档。

有各种框架和算法可以用以建立主题模型：

潜在语义索引（latent semantic indexing）

潜在狄利克雷分配（latent dirichlet allocation，lda）

非负矩阵分解（non-negative matrix factorization，nmf）

在本文中，我们将重点讨论如何使用python进行lda主题建模。具体来说，我们将讨论：

什么是潜在狄利克雷分配（lda, latent dirichlet allocation）；

lda算法如何工作；

如何使用python建立lda主题模型。

什么是潜在狄利克雷分配（lda, latent dirichlet allocation）？

潜在狄利克雷分配（lda, latent dirichlet allocation）是一种生成概率模型（generative probabilistic model），该模型假设每个文档具有类似于概率潜在语义索引模型的主题的组合。

简而言之，lda背后的思想是，每个文档可以通过主题的分布来描述，每个主题可以通过单词的分布来描述。

lda算法如何工作？

lda由两部分组成：

我们已知的属于文件的单词；

需要计算的属于一个主题的单词或属于一个主题的单词的概率。

注意：lda不关心文档中单词的顺序。通常，lda使用词袋特征（bag-of-word feature）表示来代表文档。

以下步骤非常简单地解释了lda算法的工作原理：

1. 对于每个文档，随机将每个单词初始化为k个主题中的一个（事先选择k个主题）；

2. 对于每个文档d，浏览每个单词w并计算：

p(t | d)：文档d中，指定给主题t的单词的比例；

p(w | t)：所有包含单词w的文档中，指定给主题t的比例。

3. 考虑所有其他单词及其主题分配，以概率p(t | d)´ p(w | t) 将单词w与主题t重新分配。

lda主题模型的图示如下。

图片来源：wiki

下图直观地展示了每个参数如何连接回文本文档和术语。假设我们有m个文档，文档中有n个单词，我们要生成的主题总数为k。

图中的黑盒代表核心算法，它利用前面提到的参数从文档中提取k个主题。

图片来源：christine doig

如何使用python建立lda主题模型

我们将使用gensim包中的潜在狄利克雷分配（lda）。

首先，我们需要导入包。核心包是re、gensim、spacy和pyldavis。此外，我们需要使用matplotlib、numpy和panases以进行数据处理和可视化。

1. import re
2. import numpy as np
3. import pandas as pd
4. from pprint import pprint
5. 
6. # gensim
7. import gensim
8. import gensim.corpora as corpora
9. from gensim.utils import simple_preprocess
10. from gensim.models import coherencemodel
11. 
12. # spacy for lemmatization
13. import spacy
14. 
15. # plotting tools
16. import pyldavis
17. import pyldavis.gensim  # don't skip this
18. import matplotlib.pyplot as plt
19. %matplotlib inline
20. 
21. # enable logging for gensim - optional
22. import logging
23. logging.basicconfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.error)
24. 
25. import warnings
26. warnings.filterwarnings("ignore",category=deprecationwarning)

像am/is/are/of/a/the/but/…这样的词不包含任何关于“主题”的信息。因此，作为预处理步骤，我们可以将它们从文档中移除。

要做到这一点，我们需要从nlt导入停用词。还可以通过添加一些额外的单词来扩展原始的停用词列表。

1.# nltk stop words
2. from nltk.corpus import stopwords
3. stop_words = stopwords.words('english')
4. stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

在本教程中，我们将使用20个新闻组数据集，其中包含来自20个不同主题的大约11k个新闻组帖子。这可以作为newsgroups.json获得。

1. # import dataset
2. df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
3. print(df.target_names.unique())
4. df.head()

删除电子邮件链接和换行符

在我们开始主题建模之前，需要清理数据集。首先，删除电子邮件链接、多余的空格和换行符。

1. # convert to list
2. data = df.content.values.tolist()
3. 
4. # remove emails
5. data = [re.sub('s*@s*s?', '', sent) for sent in data]
6. 
7. # remove new line characters
8. data = [re.sub('s+', ' ', sent) for sent in data]
9. 
10. # remove distracting single quotes
11. data = [re.sub("'", "", sent) for sent in data]
12. 
13. pprint(data[:1])

标记（tokenize）单词和清理文本

让我们把每个句子标记成一个单词列表，去掉标点符号和不必要的字符。

1. def sent_to_words(sentences):
2.     for sentence in sentences:
3.         yield(gensim.utils.simple_preprocess(str(sentence), deacc=true))  # deacc=true removes punctuations
4. 
5. data_words = list(sent_to_words(data))
6. 
7. print(data_words[:1])

创建二元（bigram）模型和三元（trigram）模型

1. # build the bigram and trigram models
2. bigram = gensim.models.phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
3. trigram = gensim.models.phrases(bigram[data_words], threshold=100)  
4. 
5. # faster way to get a sentence clubbed as a trigram/bigram
6. bigram_mod = gensim.models.phrases.phraser(bigram)
7. trigram_mod = gensim.models.phrases.phraser(trigram)
8. 
9. # see trigram example
10. print(trigram_mod[bigram_mod[data_words[0]]])

删除停用词（stopword），建立二元模型和词形还原（lemmatize）

在这一步中，我们分别定义了函数以删除停止词、建立二元模型和词形还原，并且依次调用了这些函数。

1.# define functions for stopwords, bigrams, trigrams and lemmatization
2. def remove_stopwords(texts):
3.     return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
4. 
5. def make_bigrams(texts):
6.     return [bigram_mod[doc] for doc in texts]
7. 
8. def make_trigrams(texts):
9.     return [trigram_mod[bigram_mod[doc]] for doc in texts]
10. 
11. def lemmatization(texts, allowed_postags=['noun', 'adj', 'verb', 'adv']):
12.     """https://spacy.io/api/annotation"""
13.     texts_out = []
14.     for sent in texts:
15.         doc = nlp(" ".join(sent)) 
16.         texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
17.     return texts_out


1. # remove stop words
2. data_words_nostops = remove_stopwords(data_words)
3. 
4. # form bigrams
5. data_words_bigrams = make_bigrams(data_words_nostops)
6. 
7. # initialize spacy 'en' model, keeping only tagger component (for efficiency)
8. # python3 -m spacy download en
9. nlp = spacy.load('en', disable=['parser', 'ner'])
10. 
11. # do lemmatization keeping only noun, adj, vb, adv
12. data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['noun', 'adj', 'verb', 'adv'])
13. 
14. print(data_lemmatized[:1])

创建主题建模所需的词典和语料库（corpus）

gensim为文档中的每个单词创建一个唯一的id，但是在此之前，我们需要创建一个字典和语料库作为模型的输入。

1. # create dictionary
2. id2word = corpora.dictionary(data_lemmatized)
3. 
4. # create corpus
5. texts = data_lemmatized
6. 
7. # term document frequency
8. corpus = [id2word.doc2bow(text) for text in texts]
9. 
10. # view
11. print(corpus[:1])

建立主题模型

现在我们准备进入核心步骤，使用lda进行主题建模。让我们开始建立模型。我们将建立20个不同主题的lda模型，其中每个主题都是关键字的组合，每个关键字在主题中都具有一定的权重（weightage）。

一些参数的解释如下：

num_topics —需要预先定义的主题数量；

chunksize — 每个训练块（training chunk）中要使用的文档数量；

alpha — 影响主题稀疏性的超参数；

passess — 训练评估的总数。

1. # build lda model
2. lda_model = gensim.models.ldamodel.ldamodel(corpus=corpus,
3.                                            id2word=id2word,
4.                                            num_topics=20, 
5.                                            random_state=100,
6.                                            update_every=1,
7.                                            chunksize=100,
8.                                            passes=10,
9.                                            alpha='auto',
10.                                            per_word_topics=true)

查看lda模型中的主题

我们可以可视化每个主题的关键词和每个关键词的权重（重要性）。

1.# print the keyword in the 10 topics
2. pprint(lda_model.print_topics())
3. doc_lda = lda_model[corpus]

计算模型困惑度（perplexity）和一致性分数（coherence score）

模型困惑度是对概率分布或概率模型预测样本好坏的一种度量。主题一致性通过测量主题中得分高的单词之间的语义相似度来衡量单个主题的得分。

简而言之，它们提供了一种方便的方法来判断一个给定的主题模型有多好。

1. # compute perplexity
2. print('nperplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.
3. 
4. # compute coherence score
5. coherence_model_lda = coherencemodel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
6. coherence_lda = coherence_model_lda.get_coherence()
7. print('ncoherence score: ', coherence_lda)