gensim使用

程序员文章站 2022-06-04 09:24:37

...

一、基本概念
gensim是一个python的自然语言处理库，能够将文档向量化以及建立模型（TF-IDF, LDA, LSI）。
corpora用于构建语料库，models用于构建处理模型，Similarity用于文档相似性比对
顺序：corpora-->models-->Similarity
简单例子：
如：两篇文章，每个文章都是由一句话组成
D1: I am a student.
D2: I am a teacher,I love my students.
1. 预处理数据
1.1 分词,英文通常使用空格分词(注意词干提取和词形还原)，中文可以选择jieba分词
1.2 去掉停词（如： a in of I am 等等）或者去掉指定词根据业务
1.3 其他处理，如去掉出现一次的词，由业务取决
2.建立字典
由D1与D2组成的字典：student、teacher、love(分词、去停词，这里没有去掉出现一次的词)
D1表现形式：1,0,0，D2表现形式：1,1,1，只考虑词频不考虑词语之间的位置，在D1中出现一次student
D2中出现一次student、teacher、love
3.建立模型
4.查询文档相似性

二、corpora 与字符串向量化

from gensim import corpora,models,similarities
		#import logging
		#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


		documents = ["Human machine interface for lab abc computer applications",
		             "A survey of user opinion of computer system response time",
		             "The EPS user interface management system",
		             "System and human system engineering testing of EPS",
		             "Relation of user perceived response time to error measurement",
		             "The generation of random binary unordered trees",
		             "The intersection graph of paths in trees",
		             "Graph minors IV Widths of trees and well quasi ordering",
		             "Graph minors A survey"]


		#分词,并过滤停词
		stoplist=set('for a of the and to in'.split())
		texts=[[word for word in document.lower().split() if word not in stoplist ]for document in documents]
		print(texts)
		#[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]
		#生成字典-传入对象必须为数组
		dictionary=corpora.Dictionary(texts);
		#dictionary.dfs.items()--->(tokenid,词频),查找词频为1的tokenid
		ids=[tokenid for tokenid,freq in dictionary.dfs.items() if freq == 1]
		#过滤tokenid,形成新的字典
		dictionary.filter_tokens(ids);
		#重新分配tokenid
		dictionary.compactify()
		print(dictionary.token2id)
		#{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
		#dictionary.doc2bow(text)将文档token化
		corpus=[dictionary.doc2bow(text) for text in texts]
		print(corpus)
		#[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

三、models

#计算TFIDF
		tfidf = models.TfidfModel(corpus)
		corpus_tfidf = tfidf[corpus]
		print(corpus_tfidf)
		#lda模型，corpus 为TF-IDF、id2word为字典、num_topics为主题数量
		LDA_model = models.LdaModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=2)
		print(LDA_model)
		#日志形式输出topic情况
		LDA_model.print_topics(1)
		#LSI模型
		lsi=models.LsiModel(corpus=corpus_tfidf,id2word=dictionary,num_topics=2)
		lsi.print_topics(2)

四、Similarity

#similarity interface
		doc="Human computer interaction"
		ver_bow=dictionary.doc2bow(doc.lower().split())#return bags-of-word[(tokenid,count)....]
		print(ver_bow)
		#计算向量值
		vec_lsi=lsi[ver_bow]
		print(vec_lsi)
		#矩阵相似
		index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
		#计算doc的相似文档
		sims=index[vec_lsi]
		#排序
		sims = sorted(enumerate(sims), key=lambda item: -item[1])
		print(sims)

相关标签： gensim 模型文本分析

上一篇：李子哪时候成熟，李子就应该这样吃才正宗

下一篇：鸭子是什么，以鸭为主的这些美食你都吃过了吗

gensim使用

js软件可以提供的信息（java编程软件工具使用）

SolidWorks圆角焊缝命令怎么使用?

win2008 r2因为使用安全设置软件导致权限丢失无法打开磁盘的解决方法

Python之range()函数使用

java正则表达式详解（免费科普正则表达式使用方法）

PHP mail()函数使用及配置方法

CATIA盒体命令怎么使用?

c4d怎么使用变形器创建地球模型?

Axure怎么创建母版和使用?

全面解析Python的While循环语句的使用方法