Python使用jieba分词

程序员文章站 2022-05-22 16:39:57

附加：另一种jieba分词写法：参考jieba中文分词：https://github.com/fxsjy/jieba ##欢迎讨论 ......

# -*- coding: utf-8 -*-
# spyder (python 3.7)

import pandas as pd
import jieba
import jieba.analyse as anls

if __name__ == '__main__':  
    data = pd.read_excel(r'空气指数评论.xlsx')
    # content为excel的列名
    opinion_content = data['content'].dropna().values
    all_word = ''
    for i in opinion_content: #形成整个字符串
        all_word = all_word +','+ str(i)
    all_word = all_word.strip()  #去掉字符串的空格
    all_word_upper = all_word.upper() #大写

　　#加载词典 #jieba.load_userdict(r"d:\python_workspace\aaaa.txt")
  
　　#如果有不想被切分开的词，例如王者荣耀，和平精英等，可以进行参数设置：tune=true
　　# jieba.analyse 是基于tf-idf算法的关键词抽取
    segment=['王者荣耀','和平精英']
    for ii in segment:
        jieba.suggest_freq(ii, tune=true)
    
    anls.set_stop_words("111.txt")  #加载停用词文档，网上可以下载或者自己创建
    tags = anls.extract_tags(all_word_upper, topk=none, withweight=true)
    for x, w in tags:
        print('%s %s' % (x, w))
        
    for v, n in tags:
        #权重n是小数，乘了十万成为整数，可以按需求设置不同值
        out_words= v + '\t' + str(int(n * 100000))
        #注意'a+'为追加写入，因此如果重新运行程序，则需要先删除上次生成的文件，结果保存在当前目录下，可以更改目录
        with open('.\cut_words_content.txt','a+',encoding='utf-8')as f:
            f.write(out_words+'\n')

附加：另一种jieba分词写法：

 sentence_seged = [seg for seg in jieba.cut(all_word) if len(seg) >= char_len]
# all_word为整个要分词的字符串，该方式没有利用到权重，是单纯的分词
# 返回的是分词后的列表
# 分词长度最少大于char_len

参考jieba中文分词：

##欢迎讨论

上一篇：他身为一个太监为什么他能够埋葬在皇陵之中呢

下一篇：王安石变法的问题到底有多大为什么会成为北宋灭亡的罪魁祸首呢

Python使用jieba分词

Python实现的监测服务器硬盘使用率脚本分享

Python使用百度API上传文件到百度网盘代码分享

Python使用urllib模块的urlopen超时问题解决方法

Python学习笔记之os模块使用总结

Python使用htpasswd实现基本认证授权的例子

Python开发的单词频率统计工具wordsworth使用方法

python的绘图工具matplotlib使用实例

Python使用统计函数绘制简单图形实例代码

Python使用pandas处理CSV文件的实例讲解

Python使用新浪微博API发送微博的例子