利用python程序实现‘情话词云图’
程序员文章站
2022-03-10 21:20:44
...
效果展示
提取出你和你喜欢的人聊天记录中的出现次数最多的词汇,做成一张情话词云图。
1.提取聊天记录
如果手机换的频率比较快的话,建议大家可以保存到电脑上面,当然SVIP的土豪请无视。将所有聊天记录同步到电脑端后,在电脑端的QQ上:
在消息管理器中,找到对应的人选,导出聊天记录:
保存为.txt文本格式
2. 准备相关的package
主要用到的库有numpy, pandas,matplotlib,wordcloud,jieba,scipy这几个,用pip或者conda安装好就行了。
3.txt转csv
首先,txt文件为以下格式:
可以看到有些是空行,有些是日期和id等信息,其他的行才是我们需要的聊天记录,我们从中提取出来,并且转换为.csv文件。
import pandas as pd
def loadtxt(file):
f = open(file, 'r', encoding='utf-8')
lines = f.readlines()
dataset_m = []
for line in lines:
temp1 = line.strip('\n')
if temp1[:2] == '201' or temp1 == '':
continue
else:
dataset_m.append(temp1)
return dataset_m
def write_csv(datalist):
f = pd.DataFrame(data=datalist)
f.to_csv('./record.csv', encoding='utf-8')
return f
file_txt = 'record.txt'
file = loadtxt(file_txt)
write_csv(file)
4.分词,并按照词频进行词云制作
import warnings
import jieba
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from scipy.misc import imread
warnings.filterwarnings('ignore')
# 读取数据文件
df = pd.read_csv('record.csv', encoding='utf-8', names=['id', 'content'])
df = df['content']
df = df.dropna()
content = df.values.tolist()
segment = []
for line in content:
try:
segs = jieba.lcut(line)
for seg in segs:
if len(seg) > 1 and seg !='\r\n':
segment.append(seg)
except:
print(line)
continue
# 去除停用词
words_df = pd.DataFrame({'segment':segment})
stopwords = pd.read_csv('stopwords.txt', index_col=False, quoting=3, sep='\t', names=['stopword'], encoding='utf-8')
words_df = words_df[~words_df.segment.isin(stopwords.stopword)]
# print(words_df.head())
words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数": np.size})
words_stat = words_stat.reset_index().sort_values(by=['计数'], ascending=False)
words_stat.drop(2)
# print(words_stat.head())
# 可视化
bimg = imread('love.jpeg')
wordcloud = WordCloud(font_path='simhei.ttf', mask=bimg, background_color='white', max_font_size=120,
width=500, height=300)
word_frequence = {x[0]: x[1] for x in words_stat.head(500).values}
wordcloud = wordcloud.fit_words(word_frequence)
wordcloud.to_file('Love.png')
plt.imshow(wordcloud)
plt.show()
其中jieba wordcould等一些API都比较简单,可以在网上找到对应的参数详解。我这里主要就是按照词频,提取出前500个词汇。
github地址:
上一篇: 深入Java Final
下一篇: css网页乱码怎么办