欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

25个值得收藏的Python文本处理案例

程序员文章站 2023-01-05 10:13:22
1提取 pdf 内容2提取 word 内容3提取 web 网页内容4读取 json 数据5读取 csv 数据6删除字符串中的标点符号7使用 nltk 删除停用词8使用 textblob 更正拼写9使用...

1提取 pdf 内容

2提取 word 内容

3提取 web 网页内容

4读取 json 数据

5读取 csv 数据

6删除字符串中的标点符号

7使用 nltk 删除停用词

8使用 textblob 更正拼写

9使用 nltk 和 textblob 的词标记化

output:

['natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']

10使用 nltk 提取句子单词或短语的词干列表

output:

where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump

11使用 nltk 进行句子或短语词形还原

output:

she gripped the armrest a he passed two car at a time.
her car wa in full view.
a number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy

12使用 nltk 从文本文件中查找每个单词的频率

output:

[nltk_data] downloading package webtext to
[nltk_data]     c:\users\amit\appdata\roaming\nltk_data...
[nltk_data]   unzipping corpora\webtext.zip.
1989: 1
accessing: 1
analysis: 1
anyone: 1
chapter: 1
coding: 1
data: 1
...

13从语料库中创建词云

14nltk 词法散布图

15使用 countvectorizer 将文本转换为数字

output:

             go  java  python
and           2     2       2
application   0     1       0
are           1     0       1
bytecode      0     1       0
can           0     1       0
code          0     1       0
comes         1     0       1
compiled      0     1       0
derived       0     1       0
develops      0     1       0
for           0     2       0
from          0     1       0
functional    1     0       1
imperative    1     0       1
...

16使用 tf-idf 创建文档术语矩阵

output:

                   go      java    python
and          0.323751  0.137553  0.323751
application  0.000000  0.116449  0.000000
are          0.208444  0.000000  0.208444
bytecode     0.000000  0.116449  0.000000
can          0.000000  0.116449  0.000000
code         0.000000  0.116449  0.000000
comes        0.208444  0.000000  0.208444
compiled     0.000000  0.116449  0.000000
derived      0.000000  0.116449  0.000000
develops     0.000000  0.116449  0.000000
for          0.000000  0.232898  0.000000
...

17为给定句子生成 n-gram

自然语言工具包:nltk

文本处理工具:textblob

output:

1-gram:  ['a', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram:  ['a class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram:  ['a class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram:  ['a class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']

18使用带有二元组的 sklearn countvectorize 词汇规范

output:

                        assembly  machine
also either                    0        1
and or                         0        1
are also                       0        1
are readable                   1        0
are still                      1        0
assembly language              5        0
because each                   1        0
but difficult                  0        1
by computers                   0        1
by people                      0        1
can execute                    0        1
...

19使用 textblob 提取名词短语

output:

canada
northern part
america

20如何计算词-词共现矩阵

output:

            best  use  what  where  ...    in   is  python  used
best         0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
use          0.0  0.0   0.0    0.0  ...   0.0  1.0     0.0   0.0
what         1.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
where        0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
pythonused   0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
why          0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   1.0
companies    0.0  1.0   0.0    1.0  ...   1.0  0.0     0.0   0.0
in           0.0  0.0   0.0    0.0  ...   0.0  0.0     1.0   0.0
is           0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0
python       0.0  0.0   0.0    0.0  ...   0.0  0.0     0.0   0.0
used         0.0  0.0   1.0    0.0  ...   0.0  0.0     0.0   0.0
 
[11 rows x 11 columns]

21使用 textblob 进行情感分析

output:

sentiment(polarity=1.0, subjectivity=1.0)
positive
sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
positive
sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
negative

22使用 goslate 进行语言翻译

23使用 textblob 进行语言检测和翻译

output:

fr
¿como estas tu?
how are you?
你好吗?

24使用 textblob 获取定义和同义词

output:

['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}

25使用 textblob 获取反义词列表

output:

{'dangerous', 'out'}

 到此这篇关于25个值得收藏的python文本处理案例的文章就介绍到这了,更多相关python文本处理案例内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持!