【网页正文识别及提取算法】提取网络正文的实践

程序员文章站 2022-05-08 16:44:05

...

Goose安装

pip install goose-extractor

或

pip3 install goose

github：https://github.com/grangier/python-goose

简单实例

:python3
Python 3.7.6 (default, Feb 16 2020, 17:48:02) 
[Clang 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> url = 'https://blog.csdn.net/LU_ZHAO/article/details/104935957'
>>> print(article.title)
The Serenity Prayer_LU_ZHAO的博客-CSDN博客
>>> print(article.cleaned_text)
上帝，请赐予我宁静，去接受我所不能改变的；

请赐予我勇气，去改变我所能改变的；

并请赐予我智慧，去辨别什么可以改变，什么不能。

用心生活每一天；用灵魂享受每个时刻；承受磨难，因为它是通向安宁的必经之路。

接受它原本的样子，而不是我所期盼的样子；

这样，这一生我就有理由得到快乐，并在天堂与您一起得到极乐。
>>>

只有中文。。。没有英文了？？因为选了中文就只有中文了吗？？

英文尝试如下：

:python3
Python 3.7.6 (default, Feb 16 2020, 17:48:02) 
[Clang 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> g=Goose()
>>> url = 'https://blog.csdn.net/LU_ZHAO/article/details/104935957'
>>> article = g.extract(url=url)
>>> print(article.title)
The Serenity Prayer_LU_ZHAO的博客-CSDN博客
>>> print(article.cleaned_text)

>>>

原文也有英文的呀。。

测试的原文：https://blog.csdn.net/LU_ZHAO/article/details/104935957

【网页正文识别及提取算法】提取网络正文的实践

Goose安装

简单实例

【python教程】网页正文及内容图片提取算法

【网页正文识别及提取算法】提取网络正文的实践

【网页正文识别及提取算法】提取网络正文的实践

【python教程】网页正文及内容图片提取算法