【网页正文识别及提取算法】提取网络正文的实践
程序员文章站
2022-05-08 16:44:05
...
Goose安装
pip install goose-extractor
或
pip3 install goose
github:https://github.com/grangier/python-goose
简单实例
:python3
Python 3.7.6 (default, Feb 16 2020, 17:48:02)
[Clang 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> url = 'https://blog.csdn.net/LU_ZHAO/article/details/104935957'
>>> print(article.title)
The Serenity Prayer_LU_ZHAO的博客-CSDN博客
>>> print(article.cleaned_text)
上帝,请赐予我宁静,去接受我所不能改变的;
请赐予我勇气,去改变我所能改变的;
并请赐予我智慧,去辨别什么可以改变,什么不能。
用心生活每一天;用灵魂享受每个时刻;承受磨难,因为它是通向安宁的必经之路。
接受它原本的样子,而不是我所期盼的样子;
这样,这一生我就有理由得到快乐,并在天堂与您一起得到极乐。
>>>
只有中文。。。没有英文了??因为选了中文就只有中文了吗??
英文尝试如下:
:python3
Python 3.7.6 (default, Feb 16 2020, 17:48:02)
[Clang 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> g=Goose()
>>> url = 'https://blog.csdn.net/LU_ZHAO/article/details/104935957'
>>> article = g.extract(url=url)
>>> print(article.title)
The Serenity Prayer_LU_ZHAO的博客-CSDN博客
>>> print(article.cleaned_text)
>>>
原文也有英文的呀。。
测试的原文:https://blog.csdn.net/LU_ZHAO/article/details/104935957