Python转换HTML到Text纯文本的方法

程序员文章站 2023-11-23 17:39:46

本文实例讲述了python转换html到text纯文本的方法。分享给大家供大家参考。具体分析如下：今天项目需要将html转换为纯文本，去网上搜了一下，发现python果...

本文实例讲述了python转换html到text纯文本的方法。分享给大家供大家参考。具体分析如下：

今天项目需要将html转换为纯文本，去网上搜了一下，发现python果然是神通广大，无所不能，方法是五花八门。

拿今天亲自试的两个方法举例，以方便后人：

方法一：

1. 安装nltk，可以去pipy装

（注：需要依赖以下包：numpy, pyyaml）

2.测试代码：

复制代码代码如下:

>>> import nltk  

>>> aa = r''''' 

<html> 

    <body> 

 <b>project:</b> dehtml<br> 

 <b>description</b>:<br> 

 this small script is intended to allow conversion from html markup to  

 plain text. 

    </body> 

</html> 

'''

>>> aa  

'\n<html>\n            <body>\n                <b>project:</b> dehtml<br>\n                <b>description</b>:<br>\n                this small script is intended to allow conversion from html markup to \n                plain text.\n            </body>\n        </html>\n        '  

>>> <strong>print nltk.clean_html(aa)</strong>  

project: dehtml   

     description :   

    this small script is intended to allow conversion from html markup to   

    plain text.

方法二：

如果觉得nltk太笨重，大材小用的话，可以自己写代码，代码如下:

复制代码代码如下:

from htmlparser import htmlparser  

from re import sub  

from sys import stderr  

from traceback import print_exc  

class _dehtmlparser(htmlparser):  

    def __init__(self):  

        htmlparser.__init__(self)  

        self.__text = []  

    def handle_data(self, data):  

        text = data.strip()  

        if len(text) > 0:  

            text = sub('[ \t\r\n]+', ' ', text)  

            self.__text.append(text + ' ')  

    def handle_starttag(self, tag, attrs):  

        if tag == 'p':  

            self.__text.append('\n\n')  

        elif tag == 'br':  

            self.__text.append('\n')  

    def handle_startendtag(self, tag, attrs):  

        if tag == 'br':  

            self.__text.append('\n\n')  

    def text(self):  

        return ''.join(self.__text).strip()  

def dehtml(text):  

    try:  

        parser = _dehtmlparser()  

        parser.feed(text)  

        parser.close()  

        return parser.text()  

    except:  

        print_exc(file=stderr)  

        return text  

def main():  

    text = r''''' 

        <html> 

            <body> 

                <b>project:</b> dehtml<br> 

                <b>description</b>:<br> 

                this small script is intended to allow conversion from html markup to  

                plain text. 

            </body> 

        </html> 

    '''  

    print(dehtml(text))  

if __name__ == '__main__':  

    main()

运行结果：

>>> ================================ restart ================================
>>>
project: dehtml
description :
this small script is intended to allow conversion from html markup to plain text.

希望本文所述对大家的python程序设计有所帮助。

上一篇： jQuery插件HighCharts绘制的基本折线图效果示例【附demo源码下载】

下一篇： Python中的对象，方法，类，实例，函数用法分析

Python转换HTML到Text纯文本的方法

Python转换HTML到Text纯文本的方法

C#实现将HTML转换成纯文本的方法

python实现将html表格转换成CSV文件的方法

python中从str中提取元素到list以及将list转换为str的方法

python实现将文本转换成语音的方法

Python实现将HTML转换成doc格式文件的方法示例

Linux不用使用软件把纯文本文档转换成PDF文件的方法

Python将xml和xsl转换为html的方法

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

Python3转换html到pdf的不同解决方案