Python 文本解析器

程序员文章站 2022-04-14 15:36:53

一、项目介绍本教程讲解一个使用 Python 来解析纯文本生成一个 HTML 页面的小程序。二、相关技术 Python：一种面向对象、解释型计算机程序设计语言，用它可以做 Web 开发、图形处理、文本处理和数学处理等等。 HTML：超文本标记语言，主要用来实现网页。三、项目截图纯文本文件： ......

一、项目介绍

本教程讲解一个使用 python 来解析纯文本生成一个 html 页面的小程序。

二、相关技术

python：一种面向对象、解释型计算机程序设计语言，用它可以做 web 开发、图形处理、文本处理和数学处理等等。

html：超文本标记语言，主要用来实现网页。

三、项目截图

纯文本文件：

welcome to hello world

解析后生成的 html 页面如下图

image.png

四、项目讲解

1. 文本块生成器

首先我们需要有一个文本块生成器把纯文本分成一个一个的文本块，以便接下来对每一个文本快进行解析，util.py 代码如下：

#!/usr/bin/python
# encoding: utf-8

def lines(file):
    """
    生成器,在文本最后加一空行
    """
    for line in file: yield line
    yield '\n'

def blocks(file):
    """
    生成器,生成单独的文本块
    """
    block = []
    for line in lines(file):
        if line.strip():
            block.append(line)
        elif block:
            yield ''.join(block).strip()
            block = []

2. 处理程序

通过文本生成器我们得到了一个一个的文本块，然后需要有处理程序对不同的文本块加相应的 html 标记，handlers.py 代码如下：

#!/usr/bin/python
# encoding: utf-8

class handler:
    """
    处理程序父类
    """
    def callback(self, prefix, name, *args):
        method = getattr(self, prefix + name, none)
        if callable(method): return method(*args)

    def start(self, name):
        self.callback('start_', name)

    def end(self, name):
        self.callback('end_', name)

    def sub(self, name):
        def substitution(match):
            result = self.callback('sub_', name, match)
            if result is none: result = match.group(0)
            return result
        return substitution

class htmlrenderer(handler):
    """
    html 处理程序,给文本块加相应的 html 标记
    """
    def start_document(self):
        print '<html><head><title>shiyanlou</title></head><body>'

    def end_document(self):
        print '</body></html>'

    def start_paragraph(self):
        print '<p style="color: #444;">'

    def end_paragraph(self):
        print '</p>'

    def start_heading(self):
        print '<h2 style="color: #68be5d;">'

    def end_heading(self):
        print '</h2>'

    def start_list(self):
        print '<ul style="color: #363736;">'

    def end_list(self):
        print '</ul>'

    def start_listitem(self):
        print '<li>'

    def end_listitem(self):
        print '</li>'

    def start_title(self):
        print '<h1 style="color: #1abc9c;">'

    def end_title(self):
        print '</h1>'

    def sub_emphasis(self, match):
        return '<em>%s</em>' % match.group(1)

    def sub_url(self, match):
        return '<a target="_blank" style="text-decoration: none;color: #bc1a4b;" href="%s">%s</a>' % (match.group(1), match.group(1))

    def sub_mail(self, match):
        return '<a style="text-decoration: none;color: #bc1a4b;" href="mailto:%s">%s</a>' % (match.group(1), match.group(1))

    def feed(self, data):
        print data

3. 规则

有了处理程序和文本块生成器，接下来就需要一定的规则来判断每个文本块交给处理程序将要加什么标记，rules.py 代码如下：

#!/usr/bin/python
# encoding: utf-8

class rule:
    """
    规则父类
    """
    def action(self, block, handler):
        """
        加标记
        """
        handler.start(self.type)
        handler.feed(block)
        handler.end(self.type)
        return true

class headingrule(rule):
    """
    一号标题规则
    """
    type = 'heading'
    def condition(self, block):
        """
        判断文本块是否符合规则
        """
        return not '\n' in block and len(block) <= 70 and not block[-1] == ':'

class titlerule(headingrule):
    """
    二号标题规则
    """
    type = 'title'
    first = true

    def condition(self, block):
        if not self.first: return false
        self.first = false
        return headingrule.condition(self, block);

class listitemrule(rule):
    """
    列表项规则
    """
    type = 'listitem'
    def condition(self, block):
        return block[0] == '-'

    def action(self, block, handler):
        handler.start(self.type)
        handler.feed(block[1:].strip())
        handler.end(self.type)
        return true

class listrule(listitemrule):
    """
    列表规则
    """
    type = 'list'
    inside = false
    def condition(self, block):
        return true

    def action(self, block, handler):
        if not self.inside and listitemrule.condition(self, block):
            handler.start(self.type)
            self.inside = true
        elif self.inside and not listitemrule.condition(self, block):
            handler.end(self.type)
            self.inside = false
        return false

class paragraphrule(rule):
    """
    段落规则
    """
    type = 'paragraph'

    def condition(self, block):
        return true

4. 解析

最后我们就可以进行解析了，markup.py 代码如下：

#!/usr/bin/python
# encoding: utf-8

import sys, re
from handlers import *
from util import *
from rules import *

class parser:
    """
    解析器父类
    """
    def __init__(self, handler):
        self.handler = handler
        self.rules = []
        self.filters = []

    def addrule(self, rule):
        """
        添加规则
        """
        self.rules.append(rule)

    def addfilter(self, pattern, name):
        """
        添加过滤器
        """
        def filter(block, handler):
            return re.sub(pattern, handler.sub(name), block)
        self.filters.append(filter)

    def parse(self, file):
        """
        解析
        """
        self.handler.start('document')
        for block in blocks(file):
            for filter in self.filters:
                block = filter(block, self.handler)
            for rule in self.rules:
                if rule.condition(block):
                    last = rule.action(block, self.handler)
                    if last: break
        self.handler.end('document')

class basictextparser(parser):
    """
    纯文本解析器
    """
    def __init__(self, handler):
        parser.__init__(self, handler)
        self.addrule(listrule())
        self.addrule(listitemrule())
        self.addrule(titlerule())
        self.addrule(headingrule())
        self.addrule(paragraphrule())

        self.addfilter(r'\*(.+?)\*', 'emphasis')
        self.addfilter(r'(http://[\.a-za-z/]+)', 'url')
        self.addfilter(r'([\.a-za-z]+@[\.a-za-z]+[a-za-z]+)', 'mail')

"""
运行程序
"""
handler = htmlrenderer()
parser = basictextparser(handler)
parser.parse(sys.stdin)

运行程序（纯文本文件为 test.txt，生成 html 文件为 test.html）

python markup.py < test.txt > test.html

五、小结

在这个小程序中，我们使用了 python 来解析纯文本文件并生成 html 文件，这个只是简单实现，通过这个案例大家可以动手试试解析 markdown 文件。

上一篇： PHP 日期之间所有日期

下一篇： java+appium 自动化环境搭建

Python 文本解析器

一、项目介绍

二、相关技术

三、项目截图

四、项目讲解

1. 文本块生成器

2. 处理程序

3. 规则

4. 解析

五、小结

将MySQL命令行的显示数据提取为文本方法[图文]

PHP 存储文本换行实现方法_PHP教程

请教 PHP 如何保留文本中的一段特定代码？

关于PHP代码生成文本框的位置有关问题

怎么在一个php网页插入一个富文本编辑框并把内容存入mysql

基于python发送邮件的乱码问题的解决办法_PHP教程

Ubuntu 12.04下源代码安装MySQL5.6以及Python-MySQLdb

PHP、Python 相关正则函数一点实例

为什么Python 比 PHP 更有效率？（不考虑人的主观因素，如编程风格和架构设计等）

删除html标签得到纯文本可处理嵌套的标签