欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

数据提取(二):xpath - lxml从字符串和文件中解析html代码 etree.html(),etree.tostring(),etree.parse(),etree.HTMLParser()

程序员文章站 2022-05-09 21:17:03
...

一、lxml库简述

lxml库是一个HTML、XML的解析器,主要功能是如何解析和提取HTML、XML数据。它和正则一样是用 C 实现的,是一款高性能的 Python HTML/XML 解析器,可以利用之前学习的XPath语法,来快速的定位特定元素以及节点信息。

lxml python 官方文档:http://lxml.de/index.html

需要安装C语言库,可使用 pip 安装:pip install lxml

二、lxml库的基本使用

(1)从字符串中解析HTML代码:etree.html(str)

函数定义:
HTML(text, parser=None, base_url=None)
	Parses an HTML document from a string constant.  Returns the root node (or the result returned by a parser target). 
作用:
	解析HTML代码的时候,如果HTML代码不规范,自动进行补全
	利用etree.HTML(string),将字符串解析为HTML文档
	利用etree.tostring(html) 按字符串将HTML文档序列化为bytes类型,可通过decode('utf-8')解码为str类型
		etree.tostring()函数详解稍后给出
返回值:返回一个<class 'lxml.etree._Element'>对象

参数列表:
	parser:主要是重写overrie该函数的解析机制时传入,一般不用
	base_url:为该生成的html文档设置url,以此来支持查找外部实体时可使用相对路径
		The ``base_url`` keyword allows setting a URL for the document
	    when parsing from a file-like object.  This is needed when looking
	    up external entities (DTD, XInclude, ...) with relative paths.
	
from lxml import etree 
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> <!--注意 这里少了一个<li>标签-->
     </ul>
 </div>
'''
html = etree.HTML(text) 
print(type(html))
print(html)

bytes_res = etree.tostring(html) 
print(bytes_res)
str_res = etree.tostring(html).decode('utf-8')
print(str_res)

运行结果
<Element html at 0x3197b98>
<class 'lxml.etree._Element'>

b'<html><body><div>\n    <ul>\n         <li class="item-0"><a href="link1.html">first item</a></li>\n         <li class="item-1"><a href="link2.html">second item</a></li>\n         <li class="item-inactive"><a href="link3.html">third item</a></li>\n         <li class="item-1"><a href="link4.html">fourth item</a></li>\n         <li class="item-0"><a href="link5.html">fifth item</a>  <!--&#27880;&#24847; &#36825;&#37324;&#23569;&#20102;&#19968;&#20010;<li>&#26631;&#31614;-->\n     </li></ul>\n </div>\n</body></html>'

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>  <!--&#27880;&#24847; &#36825;&#37324;&#23569;&#20102;&#19968;&#20010;<li>&#26631;&#31614;-->
     </li></ul>
 </div>
</body></html>
<class 'str'>

lxml会自动修改HTML代码。例子中不仅补全了li标签,还添加了body,html标签
但是我们的中文注释还是乱码 这就需要我们了解一下etree.tostring()函数
etree.tostring()函数定义(了解即可)
tostring(element_or_tree, encoding=None, method="xml",
                     xml_declaration=None, pretty_print=False, with_tail=True,
                     standalone=None, doctype=None,
                     exclusive=False, inclusive_ns_prefixes=None,
                     with_comments=True, strip_text=False,
                     )
作用:Serialize an element to an encoded string representation of its XML tree.
参数列表:
	encoding,xml_declaration 编码相关
		注意:未指定参数xml_declaration默认是把ASCII编码的字符串转为html文档
		这意味着我们传入的字符串中不能含有中文等,因此若含有中文需要指定参数encoding=‘utf-8’ 
		手动指定参数encoding为非utf-8/ASCII编码时,默认允许declaration
	
        Defaults to ASCII encoding without XML declaration.  This
        behaviour can be configured with the keyword arguments 'encoding'
        (string) and 'xml_declaration' (bool).  Note that changing the
        encoding to a non UTF-8 compatible encoding will enable a
        declaration by default.
    
        You can also serialise to a Unicode string without declaration by
        passing the name ``'unicode'`` as encoding (or the ``str`` function
        in Py3 or ``unicode`` in Py2).  This changes the return value from
        a byte string to an unencoded unicode string.
    
	pretty_print' (bool) enables formatted XML.
    
	method:' selects the output method: 'xml','html', plain 'text' (text content without tags), 'c14n' or 'c14n2'. Default is 'xml'.
 
更多了解参见help dir


因此上述代码
str_res = etree.tostring(html).decode('utf-8')修改为
str_res = etree.tostring(html,encoding=‘utf-8).decode('utf-8')
即可解决中文乱码 即在tostring是指定编码为utf-8,再按照utf-8解码

因为tostring默认是ascii码编码
即utf-8不能含有中文,先在tostring()指定编码会江中文按照utf-8进行编码
再解码输出就自然不会有问题了

(2)从文件中读取html代码:etree.parse(file)

函数定义
parse(source, parser=None, base_url=None): 
返回值: 返回一个ElementTree对象  
参数列表:
	(1)The ``source`` can be any of the following:
    
        - a file name/path
        - a file object
        - a file-like object
        - a URL using the HTTP or FTP protocol

	(2)If no parser is provided as second argument, the default parser is used.
		To parse from a string, use the ``fromstring()`` function instead.
    
        Note that it is generally faster to parse from a file path or URL
        than from an open file object or file-like object.  Transparent
        decompression from gzip compressed sources is supported (unless
        explicitly disabled in libxml2).


from lxml import etree
html = etree.parse('html.html')
res = etree.tostring(html)
print(res.decode('utf-8')) 

运行结果:
lxml.etree.XMLSyntaxError:Opening and ending tag mismatch: li line 7 and ul, line 8, column 11

注意:
从html文件中直接读取,若存在html语法错误不会像从字符串中读取那样自动修正 而是会报错

修正html语法错误无报错但是输出中文同样乱码 且没有自动补全html body等标签
输出
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>  <!--注意,这里少了一个<li>标签-->
     </ul>
 </div>

解决乱码
方法一:
	像从字符串中读取html一样修正res = etree.tostring(html,encoding='utf-8')
方法二:
	print('===========解决html中文乱码方法2===========')
	parser = etree.HTMLParser(encoding='utf-8')
	htmlElement = etree.parse("html.html", parser=parser)
	print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))

但是这样每一行最后都多了一个 &#13 貌似是换行符\n还是空格乱码了  这里旨在说明可以自定义解析器parser
输出内容是
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
    <ul>&#13;
         <li class="item-0"><a href="link1.html">first item</a></li>&#13;
         <li class="item-1"><a href="link2.html">second item</a></li>&#13;
         <li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
         <li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
         <li class="item-0"><a href="link5.html">fifth item</a></li>  <!--注意,这里少了一个<li>标签-->&#13;
     </ul>&#13;
 </div>&#13;
</body></html>

总结

从字符串解析HTML是etree.HTML()

从文件解析HTML是etree.parse()

构造etree.HTML()或者etree.parse()的parser参数是etree.HTMLParser()

读取解析出来是的HTML是etree.tostring().decode()

相关标签: 爬虫学习 爬虫