Python中BeautifulSoup详解

程序员文章站 2022-04-26 10:46:15

...

BeautifulSoup是用来从HTML或XML中提取数据的Python库。对于不具备良好格式的 HTML 内容，lxml 提供了两个有用的包：lxml.html 模块和 BeautifulSoup 解析器。

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序

BeautifulSoup4 安装命令

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple BeautifulSoup4

先给出一个学习实例: 取名叫 hezhi.html ，内容最下方给出，直接复制内容，保存到hezhi.html即可

一个最基础的BeautifulSoup事例。

from bs4 import BeautifulSoup
soup = BeautifulSoup( "<p>这是一个html的P标签</p>", "html.parser" )
print(soup)

然后开始我们的hezhi.html例子演示:

首先把我们的hezhi.html放到一个固定的位置吧，对文件读取操作不是很熟练的可以直接放在D盘即可。

我在这以D盘为例:我直接放到了D盘根目录。

Python中BeautifulSoup详解

开始我萌的实验之旅。

第一个:我萌要看的函数 prettify()

prettify()的意思是美化，就是进行格式化输出，而不会把hezhi.html的内容当成一团可读性极差的内容，糟糕的输出。

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

#用BeautifulSoup格式美化输出
soup = BeautifulSoup( demo , "html.parser")
print( soup.prettify() )

第二个:开始有分析数据的味儿了，以标签为单位寻找信息。

1.找到第一个要求类型的标签

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

# 提取第一个标签
soup = BeautifulSoup( demo , "html.parser")
print(soup.a) # 输出第一个a标签
print(soup.div) # 输出第一个div标签
print(soup.input) # 输出第一个input标签
print(soup.h1) # 输出第一个h1标签

2.找到所有同一种标签类型的标签用到函数 find_all(标签类型)

给了提取input标签和div标签的方法，其他的标签也是同样的原理

返回类型类似一个List是有索引的。所以可以快速输出第一个、第二个标签

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 利用函数,提取同一类型的标签
soup = BeautifulSoup( demo , "html.parser")

print( soup.find_all("input") )

print( soup.find_all("div") )

# 输出第一个第二个input标签
print( soup.find_all("input")[0] )

print( soup.find_all("input")[1] )

3.根据独特的信息提取标签：这里用到字典 attrs = { "class":"third","name":"three" } 每一个属性对应一个值

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 利用特定信息,提取同一类型的标签，
soup = BeautifulSoup( demo , "html.parser")

#根据属性找div
print( soup.find_all("div" , attrs = { "class":"first" } ) )

print( soup.find_all("div" , attrs = { "class":"third","name":"three" } )  )

print( soup.find_all("div" , attrs = { "class":"third","name":"three" } )[0]  )

#根据信息找input
print( soup.find_all("input" , attrs = { "class":"user" } )[0]  )

print( soup.find_all("input" , attrs = { "type":"text" } )[0]  )

print( soup.find_all("input" , attrs = { "type":"text" } )[1]  )

4.标签的几个常用属性，其他标签都可参照如下内容

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 输出标签的属性
soup = BeautifulSoup( demo , "html.parser")

print( soup.div.attrs )  # 标签的属性们，是个字典
print( soup.div.attrs["class"] ) # 输出标签的class

print( soup.a.attrs)
print( soup.a.attrs["href"]) #输出a标签的链接信息

print( soup.div.name )  #输出div ,如果是soup.a.name就是a
print( soup.div.string ) #可以输出注释
print( soup.div.text ) #过滤所有的注释之外的内容

5.简单的进阶,找到class是second的内容

from bs4 import BeautifulSoup
#读取hezhi.html的代码
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

# 输出标签的属性
soup = BeautifulSoup( demo , "html.parser")

print( soup.find_all("div" , attrs = {"class":"second" } )[0].string )
print( soup.find_all("div" , attrs = {"class":"second" } )[0].name )

#因为我写的页面只存在一个second的所以此时也可等价于如下代码，find函数是找到第一个
print( soup.find("div" , attrs = {"class":"second" } ).string )
print( soup.find("div" , attrs = {"class":"second" } ).name )

BeautifulSoup总结出来的一些知识：

soup.Tag中的Tag代表各种标签哈，比如:soup.a 、soup.div 、soup.h1 、soup.input

Tag	标签	例如
element	要素、含义
soup.Tag	某个标签所有内容	soup.div soup.a
soup.Tag.attrs	属性们、字典类型、attrs全称:attribute属性	soup.a.attrs
soup.Tag.string	输出标签的内容信息	一个字符串
soup.prettify()	美化显示爬取的信息	soup.prettify()
soup.Tag.name	获得标签的名字	div、a
soup.find_all("Tag.name")	获得所有的Tag标签,并拥有索引值	soup.find_all("input")
soup.find_all("input",attrs={"type":"text"})	获得标签的子标签,有索引，长度等属性，是list类型,字典(name,attrs={属性字典})	soup.find_all("input",attrs={"type":"text"})
soup.a.children	多个孩子，迭代器，没有索引，没有长度，可循环遍历list_iterator
soup.a.parent	获得自己的父亲节点	soup.input.parent
soup.a.parents	迭代器，只能循环编列	soup.a.parents
soup.Tag.contents	获得标签的子标签,有索引，长度等属性，是list类型	soup.body.contents
soup.Tag.next_sibling	获得平行节点下一个节点	soup.div.next_sibling
soup.Tag.next_siblings	多个节点,没有索引，没有长度，可以循环遍历，迭代类型
soup.Tag.previous_sibling	获得平行节点上一个节点
soup.Tag.previous_siblings	多个节点,没有索引，没有长度，可以循环遍历，迭代类型
soup.p.string	返回节点内容，如果多个节点就返回空,会显示注释
soup.p.text	除了注释的内容

hezhi.html在这

<html>
<head>
	<meta charset="utf-8">
	<title>BeautifulSoup</title>	
	<style>
	*{
		margin:0;
		padding:0;
		text-align:center;
	}
	.first{
		background-color:yellow;
	}
	.second{
		background-color:blue;
	}
	.third{
		background-color:green;
	}
	
	</style>
	
</head>

<body>

	<div class="first">I'am the First Div</div>
	
	<div class="second">I'am the Second Div</div>
	
	<div class="third" name="three">
		<H1>I'am a H1</H1>
		<p name="P1">I'am  a  P</p>
	</div>
	
	<form action="#">
		<input class="user" type="text" /> <br/>
		
		<input class="pwd" type="text"  /> <br/>
		
		<input type="submit">	
	</form>
	
	<a class="Tag_A" href="http://www.baidu.com">百度</a>
	<p name="P2" > <!--我是注释--> </p>
	
</body>

</html>

Python中BeautifulSoup详解

一个最基础的BeautifulSoup事例。

然后开始我们的hezhi.html例子演示:

开始我萌的实验之旅。

第一个:我萌要看的函数 prettify()

第二个:开始有分析数据的味儿了，以标签为单位寻找信息。

BeautifulSoup总结出来的一些知识：

hezhi.html在这

Python中的模块和包概念介绍

Python中的面向对象编程详解(下)

nginx配置中location匹配规则详解

php中spl_autoload详解_PHP

PHP也能干大事之PHP中的编码解码详解

python中执行shell命令的几个方法小结

PHP中error_reporting()用法详解_php技巧

详解PHP中的Traits_PHP

如何在Python中声明和添加项目到数组？

python中如何使用requests模块下载文件并获取进度提示？