速学bs4.BeautifulSoup()结构及用法

程序员文章站 2022-05-02 17:37:10

...

bs4 官方文档

BeautifulSoup4被移植到bs4中，就是说使用时需要from bs4 import BeautifulSoup

Beautiful Soup 4.2.0 中文文档

HTML文件与标签树

HTML文件其实就是由一组尖括号构成的标签组织起来的，每一对尖括号形成一个标签，标签之间存在上下关系，形成标签树；

因此可以说Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

标签

p标签（<p></p>）：标签Tag ，一般，标签名都是成对出现的（位于起始和末尾），例如p；在第一个标签名之后可以有0到多个属性，表示标签的特点

<p class="title">...</p>
# 中间的class属性，其值为“title ”（属性是由键和值，键值对构成的）

可以理解为：HTML文档和标签树，BeautifulSoup类是等价的

Beautiful Soup库解析器

解析器	意义	语法	条件
bs4的HTML解析器	默认解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	需要C语言库	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	需要C语言库，唯一支持XML的解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	纯Python实现；解析方式与浏览器相同；速度慢；容错性最好	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

爬虫程序核心是对网页进行解析，从中提取出自己想要的信息数据。这些数据可能是网址（url、href）、图片（image）、文字（text）、语音（MP3）、视频（mp4、avi……），它们隐藏在网页的html数据中，在各级等级分明的element里面，通常是有迹可循的，否则就没有爬取的必要了。提取的手段主要有三种：xpath、BeautifulSoup、正则表达式（Re）。下面分别进行介绍：

XPath 是一门在 XML 文档中查找信息的语言。
BeautifulSoup是一种在BeautifulSoup（）处理后的树形文档中解析的语言
re正则表达式只能对string类型对象进行解析

Beautiful Soup类的基本元素

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag:标签，最基本的信息组织单元，分别用<>和</>表明开头和结尾。HTML标签加上里面的属性共同构成Tag

print soup.title
#<title>The Dormouse's story</title>
print soup.p
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
# 通过soup+标签名获取标签内容；只会获取发现的第一个标签
print type(soup.a)
#<class 'bs4.element.Tag'>

Tag有两个重要属性：

Name——标签的名字，
…
的名字是’p’,格式：.name
Attributes——标签的属性，字典形式组织，格式：.attrs

print soup.name
print soup.head.name   # 查看标签的名字
#[document]
#head
print soup.p.attrs   # 查看标签的属性。所有属性
#{'class': ['title'], 'name': 'dromouse'}   # 所有属性都打出来，是一个字典类型
	
print soup.p['class'] # 查看某个具体属性
#['title'] 
print soup.p.get('class')  # 通过get()函数查看具体属性
#['title']


soup.p['class']="newClass"   # 修改某个具体属性
del soup.p['class']   # 删除某个具体属性

NavigableString可以遍历的字符串，是一种类型

print soup.p.string # 获取标签内部的具体文字
#The Dormouse's story
print type(soup.p.string)   # 查看类型
#<class 'bs4.element.NavigableString'>

BeautifulSoup可以理解为一个特殊的Tag，表示一个文档的全部内容

print type(soup.name)
#<type 'unicode'>
print soup.name 
# [document]
print soup.attrs 
#{} 空字典

Comment特殊类型的 NavigableString 对象，输出的内容不包括注释符号

print soup.a
print soup.a.string
print type(soup.a.string)
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
#Elsie 
#<class 'bs4.element.Comment'>

a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

（尖括号叹号表示注释开始：）

所以，我们在使用前最好做一下判断，判断代码如下:

# 先判断了它的类型，是否为 Comment 类型，然后再进行其他操作，如打印输出。
if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

BeautifulSoup应用范例

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)   # 实例化BeautifulSoup
soup = BeautifulSoup(open('index.html))   # 读取本地文件实现实例化
print(soup.prettify())   # 格式化输出

通过soup = BeautifulSoup()获取到文档树，下一步就是拿到文档树中所需内容，有两种基本思路：遍历文档树、搜索文档树。

这里提到的文档树，可以理解为HTML文档，也可以是BeautifulSoup类。

所谓遍历文档树&搜索文档树，也就是BeautifulSoup类对应的函数及方法。

遍历文档树

直接子节点

tag.content 属性

tag 的 .content 属性可以将tag的子节点以列表的方式输出，输出方式为列表，我们可以用列表索引来获取它的某一个元素。

print soup.head.contents 
#[<title>The Dormouse's story</title>]
print soup.head.contents[0]
#<title>The Dormouse's story</title>

tag.children属性

返回一个 list 生成器对象，可以通过遍历获取所有子节点。

print soup.head.children
#<listiterator object at 0x7f71457f5710>
for child in  soup.body.children:
    print child
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

所有子孙节点
节点内容
多个内容
父节点
全部父节点
兄弟节点
全部兄弟节点
前后节点
所有前后节点

搜索文档树
- find_all()
- find()
- find_parents() & find_parent()
- find_next_siblings() & find_next_sibling()
- find_previous_siblings() & find_previous_sibling()
- find_all_next() & find_next()
- find_all_previous() & find_previous()
- CSS选择器
  
  我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 **soup.select()，**返回类型是 list
  1. ```
  print soup.select('title')   # 通过标签名查找
```
2. ```
print soup.select(&#39;.sister&#39;)   # 通过类名查找`
```
  3. ```
  print soup.select('#link1')   # 通过 id 名查找
```
4. ```
print soup.select('p #link1')   # 组合查找。p 标签中，id 等于 link1的内容，二者需要用空格分开
```
  5. ```
  print soup.select('a[class="sister"]')   # 属性查找
```
select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。
```
  soup = BeautifulSoup(html, 'lxml')
  print type(soup.select('title'))
  print soup.select('title')[0].get_text()
   
  for title in soup.select('title'):
      print title.get_text()
```
Reference

备注，文中内容为网络查找各优秀文档组合结果，如需详细掌握，查看参考链接

速学bs4.BeautifulSoup()结构及用法

bs4 官方文档

Beautiful Soup 4.2.0 中文文档

HTML文件与标签树

标签

Beautiful Soup库解析器

Beautiful Soup类的基本元素

BeautifulSoup应用范例

遍历文档树

直接子节点

所有子孙节点

节点内容

多个内容

父节点

全部父节点

兄弟节点

全部兄弟节点

前后节点

所有前后节点

搜索文档树

find_all()

find()

find_parents() & find_parent()

find_next_siblings() & find_next_sibling()

find_previous_siblings() & find_previous_sibling()

find_all_next() & find_next()

find_all_previous() & find_previous()

CSS选择器

Reference

WordPress中查询文章的循环Loop结构及用法分析

Python数据结构之栈、队列及二叉树定义与用法浅析

Go语言基础结构体用法及示例详解

WordPress中查询文章的循环Loop结构及用法分析

WordPress中查询文章的循环Loop结构及用法分析

WordPress中查询文章的循环Loop结构及用法分析

[转载]Mysql导出表结构及表数据 mysqldump用法