初窥 -- BeautifulSoup

程序员文章站 2022-05-02 17:39:34

...

Tag

soup = BeautifulSoup("<**p** class='title'><b>The Dormouse's story</b></p>")
tag = soup.p    #这里的soup.p中的 p 是上一条语句中的 p 
type(tag)
#<class 'bs4.element.Tag'>

_Tag的name属性

tag.name
'p'
tag
#<p class='title'><b>The Dormouse's story</b></p>

tag.name='test'   #name可以修改
tag
#<test class="title"><b>The Dormouse's story</b></test>

Tag的Attributes属性
标签<p class='title'><b>The Dormouse's story</b></p> 中 class 即为tag的属性，一个tag可以有很多个属性。class 属性的值为 title 。tag属性的操作方法与字典相同。

tag['class']
#['title']

tag的属性可以被添加，删除和修改，操作与字典一样

tag['class'] = 'test1'      #修改
tag['id'] = 11              #添加
tag
#<test class="test1" id="11"><b>The Dormouse's story</b></test>
del tag['class']            #删除
del tag['id']
#<test><b>The Dormouse's story</b></test>
print(tag.get('class',404))
#404

另外也可以用.attrs （字典方法）获取tag属性

tag['class'] = 'test1'
tag['id'] = 11
tag.attr
#{'class': 'test1', 'id': 11}

Tag 的多值属性 – 未完全弄懂待续

BeautifulSoup对象
BeautifulSoup 对象(常用soup表示)表示的是一个文档的全部内容，大部分时候可以把他当做Tag 对象，但他没有name和attribute属性

*遍历文档树*
演示示例：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Tag的名字
操作文档树就是告诉他你想要获取的tag的name。若想要获取标签，只需要用soup.head

soup.head
#<head><title>The Dormouse's story</title></head>
soup.title
#<title>The Dormouse's story</title>

可以多次使用. 点方法，获取你想要的标签

soup.head.title
#<title>The Dormouse's story</title>
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.b
#<b>The Dormouse's story</b>
soup.body.p.b
#<b>The Dormouse's story</b>

另外通过点取属性只能获得当前名字的第一个Tag：

soup.a
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

若想要得到所有的标签，可以用：find_all() 方法等

soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

_.contents 和 .children —仅包含tag的直接子节点
tag 的 .contents 属性可以将tag的子节点以列表的方式输出

soup.head
#<head><title>The Dormouse's story</title></head>
soup.head.contents
#[<title>The Dormouse's story</title>]  --<head>直接子节title
soup.head.contents[0].contents
#["The Dormouse's story"]

通过tag的.children 生成器可以对tag的子节点进行循环

for child in head_tag.children:
    print(child)
#<title>The Dormouse's story</title>
#若不用循环
head_tag.children
#<list_iterator object at 0x000000000349AF28>

.descendants –可以对所有的tag的子孙节点进行递归循环

for child in head_tag.descendants:
    print(child)
#<title>The Dormouse's story</title>
#The Dormouse's story

.string
如果tag只有一个NavigableString 类型的子节点，那么可以用.string 得到子节点：

soup.head.string
#"The Dormouse's story"

如果包含多个NavigableString 类型的子节点，输出结果则为None

.strings 和 stripped_strings
如果tag中包含多个字符串，可以使用.strings 来循环获取：

for string in soup.strings:
    print(string)
#
#   
#
#
#The Dormouse's story
#
#
#
#
#The Dormouse's story
#
#
#Once upon a time there were three little sisters; and their #names were
#
#Elsie
#,
#
#Lacie
# and
#
#Tillie
#;
#and they lived at the bottom of a well.
#
#
#...

输出的字符串可能包含很多空格或空行，使用.stripped_strings 去除多余的空白内容,全部是空格的行挥别忽略掉，段首和段末的空白会被删除：

for string in soup.stripped_strings:
    print(string)
#The Dormouse's story
#The Dormouse's story
#Once upon a time there were three little sisters; and their #names were
#Elsie
#,
#Lacie
#and
#Tillie
#;
#and they lived at the bottom of a well.
#...

.parent
通过.parent 属性来获取某个元素的父节点，上例中标签是标签的父节点：

soup.title.parent
#<head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点：

soup.title.string
#"The Dormouse's story"
soup.title.string.parent
#<title>The Dormouse's story</title>

.parents
通过.parents 属性可以递归得到元素的所有父辈节点：

 for parent in soup.title.string.parents:
    print(parent.name)
#title
#head
#html
#[document]

*搜索文档树*
例子如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

过滤器 – find_all()方法，返回一个列表
①字符串 – 最简单的过滤器
在搜索方法中传入一个字符串参数，BeautifulSoup会查找与字符串完整匹配的内容

soup.find_all('p')
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

soup.find_all('b')
#[<b>The Dormouse's story</b>]

②正则表达式
在find_all()中传入正则表达式作为参数，BeautifulSoup会通过正则表达式的match()来匹配内容。

import re
for tag in soup.find_all(re.compile('^b')):  #找出所有以b为开头的标签
    print(tag.name)
#body
#b

for tag in soup.find_all(re.compile('t')):    #找出所有名字中包含t的标签
    print(tag.name)
#html
#title

列表
find_all()传入列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回。

soup.find_all(['body','b'])
#[<body>
#<p class="title"><b>The Dormouse's story</b></p>
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
#<p class="story">...</p>
#</body>, <b>The Dormouse's story</b>]

④True
True 可以匹配任何值，可用来查找所有的tag，但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p

⑤方法
find_all()可传入一个方法，该方法只接受一个元素参数，如果这个方法返回True 表示当前元素匹配并且被找到，否则返回False

*find_all()*
find_all(name,_attrs,recursive,string,**kwargs)
①name参数
name 参数可以查找所有名字为name的tag，字符串对象会被自动省略。

soup.find_all('head')
#[<head><title>The Dormouse's story</title></head>]

soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]

②keyword参数
如果find_all()中的参数名，不是指定搜索文档中的内置参数名，搜索时会把该参数当做指定名字的tag属性来搜索。
搜索指定名字的属性时可以使用的参数值有：字符串，正则表达式，列表，True

soup.find_all(id='link1')   --#字符串
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(href=re.compile("elsie"))  --#正则表达式
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id=True)  --#True
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]

可以使用多个指定名字的参数同时过滤tag的多个属性：

soup.find_all(id='link1',href=re.compile("elsie"))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

按CSS搜索
通过CSS类名搜索，通过class_ 参数搜索有指定CSS类名的tag：
class_ 参数接受不同类型的过滤器：字符串，正则表达式，True和方法

soup.find_all(class_='title') #字符串
#[<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all(class_=re.compile('itl'))  #正则表达式
#[<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all(class_=True)  #True
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little #sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" #href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

当tag中的class 是多值属性.按CSS类名搜索可以分别搜索tag中的每个CSS类名;

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
css_soup.find_all(class_='body')
#[<p class="body strikeout"></p>]

css_soup.find_all('p',class_='strikeout')
#[<p class="body strikeout"></p>]

如果按照class 属性完全匹配，如果CSS类名的顺序与实际不符，将搜索不到结果:

css_soup.find_all(class_='body strikeout')
#[<p class="body strikeout"></p>]

__string 参数
string 参数可以搜索文档中的字符串内容，string 参数接受字符串，正则表达式，True，列表，返回字符串列表

soup.find_all(string='Tillie')   #字符串
#['Tillie']
soup.find_all(string=True)       #True
#['\n', "The Dormouse's story", '\n', '\n', "The Dormouse's story", '\n', 'Once upon a time there were three little sisters; and their names were\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n']
soup.find_all(string=re.compile('D'))  #正则表达式
#["The Dormouse's story", "The Dormouse's story"]
soup.find_all(string=['Tillie','Elsie','Lacie'])  #列表
#['Elsie', 'Lacie', 'Tillie']

可以和其他参数混合使用过滤tag

soup.find_all('a',string='Tillie')    #**与字符串混合使用，字符串放在string之前**
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(string='Tillie',id='link3')
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

limit参数
find_all()方法返回全部的搜索结构，可以使用limit参数限制返回结果的数量 当搜索数量达到limit限制，就停止搜索

soup.find_all(class_='sister',limit=2)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(class_='sister',limit=1)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

recursive参数 – 用于搜索tag的直接子节点，recursive=False

soup.find_all('title')
#[<title>The Dormouse's story</title>]
soup.find_all('title',recursive=False)
#[]

像调用find_all一样调用tag
即soup.find_all(参数) 相当于soup(参数)

soup('title')
#[<title>The Dormouse's story</title>]
soup('a',id='link1')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

find() 与 find_all() 区别
find_all() 方法返回结果是一个列表,find()方法直接返回结果
find_all() 方法未找到目标结果返回空列表，find()方法则返回None

———————————————————————————–此仅仅为BeautifulSoup部分

初窥 -- BeautifulSoup

ASP.NET初了解（三）-- 内置对象（2）

初入webpack，手把手记录（一）

python基于BeautifulSoup实现抓取网页指定内容的方法

sai给初音未来的线稿上色?

Python爬虫实战用 BeautifulSoup 爬取电影网站信息

网站初运营面临的难题和解决方法浅析

Python获取基金网站网页内容、使用BeautifulSoup库分析html操作示例

python3第三方爬虫库BeautifulSoup4安装教程

Windows8下安装Python的BeautifulSoup

Python网页解析利器BeautifulSoup安装使用介绍