初窥 -- BeautifulSoup
Tag
soup = BeautifulSoup("<**p** class='title'><b>The Dormouse's story</b></p>")
tag = soup.p #这里的soup.p中的 p 是上一条语句中的 p
type(tag)
#<class 'bs4.element.Tag'>
_Tag的name属性
tag.name
'p'
tag
#<p class='title'><b>The Dormouse's story</b></p>
tag.name='test' #name可以修改
tag
#<test class="title"><b>The Dormouse's story</b></test>
Tag的Attributes属性
标签<p class='title'><b>The Dormouse's story</b></p>
中 class 即为tag的属性,一个tag可以有很多个属性。class 属性的值为 title 。tag属性的操作方法与字典相同。
tag['class']
#['title']
tag的属性可以被添加,删除和修改,操作与字典一样
tag['class'] = 'test1' #修改
tag['id'] = 11 #添加
tag
#<test class="test1" id="11"><b>The Dormouse's story</b></test>
del tag['class'] #删除
del tag['id']
#<test><b>The Dormouse's story</b></test>
print(tag.get('class',404))
#404
另外也可以用.attrs
(字典方法)获取tag属性
tag['class'] = 'test1'
tag['id'] = 11
tag.attr
#{'class': 'test1', 'id': 11}
Tag 的多值属性 – 未完全弄懂 待续
BeautifulSoup对象
BeautifulSoup 对象(常用soup表示)表示的是一个文档的全部内容,大部分时候可以把他当做Tag
对象,但他没有name和attribute属性
*遍历文档树*
演示示例:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
Tag的名字
操作文档树就是告诉他你想要获取的tag的name。若想要获取标签,只需要用soup.head
soup.head
#<head><title>The Dormouse's story</title></head>
soup.title
#<title>The Dormouse's story</title>
可以多次使用.
点方法,获取你想要的标签
soup.head.title
#<title>The Dormouse's story</title>
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.b
#<b>The Dormouse's story</b>
soup.body.p.b
#<b>The Dormouse's story</b>
另外通过点取属性只能获得当前名字的第一个Tag:
soup.a
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
若想要得到所有的标签,可以用:find_all() 方法等
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
_.contents 和 .children —仅包含tag的直接子节点
tag 的 .contents
属性可以将tag的子节点以列表的方式输出
soup.head
#<head><title>The Dormouse's story</title></head>
soup.head.contents
#[<title>The Dormouse's story</title>] --<head>直接子节title
soup.head.contents[0].contents
#["The Dormouse's story"]
通过tag的.children
生成器可以对tag的子节点进行循环
for child in head_tag.children:
print(child)
#<title>The Dormouse's story</title>
#若不用循环
head_tag.children
#<list_iterator object at 0x000000000349AF28>
.descendants –可以对所有的tag的子孙节点进行递归循环
for child in head_tag.descendants:
print(child)
#<title>The Dormouse's story</title>
#The Dormouse's story
.string
如果tag只有一个NavigableString
类型的子节点,那么可以用.string
得到子节点:
soup.head.string
#"The Dormouse's story"
如果包含多个NavigableString
类型的子节点,输出结果则为None
.strings 和 stripped_strings
如果tag中包含多个字符串,可以使用.strings
来循环获取:
for string in soup.strings:
print(string)
#
#
#
#
#The Dormouse's story
#
#
#
#
#The Dormouse's story
#
#
#Once upon a time there were three little sisters; and their #names were
#
#Elsie
#,
#
#Lacie
# and
#
#Tillie
#;
#and they lived at the bottom of a well.
#
#
#...
输出的字符串可能包含很多空格或空行,使用.stripped_strings
去除多余的空白内容,全部是空格的行挥别忽略掉,段首和段末的空白会被删除:
for string in soup.stripped_strings:
print(string)
#The Dormouse's story
#The Dormouse's story
#Once upon a time there were three little sisters; and their #names were
#Elsie
#,
#Lacie
#and
#Tillie
#;
#and they lived at the bottom of a well.
#...
.parent
通过.parent
属性来获取某个元素的父节点,上例中标签是标签的父节点:
soup.title.parent
#<head><title>The Dormouse's story</title></head>
文档title的字符串也有父节点:
soup.title.string
#"The Dormouse's story"
soup.title.string.parent
#<title>The Dormouse's story</title>
.parents
通过.parents
属性可以递归得到元素的所有父辈节点:
for parent in soup.title.string.parents:
print(parent.name)
#title
#head
#html
#[document]
*搜索文档树*
例子如下:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
过滤器 – find_all()方法,返回一个列表
①字符串 – 最简单的过滤器
在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容
soup.find_all('p')
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
soup.find_all('b')
#[<b>The Dormouse's story</b>]
②正则表达式
在find_all()中传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。
import re
for tag in soup.find_all(re.compile('^b')): #找出所有以b为开头的标签
print(tag.name)
#body
#b
for tag in soup.find_all(re.compile('t')): #找出所有名字中包含t的标签
print(tag.name)
#html
#title
列表
find_all()传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。
soup.find_all(['body','b'])
#[<body>
#<p class="title"><b>The Dormouse's story</b></p>
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
#<p class="story">...</p>
#</body>, <b>The Dormouse's story</b>]
④True
True 可以匹配任何值,可用来查找所有的tag,但是不会返回字符串节点
for tag in soup.find_all(True):
print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p
⑤方法
find_all()可传入一个方法,该方法只接受一个元素参数,如果这个方法返回True
表示当前元素匹配并且被找到,否则返回False
*find_all()*
find_all(name,_attrs,recursive,string,**kwargs)
①name参数
name 参数可以查找所有名字为name的tag,字符串对象会被自动省略。
soup.find_all('head')
#[<head><title>The Dormouse's story</title></head>]
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]
②keyword参数
如果find_all()中的参数名,不是指定搜索文档中的内置参数名,搜索时会把该参数当做指定名字的tag属性来搜索。
搜索指定名字的属性时可以使用的参数值有:字符串,正则表达式,列表,True
soup.find_all(id='link1') --#字符串
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(href=re.compile("elsie")) --#正则表达式
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(id=True) --#True
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]
可以使用多个指定名字的参数同时过滤tag的多个属性:
soup.find_all(id='link1',href=re.compile("elsie"))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
按CSS搜索
通过CSS类名搜索,通过class_
参数搜索有指定CSS类名的tag: class_
参数接受不同类型的过滤器:字符串,正则表达式,True和方法
soup.find_all(class_='title') #字符串
#[<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all(class_=re.compile('itl')) #正则表达式
#[<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all(class_=True) #True
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little #sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" #href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
当tag中的class
是多值属性.按CSS类名搜索可以分别搜索tag中的每个CSS类名;
css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
css_soup.find_all(class_='body')
#[<p class="body strikeout"></p>]
css_soup.find_all('p',class_='strikeout')
#[<p class="body strikeout"></p>]
如果按照class
属性完全匹配,如果CSS类名的顺序与实际不符,将搜索不到结果:
css_soup.find_all(class_='body strikeout')
#[<p class="body strikeout"></p>]
__string
参数 string
参数可以搜索文档中的字符串内容,string
参数接受字符串,正则表达式,True,列表,返回字符串列表
soup.find_all(string='Tillie') #字符串
#['Tillie']
soup.find_all(string=True) #True
#['\n', "The Dormouse's story", '\n', '\n', "The Dormouse's story", '\n', 'Once upon a time there were three little sisters; and their names were\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n']
soup.find_all(string=re.compile('D')) #正则表达式
#["The Dormouse's story", "The Dormouse's story"]
soup.find_all(string=['Tillie','Elsie','Lacie']) #列表
#['Elsie', 'Lacie', 'Tillie']
可以和其他参数混合使用过滤tag
soup.find_all('a',string='Tillie') #**与字符串混合使用,字符串放在string之前**
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(string='Tillie',id='link3')
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
limit参数
find_all()方法返回全部的搜索结构,可以使用limit参数限制返回结果的数量 当搜索数量达到limit限制,就停止搜索
soup.find_all(class_='sister',limit=2)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(class_='sister',limit=1)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
recursive参数 – 用于搜索tag的直接子节点,recursive=False
soup.find_all('title')
#[<title>The Dormouse's story</title>]
soup.find_all('title',recursive=False)
#[]
像调用find_all一样调用tag
即soup.find_all(参数)
相当于soup(参数)
soup('title')
#[<title>The Dormouse's story</title>]
soup('a',id='link1')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
find() 与 find_all() 区别
find_all() 方法返回结果是一个列表,find()方法直接返回结果
find_all() 方法未找到目标结果返回空列表,find()方法则返回None
———————————————————————————–此仅仅为BeautifulSoup部分