欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

初窥 -- BeautifulSoup

程序员文章站 2022-05-02 17:39:34
...

Tag

soup = BeautifulSoup("<**p** class='title'><b>The Dormouse's story</b></p>")
tag = soup.p    #这里的soup.p中的 p 是上一条语句中的 p 
type(tag)
#<class 'bs4.element.Tag'>

_Tag的name属性

tag.name
'p'
tag
#<p class='title'><b>The Dormouse's story</b></p>

tag.name='test'   #name可以修改
tag
#<test class="title"><b>The Dormouse's story</b></test>

Tag的Attributes属性
标签<p class='title'><b>The Dormouse's story</b></p> 中 class 即为tag的属性,一个tag可以有很多个属性。class 属性的值为 title 。tag属性的操作方法与字典相同。

tag['class']
#['title']

tag的属性可以被添加,删除和修改,操作与字典一样

tag['class'] = 'test1'      #修改
tag['id'] = 11              #添加
tag
#<test class="test1" id="11"><b>The Dormouse's story</b></test>
del tag['class']            #删除
del tag['id']
#<test><b>The Dormouse's story</b></test>
print(tag.get('class',404))
#404

另外也可以用.attrs (字典方法)获取tag属性

tag['class'] = 'test1'
tag['id'] = 11
tag.attr
#{'class': 'test1', 'id': 11}

Tag 的多值属性 – 未完全弄懂 待续

BeautifulSoup对象
BeautifulSoup 对象(常用soup表示)表示的是一个文档的全部内容,大部分时候可以把他当做Tag 对象,但他没有name和attribute属性

*遍历文档树*
演示示例:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Tag的名字
操作文档树就是告诉他你想要获取的tag的name。若想要获取标签,只需要用soup.head

soup.head
#<head><title>The Dormouse's story</title></head>
soup.title
#<title>The Dormouse's story</title>

可以多次使用. 点方法,获取你想要的标签

soup.head.title
#<title>The Dormouse's story</title>
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.p.b
#<b>The Dormouse's story</b>
soup.body.p.b
#<b>The Dormouse's story</b>

另外通过点取属性只能获得当前名字的第一个Tag:

soup.a
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

若想要得到所有的标签,可以用:find_all() 方法等

soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

_.contents 和 .children —仅包含tag的直接子节点
tag 的 .contents 属性可以将tag的子节点以列表的方式输出

soup.head
#<head><title>The Dormouse's story</title></head>
soup.head.contents
#[<title>The Dormouse's story</title>]  --<head>直接子节title
soup.head.contents[0].contents
#["The Dormouse's story"]

通过tag的.children 生成器可以对tag的子节点进行循环

for child in head_tag.children:
    print(child)
#<title>The Dormouse's story</title>
#若不用循环
head_tag.children
#<list_iterator object at 0x000000000349AF28>

.descendants –可以对所有的tag的子孙节点进行递归循环

for child in head_tag.descendants:
    print(child)
#<title>The Dormouse's story</title>
#The Dormouse's story

.string
如果tag只有一个NavigableString 类型的子节点,那么可以用.string 得到子节点:

soup.head.string
#"The Dormouse's story"

如果包含多个NavigableString 类型的子节点,输出结果则为None

.strings 和 stripped_strings
如果tag中包含多个字符串,可以使用.strings 来循环获取:

for string in soup.strings:
    print(string)
#
#   
#
#
#The Dormouse's story
#
#
#
#
#The Dormouse's story
#
#
#Once upon a time there were three little sisters; and their #names were
#
#Elsie
#,
#
#Lacie
# and
#
#Tillie
#;
#and they lived at the bottom of a well.
#
#
#...

输出的字符串可能包含很多空格或空行,使用.stripped_strings 去除多余的空白内容,全部是空格的行挥别忽略掉,段首和段末的空白会被删除:

for string in soup.stripped_strings:
    print(string)
#The Dormouse's story
#The Dormouse's story
#Once upon a time there were three little sisters; and their #names were
#Elsie
#,
#Lacie
#and
#Tillie
#;
#and they lived at the bottom of a well.
#... 

.parent
通过.parent 属性来获取某个元素的父节点,上例中标签是标签的父节点:

soup.title.parent
#<head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:

soup.title.string
#"The Dormouse's story"
soup.title.string.parent
#<title>The Dormouse's story</title>

.parents
通过.parents 属性可以递归得到元素的所有父辈节点:

 for parent in soup.title.string.parents:
    print(parent.name)
#title
#head
#html
#[document]

*搜索文档树*
例子如下:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

过滤器 – find_all()方法,返回一个列表
①字符串 – 最简单的过滤器
在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容

soup.find_all('p')
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

soup.find_all('b')
#[<b>The Dormouse's story</b>]

②正则表达式
在find_all()中传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。

import re
for tag in soup.find_all(re.compile('^b')):  #找出所有以b为开头的标签
    print(tag.name)
#body
#b

for tag in soup.find_all(re.compile('t')):    #找出所有名字中包含t的标签
    print(tag.name)
#html
#title

列表
find_all()传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。

soup.find_all(['body','b'])
#[<body>
#<p class="title"><b>The Dormouse's story</b></p>
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
#<p class="story">...</p>
#</body>, <b>The Dormouse's story</b>]

④True
True 可以匹配任何值,可用来查找所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p

⑤方法
find_all()可传入一个方法,该方法只接受一个元素参数,如果这个方法返回True 表示当前元素匹配并且被找到,否则返回False

*find_all()*
find_all(name,_attrs,recursive,string,**kwargs)
①name参数
name 参数可以查找所有名字为name的tag,字符串对象会被自动省略。

soup.find_all('head')
#[<head><title>The Dormouse's story</title></head>]

soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]

②keyword参数
如果find_all()中的参数名,不是指定搜索文档中的内置参数名,搜索时会把该参数当做指定名字的tag属性来搜索
搜索指定名字的属性时可以使用的参数值有:字符串,正则表达式,列表,True

soup.find_all(id='link1')   --#字符串
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(href=re.compile("elsie"))  --#正则表达式
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id=True)  --#True
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]

可以使用多个指定名字的参数同时过滤tag的多个属性:

soup.find_all(id='link1',href=re.compile("elsie"))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

按CSS搜索
通过CSS类名搜索,通过class_ 参数搜索有指定CSS类名的tag:
class_ 参数接受不同类型的过滤器:字符串,正则表达式,True和方法

soup.find_all(class_='title') #字符串
#[<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all(class_=re.compile('itl'))  #正则表达式
#[<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all(class_=True)  #True
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little #sisters; and their names were
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" #href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

当tag中的class 是多值属性.按CSS类名搜索可以分别搜索tag中的每个CSS类名;

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
css_soup.find_all(class_='body')
#[<p class="body strikeout"></p>]

css_soup.find_all('p',class_='strikeout')
#[<p class="body strikeout"></p>]

如果按照class 属性完全匹配,如果CSS类名的顺序与实际不符,将搜索不到结果:

css_soup.find_all(class_='body strikeout')
#[<p class="body strikeout"></p>]

__string 参数
string 参数可以搜索文档中的字符串内容,string 参数接受字符串,正则表达式,True,列表,返回字符串列表

soup.find_all(string='Tillie')   #字符串
#['Tillie']
soup.find_all(string=True)       #True
#['\n', "The Dormouse's story", '\n', '\n', "The Dormouse's story", '\n', 'Once upon a time there were three little sisters; and their names were\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n']
soup.find_all(string=re.compile('D'))  #正则表达式
#["The Dormouse's story", "The Dormouse's story"]
soup.find_all(string=['Tillie','Elsie','Lacie'])  #列表
#['Elsie', 'Lacie', 'Tillie']

可以和其他参数混合使用过滤tag

soup.find_all('a',string='Tillie')    #**与字符串混合使用,字符串放在string之前**
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(string='Tillie',id='link3')
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

limit参数
find_all()方法返回全部的搜索结构,可以使用limit参数限制返回结果的数量 当搜索数量达到limit限制,就停止搜索

soup.find_all(class_='sister',limit=2)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" #href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(class_='sister',limit=1)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

recursive参数用于搜索tag的直接子节点,recursive=False

soup.find_all('title')
#[<title>The Dormouse's story</title>]
soup.find_all('title',recursive=False)
#[]

像调用find_all一样调用tag
soup.find_all(参数) 相当于soup(参数)

soup('title')
#[<title>The Dormouse's story</title>]
soup('a',id='link1')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

find() 与 find_all() 区别
find_all() 方法返回结果是一个列表,find()方法直接返回结果
find_all() 方法未找到目标结果返回空列表,find()方法则返回None

———————————————————————————–此仅仅为BeautifulSoup部分

相关标签: Python bs4