BeautifulSoup0929

程序员文章站 2022-06-08 10:43:13

...

补充：
创建Beautiful Soup对象：
soup = BeautifulSoup(html, ‘html.parser’)

一、Beautiful Soup类基本元素（5）：

例、<p class="title">...</p>
<p>...</p>：标签Tag
'p'：标签名字name
class="title"：属性Attributes，通常以键值对形式出现
标签内非属性字符串NavigableString
标签内字符串注释部分Comment

BeautifulSoup库：
引用方式：from bs4 import BeautifulSoup
创建Beautiful Soup对象：soup = BeautifulSoup(html, ‘html.parser’)
Tag方法：
【.name】print(soup.a.parent.name) #获得a标签上一级标签的名字
【.attrs】print(a.attrs) #a标签的属性

 1 print(soup.p['class'])#读取
 2 #['title']
 5 
 6 soup.p['class']="newClass"#修改
 7 print(soup.p)
 8 #<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
 9 
10 del soup.p['class']#删除
11 print(soup.p)
12 #<p name="dromouse"><b>The Dormouse's story</b></p>
13 type(a.attrs)     #为字典类型

【NavigableString】print(soup.a.string) #获取标签内非属性字符串

二、遍历：
.contents ：子节点的列表，将儿子节点存入列表（儿子节点含非标签的字符节点，NavigableString）
.children：子节点的迭代类型，遍历儿子节点
.descendants：子孙节点的迭代类型，遍历子孙节点

格式：
for child in soup.body.children:
    print(child)    遍历儿子节点
for child in soup.body.descendants:
    print(child)    遍历子孙节点

.parent，父亲节点遍历
.next_sibling：遍历下一个平行节点标签
.previous_sibling：遍历上一个平行节点标签
.next_siblings：遍历后续所有平行节点标签
.previous_siblings：遍历前续所有平行节点标签

for sibling in soup.a.next_siblings:
    print(sibling)   遍历后续所有平行节点标签
for sibling in soup.a.previous_siblings:
    print(sibling)  遍历前续所有平行节点标签

soup.prettify() 可增加代码可读性、清晰度

例、提取HTML中所有URL链接
思路：1）搜索链接所在的标签
2）解析标签，提取href后的链接内容

form bs4 import BeautifulSoup
    soup = BeautifulSoup(demo,'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

三、查找HTML中内容：
soup.find_all([‘a’, ‘b’]) #查找所有a、b标签
soup.find_all(‘p’, ‘course’) #返回有course内容的P标签

import re        
soup.find_all(id = re.compile('link'))  #返回含id='link'的标签
soup.find_all(string=re.compile("python"))  #返回所有含python文本的字符串

*参考：
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
https://www.cnblogs.com/wuwenyan/p/4773427.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#