BeautifulSoup0929
补充:
创建Beautiful Soup对象:
soup = BeautifulSoup(html, ‘html.parser’)
一、Beautiful Soup类基本元素(5):
例、<p class="title">...</p>
<p>...</p>:标签Tag
'p':标签名字name
class="title":属性Attributes,通常以键值对形式出现
标签内非属性字符串NavigableString
标签内字符串注释部分Comment
BeautifulSoup库:
引用方式:from bs4 import BeautifulSoup
创建Beautiful Soup对象:soup = BeautifulSoup(html, ‘html.parser’)
Tag方法:
【.name】print(soup.a.parent.name) #获得a标签上一级标签的名字
【.attrs】print(a.attrs) #a标签的属性
1 print(soup.p['class'])#读取
2 #['title']
5
6 soup.p['class']="newClass"#修改
7 print(soup.p)
8 #<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
9
10 del soup.p['class']#删除
11 print(soup.p)
12 #<p name="dromouse"><b>The Dormouse's story</b></p>
13 type(a.attrs) #为字典类型
【NavigableString】print(soup.a.string) #获取标签内非属性字符串
二、遍历:
.contents :子节点的列表,将儿子节点存入列表(儿子节点含非标签的字符节点,NavigableString)
.children:子节点的迭代类型,遍历儿子节点
.descendants:子孙节点的迭代类型,遍历子孙节点
格式:
for child in soup.body.children:
print(child) 遍历儿子节点
for child in soup.body.descendants:
print(child) 遍历子孙节点
.parent,父亲节点遍历
.next_sibling:遍历下一个平行节点标签
.previous_sibling:遍历上一个平行节点标签
.next_siblings:遍历后续所有平行节点标签
.previous_siblings:遍历前续所有平行节点标签
for sibling in soup.a.next_siblings:
print(sibling) 遍历后续所有平行节点标签
for sibling in soup.a.previous_siblings:
print(sibling) 遍历前续所有平行节点标签
soup.prettify() 可增加代码可读性、清晰度
例、提取HTML中所有URL链接
思路:1)搜索链接所在的标签
2)解析标签,提取href后的链接内容
form bs4 import BeautifulSoup
soup = BeautifulSoup(demo,'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
三、查找HTML中内容:
soup.find_all([‘a’, ‘b’]) #查找所有a、b标签
soup.find_all(‘p’, ‘course’) #返回有course内容的P标签
import re
soup.find_all(id = re.compile('link')) #返回含id='link'的标签
soup.find_all(string=re.compile("python")) #返回所有含python文本的字符串
*参考:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
https://www.cnblogs.com/wuwenyan/p/4773427.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
上一篇: nginx编译参数详解
推荐阅读