python网络爬虫入门（三、复杂HTML的解析）

程序员文章站 2024-01-12 19:54:04

...

一、使用标签名和属性

#1.使用标签  
import requests
from bs4 import BeautifulSoup

url="http://www.runoob.com/html/html-intro.html" 
r=requests.get(url)  
html=r.text.encode(r.encoding).decode()      
soup=BeautifulSoup(html,"lxml") 

#1.使用标签  
soup.findAll(name={"h1","h2","h3"})                #找到所有的h1,h2,h3标签
len(soup.body.findAll("div",recursive=False))     #recursive=False 非递归找出body里面的div分区
                                                   #recursive=True  递归找出body里面所有的div分区
#2.使用属性

#1.找出class属性为"article"和"container navigation"的div分区
divs=soup.findAll("div",attrs={"class":{"article","container navigation"}})  

#2.找出articl中h2标记
divs[1].findAll("h2")  #找出article中所有h2标签

二、使用文本和关键字

#1.使用文本
import requests
from bs4 import BeautifulSoup
import re

url="http://www.runoob.com/html/html-intro.html" 
r=requests.get(url)  
html=r.text.encode(r.encoding).decode()      
soup=BeautifulSoup(html,"lxml") 

#查看所有文本内容为 "HTML 标签"的所有html元素（tag对象）
soup.findAll(re.compile(""),text=("HTML 标签")) 

#查看所有文本内容为 "HTML开头"的所有h1,h2,h3,h4元素（tag对象）
soup.findAll({"h1","h2","h3","h4"},text=re.compile("^HTML"))


#2.使用关键字
soup.findAll(class_ ={"article","container navigation"})
#因为class是python关键字，此处又要作为属性名，为了避免冲突，需要加下划线
#**arg参数和
soup.findAll("div",id={"footer"})

三、使用lambda表达式

import requests
from bs4 import BeautifulSoup

url="http://www.runoob.com/html/html-intro.html" 
r=requests.get(url)  
html=r.text.encode(r.encoding).decode()      
soup=BeautifulSoup(html,"lxml") 

#方法1;lambda函数
soup.findAll(lambda tag:tag.name=="h2" and len(tag.attrs)==0)   #找到所有没有属性h2标签

#方法2：if筛选
[x for x in soup.findAll("h2") if len(x.attrs)==0]

#方法3：filter过滤
list(filter(lambda tag:len(tag.attrs)==0,soup.findAll("h2")))

四、使用正则表达式

python网络爬虫入门（三、复杂HTML的解析）

import requests
from bs4 import BeautifulSoup

url="http://www.runoob.com/html/html-intro.html" 
r=requests.get(url)  
html=r.text.encode(r.encoding).decode()      
soup=BeautifulSoup(html,"lxml")

import re
#1.查找符合从h1-h9所有标签
soup.findAll(re.compile("h[1-9]"))  
#2.查找符合从h1-h9所有标签且文本内容含有HTML或html
soup.findAll(re.compile("h[1-9]"),text=re.compile(".*(HTML)|(html).*"))
#3.查找符合以www或者https://www开始的链接
soup.findAll("a",attrs={"href":re.compile("^//(www)|(https\://www).*")})

五、使用导航树

python网络爬虫入门（三、复杂HTML的解析）

import requests
from bs4 import BeautifulSoup

url="http://www.runoob.com/html/html-intro.html" 
r=requests.get(url)  
html=r.text.encode(r.encoding).decode()      
soup=BeautifulSoup(html,"lxml") 

len(list(soup.body.children))        #body标签孩子标签个数
len(list(soup.body.descendants))     #body标签后代标签个数
len(list(soup.body.find("div").next_siblings))  #div标签的兄弟标签个数
soup.body.find("div").parent.name            #div标签父标签的名字

相关标签： python 爬虫入门复杂HTML的解析

上一篇：由JavaScript技术实现的web小游戏（不含网游）_javascript技巧

下一篇： SpringMVC（10）——JSON数据交互

python网络爬虫入门（三、复杂HTML的解析）

一、使用标签名和属性

二、使用文本和关键字

三、使用lambda表达式

四、使用正则表达式

五、使用导航树

python网络爬虫入门（三、复杂HTML的解析）

从入门到放弃：python爬虫系列-xpath解析库的使用

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

Python实现一个简单三层神经网络的搭建及测试代码解析

05 Python网络爬虫的数据解析方式

从入门到放弃：python爬虫系列-xpath解析库的使用

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

python爬虫入门教程(三)：淘女郎爬虫 ( 接口解析 | 图片下载 )

[Python]网络爬虫（三）：异常的处理和HTTP状态码的分类

[Python]网络爬虫（三）：异常的处理和HTTP状态码的分类

python网络爬虫入门（三、复杂HTML的解析）

一、使用标签名和属性

二、使用文本和关键字

三、使用lambda表达式

四、使用正则表达式

五、使用导航树

python网络爬虫入门（三、复杂HTML的解析）

从入门到放弃：python爬虫系列-xpath解析库的使用

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

Python实现一个简单三层神经网络的搭建及测试 代码解析

05 Python网络爬虫的数据解析方式

从入门到放弃：python爬虫系列-xpath解析库的使用

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

python爬虫入门教程(三)：淘女郎爬虫 ( 接口解析 | 图片下载 )

[Python]网络爬虫（三）：异常的处理和HTTP状态码的分类

[Python]网络爬虫（三）：异常的处理和HTTP状态码的分类

Python实现一个简单三层神经网络的搭建及测试代码解析