python爬虫学习：第三章：数据解析

程序员文章站 2022-05-07 23:09:22

...

1. 数据解析的使用

聚焦爬虫：爬取页面中指定的页面内容。
聚焦爬虫编码流程：
- 指定url
- 发起请求
- 获取响应数据
- 数据解析
- 持久化存储

2. 数据解析的分类

正则

需求：爬取糗事百科中的糗图模块下的所有的糗图图片

代码展示

import requests
import re
import os

if __name__ == '__main__':
    url = 'https://www.qiushibaike.com/imgrank/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    }
    if not os.path.exists('./qiutu_img'):
        os.mkdir('./qiutu_img')
    #通用爬虫：获取url对应的一整张页面
    img_text = requests.get(url=url,headers=headers).text
    #聚焦爬虫：将页面中的所有糗图数据进行解析/提取
    rel = r'<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    img_list = re.findall(rel,img_text,re.S)
    for src in img_list:
        img_url = "https:"+ src
        img_data = requests.get(url=img_url,headers=headers).content #图片数据
        img_name = src.split("/")[-1]
        filename = './qiutu_img'+'/'+img_name
        with open(filename,'wb') as fp:
            fp.write(img_data)
            print(filename+'-'*10+"下载完成")

xpath（重点，通用）

3.bs4使用

原理
- 实例化一个BeautifulSoup对象，并且将页面源码数据加载到该对象
- 通过调用BeautifulSoup对象中的相关属性或者方法进行标签定位和数据提取
准备工作：
- 环境安装：
  - pip install bs4
  - pip install lxml

如何实例化BeautifulSoup对象？

先导包：from bs4 import BeautifulSoup

对象的实例化:

方法1.将本地的html文档中的数据加载到该对象中

	#读取html文档数据
	fp = open('./text.html','r',encoding='utf-8')
	#实例化BeautifulSoup对象
	soup = BeautifulSoup(fp, 'lxml')

  > BeautifulSoup(源码,解析工具名称)

方法2.将互联网上获取的页面源码加载到该对象中(常用)

page_text = response.text   # 源码数据
soup = BeautifulSoup(page_text, 'lxml')    
print(soup)

BeautifulSoup对象提供的用于数据解析方法和属性
<1>获取标签

soup.标签名称: 返回源码数据中对应标签名称第一次出现的标签。
soup.find()：
- 标签定位：soup.find(标签名称)
- 属性定位标签：例：soup.find(‘div’,class_=‘song’)
返回源码数据中对应标签名称第一次出现的标签。
soup.find_all():
- 标签定位：soup.find_all(标签名称)
- 属性定位标签：例：soup.find_all(‘div’,class_=‘song’)
返回所有符合条件的标签，返回一个列表。
soup.select():
- 选择器：soup.select(选择器名称)
- 层级选择器：例：select(’.classname > li > a’) , select(’.classname a’)
  - “>”:表示一级
  - 空格表示多个层级

<2>获取标签之间的值或属性值

soup.标签名称.text(): 获取标签之间的文本数据。
soup.标签名称.get_text(): 获取该标签的所有内容。
soup.标签名称.string：获取该标签的直系内容。
soup.标签名称[‘属性名’]：获取标签中的属性值。
bs4练习：爬取三国演义小说中的所有章节标题和内容

代码展示：

	import requests
	from bs4 import BeautifulSoup
	import os
	if __name__ == '__main__':
	    #1.通用爬虫：对首页进行爬取
	    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
	    headers = {
	        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
	    }
	    # if not os.path.exists('./book'):
	    #     os.mkdir('./book')
	    text = requests.get(url=url,headers=headers).text
	    #2.实例化BeautifulSoup对象
	    soup = BeautifulSoup(text, 'lxml')
	    li_list = soup.select('.book-mulu li')
	    fp = open("./book.txt",'w',encoding='utf8')
	    for li in li_list:
	        #3.获取章节名称
	        title = li.a.string
	        #4.获取详情内容对应的url
	        href = li.a['href']
	        book_url = 'https://www.shicimingju.com'+href
	        #5.获取详情页的页面数据，并实例化新的BeautifulSoup对象
	        book_data = requests.get(url=book_url,headers=headers).text
	        book_soup = BeautifulSoup(book_data,'lxml')
	        #6.获取每一章对应的内容
	        div_tag = book_soup.find('div',class_= 'chapter_content').text
	        #7.持久化存储
	        fp.write(title+"\n"+div_tag+"\n")
	        print(f'{title}下载完毕')
	    fp.close()

4.xpath使用

原理：
- 实例化一个etree的对象，将页面源码加载到该对象中。
- 调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
准备工作
- 环境的安装:
  - pip install lxml

如何实例化etree对象？

方法1.将本地的html文档中的数据加载到etree对象中：

from lxml import etree
#实例化一个etree对象
tree = etree.parse('test.html')

方法2：将互联网上获取的页面源码加载到该对象中(常用):

   from lxml import etree
   home_text = response.text   # 源码数据
   tree = etree.HTML(home_text)

xpath使用：
- 语法：etree对象.xpath(‘xpath表达式’)
- /: 表示一个层级：r = tree.xpath(’/html/head/title’) #根目录下一级
- //: 表示多个层级：r = tree.xpath(’/html//title’) #多个层级
- 属性定位：tag[@attrName] 例：/div[@class=“song”]
- 索引定位：例：//div[@class=“song”]/p[3]
  - 注意：索引从1开始
- 获取标签之间的内容或者属性对应的属性值：
  - 获取内容：
    - /text():获取直系内容，例：r =tree.xpath(’//a/text()’)
    - //text():获取非直系内容（所有内容），例：r=tree.xpath(’//a//text()’)
  - 获取属性值：例：r = tree.xpath(’//a/@href’）

xpath练习：爬取58同城二手房源

代码展示

	import requests
	from lxml import etree
	if __name__ == '__main__':
	    url = 'https://nc.58.com/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d100000-0029-d48d-51b0-62840783cec3&ClickID=2'
	    headers = {
	        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
	    }
	    home_text = requests.get(url=url,headers=headers).text
	
	    #实例化etree对象
	    html = etree.HTML(home_text)
	    #所有的房源列表
	    home_list = html.xpath('//ul[@class="house-list-wrap"]/li')
	    # print(home_list)
	    for home in home_list:
	        title = home.xpath(".//h2[@class='title']//text()")[1]
	        with open('二手房源.txt','a',encoding='utf8') as es:
	            es.write(title+"\n")

python爬虫学习：第三章：数据解析

1. 数据解析的使用

2. 数据解析的分类

3.bs4使用

4.xpath使用

Python 爬虫招聘信息并存入数据库

Python爬虫抓取手机APP的传输数据

网易云歌单信息爬取及数据分析（python爬虫）

Python爬虫包BeautifulSoup学习实例（五）

python数据持久存储 pickle模块的基本使用方法解析

Python爬虫的两套解析方法和四种爬虫实现过程

深入解析Python小白学习【操作列表】

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Python爬虫【解析库之pyquery】

Python爬虫学习教程：天猫商品数据爬虫

python爬虫学习：第三章：数据解析

1. 数据解析的使用

2. 数据解析的分类

3.bs4使用

4.xpath使用

Python 爬虫 招聘信息并存入数据库

Python爬虫抓取手机APP的传输数据

网易云歌单信息爬取及数据分析（python爬虫）

Python爬虫包BeautifulSoup学习实例（五）

python数据持久存储 pickle模块的基本使用方法解析

Python爬虫的两套解析方法和四种爬虫实现过程

深入解析Python小白学习【操作列表】

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Python爬虫【解析库之pyquery】

Python爬虫学习教程：天猫商品数据爬虫

Python 爬虫招聘信息并存入数据库