Python爬虫request爬取网站实例解析

程序员文章站 2022-03-11 16:18:49

自从自学ython到现在差不多三月有余，总感觉学了后面的，就会忘了前面的，特此开个博客在这里记录一下，顺便让自己巩固一下知识。刚来第一天，也不知道弄什么好，还是先来一个爬虫爬取小姐姐的图片吧（听说这样比较吸引点击…）说到爬虫，当然必须要用到requests，所以第一步当然是安装了，安装也很简单,直接pip就行：pip install requests哦，忘记说了，我用的Python版本是3.7.9的，建议大家把Python换成3.5以上的吧。今天我们的目标就是：站长之家http://aspx...

爬虫实例解析，先来一个爬虫爬取小姐姐的图片吧（听说这样比较吸引点击…）

说到爬虫，当然必须要用到requests，所以第一步当然是安装了，安装也很简单,直接pip就行：

pip install requests

哦，忘记说了，我用的Python版本是3.7.9的，建议大家把Python换成3.5以上的吧。

今天我们的目标就是：站长之家
http://aspx.sc.chinaz.com/query.aspx?keyword=%E6%80%A7%E6%84%9F%E7%BE%8E%E5%A5%B3

我们先来分析一下网站，右键点击检查，再点击源代码，发现这是个静态网页，没困难，直接搞起
我们需要用到一个parsel 库，这个库是scrapy的内置库，里面有各种查找数据的方法，re,xpath,css等等都很有用，而且还能为以后学习scrapy提前参考一下，parsel下载也很简单：

pip install parsel

import parsel import requests
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400' } url = 'http://aspx.sc.chinaz.com/query.aspx?keyword=性感美女' html = requests.get(url,headers=headers).text

res = parsel.Selector(html) url_list = res.xpath('//div[@class="imgload"]/div') for urls in url_list: link = urls.xpath('./div/a/@href').get() print(link)

output：
http://sc.chinaz.com/tupian/150921398600.htm
http://sc.chinaz.com/tupian/190529039593.htm
http://sc.chinaz.com/tupian/150807092720.htm
http://sc.chinaz.com/tupian/150116595010.htm
http://sc.chinaz.com/tupian/190205506532.htm
http://sc.chinaz.com/tupian/170516393121.htm
http://sc.chinaz.com/tupian/190403392630.htm
http://sc.chinaz.com/tupian/180301072722.htm
http://sc.chinaz.com/tupian/140703076000.htm ....

顺利拿到内容页，继续分析，方法同上，就不多说了，直接上代码：

 content = requests.get(link,headers=headers) content.encoding = "utf-8" image_url = parsel.Selector(content.text) for img in image_url.xpath('//div[@class="imga"]/a'): # 提取图片地址 images = img.xpath('./img/@src').get() # 提取标题做图片名称 title = img.xpath('./@title').get() print(images,title)

output：
http://pic.sc.chinaz.com/files/pic/pic9/201509/apic14867.jpg 漂亮性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201905/zzpic18256.jpg 欧美性感美女写真图片
http://pic.sc.chinaz.com/files/pic/pic9/201508/apic13697.jpg 美乳性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201501/apic8825.jpg 包厢性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201901/zzpic16191.jpg 个性性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201701/fpic10114.jpg 风尘性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201904/zzpic17359.jpg 抽烟性感美女图片
http://pic.sc.chinaz.com/files/pic/pic9/201802/zzpic10593.jpg 超性感美女写真图片
http://pic.sc.chinaz.com/files/pic/pic9/201406/apic4601.jpg 妖娆性感美女图片 .....

图片与名字都拿到了，下一步就是讲这些全部下载到本地，然后有时间再慢慢看咯,全部代码奉上：

import parsel import requests for page in range(1,21): # 爬取20页 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3760.400 QQBrowser/10.5.4083.400' } url = f'http://aspx.sc.chinaz.com/query.aspx?keyword=性感美女&issale=&classID=0&navindex=0&page={page}' html = requests.get(url,headers=headers).text

    res = parsel.Selector(html) url_list = res.xpath('//div[@class="imgload"]/div') for urls in url_list: link = urls.xpath('./div/a/@href').get() content = requests.get(link,headers=headers) content.encoding = "utf-8" image_url = parsel.Selector(content.text) for img in image_url.xpath('//div[@class="imga"]/a'): # 提取图片地址 images = img.xpath('./img/@src').get() # 提取标题做图片名称 title = img.xpath('./@title').get() print("正在下载",images) # 定义下载到何处 filename = title + ".jpg" image = requests.get(images) with open("./images/" + filename,"wb") as f: f.write(image.content)

Python爬虫request爬取网站实例解析
可以优化的地方还有很多，可以做的更加灵活，暂时就先这样吧

本文地址：https://blog.csdn.net/weixin_51211600/article/details/108862562

Python爬虫request爬取网站实例解析

使用python的request库爬取某小说书网站

Python爬虫爬取美剧网站的实现代码

Python爬虫实战用 BeautifulSoup 爬取电影网站信息

python面向对象多线程爬虫爬取搜狐页面的实例代码

Python爬虫实例爬取网站搞笑段子

Python爬虫实现爬取百度百科词条功能实例

Python网络爬虫开发从环境搭建到实例爬取网页

以视频爬取实例讲解Python爬虫神器Beautiful Soup用法

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Python爬虫实例：爬取B站《工作细胞》短评——异步加载信息的爬取