图片抓取【scrapy、splash】
程序员文章站
2024-03-22 16:14:16
...
环境说明:
使用了爬虫框架scrapy, 并通过splash进行渲染(不然爬虫使用过程中,很多网站异步加载的情况下是无法抓取内容的)。
scrapy框架安装(直接下载离线安装即可)【https://blog.csdn.net/pp_lan/article/details/90642614】
splash安装(过程比较麻烦)【https://blog.csdn.net/pp_lan/article/details/90692510】
进行图片抓取
核心代码:
# 抓取图片
import os
import re
import urllib
import scrapy
from scrapy_splash import SplashRequest
class AutoHomeImgSpider(scrapy.Spider):
name = 'autohome_img_spider'
allowed_domains = ['autohome.com']
def start_requests(self):
img_url = "https://club.autohome.com.cn/bbs/thread/b25da065245156d4/83662885-1.html#pvareaid=2592101"
try:
yield SplashRequest(img_url
, callback=self.parse_config
, args={'wait': '2',
'timeout': '10'})
except:
print("异常model\t")
def parse_config(self, response):
imgSum = 0
badImg = 0
hrefCmp = re.compile("""<img.*?name="F06".*?src="(.*?)".*?>""")
hreflist = hrefCmp.findall(str(response.text))
drive = "F:\\pyworkspace\\mySpider\\img"
if not os.path.exists(drive):
os.mkdir(drive)
for href in hreflist:
if href.find("""http://""") == 0 or href.find("https://") == 0:
try:
imageName = href[href.rindex("/") + 1:]
urllib.request.urlretrieve(href, os.path.join(drive, imageName))
imgSum += 1
print(imageName + " OK")
except:
print("cannot download this image:" + imageName)
else:
try:
href = 'http:' + href
imageName = href[href.rindex("/") + 1:]
urllib.request.urlretrieve(href, os.path.join(drive, imageName))
imgSum += 1
print(imageName + " OK")
except:
print("cannot download this image:" + imageName)
badImg += 1
print(href)
print("Sucess:", imgSum, " Failed:", badImg)
启动:
from scrapy.cmdline import execute
execute(["scrapy", "crawl", "autohome_img_spider"])
结果示例:
上一篇: 微信小程序-引入类model