欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

图片抓取【scrapy、splash】

程序员文章站 2024-03-22 16:14:16
...

 

环境说明:

使用了爬虫框架scrapy, 并通过splash进行渲染(不然爬虫使用过程中,很多网站异步加载的情况下是无法抓取内容的)。

scrapy框架安装(直接下载离线安装即可)【https://blog.csdn.net/pp_lan/article/details/90642614

splash安装(过程比较麻烦)【https://blog.csdn.net/pp_lan/article/details/90692510

进行图片抓取

核心代码:

# 抓取图片
import os
import re
import urllib

import scrapy
from scrapy_splash import SplashRequest


class AutoHomeImgSpider(scrapy.Spider):
    name = 'autohome_img_spider'
    allowed_domains = ['autohome.com']

    def start_requests(self):
        img_url = "https://club.autohome.com.cn/bbs/thread/b25da065245156d4/83662885-1.html#pvareaid=2592101"
        try:
            yield SplashRequest(img_url
                                , callback=self.parse_config
                                , args={'wait': '2',
                                        'timeout': '10'})
        except:
            print("异常model\t")

    def parse_config(self, response):
        imgSum = 0
        badImg = 0
        hrefCmp = re.compile("""<img.*?name="F06".*?src="(.*?)".*?>""")
        hreflist = hrefCmp.findall(str(response.text))
        drive = "F:\\pyworkspace\\mySpider\\img"
        if not os.path.exists(drive):
            os.mkdir(drive)
        for href in hreflist:
            if href.find("""http://""") == 0 or href.find("https://") == 0:
                try:
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
            else:
                try:
                    href = 'http:' + href
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
                    badImg += 1
                    print(href)
        print("Sucess:", imgSum, "    Failed:", badImg)

启动:

from scrapy.cmdline import execute

execute(["scrapy", "crawl", "autohome_img_spider"])

结果示例:

图片抓取【scrapy、splash】