图片抓取【scrapy、splash】

程序员文章站 2024-03-22 16:14:16

...

环境说明：

使用了爬虫框架scrapy, 并通过splash进行渲染（不然爬虫使用过程中，很多网站异步加载的情况下是无法抓取内容的）。

scrapy框架安装（直接下载离线安装即可）【https://blog.csdn.net/pp_lan/article/details/90642614】

splash安装（过程比较麻烦）【https://blog.csdn.net/pp_lan/article/details/90692510】

进行图片抓取

核心代码：

# 抓取图片
import os
import re
import urllib

import scrapy
from scrapy_splash import SplashRequest


class AutoHomeImgSpider(scrapy.Spider):
    name = 'autohome_img_spider'
    allowed_domains = ['autohome.com']

    def start_requests(self):
        img_url = "https://club.autohome.com.cn/bbs/thread/b25da065245156d4/83662885-1.html#pvareaid=2592101"
        try:
            yield SplashRequest(img_url
                                , callback=self.parse_config
                                , args={'wait': '2',
                                        'timeout': '10'})
        except:
            print("异常model\t")

    def parse_config(self, response):
        imgSum = 0
        badImg = 0
        hrefCmp = re.compile("""<img.*?name="F06".*?src="(.*?)".*?>""")
        hreflist = hrefCmp.findall(str(response.text))
        drive = "F:\\pyworkspace\\mySpider\\img"
        if not os.path.exists(drive):
            os.mkdir(drive)
        for href in hreflist:
            if href.find("""http://""") == 0 or href.find("https://") == 0:
                try:
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
            else:
                try:
                    href = 'http:' + href
                    imageName = href[href.rindex("/") + 1:]
                    urllib.request.urlretrieve(href, os.path.join(drive, imageName))
                    imgSum += 1
                    print(imageName + "    OK")
                except:
                    print("cannot download this image:" + imageName)
                    badImg += 1
                    print(href)
        print("Sucess:", imgSum, "    Failed:", badImg)

启动：

from scrapy.cmdline import execute

execute(["scrapy", "crawl", "autohome_img_spider"])

结果示例：

图片抓取【scrapy、splash】

上一篇：微信小程序-引入类model

下一篇： RXJava之线程控制Scheduler(四)

图片抓取【scrapy、splash】

图片抓取【scrapy、splash】

scrapy爬虫之基本抓取流程和scrapy项目文件

详解JAVA抓取网页的图片,JAVA利用正则表达式抓取网站图片

详解JAVA抓取网页的图片,JAVA利用正则表达式抓取网站图片

PHP抓取远程图片(含不带后缀的)教程详解

PHP通过CURL实现定时任务的图片抓取功能示例

Python使用正则表达式抓取网页图片的方法示例

Java实现的爬虫抓取图片并保存操作示例

Python使用正则表达式抓取网页图片的方法示例

c#根据网址抓取网页截屏生成图片的示例