Scrapy之Images Pipeline

程序员文章站 2022-03-02 21:16:25

...

items. py

import scrapy

class MyItem(scrapy.Item):

    # ... other item fields ...
    img_urls = scrapy.Field()
    img_paths = scrapy.Field()

pipelines. py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class ZhihuImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for img_url in item['img_urls']:
            yield scrapy.Request(img_url)

    def item_completed(self, results, item, info):
        img_paths = [x['path'] for ok, x in results if ok]
        if not img_paths:
            raise DropItem("Item contains no images")
        item['img_paths'] = img_paths
        return item

注释

results返回一个元组list，典型值如下：

 [(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

setting. py

ITEM_PIPELINES = {'myProject.pipelines.MyImagesPipeline': 1}	#数字越低，优先级越高
IMAGES_STORE = 'D:\\path\\...'

上一篇：【jquery】jquery怎么实现点击一个按钮控制一个div的显示和隐藏

下一篇：苹果13和苹果12的充电头不通用吗 iphone13用以前的充电器伤电池吗

Scrapy之Images Pipeline

零基础写python爬虫之使用Scrapy框架编写爬虫

零基础写python爬虫之爬虫框架Scrapy安装配置

爬虫之scrapy框架

scrapy自定义pipeline类实现将采集数据保存到mongodb的方法

爬虫(十六)：Scrapy框架(三) Spider Middleware、Item Pipeline、对接Selenium

详解python3 + Scrapy爬虫学习之创建项目

爬虫框架Scrapy 之(四) --- scrapy运行原理(管道）

scrapy爬虫之LinkExtractor的使用

python爬虫scrapy框架之增量式爬虫的示例代码

详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库