Scrapy之Images Pipeline
程序员文章站
2022-03-02 21:16:25
...
items. py
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
img_urls = scrapy.Field()
img_paths = scrapy.Field()
pipelines. py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ZhihuImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for img_url in item['img_urls']:
yield scrapy.Request(img_url)
def item_completed(self, results, item, info):
img_paths = [x['path'] for ok, x in results if ok]
if not img_paths:
raise DropItem("Item contains no images")
item['img_paths'] = img_paths
return item
注释
results
返回一个元组list,典型值如下:
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
setting. py
ITEM_PIPELINES = {'myProject.pipelines.MyImagesPipeline': 1} #数字越低,优先级越高
IMAGES_STORE = 'D:\\path\\...'
推荐阅读
-
零基础写python爬虫之使用Scrapy框架编写爬虫
-
零基础写python爬虫之爬虫框架Scrapy安装配置
-
爬虫之scrapy框架
-
scrapy自定义pipeline类实现将采集数据保存到mongodb的方法
-
爬虫(十六):Scrapy框架(三) Spider Middleware、Item Pipeline、对接Selenium
-
详解python3 + Scrapy爬虫学习之创建项目
-
爬虫框架Scrapy 之(四) --- scrapy运行原理(管道)
-
scrapy爬虫之LinkExtractor的使用
-
python爬虫scrapy框架之增量式爬虫的示例代码
-
详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库