欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

初玩scrapy:爬取淘票票(3)之保存图片

程序员文章站 2022-06-27 07:53:02
...

爬取图片地址,并保存到本地

1. 使用ImagesPipeline

(1) 在settings.py文件中的ITEM_PIPELINES中添加一条 'scrapy.pipelines.images.ImagesPipeline':1

(2) 在Item中添加两个字段

    img_urls = scrapy_Field()

    images = scrapy_Field()

(3) 在settings.py文件中添加保存路径IMAGES_STORE、图片url所在item字段IMAGES_URLS_FIELD

和文件结果所在item字段IMAGES_RESULT_FIELD

    IMAGES_STORE = 'F:\\py_pic'

    IMAGES_URLS_FIELD = 'img_urls'

    IMAGES_RESULT_FIELD = 'images'

    可以在settings.py中使用IMAGES_THUMBS制作缩略图,并设置缩略图的大小。

使用IMAGES_EXPIRES设置文件过期时间

    IMAGES_THUMBS = {

                'small' : (50,50),

                'big' : (270,270),

}

IMAGES_EXPIRES = 30 #30天过期

2. 结果

命令:scrpay crawl taopiaopiao

初玩scrapy:爬取淘票票(3)之保存图片

保存图片结果:

初玩scrapy:爬取淘票票(3)之保存图片


3. 代码:

items.py

import scrapy
class TaopiaopiaoItem(scrapy.Item):  
    url = scrapy.Field()  
    name = scrapy.Field()  
    actor = scrapy.Field()  
    country = scrapy.Field()  
    img_urls = scrapy.Field() #图片url
    images = scrapy.Field()   #结果

taopiaopiao.py

# coding:utf-8
import scrapy
from taopiaopiao.items import TaopiaopiaoItem

class taoPiaoPiaoSpider(scrapy.Spider):
    # 爬虫名称
    name = "taopiaopiao"
    start_urls = [
        "https://www.taopiaopiao.com/showList.htm?n_s=new"
    ]

    def parse(self, response):
        # 实现网页的解析
        item = TaopiaopiaoItem()
        movics = response.xpath("//div[@class='movie-card-wrap']")
        for movic in movics:
            item["url"] = \
                movic.xpath("a/@href").extract()[0]
            item["name"] = \
                movic.xpath("a/div[@class='movie-card-name']/span[@class='bt-l']/text()").extract()[0]
            item["img_urls"] = \
                movic.xpath("a/div[@class='movie-card-poster']/img/@src").extract()
            item["actor"] = \
                movic.xpath("a/div[@class='movie-card-info']/div[@class='movie-card-list']/span[2]/text()").extract()[0]
            item["country"] = \
                movic.xpath("a/div[@class='movie-card-info']/div[@class='movie-card-list']/span[4]/text()").extract()[0]
            yield item  

settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'taopiaopiao'

SPIDER_MODULES = ['taopiaopiao.spiders']
NEWSPIDER_MODULE = 'taopiaopiao.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'taopiaopiao.pipelines.TaopiaopiaoPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline' : 1,
}

IMAGES_STORE = 'F:\\py_pic'
IMAGES_URLS_FIELD = 'img_urls'
IMAGES_RESULT_FIELD = 'images'
IMAGES_THUMBS = {
    'small' : (50,50),
    'big' : (270,270),
}
IMAGES_EXPIRES = 30