Scrapy 爬取某网站图片
程序员文章站
2024-01-12 09:08:52
1. 创建一个 Scrapy 项目,在命令行或者 Pycharm 的 Terminal 中输入: 自动生成了下列文件: 2. 在 imagepixiv/spiders 文件夹下新建一个 imagepixiv.py 文件 3. imagepixiv.py 下的代码: 4. items.py 下的代码: ......
1. 创建一个 scrapy 项目,在命令行或者 pycharm 的 terminal 中输入:
scrapy startproject imagepix
自动生成了下列文件:
2. 在 imagepixiv/spiders 文件夹下新建一个 imagepixiv.py 文件
3. imagepixiv.py 下的代码:
import scrapy from urllib.parse import urlencode import json from ..items import imagepixitem class imagepixivspider(scrapy.spider): name = 'imagepixiv' def start_requests(self): data = {'keyword': '风景'} base_url_1 = 'https://api.pixivic.com/illustrations?' for page in range(1, self.settings.get('max_page') + 1): data['page'] = page params = urlencode(data) url_1 = base_url_1 + params yield scrapy.request(url_1, callback=self.parse) def parse(self, response): result = json.loads(response.text) for image in result.get('data'): item = imagepixitem() item['title'] = image.get('title') item['id'] = image.get('id') url = image.get('imageurls')[0].get('large') url_rel = 'https://img.pixivic.com:23334/get/' + str(url) item['url'] = url_rel yield item
4. items.py 下的代码:
import scrapy from scrapy import field class imagepixitem(scrapy.item): title = field() id = field() url = field()
5. pipelines.py 下的代码:
from scrapy import request
from scrapy.exceptions import dropitem
from scrapy.pipelines.images import imagespipeline
class imagepixpipeline(imagespipeline):
def file_path(self, request, response=none, info=none):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise dropitem('image downloaded failed')
return item
def get_media_requests(self, item, info):
yield request(item['url'])
6. settings.py 下的代码:
bot_name = 'imagepix' spider_modules = ['imagepix.spiders'] newspider_module = 'imagepix.spiders' max_page = 50 feed_export_encoding = 'utf-8' images_store = './images' item_pipelines = { 'imagepix.pipelines.imagepixpipeline': 300, } robotstxt_obey = false
7. 在命令行运行:
scrapy crawl imagepixiv
8. 结果: