python scrapy爬虫框架抓取多个item 返回多pipeline的处理

程序员文章站 2022-05-08 17:06:11

...

python scrapy爬虫框架抓取多个item 返回多pipeline的处理

本文仅仅是记录下踩坑过程，如果有更好的解决方法，还请大家指导下。
对于做python爬虫的同学应该都知道scrapy这个爬虫框架，这是个自带多线程协程的框架，他的底层是使用Twisted异步框架实现的，这个框架是使用python实现的，本文讲诉那些东西，仅为大家演示代码怎么写，代码怎么实现的，至于那些什么引擎下载器什么的不讲述了，网上资源挺多的，
起初是不想写的，但是最进几天踩坑挺多的，主要演示多个item怎么弄，以及多管道的操作，下面就是具体的操作，
首先肯定是编写详细的爬虫代码，在spiders的文件夹下随意新建个py文件，我的代码如下，爬的是我自己的博客，虽然啥也没有哈哈，

import scrapy
from ..items import XiechengNewItem,ComItem
class DmozSpider(scrapy.Spider):
    name = "ce"
    allowed_domains = ["wensong.xyz/"]
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Cookie': 'Hm_lvt_bfe6b19ea4d61a4b723235442982102c=1596154422,1596699752,1597652311,1598425017; Hm_lpvt_bfe6b19ea4d61a4b723235442982102c=1598425038'
    }
    start_urls = [
        'http://wensong.xyz/',
    ]
    def parse(self, response, **kwargs):
        #实例化xieitem的类
        item=XieItem()
        #拿到标题
        url = response.xpath('//title//text()').getall()[0]
        item['name']=url
        yield item
        #实例化ComItem的类
        item1=ComItem()
        #拿到下一页的链接
        url1 = 'http://wensong.xyz' + response.xpath('//p//a//@href').getall()[0]
        item1['url'] = url1
        yield item1

细心的同学都发现我在parse中写了两个yiled函数，分别返回了两个item，
item文件如下

import scrapy


class XieItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
class ComItem(scrapy.Item):
    url=scrapy.Field()

注意的是item文件的name跟爬虫文件的item[‘url’’]一致，item的就是你需要抓的字段，
下面就是管道文件pipelines文件

from itemadapter import ItemAdapter
from .items import XieItem,ComItem
# from .items import ComItem

class XiechengNewPipeline:
    def process_item(self, item, spider):
        print(item['name'])
class ComchengNewPipeline:
    def process_item1(self, item, spider):
        print(item['url'],'11111')

下面就是管道文件的配置

ITEM_PIPELINES = {
   'xiecheng_new.pipelines.XiechengNewPipeline': 200,
   'xiecheng_new.pipelines.ComchengNewPipeline': 300,
}

后面的数字是多少无所谓但是别重复就是，那个是优先级，这样写完代码会发现，item都返回给第一个管道文件，不会进入到第二个管道。
那么我们怎么解决呢，如果只想只有单个管道配置跟单个爬虫文件的话，直接通过判断item类型来进行存储数据，
代码如下，

from itemadapter import ItemAdapter
from .items import XieItem,ComItem
# from .items import ComItem

class XiechengNewPipeline:
    def process_item(self, item, spider):
        if type(item)==XieItem:
            print(item['name'])
class ComchengNewPipeline:
    def process_item1(self, item, spider):
        if type(item)==ComItem:
            print(item['url'],'11111')

但是多个文件不想使用一个通道时候就要使用另一个方法了，这种情况在多个爬虫时候好用，但是单个爬虫还是老老实实的的判断item类型就好了，
多管道方法如下，
在爬虫的开头直接增加配置

import scrapy
from ..items import XieItem,ComItem
# from ..items import ComItem
class DmozSpider(scrapy.Spider):
    name = "ca"
    allowed_domains = ["wensong.xyz/"]
    start_urls = [
        'http://wensong.xyz/',
    ]
    custom_settings = {
        'ITEM_PIPELINES' : {
        'xiecheng_new.pipelines.ComchengNewPipeline': 300}
    }
    def parse(self, response, **kwargs):
        item=XieItem()
        url = response.xpath('//title//text()').getall()[0]
        item['name']=url
        yield item
        item1=ComItem()
        url1 = 'http://wensong.xyz' + response.xpath('//p//a//@href').getall()[0]
        item1['url'] = url1
        yield item1

这样配置就能实现多个爬虫使用不同的管道文件了，简单的总结就是单个爬虫需要使用多item就判断item的类型，多爬虫文件需要多item的时候，
在开头位置增加配置来使用多管道，下一篇是scrapy的实战，