[爬虫技巧] Scrapy中定制写入CSV文件的Pipeline

程序员文章站 2022-05-05 15:26:07

...

前言：

在使用Scrapy写项目时，难免有时会需要将数据写入csv文件中，自带的FEED写法如下：

settings.py （系统：Ubuntu 14）

FEED_URI = 'file:///home/eli/Desktop/qtw.csv'
FEED_FORMAT = 'CSV'

无需另写pipeline类，这种写法是最简单的。

但鱼和熊掌不可兼得，它的写法决定了它功能局限的特性，当我们遇到以下场景时，它无法满足：

1、过滤某些item（如包含空字段或其他非法字段值的item）

2、只将某些item字段写入csv文件

3、item去重

所以，当有更多需求时，我们仍需要定制自己的项目管道（Pipeline）,下面给出具体代码片段。

代码片段：

pipelines.py

class Pipeline_ToCSV(object):

    def __init__(self):
        #csv文件的位置,无需事先创建
        store_file = os.path.dirname(__file__) + '/spiders/qtw.csv'
        #打开(创建)文件
        self.file = open(store_file,'wb')
        #csv写法
        self.writer = csv.writer(self.file)
        
    def process_item(self,item,spider):
        #判断字段值不为空再写入文件
        if item['image_name']:
            self.writer.writerow((item['image_name'].encode('utf8','ignore'),item['image_urls']))
        return item
    
    def close_spider(self,spider):
        #关闭爬虫时顺便将文件保存退出
        self.file.close()

settings.py

ITEM_PIPELINES = {
    'yourproject.pipelines.Pipeline_ToCSV':100,
}

如有疑问，欢迎留言。