scrapy--set去重数据
程序员文章站
2024-03-01 18:15:22
...
scrapy–set去重数据
1、Pipeline 中
from scrapy.exceptions import DropItem
class CheckPipeline(object):
"""check item, and drop the duplicate one"""
def __init__(self):
self.names_seen = set()
def process_item(self, item, spider):
print(item)
if item['single_lesson_id']:
if item['single_lesson_id'] in self.names_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.names_seen.add(item['single_lesson_id'])
return item
else:
raise DropItem("Missing price in %s" % item)
2、setting
ITEM_PIPELINES = {
'moocScrapy.pipelines.MoocscrapyPipeline': 400,
'moocScrapy.pipelines.MongoPipeline': 300,
'moocScrapy.pipelines.CheckPipeline': 100,
}
注意:CheckPipeline级别要高些
上一篇: Python实现查找匹配项作处理后再替换回去的方法
下一篇: mysql 多条件模糊查询