欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy--set去重数据

程序员文章站 2024-03-01 18:15:22
...

scrapy–set去重数据
1、Pipeline 中


from scrapy.exceptions import DropItem
class CheckPipeline(object):
    """check item, and drop the duplicate one"""

    def __init__(self):
        self.names_seen = set()

    def process_item(self, item, spider):
        print(item)
        if item['single_lesson_id']:
            if item['single_lesson_id'] in self.names_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.names_seen.add(item['single_lesson_id'])
                return item
        else:
            raise DropItem("Missing price in %s" % item)

2、setting
ITEM_PIPELINES = {
   'moocScrapy.pipelines.MoocscrapyPipeline': 400,
   'moocScrapy.pipelines.MongoPipeline': 300,
   'moocScrapy.pipelines.CheckPipeline': 100,
}


注意:CheckPipeline级别要高些