Scrapy框架架构（二）

程序员文章站 2024-03-17 15:00:34

...

1.pipeline保存数据（用python自带的Json格式保存数据）

（1）response是一个“scrapy.http.response.html.HtmlResponse”对象。可以执行“xpath”和“css”语法来提取数据。

（2）提取出来的数据，是一个“Selector”或者是一个“SelectorList”对象。如果想要获取其中的字符串，那么应该执行“getall”或者“get”方法。

（3）getall方法：获取“Selector”中的所有文本。返回的是一个列表。

（4）get方法：获取的是“Selector”中的第一个文本。返回的是一个str类型。

（5）如果数据解析回来，要传给pipeline处理。那么可以使用“yield”来返回。或者是手机所有的item。最后统一使用return返回。

（6）item：建议在“items.py”中定义好模型。以后就不要使用字典。

（7）pipeline：这个是专门用来保存数据的。其中有三个方法是经常使用的。

“open_spider（self, spider）”：当爬虫被打开的时候执行。
“process_item（self, item, apider）”：当爬虫有item传过来的时候调用。
“close_spider（self, spider）”：当爬虫被关闭的时候被调用。

（8）要**pipeline，应该在“settings.py”中，设置“ITEM_PIPELINES”。实例如下（其中scrapy_fdemo是文件夹名，QsbkPipeline是pipelines中的类名）：

ITEM_PIPELINES = {
   'scrapy_fdemo.pipelines.QsbkPipeline': 300,
}

2.优化数据存储方式

保存json数据的时候，可以选择JsonItemExporter和JsonLinesItemExporter两个类，让操作变得更简单。

（1）“JsonItemExporter”：这个是每次把数据添加到内存中。最后统一写入到磁盘中。好处是，存储的数据是一个满足json规则的数据。坏处是如果数据量比较大，那么比较耗内存。示例代码如下：

import sys
from scrapy.exporters import JsonItemExporter, JsonLinesItemExporter
reload(sys)
sys.setdefaultencoding('utf-8')


# 存储方式一： JsonItemExporter方式存储
class QsbkPipeline(object):
    def __init__(self):
        # self.fp = open("qsbk_spider.json", "w", encoding="utf-8")
        # "wb"以二进制byte文件读取
        self.fp = open("qsbk_spider.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")
        self.exporter.start_exporting()

    def open_spider(self, spider):
        print "Successfully"

    def process_item(self, item, spider):
        # dict(item):将item信息字典化
        self.exporter.export_item(item)
        print "="*50
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print "Close"

（2）“JsonLinesItemExporter”：这个是每次调用“export_item”的时候就把这个item存储到硬盘中。坏处是每一个字典是一行，整个文件不是一个满足json格式的文件。好处是每次处理数据的时候直接存储到硬盘中，节省内存，数据比较安全。示例代码如下：

import sys
from scrapy.exporters import JsonItemExporter, JsonLinesItemExporter
reload(sys)
sys.setdefaultencoding('utf-8')

# 存储方式二：JsonLinesItemExporter方式存储
class QsbkPipeline(object):
    def __init__(self):
        # self.fp = open("qsbk_spider.json", "w", encoding="utf-8")
        # "wb"以二进制byte文件读取
        self.fp = open("qsbk_spider.json", "wb")
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")

    def open_spider(self, spider):
        print "Successfully"

    def process_item(self, item, spider):
        # dict(item):将item信息字典化
        self.exporter.export_item(item)
        print "="*50
        return item

    def close_spider(self, spider):
        self.fp.close()
        print "Close"

Scrapy框架架构（二）

1.pipeline保存数据（用python自带的Json格式保存数据）

2.优化数据存储方式

二.scrapy框架调试

scrapy（二）