【Scrapy 框架翻译】物品（Items）详解篇

程序员文章站 2022-05-08 16:58:37

...

版本号：Scrapy 2.4

文章目录

内容介绍
使用Items

内容介绍

数据抓取的主要目标是从非结构化源(通常是网页)中提取结构化数据。

本章节介绍源码中的案例，个人感觉处理数据的操作比较繁琐，将数据处理的流程简化到最简的内容在专栏中的爬虫示例中，如果觉得文章中数据处理繁琐的小伙伴可以跳转过去直接看示例。

Items提供了一个可以读取、写入、修改的数据的字典供使用。

dictionaries：数据类型是字典。

Item objects：拥有与字典相同的操作。

from scrapy.item import Item, Field

class CustomItem(Item):
    one_field = Field()
    another_field = Field()

dataclass objects：支持序列化定义项目数据中的数据类型

from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str
    another_field: int

attrs objects：支持序列化转换数属性

import attr

@attr.s
class CustomItem:
    one_field = attr.ib(str)
    another_field = attr.ib(convert=float)

使用Items

声明字段

项子类使用简单的类定义语法和Field属性，也就是要自定义好抓取内容的列表字段名，将抓取的数据按照列联表的方式填充到表格中。

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

字段数据

创建Items

>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)

获取Items的值

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000

# 一般错误提示，同字典报错
>>> product['lala'] # 获取未定义的字段值
Traceback (most recent call last):
    ...
KeyError: 'lala'

设置Items的值

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

字典操作Items

>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

复制Items

product2 = product.copy()
# 或者
product2 = Product(product)

字典创建Items

>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')

数据类型扩展

# 直接定义数据类型
class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

# 使用序列化的方式进行定义
class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

Spider中的使用

from myproject.items import Product
 
def parse(self, response):
	item = Product()
    item["name"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["price"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["stock"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["tags"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["last_updated"]= response.xpath('//div[@class="xxx"]/text()').extract()
    yield item

相关标签： # Scrapy 数据采集 python scrapy 爬虫 items 源码

上一篇：夏季清爽凉拌菜,,6个秘诀做出爽口凉拌菜

下一篇： jquery如何查找后代元素？jquery获取后代元素方法

【Scrapy 框架翻译】物品（Items）详解篇

文章目录

内容介绍

使用Items

声明字段

字段数据

Spider中的使用

【Scrapy 框架】「版本2.4.0源码」管道（Pipeline）详解篇

【Scrapy 框架翻译】物品（Items）详解篇

【Scrapy 框架翻译】链接提取器（Link Extractors）篇

【Scrapy 框架翻译】物品加载（Item Loaders）详解篇

【Scrapy 框架翻译】爬虫页（Spiders）详解篇

【Scrapy 框架翻译】异常操作（Exceptions）篇

【Scrapy 框架翻译】请求和回应（Requests and Responses）篇

【Scrapy 框架翻译】设置（Settings）篇

【Scrapy 框架翻译】输出文件（Feed exports）篇

【Scrapy 框架翻译】物品管道（Item Pipeline）篇

【Scrapy 框架翻译】物品（Items）详解篇

文章目录

内容介绍

使用Items

声明字段

字段数据

Spider中的使用

【Scrapy 框架】「版本2.4.0源码」管道（Pipeline）详解篇

【Scrapy 框架翻译】物品（Items）详解篇

【Scrapy 框架翻译】链接提取器（Link Extractors） 篇

【Scrapy 框架翻译】物品加载（Item Loaders）详解篇

【Scrapy 框架翻译】爬虫页（Spiders）详解篇

【Scrapy 框架翻译】异常操作（Exceptions） 篇

【Scrapy 框架翻译】请求和回应（Requests and Responses） 篇

【Scrapy 框架翻译】设置（Settings） 篇

【Scrapy 框架翻译】输出文件（Feed exports） 篇

【Scrapy 框架翻译】物品管道（Item Pipeline） 篇

【Scrapy 框架翻译】链接提取器（Link Extractors）篇

【Scrapy 框架翻译】异常操作（Exceptions）篇

【Scrapy 框架翻译】请求和回应（Requests and Responses）篇

【Scrapy 框架翻译】设置（Settings）篇

【Scrapy 框架翻译】输出文件（Feed exports）篇

【Scrapy 框架翻译】物品管道（Item Pipeline）篇