scrapy爬虫完整的代码实例

程序员文章站 2022-05-06 18:50:39

...

新建工程

scrapy startproject tutorial

进入tutorial目录，在spider下面新建quotes_spider.py

import scrapy
from ..items import QuotesItem

#coding:utf-8

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domain = "toscrape.com"

    def start_requests(self):
        for i in range(1,2):
            url = "http://quotes.toscrape.com/page/" + str(i) + "/"
            yield scrapy.Request(url=url,callback=self.parse)


    def parse(self, response):
        item = QuotesItem()
        for quote in response.css('div.quote'):
            item['text'] = quote.css('span.text::text').get(),
            item['author'] = quote.css('small.author::text').get(),
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield item

进入items.py,代码如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
    pass

进入pipelines.py进行设置，对数据进行清洗

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TutorialPipeline(object):
    def process_item(self, item, spider):
        return item

class QuotesPipeline(object):
    def process_item(self,item, spider):
        if item['text']:
            item['text'] = item['text'][0][1:-1]
        if item['author']:
            item['author'] = item['author'][0][1:-1]
        return item

进入setting.py，修改相关配置

ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
   'tutorial.pipelines.QuotesPipeline': 500,
}
FEED_EXPORT_ENCODING = 'utf-8'

进行命令行，执行爬虫

scrapy crawl quotes -o quotes.jl

相关标签： python scrapy 爬虫数据清洗

上一篇：阿里云实现后台运行scrapy 爬虫

下一篇： scrapy-redis爬虫queue，去重，调度

scrapy爬虫完整的代码实例

微信小程序开发之Mustache语法的代码实例分享

Python爬虫实现爬取京东手机页面的图片(实例代码)

php随机输出名人名言的代码_php实例

用Python写贪吃蛇游戏的代码实例

使用jQuery轻松实现Ajax的实例代码_jquery

python对日志进行处理的实例代码

Asp.Net Core基于JWT认证的数据接口网关实例代码

PHP Session 变量的使用方法详解与实例代码_php技巧

Go 字符串格式化的实例代码详解

安卓的模板设计模式代码实例