Python基于Scrapy的爬虫数据采集（写入数据库）

程序员文章站 2022-04-19 14:07:54

上一节已经学了如何在spider里面对网页源码进行数据过滤。这一节将继续学习scrapy的另一个组件-pipeline，用来2次处理数据（本节中将以储存到mysql数据库为例子）虽然scrapy架构下，可自定义的模块很多，其实实现一个完整的scrapy爬虫，仅仅只需要我们写好spider和pipeline，一个用来收集数据，一个用来处理数据其他如下载中间件、引擎核心，都是自动运行的。环境设置：既然要写入到MySQL，那得先让python支持mysql的写入工作，也就是先安装mysql驱动pym...

上一节已经学了如何在spider里面对网页源码进行数据过滤。
这一节将继续学习scrapy的另一个组件-pipeline，用来2次处理数据
（本节中将以储存到mysql数据库为例子）

虽然scrapy架构下，可自定义的模块很多，其实实现一个完整的scrapy爬虫，仅仅只需要我们写好
spider和pipeline，一个用来收集数据，一个用来处理数据
其他如下载中间件、引擎核心，都是自动运行的。

环境设置：

既然要写入到MySQL，那得先让python支持mysql的写入工作，也就是先安装mysql驱动pymysql

pip install pymysql

item是scrapy中，连结spider和pipeline的桥梁，

spider爬取了数据，过滤后写入到item中，

再通过yield返回给核心引擎并交付给pipeline，

由pipeline建立到数据库的连接并写入

Item：

在Item.py中声明需要传入的数据

import scrapy

class MyItem(scrapy.Item):
    content = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

是的，只需要这么几行，声明名字就可以

Spider：

对上一节的Spider01.py中的parse()函数进行一些修改

import scrapy
from MyScrapyProject.items import MyItem #注意这里的文件和类名都是自己定义的 要一致
class Spider01 (scrapy.Spider):

    name= 'MyMainSpider'

    start_urls=[
        'http://quotes.toscrape.com/'
        ]
    def parse(self,response):
        quote_list=response.css('div.quote')    
        item = MyItem()#默认构造函数
        for quote in quote_list:
            print('Now loading a quote...')
            content = quote.css('span.text::text').get()
            item['content'] = content
            
            author = quote.css('small.author::text').get()
            item['author'] = author
            
            tags = quote.css('a.tag::text').getall()
            item['tags'] = ",".join(tags)#tags是字符串的列表，用join表示以‘，’为连接组成一个大串
                #这里用getall() 不像上面用循环 根据需求来
            yield item
#        with open('text.txt','w') as f:
#            for oneSentence in quote_list:
#                f.write(oneSentence.css('span.text::text').get()+'\n')
#                f.write(oneSentence.css('small.author::text').get()+'\n')
#                tag_list=oneSentence.css('a.tag');
#                for tag in tag_list:
#                    f.write(tag.css('::text').get()+' ')
#                f.write('\n')

这边记录一个小问题，
在一个代码块中注释大段代码，有可能会导致奇怪的缩进错误，
所以这里把大段的注释放在了最后

这里对yield进行一些说明，不一定完全准确：
在yield之前，已经对item封装完毕了，通过yield返回给引擎，再传给pipeline
pipeline对item处理完毕之后，回到parse继续运行，这时会从yield的下一句开始，
也就是进入for语句的下一个循环
这样的好处是保持只有一个item对象，节约空间和构造对象的时间

对于使用了pipeline的scrapy spider的parse中必须包含yield

这里的原因主要是，scrapy核心会对spider yield的返回值类型进行判断，为request时会传给一个放置request对象的队列（由scrapy自己维护，我们不用管），而为item时才会传给pipeline

pipeline：

这基本上可以当作模板使用，
不要忘记在setting.py中开启pipeline!!!
Python基于Scrapy的爬虫数据采集（写入数据库）
这里初始是被注释的


#from itemadapter import ItemAdapter
import pymysql.cursors


class MysqlPipeline:
    def __init__(self):
        self.mysql_url='localhost'
        self.mysql_db='mydatabase'
        
    def open_spider(self,spider):
        self.mysql_conn = pymysql.connect(
            host = self.mysql_url,
            user = 'root',
            password = 'xxxxxxxxx',#填写你的mysql密码
            db = 'mydatabase',
            charset = 'utf8mb4',
            cursorclass = pymysql.cursors.DictCursor
        )
    
    def process_item(self, item, spider):
        print('process the item')
        try :
            cursor = self.mysql_conn.cursor()
            try:
                sql_write = "INSERT INTO quotes (content, author, tags) VALUES (%s, %s, %s);"
                cursor.execute(sql_write, (item.get("content", ""), item.get("author", ""), item.get("tags", "")))
                
                cursor.connection.commit()
            except Exception  as  e:
                print('Something wrong with Table INSERT')
                print(e)
        except Exception  as  e:
            print('Something wrong with MYSQL')
            print(e)

        
        return item

核心在cursor.execute() 和cursor.connection.commit()
一定一定一定一定一定要写cursor.connection.commit()

本文地址：https://blog.csdn.net/Cake_C/article/details/107135741

上一篇： ORACLE 删除归档日志连接rman查看归档日志占有率

下一篇： Redis介绍与安装

Python基于Scrapy的爬虫数据采集（写入数据库）

环境设置：

Item：

Spider：

对于使用了pipeline的scrapy spider的parse中必须包含yield

pipeline：

基于Python的SQL Server数据库实现对象同步轻量级

浅谈Scrapy网络爬虫框架的工作原理和数据采集

Python爬取数据并写入MySQL数据库的实例

python爬虫的数据库连接问题【推荐】

基于Python的SQL Server数据库实现对象同步轻量级

基于Python3的接口自动化总结（六）——PostgreSQL数据库

Python基于scrapy采集数据时使用代理服务器的方法

Python使用scrapy采集数据时为每个请求随机分配user-agent的方法

Python基于多线程实现抓取数据存入数据库的方法

Python3实现将本地JSON大数据文件写入MySQL数据库的方法

Python基于Scrapy的爬虫 数据采集（写入数据库）

环境设置：

Item：

Spider：

对于使用了pipeline的scrapy spider的parse中必须包含yield

pipeline：

基于Python的SQL Server数据库实现对象同步轻量级

浅谈Scrapy网络爬虫框架的工作原理和数据采集

Python爬取数据并写入MySQL数据库的实例

python爬虫的数据库连接问题【推荐】

基于Python的SQL Server数据库实现对象同步轻量级

基于Python3的接口自动化总结（六）——PostgreSQL数据库

Python基于scrapy采集数据时使用代理服务器的方法

Python使用scrapy采集数据时为每个请求随机分配user-agent的方法

Python基于多线程实现抓取数据存入数据库的方法

Python3实现将本地JSON大数据文件写入MySQL数据库的方法

Python基于Scrapy的爬虫数据采集（写入数据库）