欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

--scrapy爬虫--

程序员文章站 2022-05-06 19:08:27
...

命令行中建立scrapy工程:

scrapy startproject [工程名]

爬虫文件:
    在spider目录下新建一个python文件并写入以下内容

import scrapy


class BokeSpider(scrapy.Spider):
    name = 'blog'
    start_urls = ['https://blog.csdn.net/zimoxuan_']

    def parse(self, response):
        # print(response)
        titles = response.xpath(".//div[@class='article-item-box csdn-tracking-statistics']/h4/a/text()").extract()
        reads = response.xpath(".//div[@class='info-box d-flex align-content-center']/p[2]/span/text()").extract()
        print('***********************************************************************', len(reads))
        for i, j in zip(titles, reads):
            print(i, j)

    name:爬虫名字

    start_urls:爬虫启动后自动爬取的链接列表

pycharm命令行:

    爬虫的个数:scrapy list

    启动爬虫:scrapy crawl [name]

 

存入sqlite3数据库(1):

ipthon 中建立数据库

ipython
import sqlite3
Blog = sqlite3.connect('blog.sqlite')
cre_tab = 'create table blog (title varchar(512), reads varchar(128))'
Blog.execute(cre_tab)
exit

Item

class BokeItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    reads = scrapy.Field()
    # pass

pipelines:

class BokePipeline(object):
    def process_item(self, item, spider):
        # print('******************************************************************')
        print(spider.name)
        return item

setting 中找到 ITEM_PIPELINES :

ITEM_PIPELINES = {
    'boke.pipelines.BokePipeline': 300,
}

blog.py

import scrapy
from ..items import BokeItem

class BokeSpider(scrapy.Spider):
    name = 'blog'
    start_urls = ['https://blog.csdn.net/zimoxuan_']

    def parse(self, response):
        # print(response)
        bl=BokeItem()
        titles = response.xpath(".//div[@class='article-item-box csdn-tracking-statistics']/h4/a/text()").extract()
        reads = response.xpath(".//div[@class='info-box d-flex align-content-center']/p[2]/span/text()").extract()
        # print('***********************************************************************', len(reads))
        for i, j in zip(titles, reads):
            bl['title'] = i
            bl['reads'] = j
            yield bl
            # print(i, j)

运行(测试):

******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:4', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:7', 'title': '\n        --django--      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:29', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:20', 'title': '\n        --scrapy爬虫--      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:24', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:22',
 'title': '\n        CF-Round #493 (Div. 1)-A. Convert to Ones      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:39', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:21',
 'title': '\n        CF-Round #494 (Div. 3)-C.Intense Heat      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n        Html5--网页      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:17', 'title': '\n        VC++ 简单计算器      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:41', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:44', 'title': '\n        划分中文字符串,宽字符的插入      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:32', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:36', 'title': '\n        bzoj-4562:[Haoi2016]食物链      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:15', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:22', 'title': '\n        hdu-2795:Billboard      '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:10', 'title': '\n        '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:30', 'title': '\n        hdu:Minimum Inversion Number      '}

说明已经连通,可以进行下一步

存入sqlite3数据库(2):

这一步在管道文件(pipelines)中完成:

# -*- coding: utf-8 -*-
import sqlite3
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BokePipeline(object):
    def open_spider(self, spider):
        self.con = sqlite3.connect('blog.sqlite')
        self.cu = self.con.cursor()

    def process_item(self, item, spider):
        # print('******************************************************************')
        in_sql = "insert into blog (title,reads) values('{}','{}')".format(item['title'], item['reads'])
        print(in_sql)
        self.cu.execute(in_sql)
        self.con.commit()
        print(spider.name)
        return item

    def spider_close(self, spider):
        self.con.close()

#关于数据库打不开,点击没反应,有波浪线:解决办法  database 上方点击加号,点击data sourse 里的sqlite3 ,左下方的位置会有个黄色的感叹号,提示下载文件,下载后点击apply,即解决问题。

相关标签: scrapy