--scrapy爬虫--
程序员文章站
2022-05-06 19:08:27
...
命令行中建立scrapy工程:
scrapy startproject [工程名]
爬虫文件:
在spider目录下新建一个python文件并写入以下内容
import scrapy
class BokeSpider(scrapy.Spider):
name = 'blog'
start_urls = ['https://blog.csdn.net/zimoxuan_']
def parse(self, response):
# print(response)
titles = response.xpath(".//div[@class='article-item-box csdn-tracking-statistics']/h4/a/text()").extract()
reads = response.xpath(".//div[@class='info-box d-flex align-content-center']/p[2]/span/text()").extract()
print('***********************************************************************', len(reads))
for i, j in zip(titles, reads):
print(i, j)
name:爬虫名字
start_urls:爬虫启动后自动爬取的链接列表
pycharm命令行:
爬虫的个数:scrapy list
启动爬虫:scrapy crawl [name]
存入sqlite3数据库(1):
ipthon 中建立数据库
ipython
import sqlite3
Blog = sqlite3.connect('blog.sqlite')
cre_tab = 'create table blog (title varchar(512), reads varchar(128))'
Blog.execute(cre_tab)
exit
Item
class BokeItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
reads = scrapy.Field()
# pass
pipelines:
class BokePipeline(object):
def process_item(self, item, spider):
# print('******************************************************************')
print(spider.name)
return item
setting 中找到 ITEM_PIPELINES :
ITEM_PIPELINES = {
'boke.pipelines.BokePipeline': 300,
}
blog.py
import scrapy
from ..items import BokeItem
class BokeSpider(scrapy.Spider):
name = 'blog'
start_urls = ['https://blog.csdn.net/zimoxuan_']
def parse(self, response):
# print(response)
bl=BokeItem()
titles = response.xpath(".//div[@class='article-item-box csdn-tracking-statistics']/h4/a/text()").extract()
reads = response.xpath(".//div[@class='info-box d-flex align-content-center']/p[2]/span/text()").extract()
# print('***********************************************************************', len(reads))
for i, j in zip(titles, reads):
bl['title'] = i
bl['reads'] = j
yield bl
# print(i, j)
运行(测试):
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:4', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:7', 'title': '\n --django-- '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:29', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:20', 'title': '\n --scrapy爬虫-- '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:24', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:22',
'title': '\n CF-Round #493 (Div. 1)-A. Convert to Ones '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:39', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:21',
'title': '\n CF-Round #494 (Div. 3)-C.Intense Heat '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n Html5--网页 '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:14', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:17', 'title': '\n VC++ 简单计算器 '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:41', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:44', 'title': '\n 划分中文字符串,宽字符的插入 '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:32', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:36', 'title': '\n bzoj-4562:[Haoi2016]食物链 '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:15', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:22', 'title': '\n hdu-2795:Billboard '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:10', 'title': '\n '}
******************************************************************
blog
2018-08-06 14:24:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.csdn.net/zimoxuan_>
{'reads': '阅读数:30', 'title': '\n hdu:Minimum Inversion Number '}
说明已经连通,可以进行下一步
存入sqlite3数据库(2):
这一步在管道文件(pipelines)中完成:
# -*- coding: utf-8 -*-
import sqlite3
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class BokePipeline(object):
def open_spider(self, spider):
self.con = sqlite3.connect('blog.sqlite')
self.cu = self.con.cursor()
def process_item(self, item, spider):
# print('******************************************************************')
in_sql = "insert into blog (title,reads) values('{}','{}')".format(item['title'], item['reads'])
print(in_sql)
self.cu.execute(in_sql)
self.con.commit()
print(spider.name)
return item
def spider_close(self, spider):
self.con.close()
#关于数据库打不开,点击没反应,有波浪线:解决办法 database 上方点击加号,点击data sourse 里的sqlite3 ,左下方的位置会有个黄色的感叹号,提示下载文件,下载后点击apply,即解决问题。
上一篇: android9.0 因为反射出现的系统弹框,解决方法
下一篇: scrapy爬虫创建、开启