Scrapy-Redis之RedisSpider与RedisCrawlSpider详解
在上一章《scrapy-redis入门实战》中我们利用scrapy-redis实现了京东图书爬虫的分布式部署和数据爬取。但存在以下问题:
每个爬虫实例在启动的时候,都必须从start_urls开始爬取,即每个爬虫实例都会请求start_urls中的地址,属重复请求,浪费系统资源。
为了解决这一问题,scrapy-redis提供了redisspider与rediscrawlspider两个爬虫类,继承自这两个类的spider在启动的时候能够从指定的redis列表中去获取start_urls;任意爬虫实例从redis列表中获取某一 url 时会将其从列表中弹出,因此其他爬虫实例将不能重复读取该 url ;对于那些未从redis列表获取到初始 url 的爬虫实例将一直处于阻塞状态,直到 start_urls列表中被插入新的起始地址或者redis的requests列表中出现待处理的请求。
在这里,我们以爬取当当网图书信息为例对这两个spider的用法进行简单示例。
settings.py 配置如下:
# -*- coding: utf-8 -*- bot_name = 'dang_dang' spider_modules = ['dang_dang.spiders'] newspider_module = 'dang_dang.spiders' # crawl responsibly by identifying yourself (and your website) on the user-agent user_agent = 'mozilla/5.0 (windows nt 6.1; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36' # obey robots.txt rules robotstxt_obey = false ###################################################### ##############下面是scrapy-redis相关配置################ ###################################################### # 指定redis的主机名和端口 redis_host = 'localhost' redis_port = 6379 # 调度器启用redis存储requests队列 scheduler = "scrapy_redis.scheduler.scheduler" # 确保所有的爬虫实例使用redis进行重复过滤 dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter" # 将requests队列持久化到redis,可支持暂停或重启爬虫 scheduler_persist = true # requests的调度策略,默认优先级队列 scheduler_queue_class = 'scrapy_redis.queue.priorityqueue' # 将爬取到的items保存到redis 以便进行后续处理 item_pipelines = { 'scrapy_redis.pipelines.redispipeline': 300 }
redisspider代码示例
# -*- coding: utf-8 -*- import scrapy import re import urllib from copy import deepcopy from scrapy_redis.spiders import redisspider class dangdangspider(redisspider): name = 'dangdang' allowed_domains = ['dangdang.com'] redis_key = 'dangdang:book' pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html", re.i) # def __init__(self, *args, **kwargs): # # 动态定义可爬取的域范围 # domain = kwargs.pop('domain', '') # self.allowed_domains = filter(none, domain.split(',')) # super(dangdangspider, self).__init__(*args, **kwargs) def parse(self, response): # 从首页提取图书分类信息 # 提取一级分类元素 div_list = response.xpath("//div[@class='con flq_body']/div") for div in div_list: item = {} item["b_cate"] = div.xpath("./dl/dt//text()").extract() item["b_cate"] = [i.strip() for i in item["b_cate"] if len(i.strip()) > 0] # 提取二级分类元素 dl_list = div.xpath("./div//dl[@class='inner_dl']") for dl in dl_list: item["m_cate"] = dl.xpath(".//dt/a/@title").extract_first() # 提取三级分类元素 a_list = dl.xpath("./dd/a") for a in a_list: item["s_cate"] = a.xpath("./text()").extract_first() item["s_href"] = a.xpath("./@href").extract_first() if item["s_href"] is not none and self.pattern.match(item["s_href"]) is not none: yield scrapy.request(item["s_href"], callback=self.parse_book_list, meta={"item": deepcopy(item)}) def parse_book_list(self, response): # 从图书列表页提取数据 item = response.meta['item'] li_list = response.xpath("//ul[@class='bigimg']/li") for li in li_list: item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first() if item["book_img"] == "images/model/guan/url_none.png": item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first() item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first() item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first() item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first() item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first() item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first() if item["book_publish_date"] is not none: item["book_publish_date"] = item["book_publish_date"].replace('/', '') item["book_press"] = li.xpath("./p[@class='search_book_author']/span[3]/a/text()").extract_first() yield deepcopy(item) # 提取下一页地址 next_url = response.xpath("//li[@class='next']/a/@href").extract_first() if next_url is not none: next_url = urllib.parse.urljoin(response.url, next_url) yield scrapy.request(next_url, callback=self.parse_book_list, meta={"item": item})
当redis 的dangdang:book键所对应的start_urls列表为空时,启动dangdangspider爬虫会进入到阻塞状态等待列表中被插入数据,控制台提示内容类似下面这样:
2019-05-08 14:02:53 [scrapy.core.engine] info: spider opened
2019-05-08 14:02:53 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-08 14:02:53 [scrapy.extensions.telnet] debug: telnet console listening on 127.0.0.1:6023
此时需要向start_urls列表中插入爬虫的初始爬取地址,向redis列表中插入数据可使用如下命令:
lpush dangdang:book http://book.dangdang.com/
命令执行完后稍等片刻dangdangspider便会开始爬取数据,爬取到的数据结构如下图所示:
rediscrawlspider代码示例
# -*- coding: utf-8 -*- import scrapy import re import urllib from copy import deepcopy from scrapy.spiders import crawlspider, rule from scrapy.linkextractors import linkextractor from scrapy_redis.spiders import rediscrawlspider class dangdangcrawler(rediscrawlspider): name = 'dangdang2' allowed_domains = ['dangdang.com'] redis_key = 'dangdang:book' pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html", re.i) rules = ( rule(linkextractor(allow=r'(http|https)://category.dangdang.com/cp(.*?).html'), callback='parse_book_list', follow=false), ) def parse_book_list(self, response): # 从图书列表页提取数据 item = {} item['book_list_page'] = response._url li_list = response.xpath("//ul[@class='bigimg']/li") for li in li_list: item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first() if item["book_img"] == "images/model/guan/url_none.png": item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first() item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first() item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first() item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first() item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first() item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first() if item["book_publish_date"] is not none: item["book_publish_date"] = item["book_publish_date"].replace('/', '') item["book_press"] = li.xpath("./p[@class='search_book_author']/span[3]/a/text()").extract_first() yield deepcopy(item) # 提取下一页地址 next_url = response.xpath("//li[@class='next']/a/@href").extract_first() if next_url is not none: next_url = urllib.parse.urljoin(response.url, next_url) yield scrapy.request(next_url, callback=self.parse_book_list)
与dangdangspider爬虫类似,dangdangcrawler在获取不到初始爬取地址时也会阻塞在等待状态,当start_urls列表中有地址即开始爬取,爬取到的数据结构如下图所示:
到此这篇关于scrapy-redis之redisspider与rediscrawlspider详解的文章就介绍到这了,更多相关scrapy-redis之redisspider与rediscrawlspider内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持!