python爬虫--分布式爬虫
程序员文章站
2022-04-30 12:05:15
Scrapy Redis分布式爬虫 介绍 scrapy redis架构 • 调度器(Scheduler) • Item Pipeline scrapy redis安装与使用 安装scrapy redis 之前已经装过scrapy了,这里直接装scrapy redis 使用scrapy redis的e ......
scrapy-redis分布式爬虫
介绍
scrapy-redis巧妙的利用redis 实现 request queue和 items queue,利用redis的set实现request的去重,将scrapy从单台机器扩展多台机器,实现较大规模的爬虫集群
scrapy-redis是基于redis的scrapy组件 • 分布式爬虫 多个爬虫实例分享一个redis request队列,非常适合大范围多域名的爬虫集群 • 分布式后处理 爬虫抓取到的items push到一个redis items队列,这就意味着可以开启多个items processes来处理抓取到的数据,比如存储到mongodb、mysql • 基于scrapy即插即用组件 scheduler + duplication filter, item pipeline, base spiders.
scrapy-redis架构
• 调度器(scheduler)
scrapy-redis调度器通过redis的set不重复的特性,实现了duplication filter去重(dupefilter set存放爬取过的request)。 spider新生成的request,将request的指纹到redis的dupefilter set检查是否重复,并将不重复的request push写入redis的request队列。 调度器每次从redis的request队列里根据优先级pop出一个request, 将此request发给spider处理。
• item pipeline
将spider爬取到的item给scrapy-redis的item pipeline,将爬取到的item存入redis的items队列。可以很方便的从items队列中提取item,从而实现items processes 集群
scrapy - redis安装与使用
安装scrapy-redis
之前已经装过scrapy了,这里直接装scrapy-redis
pip install scrapy-redis
使用scrapy-redis的example来修改
先从github上拿到scrapy-redis的example,然后将里面的example-project目录移到指定的地址
git clone https://github.com/rolando/scrapy-redis.git cp -r scrapy-redis/example-project ./scrapy-youyuan
或者将整个项目下载回来scrapy-redis-master.zip
解压后
cp -r scrapy-redis-master/example-project/ ./redis-youyuan cd redis-youyuan/
tree查看项目目录
修改settings.py
注意:settings里面的中文注释会报错,换成英文
# 指定使用scrapy-redis的scheduler scheduler = "scrapy_redis.scheduler.scheduler" # 在redis中保持scrapy-redis用到的各个队列,从而允许暂停和暂停后恢复 scheduler_persist = true # 指定排序爬取地址时使用的队列,默认是按照优先级排序 scheduler_queue_class = 'scrapy_redis.queue.spiderpriorityqueue' # 可选的先进先出排序 # scheduler_queue_class = 'scrapy_redis.queue.spiderqueue' # 可选的后进先出排序 # scheduler_queue_class = 'scrapy_redis.queue.spiderstack' # 只在使用spiderqueue或者spiderstack是有效的参数,,指定爬虫关闭的最大空闲时间 scheduler_idle_before_close = 10 # 指定redispipeline用以在redis中保存item item_pipelines = { 'example.pipelines.examplepipeline': 300, 'scrapy_redis.pipelines.redispipeline': 400 } # 指定redis的连接参数 # redis_pass是我自己加上的redis连接密码,需要简单修改scrapy-redis的源代码以支持使用密码连接redis redis_host = '127.0.0.1' redis_port = 6379 # custom redis client parameters (i.e.: socket timeout, etc.) redis_params = {} #redis_url = 'redis://user:pass@hostname:9001' #redis_params['password'] = 'itcast.cn' log_level = 'debug' dupefilter_class = 'scrapy.dupefilters.rfpdupefilter' #the class used to detect and filter duplicate requests. #the default (rfpdupefilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. in order to change the way duplicates are checked you could subclass rfpdupefilter and override its request_fingerprint method. this method should accept scrapy request object and return its fingerprint (a string). #by default, rfpdupefilter only logs the first duplicate request. setting dupefilter_debug to true will make it log all duplicate requests. dupefilter_debug =true # override the default request headers: default_request_headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'accept-language': 'zh-cn,zh;q=0.8', 'connection': 'keep-alive', 'accept-encoding': 'gzip, deflate, sdch', }
查看pipeline.py
from datetime import datetime class examplepipeline(object): def process_item(self, item, spider): item["crawled"] = datetime.utcnow() item["spider"] = spider.name return item
流程
- 概念:可以使用多台电脑组件一个分布式机群,让其执行同一组程序,对同一组网络资源进行联合爬取。 - 原生的scrapy是无法实现分布式 - 调度器无法被共享 - 管道无法被共享 - 基于scrapy+redis(scrapy&scrapy-redis组件)实现分布式 - scrapy-redis组件作用: - 提供可被共享的管道和调度器 - 环境安装: - pip install scrapy-redis - 编码流程: 1.创建工程 2.cd proname 3.创建crawlspider的爬虫文件 4.修改一下爬虫类: - 导包:from scrapy_redis.spiders import rediscrawlspider - 修改当前爬虫类的父类:rediscrawlspider - allowed_domains和start_urls删除 - 添加一个新属性:redis_key = 'xxxx'可以被共享的调度器队列的名称 5.修改配置settings.py - 指定管道 item_pipelines = { 'scrapy_redis.pipelines.redispipeline': 400 } - 指定调度器 # 增加了一个去重容器类的配置, 作用使用redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter" # 使用scrapy-redis组件自己的调度器 scheduler = "scrapy_redis.scheduler.scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空redis中请求队列和去重指纹的set。如果是true, 就表示要持久化存储, 就不清空数据, 否则清空数据 scheduler_persist = true - 指定redis数据库 redis_host = 'redis服务的ip地址' redis_port = 6379 6.配置redis数据库(redis.windows.conf) - 关闭默认绑定 - 56line:#bind 127.0.0.1 - 关闭保护模式 - 75line:protected-mode no 7.启动redis服务(携带配置文件)和客户端 - redis-server.exe redis.windows.conf - redis-cli 8.执行工程 - scrapy runspider spider.py 9.将起始的url仍入到可以被共享的调度器的队列(sun)中 - 在redis-cli中操作:lpush sun www.xxx.com 10.redis: - xxx:items:存储的就是爬取到的数据
分布式爬取案例
爬虫程序
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import linkextractor from scrapy.spiders import crawlspider, rule from scrapy_redis.spiders import rediscrawlspider from fbs.items import fbsproitem class fbsspider(rediscrawlspider): name = 'fbs_obj' # allowed_domains = ['www.xxx.com'] # start_urls = ['http://www.xxx.com/'] redis_key = 'sun'#可以被共享的调度器队列的名称 link = linkextractor(allow=r'type=4&page=\d+') rules = ( rule(link, callback='parse_item', follow=true), ) print(123) def parse_item(self, response): tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: title = tr.xpath('./td[2]/a[2]/@title').extract_first() status = tr.xpath('./td[3]/span/text()').extract_first() item = fbsproitem() item['title'] = title item['status'] = status print(title) yield item
settings.py
# -*- coding: utf-8 -*- # scrapy settings for fbspro project # # for simplicity, this file contains only settings considered important or # commonly used. you can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html bot_name = 'fbs_obj' spider_modules = ['fbs_obj.spiders'] newspider_module = 'fbs_obj.spiders' # crawl responsibly by identifying yourself (and your website) on the user-agent #user_agent = 'fbspro (+http://www.yourdomain.com)' user_agent = 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/78.0.3904.97 safari/537.36' # obey robots.txt rules robotstxt_obey = false # configure maximum concurrent requests performed by scrapy (default: 16) concurrent_requests = 2 # configure a delay for requests for the same website (default: 0) # see https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # see also autothrottle settings and docs #download_delay = 3 # the download delay setting will honor only one of: #concurrent_requests_per_domain = 16 #concurrent_requests_per_ip = 16 # disable cookies (enabled by default) #cookies_enabled = false # disable telnet console (enabled by default) #telnetconsole_enabled = false # override the default request headers: #default_request_headers = { # 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'accept-language': 'en', #} # enable or disable spider middlewares # see https://docs.scrapy.org/en/latest/topics/spider-middleware.html #spider_middlewares = { # 'fbspro.middlewares.fbsprospidermiddleware': 543, #} # enable or disable downloader middlewares # see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #downloader_middlewares = { # 'fbspro.middlewares.fbsprodownloadermiddleware': 543, #} # enable or disable extensions # see https://docs.scrapy.org/en/latest/topics/extensions.html #extensions = { # 'scrapy.extensions.telnet.telnetconsole': none, #} # configure item pipelines # see https://docs.scrapy.org/en/latest/topics/item-pipeline.html #item_pipelines = { # 'fbspro.pipelines.fbspropipeline': 300, #} # enable and configure the autothrottle extension (disabled by default) # see https://docs.scrapy.org/en/latest/topics/autothrottle.html #autothrottle_enabled = true # the initial download delay #autothrottle_start_delay = 5 # the maximum download delay to be set in case of high latencies #autothrottle_max_delay = 60 # the average number of requests scrapy should be sending in parallel to # each remote server #autothrottle_target_concurrency = 1.0 # enable showing throttling stats for every response received: #autothrottle_debug = false # enable and configure http caching (disabled by default) # see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #httpcache_enabled = true #httpcache_expiration_secs = 0 #httpcache_dir = 'httpcache' #httpcache_ignore_http_codes = [] #httpcache_storage = 'scrapy.extensions.httpcache.filesystemcachestorage' #指定管道 item_pipelines = { 'scrapy_redis.pipelines.redispipeline': 400 } #指定调度器 # 增加了一个去重容器类的配置, 作用使用redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter" # 使用scrapy-redis组件自己的调度器 scheduler = "scrapy_redis.scheduler.scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空redis中请求队列和去重指纹的set。如果是true, 就表示要持久化存储, 就不清空数据, 否则清空数据 scheduler_persist = true #指定redis redis_host = '192.168.16.119' redis_port = 6379
item.py
import scrapy class fbsproitem(scrapy.item): # define the fields for your item here like: title = scrapy.field() status = scrapy.field()
下一篇: 古代皇帝真的可以立自己喜欢的儿子当太子吗