Scrapy-Redis结合POST请求获取数据的方法示例
前言
通常我们在一个站站点进行采集的时候,如果是小站的话 我们使用scrapy本身就可以满足。
但是如果在面对一些比较大型的站点的时候,单个scrapy就显得力不从心了。
要是我们能够多个scrapy一起采集该多好啊 人多力量大。
很遗憾scrapy官方并不支持多个同时采集一个站点,虽然官方给出一个方法:
**将一个站点的分割成几部分 交给不同的scrapy去采集**
似乎是个解决办法,但是很麻烦诶!毕竟分割很麻烦的哇
下面就改轮到我们的额主角scrapy-redis登场了!
能看到这篇文章的小伙伴肯定已经知道什么是scrapy以及scrapy-redis了,基础概念这里就不再介绍。默认情况下scrapy-redis是发送get请求获取数据的,对于某些使用post请求的情况需要重写make_request_from_data函数即可,但奇怪的是居然没在网上搜到简洁明了的答案,或许是太简单了?。
这里我以httpbin.org这个网站为例,首先在settings.py中添加所需配置,这里需要根据实际情况进行修改:
scheduler = "scrapy_redis.scheduler.scheduler" #启用redis调度存储请求队列 scheduler_persist = true #不清除redis队列、这样可以暂停/恢复 爬取 dupefilter_class = "scrapy_redis.dupefilter.rfpdupefilter" #确保所有的爬虫通过redis去重 scheduler_queue_class = 'scrapy_redis.queue.spiderpriorityqueue' redis_url = "redis://127.0.0.1:6379"
爬虫代码如下:
# -*- coding: utf-8 -*- import scrapy from scrapy_redis.spiders import redisspider class hpbspider(redisspider): name = 'hpb' redis_key = 'test_post_data' def make_request_from_data(self, data): """returns a request instance from data coming from redis. by default, ``data`` is an encoded url. you can override this method to provide your own message decoding. parameters ---------- data : bytes message from redis. """ return scrapy.formrequest("https://www.httpbin.org/post", formdata={"data":data},callback=self.parse) def parse(self, response): print(response.body)
这里为了简单直接进行输出,真实使用时可以结合pipeline写数据库等。
然后启动爬虫程序scrapy crawl hpb,由于我们还没向test_post_data中写数据,所以启动后程序进入等待状态。然后模拟向队列写数据:
import redis rd = redis.redis('127.0.0.1',port=6379,db=0) for _ in range(1000): rd.lpush('test_post_data',_)
此时可以看到爬虫已经开始获取程序了:
2019-05-06 16:30:21 [hpb] debug: read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
2019-05-06 16:30:21 [scrapy.core.engine] debug: crawled (200) <post > (referer: none)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "accept-encoding": "gzip,deflate", \n "accept-language": "en", \n "content-length": "6", \n "content-type": "application/x-www-form-urlencoded", \n "host": "", \n "user-agent": "scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "
2019-05-06 16:31:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] info: crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
至于数据重复的问题,如果post的数据重复,这个请求就不会发送出去。如果有特殊情况post发送同样的数据回得到不同返回值,添加dont_filter=true是没用的,在rfpdupefilter类中并没考虑这个参数,需要重写。
总结
以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作具有一定的参考学习价值,谢谢大家对的支持。
上一篇: 草原姑娘火辣辣
下一篇: 冬天,一声叹息的伤感
推荐阅读
-
js发送请求方法(js发送post请求获取数据)
-
PostMan post请求发送Json数据的方法
-
java 通过发送json,post请求,返回json数据的方法
-
Scrapy-Redis结合POST请求获取数据的方法示例
-
js发送请求方法(js发送post请求获取数据)
-
Android拦截并获取WebView内部POST请求参数的实现方法
-
解决Vue axios post请求,后台获取不到数据的问题方法
-
JS获取url参数,JS发送json格式的POST请求方法
-
Android开发获取传感器数据的方法示例【加速度传感器,磁场传感器,光线传感器,方向传感器】
-
php获取post中的json数据的实现方法