windows下scrapy-redis如何为不同的爬虫项目分配不同的db,而不仅仅使用db0
程序员文章站
2022-06-10 16:13:12
...
windows下scrapy-redis如何为不同的爬虫项目分配不同的db,而不仅仅使用db0
1. 背景
redis默认会生成16个db:db0 ~ db15, 在编写scrapy-redis分布式爬虫时,会默认使用db0来存放去重,种子队列以及item数据。但是一般情况下,我们不会只有一个爬虫项目,如果都放到一个数据库中,很容易搞混。所以为不同的爬虫项目分配不同的db是一件很有必要的事情。
2. 环境
- 系统:win7
- scrapy-redis
- redis 3.0.5
- python 3.6.1
3. 分析
- 首先我们来分析一下scrapy-redis源码,看看设置db的位置在哪里?
- 第一步: .\Lib\site-packages\scrapy_redis\scheduler.py
@classmethod
def from_settings(cls, settings):
kwargs = {
'persist': settings.getbool('SCHEDULER_PERSIST'),
'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'),
'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'),
}
# If these values are missing, it means we want to use the defaults.
optional = {
# TODO: Use custom prefixes for this settings to note that are
# specific to scrapy-redis.
'queue_key': 'SCHEDULER_QUEUE_KEY',
'queue_cls': 'SCHEDULER_QUEUE_CLASS',
'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY',
# We use the default setting name to keep compatibility.
'dupefilter_cls': 'DUPEFILTER_CLASS',
'serializer': 'SCHEDULER_SERIALIZER',
}
for name, setting_name in optional.items():
val = settings.get(setting_name)
if val:
kwargs[name] = val
# Support serializer as a path to a module.
if isinstance(kwargs.get('serializer'), six.string_types):
kwargs['serializer'] = importlib.import_module(kwargs['serializer'])
# 初始化 redis server.
server = connection.from_settings(settings)
# Ensure the connection is working.
server.ping()
return cls(server=server, **kwargs)
会调用到 connection.py下的函数from_settings来初始化 Redis server
- 第二步: .\Lib\site-packages\scrapy_redis\connection.py
# Backwards compatible alias.
from_settings = get_redis_from_settings
def get_redis_from_settings(settings):
"""Returns a redis client instance from given Scrapy settings object.
This function uses ``get_client`` to instantiate the client and uses
``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
can override them using the ``REDIS_PARAMS`` setting.
Parameters
----------
settings : Settings
A scrapy settings object. See the supported settings below.
Returns
-------
server
Redis client instance.
Other Parameters
----------------
REDIS_URL : str, optional
Server connection URL.
REDIS_HOST : str, optional
Server host.
REDIS_PORT : str, optional
Server port.
REDIS_ENCODING : str, optional
Data encoding.
REDIS_PARAMS : dict, optional
Additional client parameters.
"""
params = defaults.REDIS_PARAMS.copy()
# 关键点就在这个位置,在这里,我们可以填入 redis自定义参数
params.update(settings.getdict('REDIS_PARAMS'))
# XXX: Deprecate REDIS_* settings.
for source, dest in SETTINGS_PARAMS_MAP.items():
val = settings.get(source)
if val:
params[dest] = val
# Allow ``redis_cls`` to be a path to a class.
if isinstance(params.get('redis_cls'), six.string_types):
params['redis_cls'] = load_object(params['redis_cls'])
return get_redis(**params)
如上述代码所示,会从settings的REDIS_PARAMS项中拿到参数,然后填入 get_redis(**params) 中,来初始化redis server,如下所示:
def get_redis(**kwargs):
"""Returns a redis client instance.
Parameters
----------
redis_cls : class, optional
Defaults to ``redis.StrictRedis``.
url : str, optional
If given, ``redis_cls.from_url`` is used to instantiate the class.
**kwargs
Extra parameters to be passed to the ``redis_cls`` class.
Returns
-------
server
Redis client instance.
"""
redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
url = kwargs.pop('url', None)
if url:
return redis_cls.from_url(url, **kwargs)
else:
return redis_cls(**kwargs)
4. 方法
- 通过上面的分析,做起来就很简单了,只要为爬虫配置好REDIS_PARAMS这个settings项就好了。
- 同理,设置password也是通过这种方式。
# 指定使用 db2
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'xxxx'
redis_key = 'xxxx:start_urls'
# ……
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'DOWNLOAD_DELAY': 0,
# 指定redis数据库的连接参数
'REDIS_HOST': '192.168.1.99',
'REDIS_PORT': 6379,
# 指定 redis链接密码,和使用哪一个数据库
'REDIS_PARAMS' : {
'password': 'redisPasswordTest123456',
'db': 2
},
}
效果如下:
注意事项:
在修改数据库之后,添加start_urls,以及从redis往mongodb进行数据转储时,需要指定对相应的数据库,如下:
# 创建redis数据库连接
rediscli = redis.Redis(host = redis_Host, port = 6379, db = "2")