Scrapy之CrawlerRunner数据无法进入数据库,并不启用pipelines
程序员文章站
2022-05-07 17:45:41
...
在同一进程中运行多个蜘蛛
默认情况下,当您运行时,Scrapy会为每个进程运行一个蜘蛛。但是,Scrapy支持使用内部API在每个进程中运行多个蜘蛛。scrapy crawl
这是一个同时运行多个蜘蛛的示例:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
使用CrawlerRunner以下示例:
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的示例,但是通过链接延迟项来依次运行蜘蛛程序:
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from ThreatIntellgence.spiders.antiy import AntiySpider
from ThreatIntellgence.spiders.kaspersky import KasperskySpider
from ThreatIntellgence.spiders.a360safety import A360safetySpider
from ThreatIntellgence.spiders.AccentureSecurity import AccenturesecuritySpider
configure_logging()
runner = CrawlerRunner()
runner.crawl(AntiySpider)
runner.crawl(KasperskySpider)
runner.crawl(A360safetySpider)
runner.crawl(AccenturesecuritySpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
注意
此方法并不启用scrapy中的pipelines,只会运行爬虫,所以在管道中的方法并不适用,也并不能将数据通过pipelines存储到数据库中。
那么我们需要启用所有的爬虫可以使用下面的方法
import os
os.system("scrapy crawl antiy -s CLOSESPIDER_TIMEOUT=30") #爬虫运行完暂停30秒
os.system("scrapy crawl kaspersky -s CLOSESPIDER_TIMEOUT=30")
os.system("scrapy crawl a360safety -s CLOSESPIDER_TIMEOUT=30")
os.system("scrapy crawl Akamai -s CLOSESPIDER_TIMEOUT=30")