爬虫(十八):Scrapy框架(五) Scrapy通用爬虫
1. scrapy通用爬虫
1.1 crawlspider
- link_extractor:是link extractor对象。通过它,spider可以知道从爬取的页面中提取哪些链接。提取出的链接会自动生成request。它又是一个数据结构,一般常用lxmllinkextractor对象作为参数。
- callback:即回调函数,和之前定义request的callback有相同的意义。每次从link_extractor中获取到链接时,该函数将会调用。该回调函数接收一个response作为其第一个参数,并返回一个包含item或request对象的列表。注意,避免使用pars()作为回调函数。由于crawlspider使用parse()方法来实现其逻辑,如果parse()方法覆盖了,crawlspider将会运行失败。
- cb_kwargs:字典,它包含传递给回调函数的参数。
- follow:布尔值,即true或false,它指定根据该规则从response提取的链接是否需要跟进。如果 callback参数为none,follow默认设置为true,否则默认为false。
- process_links:指定处理函数,从link_extractor中获取到链接列表时,该函数将会调用, 它主要用于过滤。
- process_request:同样是指定处理函数,根据该rule提取到每个request时,该函数部会调用,对request进行处理。该函数必须返回request或者none。
1.2 item loader
我们了解了利用crawlspider的rule来定义页面的爬取逻辑,这是可配置化的一部分内容。但是,rule并没有对item的提取方式做规则定义。对于item的提取,我们需要借助另一个模块item loader来实现。
item loader提供一种便捷的机制来帮助我们方便地提取item。它提供的一系列api可以分析原始数据对item进行赋值。item提供的是保存抓取数据的容器,而item loader提供的是填充容器的机制。有了它,数据的提取会变得更加规则化。
item loaderapi参数:
from scrapy.loader import itemloader from scrapydemo.items import product
def parse(self, response):
loader = itemloader(item=product(),response=response)
return loader.load_item()
这里首先声明一个product item,用该item和response对象实例化item loader,调用add_xpath()方法把来向两个不同位置的数据提取出来,分配给name属性,再用add_xpath()、add_css()、add_value()等方法对不同属性依次赋值,最后调用load_item()方法实现item的解析。这种方式比较规则化,我们可以把一些参数和规则单独提取出来做成配置文件或存到数据库,即可实现可配置化。
另外,item loader每个字段中都包含了一个input processor(输入处理器)和一个output processor(输出处理器)。input processor收到数据时立刻提取数据,input processor的结果被收集起来并且保存在ltemloader内,但是不分配给item。收集到所有的数据后,load_item()方法被调用来填充再生成item对象 。在调用时会先调用output processor来处理之前收集到的数据,然后再存入item中,这样就生成了item。
(1) identity
(2) takefirst
takefirst返回列表的第一个非空值,类似extract_first()的功能,常用作output processor。
from scrapy.loader.processors import takefirst processor = takefirst() print(processor(['',1,2,3]))
(3) join
from scrapy.loader.processors import join processor = join() print(processor(['one','two','three']))
运行结果为one two three。
from scrapy.loader.processors import join processor = join(',') print(processor(['one','two','three']))
(4) compose
from scrapy.loader.processors import compose processor = compose(str.upper,lambda s:s.strip()) print(processor('hello world'))
运行结果为hello world。
在这里我们构造了一个compose processor,传入一个开头带有空格的字符串。compose processor的参数有两个:第一个是str.upper,它可以将字母全部转为大写;第二个是一个匿名函数,它调用strip()方法去除头尾空白字符。compose会顺次调用两个参数,最后返回结果的字符串全部转化为大写并且去除了开头的空格。
(5) mapcompose
from scrapy.loader.processors import mapcompose processor = mapcompose(str.upper,lambda s:s.strip()) print(processor(['hello','world','python']))
(6) selectjmes
pip install jmespath
from scrapy.loader.processors import selectjmes processor = selectjmes('foo') print(processor({'foo':'bar'}))
以上内容便是一些常用的processor,在本节的实例中我们会使用processor来进行数据的处理。接下来,我们用一个实例来了解item loader的用法。
1.3 通用爬虫案例
单独获取数字还是可以实现的,通过requests发送请求,用正则去匹配字符元素,并再次匹配其映射关系的url,获取到的数据通过font包工具解析成字典格式,再做编码匹配,起点返回的编码匹配英文数字,英文数字匹配阿拉伯数字,最后拼接,得到实际的数字字符串,但这样多次发送请求,爬取效率会大大降低。本次集中爬取舍弃了爬取数字,选择了较容易获取的评分数字。评分值默认为0 ,是从后台推送的js数据里取值更新的。
1.3.1 新建项目
scrapy startproject qd
scrapy genspider -l
scrapy genspider -t crawl read qidian.com
1.3.2 定义rule
start_urls = ['https://www.qidian.com/all?orderid=&style=1&pagesize=20&siteid=1&pubflag=0&hiddenfield=0&page=1']
rules = ( #匹配全部主页面的url规则 深度爬取子页面 rule(linkextractor(allow=(r'https://www.qidian.com/all\?orderid=\&style=1\&pagesize=20\&siteid=1\&pubflag=0\&hiddenfield=0\&page=(\d+)')),follow=true), #匹配详情页面 不作深度爬取 rule(linkextractor(allow=r'https://book.qidian.com/info/(\d+)'), callback='parse_item', follow=false), )
1.3.3 解析页面
# -*- coding: utf-8 -*- # define here the models for your scraped items # # see documentation in: # https://doc.scrapy.org/en/latest/topics/items.html from scrapy import field,item class qditem(item): # define the fields for your item here like: book_name = field() #书名 author=field() #作者 state=field() #状态 type=field() #类型 about=field() #简介 score=field() #评分 story=field() #故事 news=field() #最新章节
def get_book_name(self,response): book_name=response.xpath('//h1/em/text()').extract()[0] if len(book_name)>0: book_name=book_name.strip() else: book_name='null' return book_name def get_author(self,response): author=response.xpath('//h1/span/a/text()').extract()[0] if len(author)>0: author=author.strip() else: author='null' return author def get_state(self,response): state=response.xpath('//p[@class="tag"]/span/text()').extract()[0] if len(state)>0: state=state.strip() else: st='null' return state def get_type(self,response): type=response.xpath('//p[@class="tag"]/a/text()').extract() if len(type)>0: t="" for i in type: t+=' '+i type=t else: type='null' return type def get_about(self,response): about=response.xpath('//p[@class="intro"]/text()').extract()[0] if len(about)>0: about=about.strip() else: about='null' return about def get_score(self,response): def get_sc(id): urll = 'https://book.qidian.com/ajax/comment/index?_csrftoken=zikrbzt4nggzbkfyumdwzvgh0x0wtro5rdegbi9w&bookid=' + id + '&pagesize=15' rr = requests.get(urll) # print(rr) score = rr.text[16:19] return score bid=response.xpath('//a[@id="bookimg"]/@data-bid').extract()[0] #获取书的id if len(bid)>0: score=get_sc(bid) #调用方法获取评分 若是整数 可能返回 9," if score[1]==',': score=score.replace(',"',".0") else: score=score else: score='null' return score def get_story(self,response): story=response.xpath('//div[@class="book-intro"]/p/text()').extract()[0] if len(story)>0: story=story.strip() else: story='null' return story def get_news(self,response): news=response.xpath('//div[@class="detail"]/p[@class="cf"]/a/text()').extract()[0] if len(news)>0: news=news.strip() else: news='null' return news
default_request_headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'user-agent': 'mozilla/5.0 (windows nt 6.3; trident/7.0; rv 11.0) like gecko', }
1.3.4 运行程序
scrapy crawl read
1.3.5 完整代码
# -*- coding: utf-8 -*- from scrapy.linkextractors import linkextractor from scrapy.spiders import crawlspider, rule from qd.items import qditem import requests class readspider(crawlspider): name = 'read' # allowed_domains = ['qidian.com'] start_urls = ['https://www.qidian.com/all?orderid=&style=1&pagesize=20&siteid=1&pubflag=0&hiddenfield=0&page=1'] rules = ( #匹配全部主页面的url规则 深度爬取子页面 rule(linkextractor(allow=(r'https://www.qidian.com/all\?orderid=\&style=1\&pagesize=20\&siteid=1\&pubflag=0\&hiddenfield=0\&page=(\d+)')),follow=true), #匹配详情页面 不作深度爬取 rule(linkextractor(allow=r'https://book.qidian.com/info/(\d+)'), callback='parse_item', follow=false), ) def parse_item(self, response): item=qditem() item['book_name']=self.get_book_name(response) item['author']=self.get_author(response) item['state']=self.get_state(response) item['type']=self.get_type(response) item['about']=self.get_about(response) item['score']=self.get_score(response) item['story']=self.get_story(response) item['news']=self.get_news(response) yield item def get_book_name(self,response): book_name=response.xpath('//h1/em/text()').extract()[0] if len(book_name)>0: book_name=book_name.strip() else: book_name='null' return book_name def get_author(self,response): author=response.xpath('//h1/span/a/text()').extract()[0] if len(author)>0: author=author.strip() else: author='null' return author def get_state(self,response): state=response.xpath('//p[@class="tag"]/span/text()').extract()[0] if len(state)>0: state=state.strip() else: st='null' return state def get_type(self,response): type=response.xpath('//p[@class="tag"]/a/text()').extract() if len(type)>0: t="" for i in type: t+=' '+i type=t else: type='null' return type def get_about(self,response): about=response.xpath('//p[@class="intro"]/text()').extract()[0] if len(about)>0: about=about.strip() else: about='null' return about def get_score(self,response): def get_sc(id): urll = 'https://book.qidian.com/ajax/comment/index?_csrftoken=zikrbzt4nggzbkfyumdwzvgh0x0wtro5rdegbi9w&bookid=' + id + '&pagesize=15' rr = requests.get(urll) # print(rr) score = rr.text[16:19] return score bid=response.xpath('//a[@id="bookimg"]/@data-bid').extract()[0] #获取书的id if len(bid)>0: score=get_sc(bid) #调用方法获取评分 若是整数 可能返回 9," if score[1]==',': score=score.replace(',"',".0") else: score=score else: score='null' return score def get_story(self,response): story=response.xpath('//div[@class="book-intro"]/p/text()').extract()[0] if len(story)>0: story=story.strip() else: story='null' return story def get_news(self,response): news=response.xpath('//div[@class="detail"]/p[@class="cf"]/a/text()').extract()[0] if len(news)>0: news=news.strip() else: news='null' return news
# -*- coding: utf-8 -*- # define here the models for your scraped items # # see documentation in: # https://doc.scrapy.org/en/latest/topics/items.html from scrapy import field,item class qditem(item): # define the fields for your item here like: book_name = field() #书名 author=field() #作者 state=field() #状态 type=field() #类型 about=field() #简介 score=field() #评分 story=field() #故事 news=field() #最新章节
# -*- coding: utf-8 -*- # scrapy settings for qd project # # for simplicity, this file contains only settings considered important or # commonly used. you can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html bot_name = 'qd' spider_modules = ['qd.spiders'] newspider_module = 'qd.spiders' default_request_headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'user-agent': 'mozilla/5.0 (windows nt 6.3; trident/7.0; rv 11.0) like gecko', } # crawl responsibly by identifying yourself (and your website) on the user-agent #user_agent = 'qd (+http://www.yourdomain.com)' # obey robots.txt rules robotstxt_obey = true # configure maximum concurrent requests performed by scrapy (default: 16) #concurrent_requests = 32 # configure a delay for requests for the same website (default: 0) # see https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # see also autothrottle settings and docs #download_delay = 3 # the download delay setting will honor only one of: #concurrent_requests_per_domain = 16 #concurrent_requests_per_ip = 16 # disable cookies (enabled by default) #cookies_enabled = false # disable telnet console (enabled by default) #telnetconsole_enabled = false # override the default request headers: #default_request_headers = { # 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'accept-language': 'en', #} # enable or disable spider middlewares # see https://docs.scrapy.org/en/latest/topics/spider-middleware.html #spider_middlewares = { # 'qd.middlewares.qdspidermiddleware': 543, #} # enable or disable downloader middlewares # see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #downloader_middlewares = { # 'qd.middlewares.qddownloadermiddleware': 543, #} # enable or disable extensions # see https://docs.scrapy.org/en/latest/topics/extensions.html #extensions = { # 'scrapy.extensions.telnet.telnetconsole': none, #} # configure item pipelines # see https://docs.scrapy.org/en/latest/topics/item-pipeline.html #item_pipelines = { # 'qd.pipelines.qdpipeline': 300, #} # enable and configure the autothrottle extension (disabled by default) # see https://docs.scrapy.org/en/latest/topics/autothrottle.html #autothrottle_enabled = true # the initial download delay #autothrottle_start_delay = 5 # the maximum download delay to be set in case of high latencies #autothrottle_max_delay = 60 # the average number of requests scrapy should be sending in parallel to # each remote server #autothrottle_target_concurrency = 1.0 # enable showing throttling stats for every response received: #autothrottle_debug = false # enable and configure http caching (disabled by default) # see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #httpcache_enabled = true #httpcache_expiration_secs = 0 #httpcache_dir = 'httpcache' #httpcache_ignore_http_codes = [] #httpcache_storage = 'scrapy.extensions.httpcache.filesystemcachestorage'
# -*- coding: utf-8 -*- # define here the models for your spider middleware # # see documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class qdspidermiddleware(object): # not all methods need to be defined. if a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # this method is used by scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # called for each response that goes through the spider # middleware and into the spider. # should return none or raise an exception. return none def process_spider_output(self, response, result, spider): # called with the results returned from the spider, after # it has processed the response. # must return an iterable of request, dict or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # should return either none or an iterable of request, dict # or item objects. pass def process_start_requests(self, start_requests, spider): # called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('spider opened: %s' % spider.name) class qddownloadermiddleware(object): # not all methods need to be defined. if a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # this method is used by scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # called for each request that goes through the downloader # middleware. # must either: # - return none: continue processing this request # - or return a response object # - or return a request object # - or raise ignorerequest: process_exception() methods of # installed downloader middleware will be called return none def process_response(self, request, response, spider): # called with the response returned from the downloader. # must either; # - return a response object # - return a request object # - or raise ignorerequest return response def process_exception(self, request, exception, spider): # called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # must either: # - return none: continue processing this exception # - return a response object: stops process_exception() chain # - return a request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('spider opened: %s' % spider.name)
# -*- coding: utf-8 -*- # define your item pipelines here # # don't forget to add your pipeline to the item_pipelines setting # see: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class qdpipeline(object): def process_item(self, item, spider): return item
