欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy

程序员文章站 2022-03-02 22:29:38
...

scrapy安装和简单使用

  scrapy是一个大而全的爬虫组件,依赖twisted,内部基于事件循环的机制实现爬虫的并发

  下载安装:

 - Win:
    下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
                
    pip3 install wheel   
    pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl #有些64位安装不了的,可以试下32位的
                
    pip3 install pywin32
                
    pip3 install scrapy 

 - Linux:
   pip3 install scrapy

  

    twisted是什么以及和requests的区别?
    requests是一个Python实现的可以伪造浏览器发送Http请求的模块。
        - 封装socket发送请求
        
    twisted是基于事件循环的异步非阻塞网络框架。
        - 封装socket发送请求
        - 单线程完成并发请求
        PS: 三个相关词
            - 非阻塞:不等待
            - 异步:回调
            - 事件循环:一直循环去检查状态。

  

    组件以及执行流程?
    - 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并的到一个 迭代器。
    - 迭代器循环时会获取Request对象,而request对象中封装了要访问的URL和回调函数。
    - 将所有的request对象(任务)放到调度器中,用于以后被下载器下载。
    - 下载器去调度器中获取要下载任务(就是Request对象),下载完成后执行回调函数。
    - 回到spider的回调函数中,
        yield Request()
        yield Item()


  基础命令

    # 创建project
    scrapy  startproject xdb
    
    cd xdb
    
    # 创建爬虫
    scrapy genspider chouti chouti.com
    scrapy genspider cnblogs cnblogs.com
    
    # 启动爬虫
    scrapy crawl chouti
    scrapy crawl chouti --nolog 

  HTML解析:xpath

	- response.text 
	- response.encoding
	- response.body 
	- response.request
	# response.xpath('//div[@href="x1"]/a').extract_first()
	# response.xpath('//div[@href="x1"]/a').extract()
	# response.xpath('//div[@href="x1"]/a/text()').extract()
	# response.xpath('//div[@href="x1"]/a/@href').extract()

   再次发起请求:yield Request对象

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['http://chouti.com/']

    def parse(self, response):
        # print(response,type(response)) # 对象
        # print(response.text)
        """
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text,'html.parser')
        content_list = soup.find('div',attrs={'id':'content-list'})
        """
        # 去子孙中找div并且id=content-list
        f = open('news.log', mode='a+')
        item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        for item in item_list:
            text = item.xpath('.//a/text()').extract_first()
            href = item.xpath('.//a/@href').extract_first()
            print(href,text.strip())
            f.write(href+'\n')
        f.close()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page,callback=self.parse) # https://dig.chouti.com/all/hot/recent/2

  注意:如果爬虫过程有编码报错,尝试加上下面这句代码

# import sys,os,io
# sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

  如果爬虫不执行parse,可修改配置文件这项

ROBOTSTXT_OBEY = False

  

  对于上面的执行过程存在两个缺点:1.每次发起请求,都有打开连接和关闭连接的过程 2.分工不明确,既然解析过程,又有存储过程

  针对上面两个问题,scrapy提供了持久化

持久化pipeline/items

  1.定义pipeline类,这里编写你的存储过程

 class XXXPipeline(object):
    def process_item(self, item, spider):
         return item

  2.定义Item类,这里定义你要接受的数据          

class XdbItem(scrapy.Item):
     href = scrapy.Field()
     title = scrapy.Field()

  3.settings里配置

ITEM_PIPELINES = {
    'xdb.pipelines.XdbPipeline': 300,
}

  爬虫每次执行yield Item对象,process_item就会调用一次

        编写pipeline:

	'''
	源码内容:
	1. 判断当前XdbPipeline类中是否有from_crawler
		有:
			obj = XdbPipeline.from_crawler(....)
		否:
			obj = XdbPipeline()
	2. obj.open_spider()
	
	3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
	
	4. obj.close_spider()
	'''
from scrapy.exceptions import DropItem

class FilePipeline(object):

	def __init__(self,path):
		self.f = None
		self.path = path

	@classmethod
	def from_crawler(cls, crawler):
		"""
		初始化时候,用于创建pipeline对象
		:param crawler:
		:return:
		"""
		print('File.from_crawler')
		path = crawler.settings.get('HREF_FILE_PATH')
		return cls(path)

	def open_spider(self,spider):
		"""
		爬虫开始执行时,调用
		:param spider:
		:return:
		"""
		print('File.open_spider')
		self.f = open(self.path,'a+')

	def process_item(self, item, spider):
		# f = open('xx.log','a+')
		# f.write(item['href']+'\n')
		# f.close()
		print('File',item['href'])
		self.f.write(item['href']+'\n')
		
		# return item  	# 交给下一个pipeline的process_item方法
		raise DropItem()# 后续的 pipeline的process_item方法不再执行

	def close_spider(self,spider):
		"""
		爬虫关闭时,被调用
		:param spider:
		:return:
		"""
		print('File.close_spider')
		self.f.close()

   注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
        pipeline持久化,from_crawler指定写入路径,open_spider打开链接,close_spider关闭链接,
        process_item中执行持久化操作,return item就交给下一个pipeline的process_item方法,如果
        raise DropItem(),后续的pipeline的process_item方法不再执行

 

去重规则

        编写类

from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint

class XdbDupeFilter(BaseDupeFilter):

	def __init__(self):
		self.visited_fd = set()

	@classmethod
	def from_settings(cls, settings):
		return cls()

	def request_seen(self, request):
		fd = request_fingerprint(request=request)
		if fd in self.visited_fd:
			return True
		self.visited_fd.add(fd)

	def open(self):  # can return deferred
		print('开始')

	def close(self, reason):  # can return a deferred
		print('结束')

	# def log(self, request, spider):  # log that a request has been filtered
	#     print('日志')

  配置

        # 修改默认的去重规则
        # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
        DUPEFILTER_CLASS = 'xdb.dupefilters.XdbDupeFilter'

   爬虫使用

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']

	def parse(self, response):
		print(response.request.url)
		# item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
		# for item in item_list:
		#     text = item.xpath('.//a/text()').extract_first()
		#     href = item.xpath('.//a/@href').extract_first()

		page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
		for page in page_list:
			from scrapy.http import Request
			page = "https://dig.chouti.com" + page
			# yield Request(url=page,callback=self.parse,dont_filter=False) # https://dig.chouti.com/all/hot/recent/2
			yield Request(url=page,callback=self.parse,dont_filter=True) # https://dig.chouti.com/all/hot/recent/2

         注意:
            - request_seen中编写正确逻辑
            - dont_filter=False
            
            如果想实现去重,可以自定义dupefilter类,在request_seen方法中执行去重操作,
            还可以这么做,在yield request时dont_filter改成False,默认也是False

深度与优先级

        - 深度 
            - 最开始是0
            - 每次yield时,会根据原来请求中的depth + 1
            配置:DEPTH_LIMIT 深度控制
        - 优先级 
            - 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY 
            配置:DEPTH_PRIORITY

   获取深度:response.meta.get("depth", 0)

 

cookie设置

  方式一,携带和解析

    cookie_dict = {}
    def parse(self, response):

        # 携带 解析的方式
        #去响应头里获取cookie,cookie保存在cookie_jar对象
        from scrapy.http.cookies import CookieJar
        from urllib.parse import urlencode
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        # 去对象中将cookie解析到字典
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value

        yield Request(
            url="https://dig.chouti.com/login",
            method="POST",
            #body 可以自定拼接,也可以使用urlencode拼接
            body="phone=8613121758648&password=woshiniba&oneMonth=1",
            cookies=self.cookie_dict,
            headers={
                "Content-Type":'application/x-www-form-urlencoded; charset=UTF-8'
            },
            callback=self.check_login
        )

    def check_login(self, response):
        print(response.text)
        yield Request(
            url="https://dig.chouti.com/all/hot/recent/1",
            cookies=self.cookie_dict,
            callback=self.index
        )

    def index(self, response):
        news_list = response.xpath("//div[@id='content-list']/div[@class='item']")
        for new in news_list:
            link_id = new.xpath(".//div[@class='part2']/@share-linkid").extract_first()
            yield Request(
                url="http://dig.chouti.com/link/vote?linksId=%s"%(link_id, ),
                method="POST",
                cookies=self.cookie_dict,
                callback=self.check_result
            )

        page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract()
        for page in page_list:
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.index)

    def check_result(self, response):
        print(response.text)

  方式二,meta

meta={'cookiejar': True}

  

start_urls

        scrapy引擎获取start_requests函数返回的结果(Request列表)封装成一个迭代器,放入的调度器中,
        下载器从调度器中,通过__next__来获取Request对象

 

         - 定制:可以去redis中获取,也可以设置代理(os.envrion)

        - 内部原理:
        """
        scrapy引擎来爬虫中取起始URL:
            1. 调用start_requests并获取返回值
            2. v = iter(返回值)
            3. 
                req1 = 执行 v.__next__()
                req2 = 执行 v.__next__()
                req3 = 执行 v.__next__()
                ...
            4. req全部放到调度器中
            
        """

        - 编写

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}
	
	def start_requests(self):
		# 方式一:
		for url in self.start_urls:
			yield Request(url=url)
		# 方式二:
		# req_list = []
		# for url in self.start_urls:
		#     req_list.append(Request(url=url))
		# return req_list

        

代理

        问题:scrapy如何加代理?
            - 环境变量 start_requests在爬虫启动时,提前在os.envrion中设置代理
            - meta  yield Request的时候meta属性设置
            - 自定义下载中间件,在process_request中加入,这种方式可以实现随机代理

  内置代理:在爬虫启动时,提前在os.envrion中设置代理即可。

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}

	def start_requests(self):
		import os
		os.environ['HTTPS_PROXY'] = "http://root:[email protected]:9999/"
		os.environ['HTTP_PROXY'] = '19.11.2.32',
		for url in self.start_urls:
			yield Request(url=url,callback=self.parse)

  meta设置代理:yield Request的时候设置meta属性

class ChoutiSpider(scrapy.Spider):
	name = 'chouti'
	allowed_domains = ['chouti.com']
	start_urls = ['https://dig.chouti.com/']
	cookie_dict = {}

	def start_requests(self):
		for url in self.start_urls:
			yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:[email protected]:9999/"'})

  自定义下载中间件,在process_request中加代理,在这里你可以实现随机代码的过程

import base64
import random
from six.moves.urllib.parse import unquote

try:
    from urllib2 import _parse_proxy
except ImportError:
    from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes


class XdbProxyMiddleware(object):

    def _basic_auth_header(self, username, password):
        user_pass = to_bytes(
            '%s:%s' % (unquote(username), unquote(password)),
            encoding='latin-1')
        return base64.b64encode(user_pass).strip()

    def process_request(self, request, spider):
        PROXIES = [
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
        ]
        url = random.choice(PROXIES)

        orig_type = ""
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None
        request.meta['proxy'] = proxy_url
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

  

选择器和解析器

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""

from scrapy.http import HtmlResponse
from scrapy.selector import Selector

response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')


# hxs = Selector(response)
# hxs.xpath()
response.xpath('')  

 

下载中间件

  在process_request里,你可以干这些事:

  • 返回HtmlResponse对象,不执行下载,但是还是会执行process_response
  • 返回Request对象,发起下次请求
  • 抛出异常IgnoreRequest,废弃当前请求,会执行process_exception
  • 对请求进行加工,比如设置User-Agent

  写中间件:

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_request(self, request, spider):
		# Called for each request that goes through the downloader
		# middleware.

		# Must either:
		# - return None: continue processing this request
		# - or return a Response object
		# - or return a Request object
		# - or raise IgnoreRequest: process_exception() methods of
		#   installed downloader middleware will be called
		print('md1.process_request',request)
		# 1. 返回Response
		# import requests
		# result = requests.get(request.url)
		# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
		# 2. 返回Request
		# return Request('https://dig.chouti.com/r/tec/hot/1')

		# 3. 抛出异常
		# from scrapy.exceptions import IgnoreRequest
		# raise IgnoreRequest

		# 4. 对请求进行加工(*)
		# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

		pass

	def process_response(self, request, response, spider):
		# Called with the response returned from the downloader.

		# Must either;
		# - return a Response object
		# - return a Request object
		# - or raise IgnoreRequest
		print('m1.process_response',request,response)
		return response

	def process_exception(self, request, exception, spider):
		# Called when a download handler or a process_request()
		# (from other downloader middleware) raises an exception.

		# Must either:
		# - return None: continue processing this exception
		# - return a Response object: stops process_exception() chain
		# - return a Request object: stops process_exception() chain
		pass

   配置

DOWNLOADER_MIDDLEWARES = {
   #'xdb.middlewares.XdbDownloaderMiddleware': 543,
	# 'xdb.proxy.XdbProxyMiddleware':751,
	'xdb.md.Md1':666,
	'xdb.md.Md2':667,
}

  应用:
     - user-agent
     - 代理

 

 

爬虫中间件 

  • process_start_requests只在爬虫启动时执行一次,在下载中间件之前执行
  • process_spider_input在下载中间件执行完后,调用回调函数时执行
  • process_spider_output在回调函数执行完后执行

  编写:

class Sd1(object):
	# Not all methods need to be defined. If a method is not defined,
	# scrapy acts as if the spider middleware does not modify the
	# passed objects.

	@classmethod
	def from_crawler(cls, crawler):
		# This method is used by Scrapy to create your spiders.
		s = cls()
		return s

	def process_spider_input(self, response, spider):
		# Called for each response that goes through the spider
		# middleware and into the spider.

		# Should return None or raise an exception.
		return None

	def process_spider_output(self, response, result, spider):
		# Called with the results returned from the Spider, after
		# it has processed the response.

		# Must return an iterable of Request, dict or Item objects.
		for i in result:
			yield i

	def process_spider_exception(self, response, exception, spider):
		# Called when a spider or process_spider_input() method
		# (from other spider middleware) raises an exception.

		# Should return either None or an iterable of Response, dict
		# or Item objects.
		pass

	# 只在爬虫启动时,执行一次。
	def process_start_requests(self, start_requests, spider):
		# Called with the start requests of the spider, and works
		# similarly to the process_spider_output() method, except
		# that it doesn’t have a response associated.

		# Must return only requests (not items).
		for r in start_requests:
			yield r

  配置

SPIDER_MIDDLEWARES = {
   # 'xdb.middlewares.XdbSpiderMiddleware': 543,
	'xdb.sd.Sd1': 666,
	'xdb.sd.Sd2': 667,
}

  应用:
    - 深度
    - 优先级

 

定制命令

  单爬虫运行:

import sys
from scrapy.cmdline import execute

if __name__ == '__main__':
    execute(["scrapy","crawl","chouti","--nolog"])

        - 所有爬虫:
            - 在spiders同级创建任意目录,如:commands
            - 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
            - 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
            - 在项目目录执行命令:scrapy crawlall

信号

  使用框架预留的位置,帮助你自定义一些功能

class MyExtend(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        self = cls()

        crawler.signals.connect(self.x1, signal=signals.spider_opened)
        crawler.signals.connect(self.x2, signal=signals.spider_closed)

        return self

    def x1(self, spider):
        print('open')

    def x2(self, spider):
        print('close')

    配置

EXTENSIONS = {
    'xdb.ext.MyExtend':666,
}

  

 

骚师博客

 

上一篇: scrapy

下一篇: Scrapy