欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

scrapy:中间件

程序员文章站 2022-05-09 13:06:32
...

scrapy的中间件可分为爬虫中间件和下载中间件,本文主要介绍下载中间件。
下载中间件位于Downloader和engine之间,主要用于拦截请求和拦截响应。

1.拦截请求

  • 作用:进行UA伪装、代理IP
  • 步骤:在middlewares.py 文件中,重写DownloaderMiddleware类中的process_request函数
    ❤️ :对发起的每个请求,重UA池中选择任意一个进行UA伪装
class MiddleDownloaderMiddleware:
	#UA池
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    #以下两个proxy是进行代理IP的设置
    PROXY_http = [
        '153.180.102.104:80',
        '195.208.131.189:56055',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '95.189.112.214:35508',
    ]

    #用于拦截请求
    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agent_list)
        # request.meta['proxy'] = 'https://61.64.144.123:8080'
        print('现在使用的UA是:',request.headers['User-Agent'])
        return None

2.拦截响应

  • 作用:篡改响应数据,比如有些页面数据是动态加载出来的,直接request拿不到,需要用selenium来获取,这时就要篡改request返回的响应数据。
  • 步骤:重写middlewares.py中 DownloaderMiddleware中的process_response函数
    1.在爬虫文件中实例化一个webdriver(写在__init__函数中),
    为什么写在爬虫文件中,而不是写在中间件文件中?
    因为webdriver只需要实例化一个,之后每个不同的url只需要webdriver.get(url)就可以,
    如果写在中间件的process_response函数中,每个url都会实例化一个浏览器,这个是没必要的
    2.在process_response函数中引入该webdriver
    3.(process_response函数中)根据条件选出需要使用selenium获得数据的url
    4.(process_response函数中)对于每个url,使用webdriver.get(url).page_source得到页面数据,
    然后通过HtmlResponse获得新响应:new_response(这里我感觉就是用HtmlResponse把request原来的响应替换成page_source,重新包装成new_response)
    5.在爬虫文件中对new_response进行数据解析,后面的操作就和其他一样了。
    ❤️ :起始页中包含了5个新闻版块(module)的url,需要对每个module_url进行请求才能得到每个module里面的若干条新闻信息;但是每个module里面的若干条新闻信息是动态加载出来的,直接对module_url发送请求、然后解析该请求的response是拿不到这些新闻信息的。因此response进行篡改。
    爬虫文件中的代码:
import scrapy
from selenium import webdriver
from wynews.items import WynewsItem
class NewsSpiderSpider(scrapy.Spider):
    name = 'news_spider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://news.163.com/']
    module_urls = []

    # 实例化一个浏览器对象
    def __init__(self):
        self.driver = webdriver.Chrome(
            executable_path=r'C:\Users\Legion\AppData\Local\Google\Chrome\Application\chromedriver.exe')

    def parse(self, response):
        info_nodes = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        node_idx = [3, 4, 6, 7, 8]
        for idx in node_idx:
            node = info_nodes[idx]
            self.module_urls.append(node.xpath('./a/@href').extract_first())

        for url in self.module_urls[:1]:
            yield scrapy.Request(url, callback=self.parse_module)

    def parse_module(self, response):  # 用于解析每个模块页面中的每条新闻信息和新闻详情页链接
        info_nodes = response.xpath('//div[@class="ndi_main"]/div')
        for node in info_nodes[:5]:
            item = WynewsItem()
            news_title = node.xpath('.//div[@class="news_title"]//h3/a/text()').extract_first()
            news_url = node.xpath('.//div[@class="news_title"]//h3/a/@href').extract_first()
            item['news_title'] = news_title
            item['news_url'] = news_url
            yield item

middlewares.py中的代码:

from scrapy.http import HtmlResponse
import time

class WynewsDownloaderMiddleware:

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        driver = spider.driver #获取爬虫文件中定义的浏览器对象
        #挑选出需要修改响应对象的url
        if request.url in spider.module_urls:
            driver.get(request.url)
            time.sleep(2)
            page_text = driver.page_source
            new_response = HtmlResponse(request.url,body=page_text,encoding='utf-8',request=request)
            return new_response

		#不需要篡改的,就直接返回原始的response
        else:
            return response
相关标签: scrapy