Python 通过xpath属性爬取豆瓣热映的电影信息

程序员文章站 2022-03-10 13:21:31

目录前言页面分析实现过程创建项目item定义中间件操作定义爬虫定义数据管道定义配置设置执行验证总结前言声明一下：本文主要是研究使用，没有别的用途。github仓库地址：github项目仓库页面分析主要...

前言

声明一下：本文主要是研究使用，没有别的用途。

github仓库地址：github项目仓库

页面分析

主要爬取页面为:https://movie.douban.com/cinema/nowplaying/nanjing/

至于后面的地区，可以按照自己的需要改一下，不过多赘述了。页面需要点击一下展开全部影片，才能显示全部内容，不然只有15部。所以我们使用selenium的时候，需要加一个打开页面后的点击逻辑。页面图如下：

Python 通过xpath属性爬取豆瓣热映的电影信息

通过f12展开的源码，用xpath helper工具验证一下右键复制下来的xpath路径。

Python 通过xpath属性爬取豆瓣热映的电影信息

为了避免布局调整导致找不到，我把xpath改为通过class名获取。

Python 通过xpath属性爬取豆瓣热映的电影信息

然后看看每个影片的信息。

Python 通过xpath属性爬取豆瓣热映的电影信息

分析一下，是不是可以通过nowplaying的div，作为根节点，然后获取下面class为list-item的节点，里面的属性就是我们要的内容。

Python 通过xpath属性爬取豆瓣热映的电影信息

没什么问题，那么就按照这个思路开始创建项目编码吧。

实现过程

创建项目

创建一个较douban_playing的项目，使用scrapy命令。

scrapy startproject douban_playing

item定义

定义电影信息实体。

# define here the models for your scraped items
#
# see documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class doubanplayingitem(scrapy.item):
    # define the fields for your item here like:
    # name = scrapy.field()
    # 电影名
    title = scrapy.field()
    # 电影分数
    score = scrapy.field()
    # 电影发行年份
    release = scrapy.field()
    # 电影时长
    duration = scrapy.field()
    # 地区
    region = scrapy.field()
    # 电影导演
    director = scrapy.field()
    # 电影主演
    actors = scrapy.field()

中间件操作定义

主要是点击展开全部影片，需要加一段代码。

# define here the models for your spider middleware
#
# see documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import time
 
from scrapy import signals
 
# useful for handling different item types with a single interface
from itemadapter import is_item, itemadapter
from scrapy.http import htmlresponse
from selenium.common.exceptions import timeoutexception
 
 
class doubanplayingspidermiddleware:
    # not all methods need to be defined. if a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # this method is used by scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_spider_input(self, response, spider):
        # called for each response that goes through the spider
        # middleware and into the spider.
 
        # should return none or raise an exception.
        return none
 
    def process_spider_output(self, response, result, spider):
        # called with the results returned from the spider, after
        # it has processed the response.
 
        # must return an iterable of request, or item objects.
        for i in result:
            yield i
 
    def process_spider_exception(self, response, exception, spider):
        # called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
 
        # should return either none or an iterable of request or item objects.
        pass
 
    def process_start_requests(self, start_requests, spider):
        # called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn't have a response associated.
 
        # must return only requests (not items).
        for r in start_requests:
            yield r
 
    def spider_opened(self, spider):
        spider.logger.info('spider opened: %s' % spider.name)
 
 
class doubanplayingdownloadermiddleware:
    # not all methods need to be defined. if a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # this method is used by scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        # called for each request that goes through the downloader
        # middleware.
 
        # must either:
        # - return none: continue processing this request
        # - or return a response object
        # - or return a request object
        # - or raise ignorerequest: process_exception() methods of
        #   installed downloader middleware will be called
        # return none
        try:
            spider.browser.get(request.url)
            spider.browser.maximize_window()
            time.sleep(2)
            spider.browser.find_element_by_xpath("//*[@id='nowplaying']/div[@class='more']").click()
            # actionchains(spider.browser).click(searchbuttonelement)
            time.sleep(5)
            return htmlresponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)
        except timeoutexception as e:
            print('超时异常:{}'.format(e))
            spider.browser.execute_script('window.stop()')
        finally:
            spider.browser.close()
 
    def process_response(self, request, response, spider):
        # called with the response returned from the downloader.
 
        # must either;
        # - return a response object
        # - return a request object
        # - or raise ignorerequest
        return response
 
    def process_exception(self, request, exception, spider):
        # called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # must either:
        # - return none: continue processing this exception
        # - return a response object: stops process_exception() chain
        # - return a request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('spider opened: %s' % spider.name)

爬虫定义

按照属性名，我们取出所有的影片信息。注意取出属性的写法。

#!/user/bin/env python
# coding=utf-8
"""
@project : douban_playing
@author  : huyi
@file   : douban_playing.py
@ide    : pycharm
@time   : 2021-11-10 16:31:23
"""
 
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import options
 
from douban_playing.items import doubanplayingitem
 
 
class doubanplayingspider(scrapy.spider):
    name = 'dbp'
    # allowed_domains = ['blog.csdn.net']
    start_urls = ['https://movie.douban.com/cinema/nowplaying/nanjing/']
    nowplaying = "//*[@id='nowplaying']/div[@class='mod-bd']//*[@class='list-item']/@{}"
    properties = ['data-title', 'data-score', 'data-release', 'data-duration', 'data-region', 'data-director',
                  'data-actors']
 
    def __init__(self):
        chrome_options = options()
        chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--no-sandbox')
        self.browser = webdriver.chrome(chrome_options=chrome_options,
                                        executable_path="e:\\chromedriver_win32\\chromedriver.exe")
        self.browser.set_page_load_timeout(30)
 
    def parse(self, response, **kwargs):
        titles = response.xpath(self.nowplaying.format(self.properties[0])).extract()
        scores = response.xpath(self.nowplaying.format(self.properties[1])).extract()
        releases = response.xpath(self.nowplaying.format(self.properties[2])).extract()
        durations = response.xpath(self.nowplaying.format(self.properties[3])).extract()
        regions = response.xpath(self.nowplaying.format(self.properties[4])).extract()
        directors = response.xpath(self.nowplaying.format(self.properties[5])).extract()
        actors = response.xpath(self.nowplaying.format(self.properties[6])).extract()
        for x in range(len(titles)):
            item = doubanplayingitem()
            item['title'] = titles[x]
            item['score'] = scores[x]
            item['release'] = releases[x]
            item['duration'] = durations[x]
            item['region'] = regions[x]
            item['director'] = directors[x]
            item['actors'] = actors[x]
            yield item

数据管道定义

还是老样子，把取出的电影数据按照格式输出在文本中。

# define your item pipelines here
#
# don't forget to add your pipeline to the item_pipelines setting
# see: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# useful for handling different item types with a single interface
from itemadapter import itemadapter
 
 
class doubanplayingpipeline:
    def __init__(self):
        self.file = open('result.txt', 'w', encoding='utf-8')
 
    def process_item(self, item, spider):
        self.file.write(
            "电影:{}\t分数:{}\t发行年份:{}\t电影时长:{}\t地区:{}\t电影导演:{}\t电影主演:{}\n".format(
                item['title'],
                item['score'],
                item['release'],
                item['duration'],
                item['region'],
                item['director'],
                item['actors']))
        return item
 
    def close_spider(self, spider):
        self.file.close()

配置设置

都是一些常规的，放开几个默认配置就行。

# scrapy settings for douban_playing project
#
# for simplicity, this file contains only settings considered important or
# commonly used. you can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
bot_name = 'douban_playing'
 
spider_modules = ['douban_playing.spiders']
newspider_module = 'douban_playing.spiders'
 
 
# crawl responsibly by identifying yourself (and your website) on the user-agent
#user_agent = 'douban_playing (+http://www.yourdomain.com)'
user_agent = 'mozilla/5.0'
 
# obey robots.txt rules
robotstxt_obey = false
 
# configure maximum concurrent requests performed by scrapy (default: 16)
#concurrent_requests = 32
 
# configure a delay for requests for the same website (default: 0)
# see https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# see also autothrottle settings and docs
#download_delay = 3
# the download delay setting will honor only one of:
#concurrent_requests_per_domain = 16
#concurrent_requests_per_ip = 16
 
# disable cookies (enabled by default)
cookies_enabled = false
 
# disable telnet console (enabled by default)
#telnetconsole_enabled = false
 
# override the default request headers:
default_request_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'accept-language': 'en',
    'user-agent': 'mozilla/5.0 (windows nt 6.2; wow64) applewebkit/537.36 (khtml, like gecko) chrome/27.0.1453.94 safari/537.36'
}
 
# enable or disable spider middlewares
# see https://docs.scrapy.org/en/latest/topics/spider-middleware.html
spider_middlewares = {
   'douban_playing.middlewares.doubanplayingspidermiddleware': 543,
}
 
# enable or disable downloader middlewares
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
downloader_middlewares = {
   'douban_playing.middlewares.doubanplayingdownloadermiddleware': 543,
}
 
# enable or disable extensions
# see https://docs.scrapy.org/en/latest/topics/extensions.html
#extensions = {
#    'scrapy.extensions.telnet.telnetconsole': none,
#}
 
# configure item pipelines
# see https://docs.scrapy.org/en/latest/topics/item-pipeline.html
item_pipelines = {
   'douban_playing.pipelines.doubanplayingpipeline': 300,
}
 
# enable and configure the autothrottle extension (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/autothrottle.html
#autothrottle_enabled = true
# the initial download delay
#autothrottle_start_delay = 5
# the maximum download delay to be set in case of high latencies
#autothrottle_max_delay = 60
# the average number of requests scrapy should be sending in parallel to
# each remote server
#autothrottle_target_concurrency = 1.0
# enable showing throttling stats for every response received:
#autothrottle_debug = false
 
# enable and configure http caching (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#httpcache_enabled = true
#httpcache_expiration_secs = 0
#httpcache_dir = 'httpcache'
#httpcache_ignore_http_codes = []
#httpcache_storage = 'scrapy.extensions.httpcache.filesystemcachestorage'

执行验证

还是老样子，不直接使用scrapy命令，构造一个py执行cmd。注意该py的位置。

Python 通过xpath属性爬取豆瓣热映的电影信息

看一下执行后的结果。

Python 通过xpath属性爬取豆瓣热映的电影信息

完美！！！

总结

最近都在写一些爬虫的案例，也是边学习边摸索，把一些实现过程记录一下，也分享一下，等过段时间还可以回忆回忆。

情之一字，不知所起，不知所栖，不知所结，不知所解，不知所踪，不知所终。 ——《雪中悍刀行》

如果本文对你有用的话，请不要吝啬你的赞，谢谢！

Python 通过xpath属性爬取豆瓣热映的电影信息

以上就是python 通过xpath属性爬取豆瓣热映的电影信息的详细内容，更多关于python 爬虫豆瓣的资料请关注其它相关文章！

上一篇： “网瘾”中老年，迷上熬夜与“剁手”

下一篇：解决JPA save()方法null值覆盖掉mysql预设的默认值问题

Python 通过xpath属性爬取豆瓣热映的电影信息

目录

前言

页面分析

实现过程

创建项目

item定义

中间件操作定义

爬虫定义

数据管道定义

配置设置

执行验证

总结

Python爬取豆瓣电影信息遇到的问题

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

Python 豆瓣电影Top250信息的爬取

Python爬取豆瓣电影信息遇到的问题

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

Python 豆瓣电影Top250信息的爬取