欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Scrapy爬取豆瓣电影

程序员文章站 2022-04-27 23:30:22
...

Scrapy爬取豆瓣电影

Scrapy爬取四部曲

  • 新建目标
  • 明确目标
  • 制作爬虫
  • 存储内容

创建项目

scrapy startproject douban

创建Spider文件

scrapy genspider douban_spider movie.douban.com
创建文件后生成代码

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    # 爬虫名字
    name = 'douban_spider'
    # 允许抓取的域名
    allowed_domains = ['movie.douban.com']
    # 入口url,扔到调度器中,自己添加后买呢的top250
    start_urls = ['https://movie.douban.com/top250']

    # 默认解析方法
    def parse(self, response):
        pass

编写items文件

Item是保存文件爬取数据的容器,使用方法和字典相同。
创建Item需要继承Scrapt.Item类,类型定义为scrapy.Field字段。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    collection = 'douban_movie'
    # 序号
    serial_number = Field()
    # 电影名称
    movie_name = Field()
    # 电影介绍
    introduce = Field()
    # 电影星级
    star = Field()
    # 电影的评论
    evaluate = Field()
    # 电影描述
    describe = Field()

解析Response

接下来使用的是Xpath解析,response自带xpath解析器。
分别解析网页的字段。

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    # 爬虫名字
    name = 'douban_spider'
    # 允许抓取的域名
    allowed_domains = ['movie.douban.com']
    # 入口url,扔到调度器中
    start_urls = ['https://movie.douban.com/top250']

    # 默认解析方法
    def parse(self, response):
        # 循环电影条目
        movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")  # 编写Xpath规则
        for i_item in movie_list:
            # 导入item文件
            douban_item = DoubanItem()
            # 写详细的xpath,数据分析
            douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first()
            douban_item['movie_name'] = i_item.xpath(
                ".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
            content = i_item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
            douban_item['introduce'] = [" ".join(i.split()) for i in content]
            douban_item['star'] = i_item.xpath(".//span[@class='rating_num']/text()").extract_first()
            douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
            douban_item['describe'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
            # 将数据yield到piplines进行数据清洗和存储
            yield douban_item
        # 解析下一页规则,取后一页的xpath
        next_link = response.xpath("//span[@class='next']/link/@href").extract_first()
        if next_link:
            yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)

注意在爬取电影介绍的时候,由于爬取的是多行,要进行简单的处理一下。
在解析完每一页的数据后,就要分析下一页,获取下一页的信息参数,判断是否有下一页,然后返回Yield的scrapy.Request请求返回url和回调函数。

现在就可以爬取页面了。

保存爬取结果

执行命令scrapy crawl douban -o result.csv会在执行命令的目录下生成爬取文件的结果。

保存数据到mongo

首先安装pymongopip install pymongo
在setting.py文件添加配置信息

MONGO_URL = '127.0.0.1'
MONGO_DB = 'douban'

接下来编写pipelines文件。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo


class DoubanPipeline(object):
    def __init__(self, mongo_url, mongo_db):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        data = dict(item)
        self.db[item.collection].insert(data)
        return item

    def close_spider(self, item):
        self.client.close()

接下来在setting.py中开启该功能

ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}

Scrapy爬取豆瓣电影

更换随机user-agent

编写中间介middlewares.py,添加一个类my_useragent。

class my_useragent(object):
    def process_request(self, request, spider):
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        user_agent = random.choice(user_agent_list)
        request.headers['User_Agent'] = user_agent

接下来就是在setting.py打开该功能。

DOWNLOADER_MIDDLEWARES = {
    # 'douban.middlewares.DoubanDownloaderMiddleware': 543,
    'douban.middlewares.my_useragent': 543,
}

总结

  1. 每次在编写piplines.py和middlewares.py的时候记得在setting.py中开启该功能,数值越小,优先级越高。
  2. 爬虫文件和爬虫名称不能相同,spider目录不能存在相同爬虫名称文件。
相关标签: Scrapy