简单的scrapy爬虫：豆瓣剧情片排行榜

程序员文章站 2022-04-28 08:37:44

...

目标：简单的scrapy练习，抓取豆瓣剧情片排行榜前20%并写入文件保存

网址：

https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85%E7%89%87&type=11&interval_id=100:90&action=

网页说明：

1，网址中100:90部分控制排行榜中分数最高的20%

2，网页解析过程略过

系统及软件：Windows7及pycharm，Python3.6

代码：

1，编写item

# -*- coding: utf-8 -*-
import scrapy

class DoubanMovieItem(scrapy.Item):
    name = scrapy.Field()
    score = scrapy.Field()
    url = scrapy.Field()

2，编写spider

# -*- coding:utf-8 -*-
import scrapy
import json
from douban_movie.items import DoubanMovieItem

class CatchMovieSpider(scrapy.Spider):
    name = 'catch_movie'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20']
    offset = 0

    def parse(self,response):
        # print(response.body.decode())
        item = DoubanMovieItem()
        movie_list = json.loads(response.body.decode())
        if movie_list == []:
            return
        for movie in movie_list:
            item['name'] = movie['title']
            item['score'] = movie['score']
            item['url'] = movie['url']
            yield item
        self.offset += 20
        new_url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start={}&limit=20'.format(self.offset)
        yield scrapy.Request(url = new_url,callback = self.parse)

3，编写pipeline

# -*- coding: utf-8 -*-

import json

class DoubanMoviePipeline(object):
    def open_spider(self,spider):
        self.file = open('douban_movie.txt','w',encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.file.write(content)
        return item

    def close_spider(self,spider):
        self.file.close()

4，编写setting

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douban_movie.pipelines.DoubanMoviePipeline': 300,
}

5，编写main

from scrapy.cmdline import execute
execute('scrapy crawl catch_movie'.split())

保存后文件内容截图：

简单的scrapy爬虫：豆瓣剧情片排行榜

笔记：

1，编写main是为了方便调试

2，这个排行榜在网址中限定了区间（网址中类似于100:90这种参数）

简单的scrapy爬虫：豆瓣剧情片排行榜

使用Python的Scrapy框架编写web爬虫的简单示例

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

Python爬虫基础之简单说一下scrapy的框架结构

实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

Python的Scrapy爬虫框架简单学习笔记

使用Python的Scrapy框架编写web爬虫的简单示例

Python爬虫框架：scrapy的简单使用教程

一、Scrapy的简单使用-爬虫

实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

简单的scrapy爬虫：豆瓣剧情片排行榜

使用Python的Scrapy框架编写web爬虫的简单示例

一个简单的python爬虫程序 爬取豆瓣热度Top100以内的电影信息

Python爬虫基础之简单说一下scrapy的框架结构

实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

Python的Scrapy爬虫框架简单学习笔记

使用Python的Scrapy框架编写web爬虫的简单示例

Python爬虫框架：scrapy的简单使用教程

一、Scrapy的简单使用-爬虫

实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250

一个简单的python爬虫程序 爬取豆瓣热度Top100以内的电影信息

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息

一个简单的python爬虫程序爬取豆瓣热度Top100以内的电影信息