欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

简单的scrapy爬虫:豆瓣剧情片排行榜

程序员文章站 2022-04-28 08:37:44
...

目标:简单的scrapy练习,抓取豆瓣剧情片排行榜前20%并写入文件保存

网址:

https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85%E7%89%87&type=11&interval_id=100:90&action=

网页说明:

   1,网址中100:90部分控制排行榜中分数最高的20%

   2,网页解析过程略过

系统及软件:Windows7及pycharm,Python3.6

代码:

   1,编写item

# -*- coding: utf-8 -*-
import scrapy

class DoubanMovieItem(scrapy.Item):
    name = scrapy.Field()
    score = scrapy.Field()
    url = scrapy.Field()

   2,编写spider

# -*- coding:utf-8 -*-
import scrapy
import json
from douban_movie.items import DoubanMovieItem

class CatchMovieSpider(scrapy.Spider):
    name = 'catch_movie'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20']
    offset = 0

    def parse(self,response):
        # print(response.body.decode())
        item = DoubanMovieItem()
        movie_list = json.loads(response.body.decode())
        if movie_list == []:
            return
        for movie in movie_list:
            item['name'] = movie['title']
            item['score'] = movie['score']
            item['url'] = movie['url']
            yield item
        self.offset += 20
        new_url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start={}&limit=20'.format(self.offset)
        yield scrapy.Request(url = new_url,callback = self.parse)

   3,编写pipeline

# -*- coding: utf-8 -*-

import json

class DoubanMoviePipeline(object):
    def open_spider(self,spider):
        self.file = open('douban_movie.txt','w',encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.file.write(content)
        return item

    def close_spider(self,spider):
        self.file.close()

   4,编写setting

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douban_movie.pipelines.DoubanMoviePipeline': 300,
}

   5,编写main

from scrapy.cmdline import execute
execute('scrapy crawl catch_movie'.split())

保存后文件内容截图:

简单的scrapy爬虫:豆瓣剧情片排行榜

简单的scrapy爬虫:豆瓣剧情片排行榜

笔记:

   1,编写main是为了方便调试

   2,这个排行榜在网址中限定了区间(网址中类似于100:90这种参数)

相关标签: scrapy