简单的scrapy爬虫:豆瓣剧情片排行榜
程序员文章站
2022-04-28 08:37:44
...
目标:简单的scrapy练习,抓取豆瓣剧情片排行榜前20%并写入文件保存
网址:
网页说明:
1,网址中100:90部分控制排行榜中分数最高的20%
2,网页解析过程略过
系统及软件:Windows7及pycharm,Python3.6
代码:
1,编写item
# -*- coding: utf-8 -*-
import scrapy
class DoubanMovieItem(scrapy.Item):
name = scrapy.Field()
score = scrapy.Field()
url = scrapy.Field()
2,编写spider
# -*- coding:utf-8 -*-
import scrapy
import json
from douban_movie.items import DoubanMovieItem
class CatchMovieSpider(scrapy.Spider):
name = 'catch_movie'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20']
offset = 0
def parse(self,response):
# print(response.body.decode())
item = DoubanMovieItem()
movie_list = json.loads(response.body.decode())
if movie_list == []:
return
for movie in movie_list:
item['name'] = movie['title']
item['score'] = movie['score']
item['url'] = movie['url']
yield item
self.offset += 20
new_url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start={}&limit=20'.format(self.offset)
yield scrapy.Request(url = new_url,callback = self.parse)
3,编写pipeline
# -*- coding: utf-8 -*-
import json
class DoubanMoviePipeline(object):
def open_spider(self,spider):
self.file = open('douban_movie.txt','w',encoding='utf-8')
def process_item(self, item, spider):
content = json.dumps(dict(item),ensure_ascii=False)+'\n'
self.file.write(content)
return item
def close_spider(self,spider):
self.file.close()
4,编写setting
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban_movie.pipelines.DoubanMoviePipeline': 300,
}
5,编写main
from scrapy.cmdline import execute
execute('scrapy crawl catch_movie'.split())
保存后文件内容截图:
笔记:
1,编写main是为了方便调试
2,这个排行榜在网址中限定了区间(网址中类似于100:90这种参数)
下一篇: Python解决八皇后问题示例