Scrapy爬取豆瓣电影
程序员文章站
2022-04-27 23:30:22
...
Scrapy爬取豆瓣电影
Scrapy爬取四部曲
- 新建目标
- 明确目标
- 制作爬虫
- 存储内容
创建项目
scrapy startproject douban
创建Spider文件
scrapy genspider douban_spider movie.douban.com
创建文件后生成代码
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫名字
name = 'douban_spider'
# 允许抓取的域名
allowed_domains = ['movie.douban.com']
# 入口url,扔到调度器中,自己添加后买呢的top250
start_urls = ['https://movie.douban.com/top250']
# 默认解析方法
def parse(self, response):
pass
编写items文件
Item是保存文件爬取数据的容器,使用方法和字典相同。
创建Item需要继承Scrapt.Item类,类型定义为scrapy.Field字段。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = 'douban_movie'
# 序号
serial_number = Field()
# 电影名称
movie_name = Field()
# 电影介绍
introduce = Field()
# 电影星级
star = Field()
# 电影的评论
evaluate = Field()
# 电影描述
describe = Field()
解析Response
接下来使用的是Xpath解析,response自带xpath解析器。
分别解析网页的字段。
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫名字
name = 'douban_spider'
# 允许抓取的域名
allowed_domains = ['movie.douban.com']
# 入口url,扔到调度器中
start_urls = ['https://movie.douban.com/top250']
# 默认解析方法
def parse(self, response):
# 循环电影条目
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li") # 编写Xpath规则
for i_item in movie_list:
# 导入item文件
douban_item = DoubanItem()
# 写详细的xpath,数据分析
douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first()
douban_item['movie_name'] = i_item.xpath(
".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first()
content = i_item.xpath(".//div[@class='info']/div[@class='bd']/p[1]/text()").extract()
douban_item['introduce'] = [" ".join(i.split()) for i in content]
douban_item['star'] = i_item.xpath(".//span[@class='rating_num']/text()").extract_first()
douban_item['evaluate'] = i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
douban_item['describe'] = i_item.xpath(".//p[@class='quote']/span/text()").extract_first()
# 将数据yield到piplines进行数据清洗和存储
yield douban_item
# 解析下一页规则,取后一页的xpath
next_link = response.xpath("//span[@class='next']/link/@href").extract_first()
if next_link:
yield scrapy.Request("https://movie.douban.com/top250" + next_link, callback=self.parse)
注意在爬取电影介绍的时候,由于爬取的是多行,要进行简单的处理一下。
在解析完每一页的数据后,就要分析下一页,获取下一页的信息参数,判断是否有下一页,然后返回Yield的scrapy.Request请求返回url和回调函数。
现在就可以爬取页面了。
保存爬取结果
执行命令scrapy crawl douban -o result.csv
会在执行命令的目录下生成爬取文件的结果。
保存数据到mongo
首先安装pymongopip install pymongo
在setting.py文件添加配置信息
MONGO_URL = '127.0.0.1'
MONGO_DB = 'douban'
接下来编写pipelines文件。
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class DoubanPipeline(object):
def __init__(self, mongo_url, mongo_db):
self.mongo_url = mongo_url
self.mongo_db = mongo_db
self.client = pymongo.MongoClient(self.mongo_url)
self.db = self.client[self.mongo_db]
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_url=crawler.settings.get('MONGO_URL'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
pass
def process_item(self, item, spider):
data = dict(item)
self.db[item.collection].insert(data)
return item
def close_spider(self, item):
self.client.close()
接下来在setting.py中开启该功能
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
更换随机user-agent
编写中间介middlewares.py,添加一个类my_useragent。
class my_useragent(object):
def process_request(self, request, spider):
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
user_agent = random.choice(user_agent_list)
request.headers['User_Agent'] = user_agent
接下来就是在setting.py打开该功能。
DOWNLOADER_MIDDLEWARES = {
# 'douban.middlewares.DoubanDownloaderMiddleware': 543,
'douban.middlewares.my_useragent': 543,
}
总结
- 每次在编写piplines.py和middlewares.py的时候记得在setting.py中开启该功能,数值越小,优先级越高。
- 爬虫文件和爬虫名称不能相同,spider目录不能存在相同爬虫名称文件。