爬取豆瓣电影信息
程序员文章站
2022-04-14 11:05:23
昨天写了一个小爬虫,爬取了豆瓣上2017年*的电影信息,网址为 "豆瓣选影视" ,爬取了电影的名称、导演、编剧、主演、类型、上映时间、片长、评分和链接,并保存到MongoDB中。 一开始用的本机的IP地址,没用代理IP,请求了十几个网页之后就收不到数据了,报HTTP错误302,然后用浏览器打开 ......
昨天写了一个小爬虫,爬取了豆瓣上2017年*的电影信息,网址为豆瓣选影视,爬取了电影的名称、导演、编剧、主演、类型、上映时间、片长、评分和链接,并保存到mongodb中。
一开始用的本机的ip地址,没用代理ip,请求了十几个网页之后就收不到数据了,报http错误302,然后用浏览器打开网页试了一下,发现浏览器也是302。。。
但是我不怕,我有代理ip,哈哈哈!详见我前一篇随笔:爬取代理ip。
使用代理ip之后果然可以持续收到数据了,但中间还是有302错误,没事,用另一个代理ip请求重新请求一次就好了,一次不行再来一次,再来一次不行那就再再来一次,再再不行,那。。。
下面附上部分代码吧。
1.爬虫文件
import scrapy import json from douban.items import doubanitem parse_url = "https://movie.douban.com/j/new_search_subjects?sort=u&range=0,10&tags=%e7%94%b5%e5%bd%b1&start={}&countries=%e4%b8%ad%e5%9b%bd%e5%a4%a7%e9%99%86&year_range=2017,2017" class cn2017spider(scrapy.spider): name = 'cn2017' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/j/new_search_subjects?sort=u&range=0,10&tags=%e7%94%b5%e5%bd%b1&start=0&countries=%e4%b8%ad%e5%9b%bd%e5%a4%a7%e9%99%86&year_range=2017,2017'] def parse(self, response): data = json.loads(response.body.decode()) if data is not none: for film in data["data"]: print(film["url"]) item = doubanitem() item["url"] = film["url"] yield scrapy.request( film["url"], callback=self.get_detail_content, meta={"item": item} ) for page in range(20,3200,20): yield scrapy.request( parse_url.format(page), callback=self.parse ) def get_detail_content(self,response): item = response.meta["item"] item["film_name"] = response.xpath("//div[@id='content']//span[@property='v:itemreviewed']/text()").extract_first() item["director"] = response.xpath("//div[@id='info']/span[1]/span[2]/a/text()").extract_first() item["scriptwriter"] = response.xpath("///div[@id='info']/span[2]/span[2]/a/text()").extract() item["starring"] = response.xpath("//div[@id='info']/span[3]/span[2]/a[position()<6]/text()").extract() item["type"] = response.xpath("//div[@id='info']/span[@property='v:genre']/text()").extract() item["release_date"] = response.xpath("//div[@id='info']/span[@property='v:initialreleasedate']/text()").extract() item["running_time"] = response.xpath("//div[@id='info']/span[@property='v:runtime']/@content").extract_first() item["score"] = response.xpath("//div[@class='rating_self clearfix']/strong/text()").extract_first() # print(item) if item["film_name"] is none: # print("*" * 100) yield scrapy.request( item["url"], callback=self.get_detail_content, meta={"item": item}, dont_filter=true ) else: yield item
2.items.py
文件
import scrapy class doubanitem(scrapy.item): #电影名称 film_name = scrapy.field() #导演 director = scrapy.field() #编剧 scriptwriter = scrapy.field() #主演 starring = scrapy.field() #类型 type = scrapy.field() #上映时间 release_date = scrapy.field() #片长 running_time = scrapy.field() #评分 score = scrapy.field() #链接 url = scrapy.field()
3.middlewares.py
文件
from douban.settings import user_agent_list import random import pandas as pd class useragentmiddleware(object): def process_request(self, request, spider): user_agent = random.choice(user_agent_list) request.headers["user-agent"] = user_agent return none class proxymiddleware(object): def process_request(self, request, spider): # called for each request that goes through the downloader # middleware. ip_df = pd.read_csv(r"c:\users\administrator\desktop\douban\douban\ip.csv") ip = random.choice(ip_df.loc[:, "ip"]) request.meta["proxy"] = "http://" + ip return none
4.pipelines.py
文件
from pymongo import mongoclient client = mongoclient() collection = client["test"]["douban"] class doubanpipeline(object): def process_item(self, item, spider): collection.insert(dict(item))
5.settings.py
文件
downloader_middlewares = { 'douban.middlewares.useragentmiddleware': 543, 'douban.middlewares.proxymiddleware': 544, } item_pipelines = { 'douban.pipelines.doubanpipeline': 300, } robotstxt_obey = false download_timeout = 10 retry_enabled = true retry_times = 10
程序共运行1小时20分21.473772秒,抓取到2986条数据。
最后,
还是要每天开心鸭!
下一篇: ceshi