scrapy实例:爬取安居客租房信息
程序员文章站
2022-07-02 21:34:50
本次爬取安居客网站,获取上海长宁区的租房信息,参考自:微信公众号 仍然是用scrapy框架构建爬虫,步骤:1.分析网页 2.items.py 3.spiders.py 4. pipelines.py 5.settings.py 观察网页 上海长宁区租房信息: https://sh.zu.anjuke ......
本次爬取安居客网站,获取上海长宁区的租房信息,参考自:微信公众号
仍然是用scrapy框架构建爬虫,步骤:1.分析网页
2.items.py
3.spiders.py
4. pipelines.py
5.settings.py
- 观察网页
上海长宁区租房信息: https://sh.zu.anjuke.com/fangyuan/changning/
- items.py
这里定义字段保存要爬取的信息
import scrapy
class anjukespideritem(scrapy.item): # define the fields for your item here like: # name = scrapy.field()
price = scrapy.field() rent_type = scrapy.field() house_type = scrapy.field() area = scrapy.field() towards = scrapy.field() floor = scrapy.field() decoration = scrapy.field() building_type = scrapy.field() community = scrapy.field()
- spider.py
这里编写爬虫文件,告诉爬虫要爬取什么,怎么爬取
import scrapy from scrapy.spiders import rule from scrapy.linkextractors import linkextractor from anjukespider.items import anjukespideritem # 定义爬虫类 class anjuke(scrapy.spiders.crawlspider): #爬虫名称 name = 'anjuke' #爬虫起始网页 start_urls = ['https://sh.zu.anjuke.com/fangyuan/changning/'] #爬取规则 rules = ( rule(linkextractor(allow=r'fangyuan/p\d+/'), follow=true), #网页中包含下一页按钮,所以这里设置true爬取所有页面 rule(linkextractor(allow=r'https://sh.zu.anjuke.com/fangyuan/\d{10}'), follow=false, callback='parse_item'),#网页里含有【推荐】的房源信息但不一定是我们想要的长宁区,所以设置false不跟进 ) #回调函数,主要就是写xpath路径,上一篇实例说过,这里就不赘述了 def parse_item(self, response): item = anjukespideritem() # 租金 item['price'] = int(response.xpath("//ul[@class='house-info-zufang cf']/li[1]/span[1]/em/text()").extract_first()) # 出租方式 item['rent_type'] = response.xpath("//ul[@class='title-label cf']/li[1]/text()").extract_first() # 户型 item['house_type'] = response.xpath("//ul[@class='house-info-zufang cf']/li[2]/span[2]/text()").extract_first() # 面积 item['area'] = int(response.xpath("//ul[@class='house-info-zufang cf']/li[3]/span[2]/text()").extract_first().replace('平方米','')) # 朝向 item['towards'] = response.xpath("//ul[@class='house-info-zufang cf']/li[4]/span[2]/text()").extract_first() # 楼层 item['floor'] = response.xpath("//ul[@class='house-info-zufang cf']/li[5]/span[2]/text()").extract_first() # 装修 item['decoration'] = response.xpath("//ul[@class='house-info-zufang cf']/li[6]/span[2]/text()").extract_first() # 住房类型 item['building_type'] = response.xpath("//ul[@class='house-info-zufang cf']/li[7]/span[2]/text()").extract_first() # 小区 item['community'] = response.xpath("//ul[@class='house-info-zufang cf']/li[8]/a[1]/text()").extract_first() yield item
- pipelines.py
保存爬取的数据,这里只保存为json格式
其实可以不写这部分,不写pipeline ,运行时加些参数:scrapy crawl anjuke -o anjuke.json -t json
scrapy crawl 爬虫名称 -o 目标文件名称 -t 保存格式
from scrapy.exporters import jsonitemexporter class anjukespiderpipeline(object): def __init__(self): self.file = open('zufang_shanghai.json', 'wb') #设置文件存储路径 self.exporter = jsonitemexporter(self.file, ensure_ascii=false) self.exporter.start_exporting() def process_item(self, item, spider): print('write') self.exporter.export_item(item) return item def close_spider(self, spider): print("close") self.exporter.finish_exporting() self.file.close()
-
settings.py
修改settings文件,使pipeline生效
设置下载延迟,防止访问过快导致被网站屏蔽
item_pipelines = { 'anjukespider.pipelines.anjukespiderpipeline': 300, } download_delay = 2
- 运行命令行,进入项目根目录,键入
scrapy crawl [爬虫名称]
-
ps f:\scrapyproject\anjukespider\anjukespider> scrapy crawl anjuke
执行完成
爬取到61条信息,json文件在指定路径已生成
-
2018-10-22 09:02:55 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 40861, 'downloader/request_count': 61, 'downloader/request_method_count/get': 61, 'downloader/response_bytes': 1925879, 'downloader/response_count': 61, 'downloader/response_status_count/200': 61, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 10, 22, 1, 2, 55, 245128), 'item_scraped_count': 60, 'log_count/debug': 122, 'log_count/info': 9, 'request_depth_max': 1, 'response_received_count': 61, 'scheduler/dequeued': 61, 'scheduler/dequeued/memory': 61, 'scheduler/enqueued': 61, 'scheduler/enqueued/memory': 61, 'start_time': datetime.datetime(2018, 10, 22, 1, 0, 29, 555537)} 2018-10-22 09:02:55 [scrapy.core.engine] info: spider closed (finished)
爬虫到此完成,但爬取到的数据并不直观,还需对其做可视化处理(pyecharts模块),这部分另写一篇pyecharts使用
- pyecharts官方文档:http://pyecharts.org/#/zh-cn/
上一篇: 聊天室php&mysql(六)
下一篇: 珠联璧合,钟汉良代言科沃斯机器人