scrapy实战——利用CrawlSpider爬取腾讯社招全部岗位信息(进行有一定深度的爬取)
程序员文章站
2022-05-05 15:57:15
...
经过scrapy的简单学习,我们实现这样一个爬虫:爬取腾讯社招的全部岗位信息,将粗略的大致信息保存在tencent.json文件中,将岗位的进一步具体信息(职责、要求)保存在positiondescribe.json文件中。
即,我们需要两个item进行页面信息的保存,同时要继承CrawlSpider对页面链接进行相应提取。
项目目录如下:(创建名为TencntSpider的项目)
TencentSpider
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ tencent.py
│ │ __init__.py
│ │
│ └─__pycache__
│ tencent.cpython-36.pyc
│ __init__.cpython-36.pyc
│
└─__pycache__
items.cpython-36.pyc
pipelines.cpython-36.pyc
settings.cpython-36.pyc
__init__.cpython-36.pyc
难点主要在:
1. 对多个item的处理:在pipelines文件中对传入的item做判断:利用class.name的方法可以对类名进行判断!
2. 对于spider中rules的书写,要清楚我们需要过滤的链接或页面。
3. 对于爬虫文件中parse方法的书写:要记得一旦继承了CrawlSpider类便不能再重写parse方法,我们要自己编写parse方法,因为我们提取了链接之后,要对链接进行跟进处理,进入详细信息的页面,所以我们要写两个parse方法!
items.py代码如下:
# items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TencentspiderItem(scrapy.Item):
# 职位名
positionName = scrapy.Field()
# 详情链接
positionLink = scrapy.Field()
# 职位类别
positionType = scrapy.Field()
# 招聘人数
peopleNum = scrapy.Field()
# 工作地点
workLocation = scrapy.Field()
# 发布时间
publishTime = scrapy.Field()
class PositionDescribe(scrapy.Item):
# 职位名
positionName = scrapy.Field()
# 职位类别
positionType = scrapy.Field()
# 招聘人数
peopleNum = scrapy.Field()
# 工作地点
workLocation = scrapy.Field()
# 职责
duty = scrapy.Field()
# 要求
requirement = scrapy.Field()
tencent.py代码如下:
# tencent.py
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from TencentSpider.items import TencentspiderItem
from TencentSpider.items import PositionDescribe
class TencentSpider(CrawlSpider):
name = 'tencent'
allowed_domains = ['hr.tencent.com']
start_urls = ['https://hr.tencent.com/position.php?&start=0#a']
rules = (
Rule(LinkExtractor(allow=r'&start=\d+'), callback='tencentParse', follow=True),
Rule(LinkExtractor(allow=r'/position_detail.php?'), callback='positionParse', follow=True)
)
def tencentParse(self, response):
jobs_list = response.xpath('//tr[@class="even"[email protected]="odd"]')
for node in jobs_list:
item = TencentspiderItem()
name = node.xpath('./td[1]/a/text()').extract()[0]
link = node.xpath('./td[1]/a/@href').extract()[0]
type = ''.join(node.xpath('./td[2]/text()').extract())
num = node.xpath('./td[3]/text()').extract()[0]
location = node.xpath('./td[4]/text()').extract()[0]
date = node.xpath('./td[5]/text()').extract()[0]
item['positionName'] = name
item['positionLink'] = 'https://hr.tencent.com/' + str(link)
item['positionType'] = type
item['peopleNum'] = num
item['workLocation'] = location
item['publishTime'] = date
yield item
def positionParse(self, response):
item = PositionDescribe()
name = response.xpath('//td[@id="sharetitle"]/text()').extract()
location = response.xpath('//tr[@class="c bottomline"]/td[1]/text()').extract()
type = response.xpath('//tr[@class="c bottomline"]/td[2]/text()').extract()
num = response.xpath('//tr[@class="c bottomline"]/td[3]/text()').extract()
s = ''
duties = response.xpath('//table//tr[3]//ul/li/text()').extract()
for duty in duties:
s += duty
requirements = response.xpath('//table//tr[4]//ul/li/text()').extract()
q = ''
for require in requirements:
q += require
# 职位名
item['positionName'] = name
# 职位类别
item['positionType'] = type
# 招聘人数
item['peopleNum'] = num
# 工作地点
item['workLocation'] = location
# 职责
item['duty'] = s
# 要求
item['requirement'] = q
yield item
pipelines.py代码如下:
# pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TencentspiderPipeline(object):
def __init__(self):
# isinstance
self.file = open('tencent.json', 'a', encoding='utf-8')
self.file2 = open('positiondescribe.json', 'a', encoding='utf-8')
def process_item(self, item, spider):
if item.__class__.__name__ == 'TencentspiderItem':
jsontext = json.dumps(dict(item), ensure_ascii=False) + ',\n'
self.file.write(jsontext)
else:
jsontext = json.dumps(dict(item), ensure_ascii=False) + ',\n'
self.file2.write(jsontext)
return item
def close_spider(self, spider):
self.file.close()
self.file2.close()
主要便是这三个文件,同时爬取之后,会生成两个json文件。
上一篇: Python 爬虫,scrapy,定义Item,封装提取的数据项
下一篇: 谈谈页面性能的那些事