python scrapy爬取智联招聘全站的公司和职位信息(三)
程序员文章站
2022-03-02 23:23:29
...
items类说明
-
items用法
在scrapy中,items是保存结构数据的地方,scrapy将解析结果以字典形式返回
下面是scrapy默认给我们创建的items.py文件class ZhaopinItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
用法很简单,主需要继承scrapy.Item即可
-
查看Feild()源码
我们点击看下Field()类其中的内容
class Field(dict): """Container of field metadata"""
Field()继承了dict
class dict(object): """ dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2) """
由dict中的内容可以看出,Field()最终只是存储一些格式化的东西
在items中添加 Field
在zhilian.py中,我们通过scrapy shell获得得每一个数据,都需要创建对应的Field()
-
添加职位相关的item
class ZhaopinJobItem(scrapy.Item): jobs_url = scrapy.Field() update_time = scrapy.Field() job_title = scrapy.Field() salary = scrapy.Field() job_area = scrapy.Field() experience = scrapy.Field() education = scrapy.Field() recruit_nums = scrapy.Field()
-
添加公司相关的item
class ZhaopinCompanyItem(scrapy.Item): company_url = scrapy.Field() company_title = scrapy.Field() company_scale = scrapy.Field() company_industry = scrapy.Field() recruit_info = scrapy.Field() invite_nums = scrapy.Field()
建立items与spider的联系
-
在zhilian.py的parse_jobs中增加如下代码,即可建立联系
def parse_jobs(self, response): job_item = ZhaopinJobItem() jobs_url = response.url update_time = response.css('.summary-plane__time::text').get() job_title = response.css('h3.summary-plane__title::text').get() salary = response.css('.summary-plane__salary::text').get() job_area = response.xpath('//*[@class="summary-plane__info"]/li[1]/a/text()').get() experience = response.xpath('//*[@class="summary-plane__info"]/li[2]/text()').get() education = response.xpath('//*[@class="summary-plane__info"]/li[3]/text()').get() recruit_nums = response.xpath('//*[@class="summary-plane__info"]/li[4]/text()').get() job_item['jobs_url'] = jobs_url job_item['update_time'] = update_time job_item['job_title'] = job_title job_item['salary'] = salary job_item['job_area'] = job_area job_item['experience'] = experience job_item['education'] = education job_item['recruit_nums'] = recruit_nums return job_item
-
同理company
def parse_company(self, response): company_item = ZhaopinCompanyItem() company_url = response.url company_title = response.css('.overview__title h1::text').get() company_scale = response.css('.overview__detail-size span::text').get() company_industry = response.css('.overview__detail-industry span::text').get() recruit_info = response.css('.com-interview__item span::text').get() company_id = re.match('[1-9]\d+[1-9]', company_url) invite_api = 'https://fe-api.zhaopin.com/c/i/company/interview?rootCompanyId={0}&companyId={1}'.format( company_id, company_id) headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'} rs = requests.get(url=invite_api, header=headers).text invite_nums = json.loads(rs)['data']['data'] company_item['company_url'] = company_url company_item['company_title'] = company_title company_item['company_scale'] = company_scale company_item['company_industry'] = company_industry company_item['recruit_info'] = recruit_info company_item['invite_nums'] = invite_nums return company_item