欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python scrapy爬取智联招聘全站的公司和职位信息(三)

程序员文章站 2022-03-02 23:23:29
...

items类说明

  • items用法

    在scrapy中,items是保存结构数据的地方,scrapy将解析结果以字典形式返回
    下面是scrapy默认给我们创建的items.py文件

    class ZhaopinItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass
    

    用法很简单,主需要继承scrapy.Item即可

  • 查看Feild()源码

    我们点击看下Field()类其中的内容

    class Field(dict):
        """Container of field metadata"""
    

    Field()继承了dict

    class dict(object):
        """
        dict() -> new empty dictionary
        dict(mapping) -> new dictionary initialized from a mapping object's
            (key, value) pairs
        dict(iterable) -> new dictionary initialized as if via:
            d = {}
            for k, v in iterable:
                d[k] = v
        dict(**kwargs) -> new dictionary initialized with the name=value pairs
            in the keyword argument list.  For example:  dict(one=1, two=2)
        """
    

    由dict中的内容可以看出,Field()最终只是存储一些格式化的东西

在items中添加 Field

在zhilian.py中,我们通过scrapy shell获得得每一个数据,都需要创建对应的Field()

  • 添加职位相关的item

    class ZhaopinJobItem(scrapy.Item):
        jobs_url = scrapy.Field()
        update_time = scrapy.Field()
        job_title = scrapy.Field()
        salary = scrapy.Field()
        job_area = scrapy.Field()
        experience = scrapy.Field()
        education = scrapy.Field()
        recruit_nums = scrapy.Field()
    
  • 添加公司相关的item

    class ZhaopinCompanyItem(scrapy.Item):
        company_url = scrapy.Field()
        company_title = scrapy.Field()
        company_scale = scrapy.Field()
        company_industry = scrapy.Field()
        recruit_info = scrapy.Field()
        invite_nums = scrapy.Field()
    

建立items与spider的联系

  • 在zhilian.py的parse_jobs中增加如下代码,即可建立联系

      def parse_jobs(self, response):
            job_item = ZhaopinJobItem()
    
            jobs_url = response.url
            update_time = response.css('.summary-plane__time::text').get()
            job_title = response.css('h3.summary-plane__title::text').get()
            salary = response.css('.summary-plane__salary::text').get()
            job_area = response.xpath('//*[@class="summary-plane__info"]/li[1]/a/text()').get()
            experience = response.xpath('//*[@class="summary-plane__info"]/li[2]/text()').get()
            education = response.xpath('//*[@class="summary-plane__info"]/li[3]/text()').get()
            recruit_nums = response.xpath('//*[@class="summary-plane__info"]/li[4]/text()').get()
    
            job_item['jobs_url'] = jobs_url
            job_item['update_time'] = update_time
            job_item['job_title'] = job_title
            job_item['salary'] = salary
            job_item['job_area'] = job_area
            job_item['experience'] = experience
            job_item['education'] = education
            job_item['recruit_nums'] = recruit_nums
        
            return job_item
    
  • 同理company

      def parse_company(self, response):
            company_item = ZhaopinCompanyItem()
            company_url = response.url
            company_title = response.css('.overview__title h1::text').get()
            company_scale = response.css('.overview__detail-size span::text').get()
            company_industry = response.css('.overview__detail-industry span::text').get()
            recruit_info = response.css('.com-interview__item span::text').get()
            company_id = re.match('[1-9]\d+[1-9]', company_url)
            invite_api = 'https://fe-api.zhaopin.com/c/i/company/interview?rootCompanyId={0}&companyId={1}'.format(
                company_id, company_id)
            headers = {
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
            rs = requests.get(url=invite_api, header=headers).text
            invite_nums = json.loads(rs)['data']['data']
            company_item['company_url'] = company_url
            company_item['company_title'] = company_title
            company_item['company_scale'] = company_scale
            company_item['company_industry'] = company_industry
            company_item['recruit_info'] = recruit_info
            company_item['invite_nums'] = invite_nums
            return company_item