scrapy由浅入深(二) 爬取51job职位薪资信息

程序员文章站 2022-05-09 22:15:21

...

上次的爬虫只是爬取了CSDN论坛的问题数据，相对来说比较简单，本篇文章来介绍一下爬取51job网站，获取它的职位，薪资，职位要求等信息。

代码思路：1.首先获取到种子网页的所有职位的url，以及下一页的url。2.通过抽取到的职位的url来依次请求相应职位的详细信息，包括薪资，职位要求等。3.定义解析数据的函数，通过xpath或者css选择器获取到职位薪资信息。4.请求第一步中获取到的下一页的网址，对下一页的网址进行分析，抽取出url，再依次分析。5.将数据保存到数据库。

1.创建项目

scrapy startproject wuyou

创建一个名称为wuyou的项目，爬取前程无忧网站的职位信息。

手动在spiders文件夹下创建一个job_spider.py文件，用来实现爬虫主要逻辑。

二.配置项目

(1)编写items文件

import scrapy


class WuyouItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    salary = scrapy.Field()
    profile = scrapy.Field()

我在这里定义了三个字段，分别对应职位名称，薪资，职位要求。

(2)编写pipelines文件

import sqlite3
db = sqlite3.connect("./../jobs.db")
cursor = db.cursor()

class WuyouPipeline(object):
    def process_item(self, item, spider):
        cursor.execute("insert into job(title, salary, profile) values (?,?,?)",(item["title"],item["salary"],item["profile"]))
        db.commit()

将数据保存在sqlite数据库中，也可以保存在MySQL数据库中，写法类似。

(3)配置settings文件

启用管道文件

ITEM_PIPELINES = {
   'wuyou.pipelines.WuyouPipeline': 300,
}

启用下载延迟

DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True

(4)编写spider文件

爬虫主要逻辑

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from wuyou.items import WuyouItem


class JobSpiderSpider(scrapy.Spider):
    name = 'job_spider'
    # allowed_domains = ['jobs.51job.com']

    def start_requests(self):
        url_str = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
        yield Request(url=url_str,callback=self.parse,meta={"page":"0"})

    def parse(self, response):
        """
        解析出职位的url以及下一页的url
        :param response:
        :return:
        """
        all_href = response.xpath('//div[@class="el"]/p/span[not(@class)]/a/@href').extract()
        next_page = response.xpath('//a[text()="下一页"]/@href').extract_first()

        for href in all_href:
            yield Request(url=href,callback=self.parse_one_job,meta={"page":"0"})
        if next_page:
            yield Request(url=next_page,callback=self.parse,meta={"page":"0"},dont_filter=True)


    def parse_one_job(self,response):
        """
        通过xpath获取职位，薪资，职位要求信息
        :param response:
        :return:
        """
        title = response.xpath('//div[@class="cn"]/h1/@title').extract()[0]
        salary = response.xpath('//div[@class="cn"]/strong/text()').extract()[0]
        job_profile = response.xpath('//div[@class="bmsg job_msg inbox"]/p/text()').extract()
        # 因为51job页面有的职位信息xpath不相同，所以使用两个xpath来获取职位要求信息
        job_profile_div = response.xpath('//div[@class="bmsg job_msg inbox"]/div/text()').extract()
        if job_profile_div is not None:
            job_profile = job_profile_div + job_profile
        content = ""
        for profile in job_profile:
            content += profile
        content = ' '.join(content.split())
        item = WuyouItem()
        item["title"] = title
        item["salary"] = salary
        item["profile"] = content
        yield item

首先请求种子网址，从种子网址中抽取每个职位对应的url与下一页的url，通过函数parse_one_job解析每个工作的详细信息，通过对下一页的url解析，判断是否有下一页，如果有则请求下一页，并对下一页进行分析。

注意：因为51job网站工作详情的html标签并不相同，这里即使用到了两个xpath来获取职位的详细信息，但是依然有个别职位的详细信息无法爬取到。下次准备爬取智联招聘这个网站，虽然同样是招聘网站，但是它跟51job最大的区别在于，智联招聘是ajax动态页面，无法直接获取，下次介绍一下使用selenium模拟浏览器来爬取ajax动态页面。

优化(防止数据丢失)

我在后续的爬去过程中发现这个程序会造成数据丢失的问题，原因在于，当我们判断如果有下一页就请求下一页的时候，这个请求与下载一个页面所有的职位信息速度不匹配，也就相当于当你请求到第十几页的时候，第一页的所有数据才会下载完成。那么这个时候scrapy已经接受了几百个爬取职位详细信息的请求，因为请求过多，所以会造成数据丢失的后果。

解决办法：查询数据库的数据记录，因为51job一个网页有50条职位信息，当我们爬取了大约45条数据的时候就能执行请求下一页的操作了，这样就相当于只有在下载完当前页面的数据才会请求下一页的数据，也就大大减少了数据丢失的可能性。

优化代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from wuyou.items import WuyouItem
import sqlite3
import time

db = sqlite3.connect("./../jobs.db")
cursor = db.cursor()

i = 0

def select_from_sql():
    """
    :return: 当前数据库中数据的总数
    """
    count = cursor.execute("select * from job")
    return len(count.fetchall())

class JobSpiderSpider(scrapy.Spider):
    name = 'job_spider'
    # allowed_domains = ['jobs.51job.com']

    def start_requests(self):
        url_str = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
        yield Request(url=url_str,callback=self.parse)

    def parse(self, response):
        """
        解析出职位的url以及下一页的url
        :param response:
        :return:
        """
        all_href = response.xpath('//div[@class="el"]/p/span[not(@class)]/a/@href').extract()
        next_page = response.xpath('//a[text()="下一页"]/@href').extract_first()

        for href in all_href:
            yield Request(url=href,callback=self.parse_one_job,meta={"next_page":next_page})


    def parse_one_job(self,response):
        """
        通过xpath获取职位，薪资，职位要求信息
        :param response:
        :return:
        """
        global i
        next_page = response.meta["next_page"]
        count = select_from_sql()

        title = response.xpath('//div[@class="cn"]/h1/@title').extract()[0]
        salary = response.xpath('//div[@class="cn"]/strong/text()').extract()[0]
        job_profile = response.xpath('//div[@class="bmsg job_msg inbox"]/p/text()').extract()
        # 因为51job页面有的职位信息xpath不相同，所以使用两个xpath来获取职位要求信息
        job_profile_div = response.xpath('//div[@class="bmsg job_msg inbox"]/div/text()').extract()
        if job_profile_div is not None:
            job_profile = job_profile_div + job_profile
        content = ""
        for profile in job_profile:
            content += profile
        content = ' '.join(content.split())


        item = WuyouItem()
        item["title"] = title
        item["salary"] = salary
        item["profile"] = content

        if count - i > 45:
            if next_page:
                i = count
                yield Request(url=next_page, callback=self.parse, meta={"page": "2"}, dont_filter=True)
        yield item

上一篇： Codeforces Round #512 (Div. 2 E. Vasya and Good Sequences 异或问题

下一篇： Codeforces Round #512 (Div. 2) D. Vasya and Triangle

scrapy由浅入深(二) 爬取51job职位薪资信息

1.创建项目

二.配置项目

(1)编写items文件

(2)编写pipelines文件

(3)配置settings文件

(4)编写spider文件

优化(防止数据丢失)

scrapy爬虫实战 - 51job爬虫职位爬取

Python scrapy框架爬取瓜子二手车信息数据

scrapy由浅入深(二) 爬取51job职位薪资信息

爬取51job职位信息并且进行数据分析(制作词云)

关于scrapy爬取51job网以及智联招聘信息存储文件的设置

爬取51job工作网的职位信息

爬虫实战-使用Webmagic爬取51job的职位信息

scrapy爬虫之爬取拉勾网职位信息

scrapy爬虫实战 - 51job爬虫职位爬取

拉勾网爬取全国python职位并数据分析薪资，工作经验，学历等信息