欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Djange构建招聘信息爬虫系统!太强了吧!

程序员文章站 2022-03-28 16:06:50
Djange搭建爬虫系统Django是一个开放源代码的 Web 应用框架,由 Python 写成。许多成功的项目都使用了这个框架,采用了 MVT 的软件设计模式,即模型(Model),视图(View)和模板(Template)。使用这个框架能够高效快捷的开发出这个系统。前言我们将在下面与大家分享如何搭建起一个django项目,以及编写招聘网站爬虫系统以及爬虫数据分析功能和下载数据。一、如何安装django?第一种方式:通过pip安装$ python -m pip instal....

Djange搭建爬虫系统

Django是一个开放源代码的 Web 应用框架,由 Python 写成。许多成功的项目都使用了这个框架,采用了 MVT 的软件设计模式,即模型(Model),视图(View)和模板(Template)。使用这个框架能够高效快捷的开发出这个系统。

 

前言

我们将在下面与大家分享如何搭建起一个django项目,以及编写招聘网站爬虫系统以及爬虫数据分析功能和下载数据。


一、如何安装django?

第一种方式:通过pip安装

$ python -m pip install Django
1

第二种方式:

yum install python-setuptools
easy_install django
12

检验是否安装成功

[root@solar django]# python
Python 3.7.4 (default, May 15 2014, 14:49:08)
[GCC 4.8.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import django
>>> django.VERSION
(3, 0, 6, 'final', 0)
1234567

安装完毕!

二、建构项目

1.创建项目

$ django-admin startproject mysite
1

如果django-admin不起作用,可以参阅:https://docs.djangoproject.com/en/3.1/faq/troubleshooting/#troubleshooting-django-admin

项目如下(示例):

mysite/
    manage.py
    mysite/
        __init__.py
        settings.py
        urls.py
        asgi.py
        wsgi.py

123456789

2.创建爬虫应用程序

现在,您的环境(一个“项目”)已设置好,您就可以开始工作了。

您在Django中编写的每个应用程序都包含一个遵循特定约定的Python包。Django附带了一个实用程序,该实用程序会自动生成应用程序的基本目录结构,因此您可以专注于编写代码,而不是创建目录。

$ python manage.py startapp polls
1

代码如下(示例):

polls/
    __init__.py
    admin.py
    apps.py
    migrations/
        __init__.py
    models.py
    tests.py
    views.py
123456789

到这一步我们的目录的就已经创建完毕了。

进入到mysite目录:
我们尝试运行一下:python manage.py runserver 0.0.0.0:8000

效果如下:

Djange构建招聘信息爬虫系统!太强了吧!

看见有这个页面,安装大功告成

补充:
把数据库改成mysql数据库。
修改setting文件即可:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        # 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
        'NAME': 'movie',
        'HOST': '127.0.0.1',
        'PORT': 3306,
        'USER': 'root',
        'PASSWORD': 'root',
    }
}
1234567891011

运行命令:
python manage.py migrate

这个时候就可以自动生成项目所需要的mysql数据表和数据字段。

3.安装自带的后台管理系统

在mysite里面的url文件改成以下。

Djange构建招聘信息爬虫系统!太强了吧!

创建超级管理员命令

$ python manage.py createsuperuser
1

创建超级密码

Username: admin
1

最后一步是输入密码。你会被要求输入两次密码,第二次的目的是为了确认第一次输入的确实是你想要的密码。

Password: **********
Password (again): *********
Superuser created successfully.
123

启动:
进入到mysite目录,运行

$ python manage.py runserver
1

4、打开浏览器,转到你本地域名的 “/admin/” 目录, – 比如 “http://127.0.0.1:8000/admin/” 。你应该会看见管理员登录界面:

Djange构建招聘信息爬虫系统!太强了吧!


我们安装一下:

pip install django-simpleui
1

Djange构建招聘信息爬虫系统!太强了吧!


完成搭建工作,刷新一下页面,一个全新的后台系统就呈现在我们的面前了,如下图所示:

Djange构建招聘信息爬虫系统!太强了吧!

至此,项目骨架就已经搭建完成。

4.搭建自定义的应用模块

1、运行命令:

   python manage.py startapp polls
1

这个时候你就可以看见poll模块已经在我们的目录里面生成的了。

2、建立相关的表模型:

model.py

from django.db import models

# Create your models here.
# models.py
from django.db import models


class Test(models.Model):
    name = models.CharField(max_length=20)


class Category(models.Model):
    # id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    name = models.CharField(u'岗位', max_length=50)
    add_time= models.CharField(u'添加时间', max_length=50)
    class Meta:
        verbose_name = '职位分类'
        verbose_name_plural = verbose_name
    def __str__(self):
        return self.name

class Record(models.Model):
    # id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    record_name=models.CharField(u'记录名称', max_length=50)
    date=models.CharField(u'记录日期', max_length=50)
    recruit_type=models.CharField(u'记录类型', max_length=50)
    class Meta:
        verbose_name = '爬取记录'
        verbose_name_plural = verbose_name
    def __str__(self):
        return self.date

class Liepin(models.Model):
    # id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    work=models.CharField(u'岗位',max_length=50)
    edu=models.CharField(u'教育背景',max_length=50)
    district=models.CharField(u'地区',max_length=50)
    compensation=models.CharField(u'薪酬',max_length=50)
    company=models.CharField(u'公司',max_length=50)
    year = models.CharField(u'工作年限', max_length=50)
    create_time = models.CharField(u'创建时间', max_length=50)
    work_type = models.ForeignKey(Category, on_delete=models.CASCADE,verbose_name='分类')
    record = models.ForeignKey(Record, on_delete=models.CASCADE, verbose_name='记录')
    salary = models.CharField(u'收入后一位', max_length=10)

    class Meta:
        verbose_name = '猎聘数据'
        verbose_name_plural = verbose_name

class Qiancheng(models.Model):
    # id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    work = models.CharField(u'岗位', max_length=50)
    edu = models.CharField(u'教育背景', max_length=50)
    district = models.CharField(u'地区', max_length=50)
    compensation = models.CharField(u'薪酬', max_length=50)
    company = models.CharField(u'公司', max_length=50)
    year = models.CharField(u'工作年限', max_length=50)
    create_time = models.CharField(u'创建时间', max_length=50)
    work_type = models.ForeignKey(Category, on_delete=models.CASCADE)
    record = models.ForeignKey(Record, on_delete=models.CASCADE, verbose_name='记录')
    salary = models.CharField(u'收入后一位', max_length=10)

    class Meta:
        verbose_name = '前程数据'
        verbose_name_plural = verbose_name


class Lagou(models.Model):
    # id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    work=models.CharField(u'岗位',max_length=50)
    edu=models.CharField(u'教育背景',max_length=50)
    district=models.CharField(u'地区',max_length=50)
    compensation=models.CharField(u'薪酬',max_length=50)
    company=models.CharField(u'公司',max_length=50)
    year = models.CharField(u'工作年限', max_length=50)
    create_time = models.CharField(u'创建时间', max_length=50)
    work_type=models.ForeignKey(Category,on_delete=models.CASCADE)
    record = models.ForeignKey(Record, on_delete=models.CASCADE, verbose_name='记录')
    salary= models.CharField(u'收入后一位', max_length=10)

    class Meta:
        verbose_name = '拉钩数据'
        verbose_name_plural = verbose_name


class Data(models.Model):
    id = models.CharField(u'实例ID', max_length=32, blank=False, primary_key=True)
    count = models.CharField(u'次数', max_length=50)
    work_name = models.CharField(u'工作名称', max_length=50)
    category_id = models.CharField(u'分类id', max_length=50)
    status = models.CharField(u'分类id', max_length=20)

    class Meta:
        verbose_name = '临时存储数据'
        verbose_name_plural = verbose_name

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596

admin.py

from django.contrib import admin

# Register your models here.
from django.contrib import admin
from polls.models import Test,Liepin,Lagou,Category,Qiancheng,Record
from django.core.paginator import Paginator

class LiepinAdmin(admin.ModelAdmin):
    list_display = ('work', 'edu', 'district','company','compensation','year','work_type','record')  # list
    search_fields = ('work',)
    # 分页,每页显示条数
    list_per_page = 10
    paginator = Paginator
class LagouAdmin(admin.ModelAdmin):
    list_display = ('work', 'edu', 'district', 'company', 'compensation','year','work_type','record')  # list
    search_fields = ('work',)
    # 分页,每页显示条数
    list_per_page = 10
    paginator = Paginator

class QianchengAdmin(admin.ModelAdmin):
    list_display = ('work', 'edu', 'district', 'company', 'compensation','year','work_type','record')  # list
    search_fields = ('work',)
    # 分页,每页显示条数
    list_per_page = 10
    paginator = Paginator

class CategoryAdmin(admin.ModelAdmin):
    list_display=('name','add_time')
    list_per_page=10
    paginator = Paginator

class RecordAdmin(admin.ModelAdmin):
    list_display=('date','recruit_type')
    list_per_page=10
    paginator = Paginator

admin.site.register(Liepin,LiepinAdmin)
admin.site.register(Lagou,LagouAdmin)
admin.site.register(Qiancheng,QianchengAdmin)
admin.site.register(Record,RecordAdmin)
admin.site.register(Category,CategoryAdmin)
123456789101112131415161718192021222324252627282930313233343536373839404142

运行

python manage.py migrate
1

数据库即可自动建立好。

进入后台查看,就可以看到相对应的表管理模块。

3、 建一个应用当作功能逻辑模块。

运行命令

   python manage.py startapp reptile
1

一个模块就已经建立完成了。

5.编写各个招聘网站的爬取脚本。

1、猎聘网

在reptile建立一个py文件。

from django.http import HttpResponse
from django.shortcuts import render
from bs4 import BeautifulSoup
import csv
import time
import random
import requests
import sys
import operator
from polls import models
from urllib.parse import quote
from polls.models import Record


def grad_action(request):
    work_name=request.GET.get('work_name')
    type = request.GET.get('type')
    record_name = request.GET.get('record_name')

    # 判断当前是否有任务在进行
    status = models.Data.objects.filter(category_id=type)
    if (status[0].status == 0):
        return HttpResponse(-1)
    models.Data.objects.filter(category_id=type).update(status=0)

    # 插入查找岗位信息记录
    record=Record(record_name=record_name,date=str(int(time.time())),recruit_type=type)
    record.save()
    record_id=record.id

    # 查找职位表是否有这个职位,没有的话就添加
    cate_id=models.Category.objects.filter(name=work_name)


    if(not(cate_id)):
        cate=models.Category(name=work_name,add_time=int(time.time()))
        cate.save()
        cate_id=cate.id
    else:
        cate_id=cate_id[0].id
    # return HttpResponse(1)
    if(int(type)==1):
        reture=liepin_action(0,0,work_name,cate_id,record_id)
        return HttpResponse(reture)



# 爬取liepin
def liepin_action(i,sleep_count,work_name,cate_id,record_id):
    # 岗位
    work_name = work_name
    link = "https://www.liepin.com/zhaopin/?industries=040&subIndustry=&dqs=050020&salary=&jobKind=&pubTime=&compkind=&compscale=&searchType=1&isAnalysis=&sortFlag=15&d_headId=aaa42964a7680110daf82f6e378267d9&d_ckId=ff5c36a41d1d524cff2692be11bbe61f&d_sfrom=search_prime&d_pageSize=40&siTag=_1WzlG2kKhjWAm3Yf9qrog%7EqdZCMSZU_dxu38HB-h7GFA&key=" + quote(
        work_name) + "&curPage=" + str(i)
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
    ]
    headers = {"User-Agent": random.choice(user_agent_list)}

    try:
        response = requests.get(link, headers=headers)
        response.encoding = 'utf-8'
        html = response.text
        soup = BeautifulSoup(html, 'html.parser')
        sojob_result = soup.find("div", class_='sojob-result')
        list_r = sojob_result.find_all("li")
    except BaseException:
        if (sleep_count > 9):
            print("亲,我都试了45分钟了,还是无法请求网络成功,请你稍后重试或寻求专业人士帮助")
            print("亲,抱歉,程序结束")
            models.Data.objects.filter(category_id=1).update(status=1)
            return 0
        print("抱歉,爬取异常,原因可能是需要验证操作或您的网络不佳,我先休息五分钟再来试试把")
        print("开始休眠5分钟")
        sleep_count = sleep_count + 1
        sys.stdout.flush()
        time.sleep(300)
        return liepin_action(i, sleep_count,work_name,cate_id,record_id)

    if (len(list_r) == 0):
        print("恭喜你,本次爬取数据任务已完成啦")
        models.Data.objects.filter(category_id=1).update(status=1)
        return 1
    # 岗位
    sleep_count = 0
    in_data = []
    out_data = []

    for x in range(0, len(list_r)):
        try:
            address = list_r[x].find("a", class_='area').get_text().strip()
        except BaseException:
            address = ''
        work = list_r[x].find("a").get_text().strip()
        edu = list_r[x].find("span", class_='edu').get_text().strip()
        year = list_r[x].find("span", class_='edu').find_next_sibling("span").get_text().strip()
        money = list_r[x].find("span", class_='text-warning').get_text().strip()
        company = list_r[x].find("p", class_='company-name').get_text().strip()
        data = {'work': work, 'edu':edu, 'compensation':money, 'company':company, 'year':year, 'district':address}

        work_data = models.Data.objects.filter(category_id=1)
        in_data = data
        out_data = work_data[0].work_name

        in_data = str(in_data)
        if (operator.eq(in_data, out_data)):

            count = work_data[0].count
            count = int(count)

            if (count > 12):
                print("恭喜你,本次爬取数据任务已完成啦")
                models.Data.objects.filter(category_id=1).update(status=1)
                return 1

        if (money != '面议'):
            try:
                salary = money.split('-')[1][-5:]
                salary_money = money.split('-')[1].replace(salary, '')
            except BaseException:
                salary_money = 0
        else:
            salary_money = 0
        # 写入数据库
        liepin_data = models.Liepin(work=work, create_time=int(time.time()),edu=edu,compensation=money
        ,record_id=record_id,work_type_id=cate_id,company=company,year=year,district=address,salary=salary_money)
        liepin_data.save()

        print(data)

    models.Data.objects.filter(category_id=1).update(work_name=str(in_data))
    models.Data.objects.filter(category_id=1).update(count=str(i))

    sys.stdout.flush()
    time.sleep(random.randint(7, 16))
    return liepin_action(i + 1, sleep_count,work_name,cate_id,record_id)

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138

2、拉勾网

from django.http import HttpResponse
from django.shortcuts import render
from bs4 import BeautifulSoup
import csv
import time
import random
import requests
import sys
import operator
import ssl
import json
import random
from urllib import parse
from urllib import request
from polls import models
from urllib.parse import quote
from polls.models import Record


def grad_action(request):
    work_name=request.GET.get('work_name')
    type = request.GET.get('type')
    record_name = request.GET.get('record_name')

    # 判断当前是否有任务在进行
    status = models.Data.objects.filter(category_id=type)
    if (status[0].status == 0):
        return HttpResponse(-1)
    models.Data.objects.filter(category_id=type).update(status=0)

    # 插入查找岗位信息记录
    record = Record(record_name=record_name, date=str(int(time.time())), recruit_type=type)
    record.save()
    record_id = record.id

    # 查找职位表是否有这个职位,没有的话就添加
    cate_id = models.Category.objects.filter(name=work_name)

    if (not (cate_id)):
        cate = models.Category(name=work_name, add_time=int(time.time()))
        cate.save()
        cate_id = cate.id
    else:
        cate_id=cate_id[0].id

    # return HttpResponse(cate_id)
    if (int(type) == 2):
        reture = lagou_action(0, work_name, cate_id, record_id)
        return HttpResponse(reture)

# 爬取lagou
def lagou_action(i,work_name,cate_id,record_id):
    try:
        # 去掉全局安全校验
        ssl._create_default_https_context = ssl._create_unverified_context
        # 先爬取首页python职位的网站以获取Cookie
        url = 'https://www.lagou.com/jobs/list_%E6%9E%B6%E6%9E%84%E5%B8%88?city=%E5%B9%BF%E5%B7%9E&labelWords=&fromSearch=true&suginput='
        # print(url)
        req = request.Request(url, headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        })
        response = request.urlopen(req)
        # print(response)
        # 从响应头中提取Cookie
        cookie = ''
        for header in response.getheaders():
            if header[0] == 'Set-Cookie':
                cookie = cookie + header[1].split(';')[0] + '; '
        # 去掉最后的空格
        cookie = cookie[:-1]
        # print(cookie)
        # 爬取职位数据
        url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
        # 构造请求头,将上面提取到的Cookie添加进去
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
            'Cookie': cookie,
            'Referer': 'https://www.lagou.com/jobs/list_%E6%9E%B6%E6%9E%84%E5%B8%88?city=%E5%B9%BF%E5%B7%9E&labelWords=&fromSearch=true&suginput='
        }
        kd = work_name;
        data = {
            'first': 'true',
            'pn': i,
            'kd': kd
        }

        req = request.Request(url, data=parse.urlencode(data).encode('utf-8'), headers=headers, method='POST')

        response = request.urlopen(req)

        result = response.read().decode('utf-8')
        result = json.loads(result)
    except IOError:
        models.Data.objects.filter(category_id=2).update(status=1)
        return 0



    if (result['content']['positionResult']['resultSize'] == 0):
        models.Data.objects.filter(category_id=2).update(status=1)
        return 1

    # 岗位
    try:
        # print(result)
        for x in range(0, result['content']['positionResult']['resultSize']):
            district = result['content']['positionResult']['result'][x]['city']
            work = result['content']['positionResult']['result'][x]['positionName']
            edu = result['content']['positionResult']['result'][x]['education']
            year = result['content']['positionResult']['result'][x]['workYear']
            money = result['content']['positionResult']['result'][x]['salary']
            company = result['content']['positionResult']['result'][x]['companyFullName']
            create_time = result['content']['positionResult']['result'][x]['createTime']
            data = [work, edu, money, company];

            if district == "广州" or district == "深圳":
                try:
                    salary_money=money.split('-')[1].replace('k', '')
                except BaseException:
                    salary_money = 0
                # 写入数据库
                lagou_data = models.Lagou(work=work, create_time=int(time.time()), edu=edu, compensation=money
                , record_id=record_id, work_type_id=cate_id, company=company, year=year,district=district,salary=salary_money)
                lagou_data.save()
            print(data)
    except IOError:
        models.Data.objects.filter(category_id=2).update(status=1)
        return 0

    sys.stdout.flush()
    time.sleep(random.randint(15, 40))
    return lagou_action(i+1, work_name, cate_id, record_id)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132

3、前程无忧网

from django.http import HttpResponse
from django.shortcuts import render
from bs4 import BeautifulSoup
import csv
import time
import random
import requests
import sys
import operator
import ssl
import json
import random
from urllib import parse
from urllib import request
from polls import models
from urllib.parse import quote
from polls.models import Record

def grad_action(request):
    work_name = request.GET.get('work_name')
    type = request.GET.get('type')
    record_name = request.GET.get('record_name')

   # 判断当前是否有任务在进行
    status=models.Data.objects.filter(category_id=type)
    if(status[0].status==0):
        return HttpResponse(-1)
    models.Data.objects.filter(category_id=type).update(status=0)

    # 插入查找岗位信息记录
    record = Record(record_name=record_name, date=str(int(time.time())), recruit_type=type)
    record.save()
    record_id = record.id

    # 查找职位表是否有这个职位,没有的话就添加
    cate_id = models.Category.objects.filter(name=work_name)

    if (not (cate_id)):
        cate = models.Category(name=work_name, add_time=int(time.time()))
        cate.save()
        cate_id = cate.id
    else:
        cate_id = cate_id[0].id


    if (int(type) == 3):
        # 更新该类目的爬取状态
        reture = qiancheng_action(1,0, work_name, cate_id, record_id)
        return HttpResponse(reture)

# 爬取lagou
def qiancheng_action(i,sleep_count,work_name,cate_id,record_id):
    # 去掉全局安全校验
    # 岗位
    work_name = work_name
    try:
        link = "https://search.51job.com/list/030200,000000,0000,00,9,99," + quote(work_name) + ",2," + str(i) + ".html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
        response = requests.get(link, headers=headers)
        code = response.apparent_encoding
        response.encoding = code
        html = response.text
        soup = BeautifulSoup(html, 'html.parser')
        in_data = []
        out_data = []
        count = 0

        sojob_result = soup.find_all("script", type='text/javascript')

    except BaseException:
        if (sleep_count > 9):
            print("亲,我都试了45分钟了,还是无法请求网络成功,请你稍后重试或寻求专业人士帮助")
            print("亲,抱歉,程序结束")
            models.Data.objects.filter(category_id=3).update(status=1)
            return 0
        print("抱歉,爬取异常,原因可能是需要验证操作或您的网络不佳,我先休息五分钟再来试试把")
        print("开始休眠5分钟")
        sleep_count = sleep_count + 1
        sys.stdout.flush()
        time.sleep(300)
        return qiancheng_action(i, sleep_count,work_name,cate_id,record_id)

    try:
        a = str(sojob_result[2])

        json_str = json.loads(a[60:-9], strict=False)

        list = json_str['engine_search_result']
    except BaseException:
        sys.stdout.flush()
        time.sleep(3)
        return qiancheng_action(i+1, sleep_count, work_name, cate_id, record_id)

    if (len(list) == 0):
        print("恭喜你,本次爬取数据任务已完成啦")
        models.Data.objects.filter(category_id=3).update(status=1)
        return 1
    try:
        for x in range(1, len(list)):
            work = list[x]['job_name']
            company = list[x]['company_name']
            address = list[x]['workarea_text']
            money = list[x]['providesalary_text']
            attribute_text = list[x]['attribute_text']
            public_time = list[x]['issuedate']
            data = [work, company, address, money, attribute_text, public_time]
            year=attribute_text[1]
            print(data)
            if("经验" in year):
                year=attribute_text[1]
            else:
                year='不限'
            # 整理学历背景
            for a in attribute_text:
                if (a == '大专' or a == '本科' or a == '中专' or a == '高中' or a == '硕士'):
                    edu = a
                else:
                    edu = '未知'


            # 整理金额
            if(money!=''):
                try:
                    salary=money.split('-')[1][-3:]
                    if(salary=='万/月'):
                        salary_money=money.split('-')[1].replace('万/月','')
                    elif(salary=='万/年'):
                        salary_money = money.split('-')[1].replace('万/年', '')
                    else:
                        salary_money = money.split('-')[1].replace('千/月', '')
                except BaseException:
                    salary_money = 0
            else:
                salary_money=0

            qiancheng = models.Qiancheng(work=work, create_time=int(time.time()), edu=edu, compensation=money
            , record_id=record_id, work_type_id=cate_id, company=company, year=year,district=address,salary=salary_money)
            qiancheng.save()

            in_data = data

            work_data = models.Data.objects.filter(category_id=3)
            out_data = work_data[0].work_name

            in_data = str(in_data)
            if (operator.eq(in_data, out_data)):
                    count = work_data[0].count
                    count = int(count)

    except BaseException:
        sys.stdout.flush()
        time.sleep(random.randint(3, 7))
        qiancheng_action(i + 1, sleep_count, work_name, cate_id, record_id)

    sys.stdout.flush()
    time.sleep(random.randint(3, 7))
    if (count > 12):
        print("恭喜你,本次爬取数据任务已完成啦")
        models.Data.objects.filter(category_id=3).update(status=1)
        return 1
    sleep_count = 0

    models.Data.objects.filter(category_id=3).update(work_name=str(in_data))
    models.Data.objects.filter(category_id=3).update(count=str(i))

    return qiancheng_action(i + 1, sleep_count, work_name, cate_id, record_id)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167

4、把相应的路由给搭建好就可以访问了。

Djange构建招聘信息爬虫系统!太强了吧!

"""mysite URL Configuration

The `urlpatterns` list routes URLs to views. For more information please see:
    https://docs.djangoproject.com/en/3.0/topics/http/urls/
Examples:
Function views
    1. Add an import:  from my_app import views
    2. Add a URL to urlpatterns:  path('', views.home, name='home')
Class-based views
    1. Add an import:  from other_app.views import Home
    2. Add a URL to urlpatterns:  path('', Home.as_view(), name='home')
Including another URLconf
    1. Import the include() function: from django.urls import include, path
    2. Add a URL to urlpatterns:  path('blog/', include('blog.urls'))
"""

from django.urls import path
from . import views,recruit_view,grad_view,grad_action,lagou_action,qiancheng_action,download_action,mate_action,grad_all

urlpatterns = [
    path("index/",views.index, name='index' ),
    path("recruit_view/<int:type_id>", recruit_view.recruit_record, name='index'),
    path("recruit_view/recruit_index/<int:type_id>/<int:id>", recruit_view.recruit_index, name='index'),
    path("download_action/<int:type_id>/<int:record_id>", download_action.download_action, name='index'),
    path("grad_view/<int:type_id>", grad_view.grad_index, name='index'),
    path("grad_action/", grad_action.grad_action, name='index'),
    path("lagou_action/", lagou_action.grad_action, name='index'),
    path("qiancheng_action/", qiancheng_action.grad_action, name='index'),
    path("mate_action/<int:type_id>/<int:record_id>", mate_action.mate_action, name='index'),
    path("grad_all/",grad_all.grad_all_view,name='index')
]

1234567891011121314151617181920212223242526272829303132

完成。
具体还有一些前端页面在这里就不细说了。

三、github

https://github.com/fengyuan1/django_manage.git


四、总结

这个系统可以轻松实现各大招聘网站的爬虫以及分析工作。

PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

本文地址:https://blog.csdn.net/pythonlaodi/article/details/109619746