欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

爬取拉钩上海Python职位信息并存入MongoDB数据库

程序员文章站 2022-05-09 17:37:34
...

昨天准备爬拉钩的python职位数据,用了老办法bs4+requests发现数据是空的,心情so down!!经过网上的查询才明白,拉钩使用Ajax技术,用bs4查找html元素是找不到数据的。今天我总结下学习过程,也算是巩固自己的知识了!!!

分析网页

登陆拉钩网站,打开开发者功能
[图片上传失败...(image-b6ac08-1512815188537)]

我们先用requests发送请求并保存一个html,来查看数据

import requests
import random 

user_agents = [
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2995.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2986.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.0 Safari/537.36'
]

headers = {
    'Host': 'www.lagou.com',
    'Referer': 'https://www.lagou.com/zhaopin/Python/?labelWords=label',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': random.choice(user_agents)
}

url = 'https://www.lagou.com/jobs/list_Python?px=default&city=%E4%B8%8A%E6%B5%B7#filterBox'
r = requests.get(url, headers=headers)
result = r.text
#print(r.text)
# 写入logou.html
with open('laogou.html', 'w', encoding='utf-8') as f:
    f.write(result)

运行代码试一下,代开lagou.html,我们看到职位信息数据是没有的

[图片上传失败...(image-4a7f52-1512957356176)]

接下来,我们再观察下Chrome开发者工具的NetWork一栏,类型选择XHR,找到下面这个链接,我们可以看到有Ajax、Json几个关键字,点击Preview

[图片上传失败...(image-d6f890-1512957356176)]

按顺序分别点开红框,就得到我们想要的数据啦
[图片上传失败...(image-f6b3c3-1512957356176)]

现在来试着写一***意这里的请求是post,带上表单,改变请求头的数据

data = {
    'first': 'true',
    'pn': 1,
    'kd': 'Python'
}

r = requests.post(url, headers=headers, data=data).json()
positions = r['content']['positionResult']['result']
print(positions)

Run一下,返回的数据就是我们想要的啦!!!

[图片上传失败...(image-3446d5-1512957356176)]

翻页

我们观察下表单内有一个pn参数,这就是页码,大家可以跳转页面来观察下数据的变化


for i in range(1, 17):
    data = {
        'first': 'true',
        'pn': i,
        'kd': 'Python'
    }

url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false&isSchoolJob=0'

r = requests.post(url,  headers=headers, data=data)
time.sleep(3)
print(json.url)

这样就把16页链接都打印了出来

[图片上传失败...(image-584b63-1512957356176)]

爬取拉钩的思路就是这样,完整代码在GitHub,欢迎大家访问!!!!!!!假如觉得有用点个star噢!!互勉!!!!!!!!!!

最后,附上一张爬下来的数据截图
[图片上传失败...(image-50bd3b-1512957356176)]

欢迎访问博客Treehl的博客
GitHub
简书
爬虫集合