爬取拉钩上海Python职位信息并存入MongoDB数据库
昨天准备爬拉钩的python职位数据,用了老办法bs4+requests发现数据是空的,心情so down!!经过网上的查询才明白,拉钩使用Ajax技术,用bs4查找html元素是找不到数据的。今天我总结下学习过程,也算是巩固自己的知识了!!!
分析网页
登陆拉钩网站,打开开发者功能
[图片上传失败...(image-b6ac08-1512815188537)]
我们先用requests发送请求并保存一个html,来查看数据
import requests
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2995.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2986.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.0 Safari/537.36'
]
headers = {
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/zhaopin/Python/?labelWords=label',
'Upgrade-Insecure-Requests': '1',
'User-Agent': random.choice(user_agents)
}
url = 'https://www.lagou.com/jobs/list_Python?px=default&city=%E4%B8%8A%E6%B5%B7#filterBox'
r = requests.get(url, headers=headers)
result = r.text
#print(r.text)
# 写入logou.html
with open('laogou.html', 'w', encoding='utf-8') as f:
f.write(result)
运行代码试一下,代开lagou.html,我们看到职位信息数据是没有的
[图片上传失败...(image-4a7f52-1512957356176)]
接下来,我们再观察下Chrome开发者工具的NetWork一栏,类型选择XHR,找到下面这个链接,我们可以看到有Ajax、Json几个关键字,点击Preview
[图片上传失败...(image-d6f890-1512957356176)]
按顺序分别点开红框,就得到我们想要的数据啦
[图片上传失败...(image-f6b3c3-1512957356176)]
现在来试着写一***意这里的请求是post,带上表单,改变请求头的数据
data = {
'first': 'true',
'pn': 1,
'kd': 'Python'
}
r = requests.post(url, headers=headers, data=data).json()
positions = r['content']['positionResult']['result']
print(positions)
Run一下,返回的数据就是我们想要的啦!!!
[图片上传失败...(image-3446d5-1512957356176)]
翻页
我们观察下表单内有一个pn参数,这就是页码,大家可以跳转页面来观察下数据的变化
for i in range(1, 17):
data = {
'first': 'true',
'pn': i,
'kd': 'Python'
}
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false&isSchoolJob=0'
r = requests.post(url, headers=headers, data=data)
time.sleep(3)
print(json.url)
这样就把16页链接都打印了出来
[图片上传失败...(image-584b63-1512957356176)]
爬取拉钩的思路就是这样,完整代码在GitHub,欢迎大家访问!!!!!!!假如觉得有用点个star噢!!互勉!!!!!!!!!!
最后,附上一张爬下来的数据截图
[图片上传失败...(image-50bd3b-1512957356176)]