python使用requests库爬取拉勾网招聘信息的实现

程序员文章站 2022-09-05 15:28:49

按f12打开开发者工具抓包，可以定位到招聘信息的接口在请求中可以获取到接口的url和formdata，表单中pn为请求的页数，kd为关请求职位的关键字使用python构建post请求data = {...

按f12打开开发者工具抓包，可以定位到招聘信息的接口

在请求中可以获取到接口的url和formdata，表单中pn为请求的页数，kd为关请求职位的关键字

python使用requests库爬取拉勾网招聘信息的实现

使用python构建post请求

data = {
  'first': 'true',
  'pn': '1',
  'kd': 'python'
}

headers = {
  'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromsearch=true&labelwords=&suginput=',
  'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.198 safari/537.36'
}

res = requests.post("https://www.lagou.com/jobs/positionajax.json?needaddtionalresult=false", data=data,headers=headers)
print(res.text)

发现没有从接口获取到数据

python使用requests库爬取拉勾网招聘信息的实现

换了个网络后接口还是会返回操作频繁的错误信息，仔细检查后发现这个接口需要一个动态的cookies不然会一值返回错误频繁

data = {
  'first': 'true',
  'pn': '1',
  'kd': 'python'
}

#头部中必须有user-agent和referer不然不会返回cookies
headers = {
  'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromsearch=true&labelwords=&suginput=',
  'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.198 safari/537.36'
}

#通过访问主页获取cookies
r1= requests.get("https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromsearch=true&labelwords=&suginput='",headers=headers)

#再post请求中传入cookies
r2 = requests.post("https://www.lagou.com/jobs/positionajax.json?needaddtionalresult=false", data=data,headers=headers, cookies=r2.cookies)
print(r2.text)

注意！每请求十次接口cookies也会刷新一次,下面贴上完整爬虫代码

import json
import logging

import requests

#获取cookie
def getcookie():
  res = requests.get("https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromsearch=true&labelwords=&suginput=",
        headers=headers)
  return res.cookies

#获取json数据
def getpage(i, cookies, kw):
  data = {
    'first': 'true',
    'pn': i,
    'kd': kw
  }
  res = requests.post("https://www.lagou.com/jobs/positionajax.json?needaddtionalresult=false", data=data,
             headers=headers, cookies=cookies)
  return json.loads(res.text)

#合并列表
def reducelist(l):
  text = ""
  for i in l:
    text += i + " "
  return text.strip()

#提取字段并保存到文件中
def saveincsv(f, data):
  js = data["content"]["positionresult"]["result"]
  for node in js:

    # 对空值进行处理
    district = node["district"]
    if district != none:
      district = "-" + district
    else:
      district = ""

    f.write(
      node["positionname"] + "·" + node["city"] + district + "·" + node[
        "salary"] + "·" +
      node["workyear"] + "·" + node["education"] + "·" + reducelist(node["skilllables"]) + "·" +
      node["companyshortname"] + "·" + node["companysize"] + "·" + node["positionadvantage"] + "\n")

if __name__ == '__main__':
  #定义头部
  headers = {
    'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromsearch=true&labelwords=&suginput=',
    'user-agent': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.198 safari/537.36'
  }

  #初始化cookie
  cookies = getcookie()

  with open("file.csv", "w", encoding="utf-8") as f:
    for i in range(1, 31):
      #每十个请求重新获取cookie
      if (i % 10 == 0):
        cookies = getcookie()

      #解析字段并存储
      data = getpage(i, cookies, "python")
      saveincsv(f, data)

到此这篇关于python使用requests库爬取拉勾网招聘信息的实现的文章就介绍到这了,更多相关python requests爬取拉勾网内容请搜索以前的文章或继续浏览下面的相关文章希望大家以后多多支持！

python使用requests库爬取拉勾网招聘信息的实现

Python实现爬取腾讯招聘网岗位信息

python使用requests模块实现爬取电影天堂最新电影信息

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

使用python爬虫实现网络股票信息爬取的demo

python使用requests库爬取拉勾网招聘信息的实现

python爬虫使用senlenium爬取拉勾网招聘数据

Python使用requests库爬取中国新闻网指定页面

Python实现爬取腾讯招聘网岗位信息

python使用requests模块实现爬取电影天堂最新电影信息

Python爬取求职网requests库和BeautifulSoup库使用详解