python + selenium 爬取Boss直聘

程序员文章站 2022-04-26 09:59:48

...

导入需要用到的模块

from selenium import webdriver
import time
import random
import pandas as pd

启动谷歌浏览器
driver=webdriver.Chrome()

隐式等待(等待页面元素加载完毕)
driver.implicitly_wait(10)

防止被识别，设置随机等待秒数
rand_seconds = random.choice([1,3])+random.random()

分析网址

python + selenium 爬取Boss直聘

发现一共是10页数据

python + selenium 爬取Boss直聘

详细代码

循环10次 代表10页
for i in range(1,11):

time.sleep(rand_seconds)
driver.get(url='https://www.zhipin.com/c101010100/?query=python&page=%d'%i) 
driver.implicitly_wait(10)  #隐式等待

 爬取标题的时候遇到了一个小问题,爬出来的标题都是重复的。所以在这强制了爬取目标
for a in range(1,29):

   用来存放全部数据
    listall = []
    用来存标题的url
    listurl = []
    dict = {}
    
    爬取标题
    listName=driver.find_elements_by_xpath('//*[@id="main"]/div/div/ul/li[%d]/div/div[1]/h3/a/div[1]'%a)
    爬取标题的url
    urls = driver.find_elements_by_xpath('//*[@id="main"]/div/div/ul/li[%d]/div/div[1]/h3/a'%a)

    # 循环取出标题文本 和 超链接
    for j in range(len(listName)):
        dict['name']=listName[j].text
        dict['url']=urls[j].get_attribute('href')


    # 进入每一个详情页 抓取内容
        time.sleep(rand_seconds)
        driver.get(url=urls[j].get_attribute('href'))  #new?

        content=driver.find_elements_by_xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div')

        # 遍历  取出内容文本
        for c in range(len(content)):
		    关于sub的用法底部有链接
            dict['content'] = re.sub('\n|\s|,,,,',',',''.join(content[c].text))
            print(dict)
            listall.append(dict)
			在这也遇到了一个小问题,爬取职位描述成功后没有继续跳转 所以用到了driver.back() 返回函数
        driver.back()   
 关于sub的用法<a href="https://blog.csdn.net/su_zhen_hua/article/details/90779053"> 点击这里</a>
    将list类型转换为pandas二维数组
    data = pd.DataFrame(listall)
    存csv     
    data.to_csv('Boss直聘.csv', encoding='utf8', mode='a+',header=None,index=False)
    
完成后关闭浏览器
driver.close()

附上爬取完成后的效果

python + selenium 爬取Boss直聘

csv截图

python + selenium 爬取Boss直聘
关于sub的用法点击这里
关于mode=“a+” 点击这里

python + selenium 爬取Boss直聘

导入需要用到的模块

分析网址

发现一共是10页数据

详细代码

附上爬取完成后的效果

csv截图

往期推荐

存读csv

python中文件读写模式的区别

django框架之分页

机器学习之绘图

爬虫学习之selenium(一)

详解python selenium 爬取网易云音乐歌单名

python爬虫系列Selenium定向爬取虎扑篮球图片详解

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

Python使用Selenium爬取淘宝异步加载的数据方法

python+selenium爬取淘宝羽毛球拍信息

Python+selenium爬取智联招聘的职位信息

python3爬虫-通过selenium登陆拉钩，爬取职位信息

python爬虫基于Selenium的股票信息爬取工具实现

Python selenium爬取微信公众号文章代码详解

Python使用Selenium+BeautifulSoup爬取淘宝搜索页

python + selenium 爬取Boss直聘

导入需要用到的模块

分析网址

发现一共是10页数据

详细代码

附上爬取完成后的效果

csv截图

往期推荐

存 读csv

python中文件读写模式的区别

django框架之分页

机器学习之绘图

爬虫学习之selenium(一)

详解python selenium 爬取网易云音乐歌单名

python爬虫系列Selenium定向爬取虎扑篮球图片详解

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

Python使用Selenium爬取淘宝异步加载的数据方法

python+selenium爬取淘宝羽毛球拍信息

Python+selenium爬取智联招聘的职位信息

python3爬虫-通过selenium登陆拉钩，爬取职位信息

python爬虫基于Selenium的股票信息爬取工具实现

Python selenium爬取微信公众号文章代码详解

Python使用Selenium+BeautifulSoup爬取淘宝搜索页

存读csv