Python网络爬虫之动态网页爬取及使用selenium模块爬取

程序员文章站 2022-04-25 23:10:20

...

Python网络爬虫之动态网页爬取及使用selenium模块爬取

使用requests模块爬取动态网页数据
使用selenium爬取今日头条新闻评论
综合案例

使用requests模块爬取动态网页数据

"""
使用requests模块爬取动态网页数据
今日头条：某条新闻的评论信息
"""
import requests

#路径
url = "https://www.toutiao.com/api/comment/list/?group_id=6749065854995939854&item_id=6749065854995939854&offset=0&count=15"

# 响应头
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0",
}

# 解决日志中的大量warning信息
requests.packages.urllib3.disable_warnings()

# 请求，获取响应
response = requests.get(url, headers=headers, verify=False)
if response.status_code == 200:
    # print(response.text)
    # print(response.json())
    with open("今日评论。txt", "w", encoding="UTF8") as f:
        f.write(str(response.json()))

使用selenium爬取今日头条新闻评论

谷歌浏览器的driver
Python网络爬虫之动态网页爬取及使用selenium模块爬取

"""
使用selenium爬取今日头条新闻评论
"""
#首先下载selenium

from selenium import webdriver
import time

# 谷歌浏览器需要下载一个驱动（并且驱动要与版本包相近）需要放到的安装目录
# 关闭页面
options = webdriver.ChromeOptions()
options.add_argument("--headless")


drivier = webdriver.Chrome(options=options)

drivier.get("https://www.toutiao.com/a6749540925430563339/")

# 先点后拿
# 获取点击事件
loadMore = drivier.find_element_by_css_selector("a.c-load-more")
# 模拟浏览器的点击事件
loadMore.click()

time.sleep(5)

contentDivs = drivier.find_elements_by_css_selector("div.c-content")
# 获取5条数据
for contentDivs in contentDivs:
    content = contentDivs.find_element_by_tag_name("p").text
    print(content)

# 获取点击事件
loadMore = drivier.find_element_by_css_selector("a.c-load-more")
# 模拟浏览器的点击事件
loadMore.click()

Python网络爬虫之动态网页爬取及使用selenium模块爬取

综合案例

使用selenium爬取 airbnb房源信息

"""
综合案例
使用selenium爬取 airbnb房源信息


某一个房源的所有信息：_14csrlku  _1df8dftk
整个框框：_b1aaqf
名称：div _1d4aktw5  _1d4aktw5
大小：  _4efw5a
价格：里面的span  _n4om66
"""
from selenium import webdriver
import time

options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)


for page in range(18):
    print(f"第{page+1}页数据")
    driver.get(f"https://www.airbnb.cn/s/%E4%B8%AD%E5%9B%BD%E6%B9%96%E5%8D%97%E7%9C%81%E9%95%BF%E6%B2%99%E5%B8%82/homes?refinement_paths%5B%5D=%2Fhomes&current_tab_id=home_tab&selected_tab_id=home_tab&screen_size=medium&hide_dates_and_guests_filters=false&place_id=ChIJxWQcnvM1JzQRgKbxoZy75bE&s_tag=ydS0THpO&section_offset=4&items_offset={page}&last_search_session_id=6d039664-ed19-4f74-8549-74986a420d9d")


    hourseAll = driver.find_elements_by_css_selector("div._1df8dftk")


    i = 1
    for hourse in hourseAll:
        # 名称
        name = hourse.find_element_by_css_selector("div._1d4aktw5").text


        # 类型大小(分割)
        TypeSize = hourse.find_element_by_css_selector("div._4efw5a").text
        Type = TypeSize.split(" . ")[0]
        Size = TypeSize.split(" . ")[1]

        # 价格
        priceDiv = hourse.find_element_by_css_selector("span._n4om66").text
        price = priceDiv.replace("价格", "").replace("\n", "")
        print(f"{i}{name}{price}{Type}{Size}")
        time.sleep(20)
        i = i + 1
    time.sleep(30)

上一篇： Python动态网页爬虫之爬取知乎话题回答

下一篇： pandas和numpy笔记（不断更新）

Python网络爬虫之动态网页爬取及使用selenium模块爬取

Python网络爬虫之动态网页爬取及使用selenium模块爬取

使用requests模块爬取动态网页数据

使用selenium爬取今日头条新闻评论

综合案例

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

Python网络爬虫（selenium爬取动态网页、爬虫案例分析、哈希算法与RSA加密）

Python爬虫学习记录——8.使用自动化神器Selenium爬取动态网页

04 Python网络爬虫 <<爬取get/post请求的页面数据>>之requests模块

Python3爬虫（十三）爬取动态页之Selenium

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

01精通Python网络爬虫——快速使用Urllib爬取网页

Python动态网页爬虫之爬取知乎话题回答

Python网络爬虫之动态网页爬取及使用selenium模块爬取

Python爬虫小白入门经典之爬取动态网页高德地图信息

Python网络爬虫之动态网页爬取及使用selenium模块爬取

Python网络爬虫之动态网页爬取及使用selenium模块爬取

使用requests模块爬取动态网页数据

使用selenium爬取今日头条新闻评论

综合案例

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

Python网络爬虫（selenium爬取动态网页、爬虫案例分析、哈希算法与RSA加密）

Python爬虫学习记录——8.使用自动化神器Selenium爬取动态网页

04 Python网络爬虫 <<爬取get/post请求的页面数据>>之requests模块

Python3爬虫（十三） 爬取动态页之Selenium

python网络爬虫之解析网页的XPath(爬取Path职位信息)[三]

01精通Python网络爬虫——快速使用Urllib爬取网页

Python动态网页爬虫之爬取知乎话题回答

Python网络爬虫之动态网页爬取及使用selenium模块爬取

Python爬虫小白入门经典之爬取动态网页高德地图信息

Python3爬虫（十三）爬取动态页之Selenium