Python爬取哔哩哔哩视频的相关信息后续

程序员文章站 2022-05-28 23:25:40

...

上一篇文章通过selenium工具自动搜索爬取哔哩哔哩上面的视频相关信息，今天我们接着上一篇文章，保存视频的图片到本地。
首先找到要爬取的网页数据所在的位置，如下图
Python爬取哔哩哔哩视频的相关信息后续
并且，右键点击该网址，可以选择在新标签页中打开，在新标签页中，除了显示图片不会显示其他东西。如下
到这就是原本的图片数据了，如果跳转过后依然不是这样，就需要继续往下寻找到最终的图片数据地址。
在上一篇文章的基础上就添加了一个保存图片的函数，上代码。

def img_save(soup):
    img_url_list = soup.find(class_='video-list clearfix').find_all_next(class_='img-anchor')

Python爬取哔哩哔哩视频的相关信息后续
查找包含图片连接的标签，得到如上所示片段

 	index = 0
    for url in img_url_list:
        img_url = url.find('img').get('src')
        print(img_url)

//i2.hdslb.com/bfs/archive/aaa@qq.com_200h.webp

获取图片链接地址，得到如上文本

		if img_url != '':
            img_request = ('https:'+ img_url).replace('webp', 'jpg')
            print(img_request)
            img_resp = requests.get(img_request)

https://i2.hdslb.com/bfs/archive/aaa@qq.com_200h.jpg

将得到的网址进行拼接，并替换最后的‘webp’为‘jpg’，如果不修改直接保存到本地，图片会无法正常显示

 			if not os.path.exists("cxk_video_img"):
                os.mkdir('cxk_video_img')
            with open('cxk_video_img/%d.jpg' %index, 'wb') as f:
                f.write(img_resp.content)
        index += 1
        time.sleep(1)

最后就是新建文件夹，保存图片到文件夹中。
但是有一个问题，就是有的图片网址获取到的为空，不知道是否和网速也有关系，还是代码本身部分还有我没有发现的欠缺，欢迎各位大佬指正。
完整代码如下：

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import xlwt
import time
import re
import requests
import os

browser = webdriver.Chrome()
#browser = webdriver.PhantomJS()
WAIT = WebDriverWait(browser, 20)
browser.set_window_size(1400, 900)

book = xlwt.Workbook(encoding='utf-8', style_compression=0)

sheet = book.add_sheet('蔡徐坤篮球', cell_overwrite_ok=True)
sheet.write(0, 0, '名称')
sheet.write(0, 1, '地址')
sheet.write(0, 2, '描述')
sheet.write(0, 3, '观看次数')
sheet.write(0, 4, '弹幕数')
sheet.write(0, 5, '发布时间')

n = 1


def search():
    try:
        print('开始访问b站....')
        browser.get("https://www.bilibili.com/")

        search = browser.find_element_by_xpath('//div[@class="nav-search"]/form/input')
        search.send_keys("蔡徐坤 篮球")
        search.send_keys(Keys.ENTER)

        # 跳转到新的窗口
        print('跳转到新窗口')
        all_h = browser.window_handles
        browser.switch_to.window(all_h[1])

        # html = browser.page_source
        # # print(html)
        # soup = BeautifulSoup(html, 'html.parser')
        # save_to_excel(soup)
        # total_index = soup.find(class_='page-item last').find(class_='pagination-btn')

        get_source()
        total_index = WAIT.until(EC.presence_of_element_located((By.CSS_SELECTOR,
                                                                 "li.page-item.last > button")))

        # pattern = re.compile('<div class="page-wrap">.*?<li class="page-item last">.*?(\d+).*?</div>', re.S)
        # total_index = int(re.findall(pattern, html)[0])
        return int(total_index.text)
    except TimeoutException:
        return search()


def next_page(page_num):
    try:
        print('获取第(%d)页数据' % page_num)
        next_btn = WAIT.until(EC.element_to_be_clickable((By.CSS_SELECTOR,
                                                          'li.page-item.next > button')))
        next_btn.click()
        get_source()
    except TimeoutException:
        browser.refresh()
        return next_page(page_num)


def save_to_excel(soup):
    list = soup.find(class_='video-list clearfix').find_all_next(class_='info')

    for item in list:
        item_title = item.find('a').get('title')
        item_link = item.find('a').get('href')
        item_dec = item.find(class_='des hide').text
        item_view = item.find(class_='so-icon watch-num').text
        item_biubiu = item.find(class_='so-icon hide').text
        item_date = item.find(class_='so-icon time').text

        print('爬取：' + item_title)

        global n

        sheet.write(n, 0, item_title)
        sheet.write(n, 1, item_link)
        sheet.write(n, 2, item_dec)
        sheet.write(n, 3, item_view)
        sheet.write(n, 4, item_biubiu)
        sheet.write(n, 5, item_date)

        n = n + 1

def img_save(soup):
    img_url_list = soup.find(class_='video-list clearfix').find_all_next(class_='img-anchor')
    index = 0
    for url in img_url_list:
        img_url = url.find('img').get('src')
        print(img_url)
        if img_url != '':
            img_request = ('https:'+ img_url).replace('webp', 'jpg')
            print(img_request)
            img_resp = requests.get(img_request)

            if not os.path.exists("cxk_video_img"):
                os.mkdir('cxk_video_img')
            with open('cxk_video_img/%d.jpg' %index, 'wb') as f:
                f.write(img_resp.content)
        index += 1
        time.sleep(1)

def get_source():
    WAIT.until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, 'ul.video-list.clearfix')))
    # browser.refresh()
    html = browser.page_source
    # print(html)
    soup = BeautifulSoup(html, 'html.parser')
    save_to_excel(soup)
    img_save(soup)

def main():
    try:
        total = search()
        # print(total)

        for i in range(2, int(total)+1):
            next_page(i)

    finally:
        browser.close()
        browser.quit()


if __name__ == '__main__':
    main()
    book.save(u'蔡徐坤篮球.xls')

本来还想爬取保存视频到本地，结果没法得到视频的最终地址，所以只能暂且搁置，保存视频到本地，也可以采用保存图片相类似的方式。

Python爬取哔哩哔哩视频的相关信息后续

Python爬取哔哩哔哩（bilibili）视频

Python爬虫爬取哔哩哔哩视频下载

python 爬取哔哩哔哩up主信息和投稿视频

哔哩哔哩视频信息爬虫（实时爬取）

Python爬取哔哩哔哩视频的相关信息后续

爬取华农兄弟哔哩哔哩所有视频信息

Python爬虫爬取哔哩哔哩视频下载

介绍Python爬取哔哩哔哩视频

python 爬取哔哩哔哩up主信息和投稿视频

Python爬取哔哩哔哩（bilibili）视频