爬虫实例3：Python实时爬取新浪热搜榜

程序员文章站 2022-06-30 11:47:18

...

因为了解到新浪热搜榜每分钟都会更新，所以写的是每分钟爬取一次的死循环，按照日期为格式创建路径，将爬取的信息按照时间顺序输出到excel。

步骤：

1、在浏览器中，用F12分析热搜榜页面的html标签结构，观察有无分页情况、分页规律。

2、分为 url、html解析、输出三大模块进行编写方法：

url：因为本案例中 url为固定静态页，且无分页，所以直接使用url即可

hutml解析：用resquests库进行get请求，请求得到的response内容用BeautifulSoup 根据标签完成解析

输出：使用pandas库中的DateFrame对象，将解析的list输出到excel中

3、测试解析内容是否存在问题。本次编写发现问题：

①文本中存在空格。如果只是希望消除空格，可以通过‘’.join(str.split()) 来达成此目的，原理是先把原字符串按照空格切割为列表格式，再通过join方法将列表中每个元素合并到一起。使用前需要确认字符串中空格的实际情况为如何，在本次案例中就发现存在文本+空格+数字+文本的字符串格式，直接进行切割后，导致结果失真。split() 中除了可以传需要切割的字符，还可以传一个num，表示切割次数，num设置为1即代表切割1次，会返回 2个元素的列表。

②在使用BeautifulSoup解析html时，先分析好页面的元素状态，如果写法正确但一直保存，很可能是因为该html的标签存在其他情况，比如此次发现的“荐”标志位，一旦有“荐”时，其href的取值与其他行不同，取href_to。且每分钟都会变化，可能写的时候没有发现这个情况，用的时候解析的内容就发现存在问题。

使用到的模块：

os # 用于查找操作文件，其下的os.path可以操作文件的属性，需注意os中部分方法使用的参数（文件描述符 fd），具体含义可以自行百度了解

time # 用于获取系统时间

requests # 用于get请求

BeautifulSoup # 用于解析html页面

pandas # 主要运用的是DataFrame 的to_excel / to_csv

源码：

import os
import time
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame

Sina_url = 'https://s.weibo.com/top/summary'
path = 'G:\Python\First\Sina_Hot'

def down_url(url):
    header = {
        "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
    }
    res = requests.get(url, headers=header).content
    table = BeautifulSoup(res, "html.parser").table
    head = table.thead.tr.find_all('th')
    body = table.tbody.find_all('tr')
    Sina_cache = []
    Sina_bodyList = []
    # 获取字段名
    for title in head:
        Sina_cache.append(title.text)
    Sina_bodyList.append(Sina_cache)
    # 获取字段内容
    for tr in body:
        Sina_cache = []
        for td in tr.find_all('td'):
            Sina_cache.append(td.text.strip())
        href_after = tr.find('td', class_="td-02").a['href']
        # 当一条热点的标志为“荐”时，正确的url尾缀在 href_to 属性下
        if href_after == 'javascript:void(0);':
            href_after = tr.find('td', class_="td-02").a['href_to']
        href = 'https://s.weibo.com/{}'.format(href_after)
        Sina_cache.append(href)
        Sina_bodyList.append(Sina_cache)
    # 将内容中的标题和热度数值分成两列存储
    for content in Sina_bodyList:
        Sina_cache = content[1].split('\n', 1)
        content[1] = Sina_cache[0]
        if len(Sina_cache) > 1:
            content.insert(2, Sina_cache[1])
        else:
            content.insert(2, '')
    return Sina_bodyList


def to_Excel(list, path):
    frame = DataFrame(list)
    # print(frame)
    path, nowTime = path_generation(path)
    DataFrame.to_excel(frame, path, sheet_name="{}".format(nowTime), index=False)
    print('{}的热点已爬取成功，请查看！'.format(nowTime))


def path_generation(path):
    # 生成多种不同格式的 当前时间，用于文件命名
    nowTime = time.strftime("%m-%d %H-%M", time.localtime(time.time()))
    nowTime_day = time.strftime("%Y-%m-%d", time.localtime(time.time()))
    nowTime_hour = time.strftime("%H", time.localtime(time.time()))
    nowTime_time = time.strftime("%H%M", time.localtime(time.time()))

    path = r'{}\{}\{}时'.format(path, nowTime_day, nowTime_hour)
    # 判断path是否存在，不存在则新建
    isexists = os.path.exists(path)
    if not isexists:
        os.makedirs(path)
    path = '{}\Sina_{}.xlsx'.format(path, nowTime_time)
    return path, nowTime

if __name__ == "__main__":
    # 死循环，每过一分钟爬一次
    while True:
        Sina_List = down_url(Sina_url)
        to_Excel(Sina_List, path)
        time.sleep(60)

输出文件即路径格式如下图：

爬虫实例3：Python实时爬取新浪热搜榜

上一篇： Hadoop之MapReduce应用实例2（分组排序）

下一篇： Hadoop学习笔记之初识MapReduce以及WordCount实例分析

爬虫实例3：Python实时爬取新浪热搜榜

python爬虫爬取微博知乎热搜榜

爬虫实例3：Python实时爬取新浪热搜榜

Python爬虫实例代码（爬取网易云音乐热歌排行榜）

Python爬虫实例代码（爬取网易云音乐热歌排行榜）