python爬取喜马拉雅音频数据

程序员文章站 2022-05-04 16:40:13

...

'''
思路：
请求和响应的过程
多层数据解析
海量音频数据保存
https://aod.cos.tx.xmcdn.com/storages/1c5f-audiofreehighqps/DB/A3/CKwRINsEdDahACUM2wCrMC1H.m4a
https://www.ximalaya.com/revision/play/v1/audio?id=415166844&ptype=1
id值哪来的
https://www.ximalaya.com/youshengshu/4256765/

案例分析
在静态页面中找音频每一个对应的ID值
根据id值替换请求每一个音频数据json字符串
从json数据中解析音频的url地址，请求音频数据
'''

import requests #导入模块
import parsel #数据解析的模块


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'
}
for page in range(1,6):
    print('==========正在下载第{}页数据=========='.format(page))
    url='https://www.ximalaya.com/youshengshu/3210/p{}/'.format(page)
    print(url)
    response=requests.get(url=url,headers=headers)
    html_data=response.text
    selector=parsel.Selector(html_data) #转换数据类型
    lis=selector.xpath('//div[@class="sound-list _is"]/ul/li')
    for li in lis:
        title=li.xpath('.//a/@title').get()
        href = li.xpath('.//a/@href').get()
        print(title,href)

        #解析id
        m4a_id=href.split('/')[-1]

        #2 根据id值替换请求
        json_url='https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(m4a_id)
        data_json=requests.get(url=json_url,headers=headers).json()

        #解析音频的url
        m4a_url=data_json['data']['src']
        #print(m4a_url)

        #请求音频数据
        m4a_data=requests.get(url=m4a_url,headers=headers).content
        with open('video\\'+str(page)+title+'.m4a',mode='wb') as f:
            f.write(m4a_data)
            print('保存完成： ',str(page)+title)

python爬取喜马拉雅音频数据

2019基于python的网络爬虫系列，爬取糗事百科

python爬虫系列Selenium定向爬取虎扑篮球图片详解

php实现爬取和分析知乎用户数据

php爬虫：百万级别知乎用户数据爬取与分析

Python爬虫爬取最美女主播

python爬虫项目实战：爬取500px图片

Python爬取微博短视频

PHP爬虫之百万级别知乎用户数据爬取与分析

Python实现爬取百度贴吧帖子所有楼层图片的爬虫示例

Python爬虫层层递进，从爬取一章小说到爬取全站小说