Java爬虫系列之实战:爬取酷狗音乐网 TOP500 的歌曲(附源码)
在前面分享的两篇随笔中分别介绍了httpclient和jsoup以及简单的代码案例:
今天就来实战下,用他们来抓取酷狗音乐网上的 top500排行榜音乐。接下来的代码中除了会用到httpclient和jsoup之外,还会用到log4j和ehcache,分别用来记录日志和实现缓存,如果看官对这两个不是很熟悉的话,请自行百度,现在网上的入门实例有很多,我就不专门记笔记了。
那为什么会想到爬取酷狗音乐网呢?其实也不是我想到的,而是不久前看过某位大神的博客就是爬取酷狗的(具体哪位大神不记得了,见谅哈~~~),我也想用自己的代码试试,并且我看的博客里面好像没有用到缓存,也没有用到代理ip这种反反爬虫的工具,我会在我的爬虫程序里面补上,亲测能自动处理全部23页的歌曲(但是付费歌曲由于必须登录购买才能访问,因此未能下载到,只有其他的400+首非付费歌曲可以正常下载),所以酷狗网的工作人员不要担心哦~~~
话有又说回来了,在那篇博客出来后,也没见酷狗音乐去专门处理下,还能给我留下写这段代码的机会,说明人家酷狗不在乎,毕竟付费歌曲是不能爬取的,而且网站已经有了一定的反爬虫机制。
***************************************************************************
声明:
本爬虫程序和程序爬取到的内容仅限个人学习交流使用,
请勿用于商业用途,否则后果自负
***************************************************************************
好,废话不多说,该上干货了~~
================很华丽的分割线=================
一、设计思路
首先说下思路,我看过的那篇博客没有把过程写详细,我就把它补充下吧:
1.点进去top500排行榜,它的地址栏里面是:,而这个1其实就是页码,访问第n页就把1改成n就行,这个是我爬取的基础
2.点具体某首歌曲,比如《你的酒馆对我打了烊》,新打开页面:https://www.kugou.com/song/#hash=be1e1d3c2a46b4cbd259aca7ff050cd3&album_id=14913769,
3.我们f12分析下网络请求(啥?打开f12没东西?大哥呀你不会再刷新下吗),
你会发现有个耗时很长的请求,而且类型是media,它很可能就是真正获取mp3的请求
仔细看,果然是的,mp3的真实地址是:http://fs.w.kugou.com/201905272134/9d4d81230e6f5c759df51618b03961a7/g126/m00/05/09/hocbafxlaoeat3bzad1nwyw7v5m814.mp3
关掉页面,重新进入该页面,mp3的真是地址是:http://fs.w.kugou.com/201905272139/2897cc9816b82f4cda304d927187b282/g126/m00/05/09/hocbafxlaoeat3bzad1nwyw7v5m814.mp3
根据这个看不出来啥
继续分析,那它是怎么找到这个真实地址的呢?应该是前面的某个请求里面获取到了真实地址,找前面的请求:
这个请求的response里面含有mp3的真实地址,
请求的request为:
https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jquery19106506492572547629_1558964792005&hash=be1e1d3c2a46b4cbd259aca7ff050cd3&album_id=14913769&dfid=3lwatj1pqwvn09grkh3fbfaf&mid=31adc5218ff6a510b05aacad71bc7090&platid=4&_=1558964792007
退出重新获取一次,然后再退出换首歌再获取一下这个request,你会发现一些规律:
粉红色是歌曲播放页面地址栏里面的内容,加粗部分是日期的long值,其他的都可以不变(“jquery19106506492572547629_1558964792005”虽然每次有变化,但是经过尝试,其实没有影响),
所以我们就可以通过请求这个链接来获取带有mp3真实地址的json,然后请求真实地址,从而获取音乐文件。
4.那粉红色部分的值怎么获取呢?查看top500的列表页的源码会发现有段内容,这个里面记录的第n页所有歌曲的hash值、歌曲名、id等基本信息
// 列表数据 global.features = [{"hash":"be1e1d3c2a46b4cbd259aca7ff050cd3","filename":"\u9648\u96ea\u51dd - \u4f60\u7684\u9152\u9986\u5bf9\u6211\u6253\u4e86\u70ca","timelen":251.048,"privilege":10,"size":4024155,"album_id":14913769,"encrypt_id":"tlk6517"},{"hash":"9198b18815ee8ce42ae368ae29276f78","filename":"\u9648\u96ea\u51dd - \u7eff\u8272","timelen":269.064,"privilege":10,"size":4314636,"album_id":15270740,"encrypt_id":"txskm8f"},{"hash":"458e9b9f362277ac37e9eef1cb80b535","filename":"\u738b\u742a - \u4e07\u7231\u5343\u6069","timelen":322.011,"privilege":10,"size":5152644,"album_id":18712576,"encrypt_id":"vsdz726"},{"hash":"7e91fde7e8d33e8ed11c6db4620917e2","filename":"\u5b64\u72ec\u8bd7\u4eba - \u6e21\u6211\u4e0d\u6e21\u5979","timelen":182.23,"privilege":10,"size":2916145,"album_id":14624971,"encrypt_id":"th6cka5"},{"hash":"9681f4ccd830b8436db5f8218c7df0c7","filename":"\u864e\u4e8c - \u4f60\u4e00\u5b9a\u8981\u5e78\u798f","timelen":259.066,"privilege":10,"size":4155201,"album_id":12249679,"encrypt_id":"rniv71f"},{"hash":"44abeaa9cce29afb5c947d4fbd2c567f","filename":"\u5927\u58ee - \u4f2a\u88c5","timelen":301.004,"privilege":10,"size":4817151,"album_id":15999493,"encrypt_id":"u6n6i28"},{"hash":"5fce4cbcb96d6025033bce2025fc3943","filename":"\u5468\u6770\u4f26 - \u544a\u767d\u6c14\u7403","timelen":215,"privilege":10,"size":3443771,"album_id":1645030,"encrypt_id":"d5c5m23"},{"hash":"0a62227caab66f54d43ec084b4bdd81f","filename":"\u5468\u6770\u4f26 - \u7a3b\u9999","timelen":223.582,"privilege":10,"size":3577344,"album_id":960399,"encrypt_id":"74itc7"},{"hash":"a11f7a8bd2ea5bbdb32f58a9081f27b4","filename":"\u82b1\u59d0 - \u72c2\u6d6a","timelen":181.037,"privilege":10,"size":2902317,"album_id":13476703,"encrypt_id":"sfzob9f"},{"hash":"33eb8fe0dc9f70d9f7fe4cb77305d5a8","filename":"\u6d77\u6765\u963f\u6728\u3001\u963f\u5477\u62c9\u53e4\u3001\u66f2\u6bd4\u963f\u4e14 - \u522b\u77e5\u5df1","timelen":280.111,"privilege":10,"size":4482365,"album_id":16324799,"encrypt_id":"uajki71"},{"hash":"76d04f195c1f081cc0cd027a310a7d9a","filename":"\u738b\u742a - \u7ad9\u7740\u7b49\u4f60\u4e09\u5343\u5e74","timelen":381.083,"privilege":10,"size":6109771,"album_id":13886090,"encrypt_id":"sunkg88"},{"hash":"9c00a468d2658487db2de4ed16a12b5a","filename":"\u738b\u8d30\u6d6a - \u50cf\u9c7c","timelen":285.031,"privilege":10,"size":4565459,"album_id":13621986,"encrypt_id":"smhia84"},{"hash":"4f76587a5b0b93eef15883e54dd3e2db","filename":"\u6bdb\u4e0d\u6613 - \u6d88\u6101 (live)","timelen":179,"privilege":10,"size":2870658,"album_id":2900867,"encrypt_id":"gf96d56"},{"hash":"8b7df540f77042fb76da1ee3a79eae0a","filename":"ncf-\u827e\u529b - \u9ece\u660e\u524d\u7684\u9ed1\u6697 (\u5973\u58f0\u7248)","timelen":145.058,"privilege":10,"size":2329748,"album_id":17997426,"encrypt_id":"twhgf05"},{"hash":"7a3269c36d07e88a24fb35d246856fa4","filename":"yusee\u897f - \u5fc3\u5982\u6b62\u6c34","timelen":182.883,"privilege":10,"size":2926594,"album_id":19692772,"encrypt_id":"wd07h77"},{"hash":"7995a2173ed0914868bb860f93c3d642","filename":"\u9b4f\u65b0\u96e8 - \u4f59\u60c5\u672a\u4e86","timelen":216.189,"privilege":10,"size":3459539,"album_id":20709823,"encrypt_id":"wnru4c8"},{"hash":"d8e40da7f51c0486224e008a3b6abd45","filename":"\u5154\u5b50\u7259 - \u5c0f\u767d\u5154\u9047\u4e0a\u5361\u5e03\u5947\u8bfa","timelen":163.087,"privilege":10,"size":2622454,"album_id":12492325,"encrypt_id":"rrrbccf"},{"hash":"d2462b148305ff7d990f3b6eb3f90d66","filename":"\u5f20\u656c\u8f69 - \u53ea\u662f\u592a\u7231\u4f60","timelen":254.302,"privilege":10,"size":4080941,"album_id":558311,"encrypt_id":"3f65bd"},{"hash":"03fe01457005ceef8627be5e5313d230","filename":"\u84dd\u4e03\u4e03 - \u9ece\u660e\u524d\u7684\u9ed1\u6697 (\u5973\u58f0\u7248)","timelen":111.986,"privilege":10,"size":1792253,"album_id":19842582,"encrypt_id":"w8lwi96"},{"hash":"96e064a41ab84ebe4c03c6aae3cb9334","filename":"\u5f20\u7d2b\u8c6a - \u53ef\u4e0d\u53ef\u4ee5","timelen":240.093,"privilege":10,"size":3855453,"album_id":9618875,"encrypt_id":"mkt6v7f"},{"hash":"5d6cce061bd65404bf5669fdd26c40b1","filename":"\u4e01\u8299\u59ae - \u53ea\u662f\u592a\u7231\u4f60","timelen":247.797,"privilege":10,"size":3965342,"album_id":18231730,"encrypt_id":"vhrxi30"},{"hash":"95b48a0894fc2198b6e2b93c034aac72","filename":"\u5468\u6770\u4f26 - \u9752\u82b1\u74f7","timelen":239.046,"privilege":10,"size":3825206,"album_id":979856,"encrypt_id":"7a6sd6"}];
把这些信息获取后放到ehcache缓存,hash为key,album_id为value,循环单个歌曲的时候播放页也能获取到hash,然后根据hash到缓存里面取值即可
5.根据以上获取的信息就可以正常爬取文件了,但是在爬取了一段时间后会发现无法正常下载了,在log中看到请求不到mp3的真实地址, 返回的json报文里面error_code不为0,这个就是爬虫程序被网站识别了,这就要用到代理ip了,当被识别出后就换个代理ip,如此循环下去直到歌曲轮询完或代理ip被用完为止。
二、核心代码展示
有了思路之后,就可以写代码了,由于篇幅原因,这里只贴出部分核心代码,完整代码请在下面的gitee上获取
代码结构:
- 需要的依赖
<!-- httpclient 抓取html --> <dependency> <groupid>org.apache.httpcomponents</groupid> <artifactid>httpclient</artifactid> <version>4.5.8</version> </dependency> <!-- jsoup 解析html--> <dependency> <groupid>org.jsoup</groupid> <artifactid>jsoup</artifactid> <version>1.11.3</version> </dependency> <!-- 用来下载歌曲,就不用自己写流操作了 --> <dependency> <groupid>commons-io</groupid> <artifactid>commons-io</artifactid> <version>2.6</version> </dependency> <!-- fastjson用来处理json --> <dependency> <groupid>com.alibaba</groupid> <artifactid>fastjson</artifactid> <version>1.2.58</version> </dependency> <!-- ehcache用作缓存 --> <dependency> <groupid>net.sf.ehcache</groupid> <artifactid>ehcache</artifactid> <version>2.10.6</version> </dependency> <!-- 引入slf4j-nop 纯粹是防止ehcache执行报错 --> <dependency> <groupid>org.slf4j</groupid> <artifactid>slf4j-nop</artifactid> <version>1.7.2</version> </dependency> <!-- log4j作为日志系统 --> <dependency> <groupid>log4j</groupid> <artifactid>log4j</artifactid> <version>1.2.17</version> </dependency>
- 主类
package com.sam.kugou.main; import java.util.list; import org.apache.log4j.logger; import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements; import com.alibaba.fastjson.jsonobject; import com.sam.kugou.utils.downloadmusic; import com.sam.kugou.utils.ehcacheutil; import com.sam.kugou.utils.httpclientutil; public class kugouspidermain { static final logger logger = logger.getlogger(kugouspidermain.class); static string url_temp = "https://www.kugou.com/yy/rank/home/page_num-8888.html?from=homepage"; public static final int sleep_time_when_deny = 1000*60*60;//被网站识别后睡眠时间 public static final int spider_during = 1;//隔多久爬取下一首,单位:ms public static final string dir_name = "e:\\personal\\音乐\\酷狗\\";//音乐下载地址 public static void main(string[] args) { //酷狗top500页面 try { for (int i = 1; i <= 23; i++) { string url = url_temp; url = url.replace("page_num", i + ""); /** * 1.请求歌曲列表 */ logger.info(url); string html = httpclientutil.gethtml(url); logger.debug(html); /** * 2.获取该页的hash和id 放到缓存 */ int beginidx = html.indexof("global.features = "); int endidx = html.indexof("];", beginidx); string features = html.substring(beginidx, endidx + 1).replace("global.features = ", ""); logger.info("containingowntext >>>>>> " + features); list<jsonobject> list = jsonobject.parsearray(features, jsonobject.class); for (jsonobject jsonobject : list) { string hash = (string) jsonobject.get("hash"); integer albumid = (integer) jsonobject.get("album_id"); ehcacheutil.setcache(hash, albumid); } /** * 3.解析列表内容 */ document doc = jsoup.parse(html); elements songlist = doc.select(".pc_temp_songlist ul li a"); for (element element : songlist) { string title = element.attr("title"); string href = element.attr("href"); if(href.contains("https")) { try { thread.sleep(spider_during); } catch (interruptedexception e) { logger.error(e.getmessage()); } logger.info("title " + title +" >>> href " + href); downloadmusic.requestmusic(title, href); } } } } catch(exception ex) { logger.error(ex.getmessage(), ex); } finally { /*** * 4.关闭 */ ehcacheutil.shutdownmanager(); } } }
- 获取真实地址
- 执行下载
public static void download(string title, string url) { if(url == null || url.equals("")) { return ; } //已经完成的就不再重新下载 element finishedcache = ehcacheutil.getfinishedcache(title); logger.debug("finishedcache >>>>> " + finishedcache); if(finishedcache != null) { logger.info("歌曲已经存在!!!"); return; } string suffix = url.substring(url.lastindexof(".")); try { httpentity httpentity = httpclientutil.gethttpentity(url); inputstream inputstream = httpentity.getcontent(); string filepath = kugouspidermain.dir_name+title+suffix; fileutils.copytofile(inputstream, new file(filepath)); logger.info("***完成下载:***"+title+suffix); logger.info("***总歌曲数量:***"+(new file(kugouspidermain.dir_name)).list().length); ehcacheutil.setfinishedcache(url, title); } catch (ioexception e) { logger.error(e.getmessage()); } }
- 设置代理ip
public static boolean setproxy() { // 1.创建一个httpclient closeablehttpclient httpclient = httpclients.createdefault(); closeablehttpresponse response = null; string url = "https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list"; try { response = dorequest(httpclient, url); logger.debug("gethtml " + url + "**处理结果:**" + response.getstatusline()); // 5.判断返回结果,200, 成功 if (httpstatus.sc_ok == response.getstatusline().getstatuscode()) { httpentity httpentity = response.getentity(); string html = entityutils.tostring(httpentity, "utf-8"); html = "["+html+"]"; list<jsonobject> list = jsonarray.parsearray(html, jsonobject.class); for (jsonobject jsonobject : list) { int port = integer.valueof(jsonobject.get("port").tostring()); string host = jsonobject.get("host").tostring(); logger.info(host + ":"+port); if(ishostconnectable(host, port)) {//代理ip可以连接 element ipscache = ehcacheutil.getproxyipscache(host, port);//代理ip未使用过 if(ipscache == null) { proxyip = host; proxyport = port; ehcacheutil.setproxyipscache(host, port); break; } else { logger.info("该代理ip已经使用过,切换下一个"); } } } } } catch (exception e) { logger.error(e.getmessage(),e); return false; } finally { // 关闭 httpclientutils.closequietly(response); httpclientutils.closequietly(httpclient); } logger.info("切换代理ip成功:>>>" + proxyip + ":" + proxyport); return true; }
三、源码下载
源码已经上传到我的gitee:
欢迎下载~~
四、遗留问题
1.只能抓取到免费歌曲,对于收费歌曲不能抓取,其实我们也不该抓取
2.代码中为了方便用了很多static,不能支持多线程或并发抓取
3.其实代理ip那里可以优化的
声明:
本爬虫程序和程序爬取到的内容仅限个人学习交流使用,请勿用于商业用途,否则后果自负!!!谢谢
上一篇: 蔬菜粥做法,宝妈们学学看!
下一篇: 怎么煮蔬菜汤好喝以及有哪些功效呢?