抓取多玩搞笑动态图片
思路:拼接两段url得到最终图片的url。
最初url: start_url = ' http://tu.duowan.com/m/bxgif '
第一段拼接(start_url):
拼接理由:该页面通过ajax加载,浏览器下拉才能刷出来的新的图片的链接。
观察路径:在network--XHR
完整 start_url = 'http://tu.duowan.com/m/bxgif?offset=0 ‘
下拉刷出来的url = ' start_url = 'http://tu.duowan.com/m/bxgif?offset=30 '
60
第一段url格式: start_url = 'http://tu.duowan.com/m/bxgif?offset={}'.format(i*30)
第二段拼接(点击进入一个动态图集):
拼接理由:请求图集加载的链接,返回的response中没有图片url.
观察路径:network---XHR
图集内部所有图片url = ‘http://tu.duowan.com/index.php?r=show/getByGallery/&gid=137294&_=1532357281131’
最后的 “&_=153235728113” 验证可以去掉
第二端url格式:http://tu.duowan.com/index.php?r=show/getByGallery/&gid={}.format(id)
id 为动态图集页面的id ,可在第一端拼接的url的叶面中找到。
不足:图片全乱了,不知道哪张图片是哪个图集的。这怎么处理,SOS
# coding:utf-8 --
import requests
from lxml import etree
import re
import json
import os
from threading import Thread
class DuoWan(object):
def __init__(self, headers, id_list, single_img_url_list):
self.headers = headers
self.id_list = id_list # save id
self.single_img_url_list = single_img_url_list # save img urls
def handle_first_index(self):
for i in range(3): #just 3 page
start_url = 'http://tu.duowan.com/m/bxgif?offset={}'.format(i*30)
respnose = requests.get(start_url,headers=self.headers)
html = respnose.text
text = etree.HTML(html)
id_list = text.xpath('//ul[@id="pic-list"]/li/a/@href') # get id
self.id_list.append(id_list)
def handle_two_index(self):
for ids in self.id_list: # list contain list
for id in ids:
pattern = re.compile(r'http://tu.duowan.com/gallery/(\d+).html', re.S)
id = re.findall(pattern, id)
id = id[0] # get id string
tuji_json_url = 'http://tu.duowan.com/index.php?r=show/getByGallery/&gid={}'.format(id)
# type:json
respnose = requests.get(tuji_json_url,headers=self.headers)
tuji_python_dict = respnose.json()
single_img_url_list = tuji_python_dict['picInfo']
for single_img_url in single_img_url_list:
single_img_url = single_img_url['url']
self.single_img_url_list.append(single_img_url)
def load_img(self):
for single_img_url in self.single_img_url_list:
respnose = requests.get(single_img_url,headers=self.headers)
img = respnose.content
try:
os.mkdir('duowan')
except:
pass
with open('duowan'+'/'+'{}'.format(single_img_url[-10:]), 'wb') as f:
f.write(img)
def main(self):
self.handle_first_index()
self.handle_two_index()
self.load_img()
if __name__ == '__main__':
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/67.0.3396.99 Chrome/67.0.3396.99 Safari/537.36"}
id_list = []
single_img_url = []
duowan = DuoWan(headers, id_list, single_img_url)
duowan.main()
上一篇: 数据分析Numpy(2)-基础运算
下一篇: emd分解MATLAB自带函数