爬取今日头条组图

程序员文章站 2022-04-26 10:01:12

...

主页面分析

话不多说直接上图。
爬取今日头条组图
从这里可以看到链接地址的参数是分别对应的。

爬取今日头条组图
这里是下拉刷新之后的参数变化。

最后总结得出关键在参数offset的变化

爬取今日头条组图
再看这里会发现，想要的图片url再data这个json数据中。
打开找到article_url，里面便是想要的下个页面的链接，顺便也可以取一下里面的title信息。

组图页面分析

下面就是找组图，具体每一个图片的地址了。
爬取今日头条组图
可以看到单个图片的地址在 gallery: JSON.parse 这个字典里，键为 sub_images 的元组里。
可以通过EL表达式取 gallery: JSON.parse 。然后还要去掉里面的一些无关信息，像 \u002F 这样的噪点。
这个影响因素今日头条为了反爬虫，它有时候会更新，不一样。自己改一下就好。

完整代码

# 今日头条爬取图片代码
# 已经过时更改图片的url表达式即可

import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
import json
import time
from bs4 import BeautifulSoup
import re
import os
import pymongo
from hashlib import md5
from multiprocessing import Pool  #进程池
from json.decoder import JSONDecodeError

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'

client = pymongo.MongoClient(MONGO_URL,connect=False)
db = client[MONGO_DB]

cookie = dict(tt_webid='6712269183072503307', WEATHER_CITY='%E5%8C%97%E4%BA%AC',
                  UM_distinctid='16bdf74e0e523d-08b226090614e4-3f385804-1fa400-16bdf74e0e65a0'
                  , csrftoken='a3fef2b005675a408ee2c0853e6d2f81',
                  CNZZDATA1259612802='638490078-1562819143-https%253A%252F%252Fwww.baidu.com%252F%7C1562910943',
                  __tasessionId='7ww9jotlq1562911586934', s_v_web_id='74dc0e90c0d35c669a123f4781409bf4'
                  )

def get_page_index(offset,keyword,timestamp):
    DATA = {
        'aid': 24,
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
        'timestamp': timestamp
    }
    url = 'https://www.toutiao.com/api/search/content/?'+urlencode(DATA)
    try:
        response = requests.get(url,cookies=cookie)
        response.encoding = 'utf-8'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print("失败")
        return None


def parse_page_index(html):
    try:
        data = json.loads(html)
        if data and 'data' in data.keys():
            for item in data.get('data'):
                yield item.get('article_url')
    except JSONDecodeError:
        pass

def get_page_detail(url):
    headers={
        'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    try:
        r = requests.get(url, headers=headers)
        if r.status_code == 200:
            return r.text
        return None
    except RequestException:
        print("详细页面错误")

def parse_page_detail(html,url):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('title')[0].get_text()
    image_pattern = re.compile('gallery: JSON.parse\("(.*?)\"\)',re.S)
    result = re.search(image_pattern,html)
    if result:
        data = json.loads(result.group(1).replace('\\u002F', ''))
        if data and 'sub_images' in data.keys():
            sub = data.get('sub_images')
            images = [imge.get('url') for imge in sub ]
            for image in images :download_image(image)
            return{
                'title':title,
                'url':url,
                'images':images
            }

def save_to_mongodb(result):
    if db[MONGO_TABLE].insert(result):
        print('存储到Mongodb，',result)
        return True
    return False

def download_image(url):
    print('正在下载：',url)
    try:
        response = requests.get(url,cookies=cookie)
        if response.status_code == 200:
            save_image(response.content)
        return None
    except RequestException:
        print('下载图片错误：',url)
        pass

def save_image(content):
    file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()

def main(offset):
    timestamp=int(time.time()*1000)
    html = get_page_index(offset=offset,keyword='街拍',timestamp=timestamp)
    for url in parse_page_index(html):
        if url:
            html = get_page_detail(url)
            if html :
                result = parse_page_detail(html,url)
                if result:
                    save_to_mongodb(result)

START = 1
STOP = 20
if __name__ == '__main__':
    group = [x * 20 for x in range(START,STOP * 1)]
    pool = Pool()
    pool.map(main,group)

上面的cookie值，header值根据自己情况改变。

用的是mongodb数据库保存数据信息。首先要安装这个数据库。

相关标签： ajax爬虫今日头条爬取

上一篇： ****实验Pre4

下一篇： python3实现爬虫爬取今日头条上面的图片（requests+正则表达式+beautifulSoup+Ajax+多线程）

爬取今日头条组图

主页面分析

组图页面分析

完整代码

使用爬虫框架scrapy爬取网站妹子图

Python使用Scrapy爬取妹子图

爬虫 Scrapy框架爬取图虫图片并下载

Python批量爬取微博素材(一分钟百张大图自动下载)

Python3爬取英雄联盟英雄皮肤大图实例代码

【Python爬虫】使用代理爬取妹子图

最让人喜欢的Python爬虫案例没有之一: 爬取妹子图

Python爬虫入门教程 13-100 斗图啦表情包多线程爬取

今日头条app怎么带图评论?

Node批量爬取头条视频并保存方法

爬取今日头条组图

主页面分析

组图页面分析

完整代码

使用爬虫框架scrapy爬取网站妹子图

Python使用Scrapy爬取妹子图

爬虫 Scrapy框架 爬取图虫图片并下载

Python批量爬取微博素材(一分钟百张大图自动下载)

Python3爬取英雄联盟英雄皮肤大图实例代码

【Python爬虫】使用代理爬取妹子图

最让人喜欢的Python爬虫案例没有之一: 爬取妹子图

Python爬虫入门教程 13-100 斗图啦表情包多线程爬取

今日头条app怎么带图评论?

Node批量爬取头条视频并保存方法

爬虫 Scrapy框架爬取图虫图片并下载