今日头条爬虫实战

程序员文章站 2022-07-08 15:58:04

为什么需要 Mask？在此，先思考一个问题，为什么需要 mask？在 NLP 中，一个最常见的问题便是输入序列长度不等，通常需要进行 PAD 操作，通常在较短的序列后面填充 0，虽然 RNN 等模型可以处理不定长输入，但在实践中，需要对 input 做 batchsize，转换成固定的 tensor。PAD 案例：如下是两句英文，先将文本转换成数字s1 = 'He likes cats's2 = 'He does not like cats's = s1.split(' ') + s2.sp...

今日头条爬虫实战

文章目录

今日头条爬虫实战
前言
一、怎么获取request url
- - 获取User-agent 和 cookies

前言

本博客主要记录如何使用python爬虫抓取今日头条上面的新闻链接，然后按照新闻链接抓取新闻的文本信息，以及新闻的热度信息，即评论转发点赞的数量。

一、怎么获取request url

首先打开今日头条网站，https://www.toutiao.com/ch/news_hot/，注意要选择左边的热点选项，而不是推荐选项，即最后网址的后缀应该是news_hot
今日头条爬虫实战
然后在当前页面按下ctrl+shift+i，进入浏览器开发者模式，在右上角选择network，如下：

找到以下XHR文件，即中间含有category=news_hot,并且在URL中前缀是https://www.toutiao.com/api/pc/feed/的XHR文件。
验证，点开preview可以看到所有新闻存储的data是以json形式存储，如下：
今日头条爬虫实战
将此request url保存下来，例如我的就是https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=news_hot&utm_source=toutiao&widen=1&tadrequire=true&_signature=_02B4Z6wo00f01YPutCwAAIBBOyjKcKvvfemD67CAAD85fGlN2Gr-czIOR0Y55rfl1sffwW7B0sik3wqiwUHxk9NhE4cZpv4vEA57j37xvkoZ1s64BK7g5sHmjnc8Xj-r2-OSje67l6B.c5Dkc4
对比参数解释：
今日头条爬虫实战
其中max_behot_time在获取的json数据中获得，具体数据见如下截图：

至此我们只是获得了爬虫的start url，在后续爬虫的时候需要按照上述参数表来获得新闻的链接，从而爬取到新闻。

获取User-agent 和 cookies

从刚才打开的XHR文件中的headers中可以找到：
如下图所示
今日头条爬虫实战
继续上文参数，python获取as和cp值：（两个参数在js文件：home_4abea46.js中有，具体算法如下代码：）

def get_as_cp():  # 该函数主要是为了获取as和cp参数，程序参考今日头条中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    print(now) # 获取当前计算机时间
    e = hex(int(now)).upper()[2:] #hex()转换一个整数对象为16进制的字符串表示
    print('e:', e)
    a = hashlib.md5()  #hashlib.md5().hexdigest()创建hash对象并返回16进制结果
    print('a:', a)
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    print('i:', i)
    if len(e)!=8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s= s+n[i]+e[i]
    for j in range(5):
        r = r+e[j+3]+a[j]
    zz ={
    'as':'A1'+s+e[-3:],
    'cp':e[0:3]+r+'E1'
    }
    print('zz:', zz)
    return zz

这样完整的链接就构成了，另外提一点就是：_signature参数去掉也是可以获取到json数据的，因此这样请求的链接就完成了；下面附上完整代码：

import requests
import json
from openpyxl import Workbook
import time
import hashlib
import os
import datetime
 
start_url = 'https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=news_hot&utm_source=toutiao&widen=1&max_behot_time='
url = 'https://www.toutiao.com'
 
headers={
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
cookies = {'csrftoken=a1cec75edb840d9c30e91b908b6df006; tt_webid=6903701895266567693; ttcid=de9284562a3d43158bbcf20f427c76bf38; s_v_web_id=verify_kifci8q5_95283jqK_hVnJ_46JY_BWm0_qh2CKMpqyzNp; tt_webid=6903701895266567693; passport_csrf_token=1360da074931584e4d0f4b71d11ecafe; toutiao_sso_user=876396d13a4fdcf0be33558bd1a69659; toutiao_sso_user_ss=876396d13a4fdcf0be33558bd1a69659; sid_guard=888954813cb61b04ed109f48e8113b56%7C1607393426%7C5184000%7CSat%2C+06-Feb-2021+02%3A10%3A26+GMT; uid_tt=132acd819771d24afff611bdd48dc359; uid_tt_ss=132acd819771d24afff611bdd48dc359; sid_tt=888954813cb61b04ed109f48e8113b56; sessionid=888954813cb61b04ed109f48e8113b56; sessionid_ss=888954813cb61b04ed109f48e8113b56; MONITOR_WEB_ID=a2ef240d-ae33-4e56-a975-966605f43948; __ac_nonce=05fcee3ab003d13bf7bb9; __ac_signature=_02B4Z6wo00f01ptuFTQAAIBCI6hrasKhKPabahGAAPkkXKbMByxFhYf4CiPJulDq4RaDiFzhxhk8qXY90KRvd90rngWGGKzzx8ziu3ARnSqI6w3Grk66cYbmO74ecRdqs0gDbZzUf.ktI.f-b6; tt_anti_token=LjrRoQn9d-fe7a8dbda23884fb006dfd76a297fa069a5ef454622aeb4f2994ae7dfe98ac87; sso_uid_tt=270bb6e7a69bffa46b17d2ddd68bba6e; sso_uid_tt_ss=270bb6e7a69bffa46b17d2ddd68bba6e; tt_scid=KCRCw6a2rFugOQWL3Nhcva29Rl82cLTGiiXyJIUYOR0ApdW1LGMJ6jtzrKn.O41.f3ef'} # 此处cookies可从浏览器中查找，为了避免被头条禁止爬虫
 
max_behot_time = '0'   # 链接参数
title = []       # 存储新闻标题
source_url = []  # 存储新闻的链接
s_url = []       # 存储新闻的完整链接
source = []      # 存储发布新闻的公众号
media_url = {}   # 存储公众号的完整链接
 
 
def get_as_cp():  # 该函数主要是为了获取as和cp参数，程序参考今日头条中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    print(now) # 获取当前计算机时间
    e = hex(int(now)).upper()[2:] #hex()转换一个整数对象为16进制的字符串表示
    print('e:', e)
    a = hashlib.md5()  #hashlib.md5().hexdigest()创建hash对象并返回16进制结果
    print('a:', a)
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    print('i:', i)
    if len(e)!=8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s= s+n[i]+e[i]
    for j in range(5):
        r = r+e[j+3]+a[j]
    zz ={
    'as':'A1'+s+e[-3:],
    'cp':e[0:3]+r+'E1'
    }
    print('zz:', zz)
    return zz
 
 
def getdata(url, headers, cookies):  # 解析网页函数
    r = requests.get(url, headers=headers, cookies=cookies)
    print(url)
    data = json.loads(r.text)
    return data
 
 
def savedata(title, s_url, source, media_url):  # 存储数据到文件
    # 存储数据到xlxs文件
    wb = Workbook()
    if not os.path.isdir(os.getcwd()+'/result'):   # 判断文件夹是否存在
        os.makedirs(os.getcwd()+'/result') # 新建存储文件夹
    filename = os.getcwd()+'/result/result-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%m')+'.xlsx' # 新建存储结果的excel文件
    ws = wb.active
    ws.title = 'data'   # 更改工作表的标题
    ws['A1'] = '标题'   # 对表格加入标题
    ws['B1'] = '新闻链接'
    ws['C1'] = '头条号'
    ws['D1'] = '头条号链接'
    for row in range(2, len(title)+2):   # 将数据写入表格
        _= ws.cell(column=1, row=row, value=title[row-2])
        _= ws.cell(column=2, row=row, value=s_url[row-2])
        _= ws.cell(column=3, row=row, value=source[row-2])
        _= ws.cell(column=4, row=row, value=media_url[source[row-2]])
 
    wb.save(filename=filename)  # 保存文件
 
 
 
def main(max_behot_time, title, source_url, s_url, source, media_url):   # 主函数
    for i in range(3):   # 此处的数字类似于你刷新新闻的次数，正常情况下刷新一次会出现10条新闻，但夜存在少于10条的情况；所以最后的结果并不一定是10的倍数
        ascp = get_as_cp()    # 获取as和cp参数的函数
        demo = getdata(start_url+max_behot_time+'&max_behot_time_tmp='+max_behot_time+'&tadrequire=true&as='+ascp['as']+'&cp='+ascp['cp'], headers, cookies)
        print(demo)
        # time.sleep(1)
        for j in range(len(demo['data'])):
            # print(demo['data'][j]['title'])
            if demo['data'][j]['title'] not in title:
                title.append(demo['data'][j]['title'])  # 获取新闻标题
                source_url.append(demo['data'][j]['source_url'])  # 获取新闻链接
                source.append(demo['data'][j]['source'])  # 获取发布新闻的公众号
            if demo['data'][j]['source'] not in media_url:
                media_url[demo['data'][j]['source']] = url+demo['data'][j]['media_url']  # 获取公众号链接
        print(max_behot_time)
        max_behot_time = str(demo['next']['max_behot_time'])  # 获取下一个链接的max_behot_time参数的值
        for index in range(len(title)):
            print('标题：', title[index])
            if 'https' not in source_url[index]:
                s_url.append(url+source_url[index])
                print('新闻链接：', url+source_url[index])
            else:
                print('新闻链接：', source_url[index])
                s_url.append(source_url[index])
                # print('源链接：', url+source_url[index])
            print('头条号：', source[index])
            print(len(title))   # 获取的新闻数量
 
if __name__ == '__main__':
    main(max_behot_time, title, source_url, s_url, source, media_url)
    savedata(title, s_url, source, media_url)

本文地址：https://blog.csdn.net/fs1341825137/article/details/110854025

相关标签： python 数据挖掘爬虫

上一篇： LeetCode题解 83. 删除排序链表中的重复元素

下一篇： word2vec实现注释

今日头条爬虫实战

今日头条爬虫实战

文章目录

前言

一、怎么获取request url

获取User-agent 和 cookies

python高阶爬虫实战分析

微信导流加粉：借助今日头条免费引流推广技巧

Android使用RecyclerView实现今日头条频道管理功能

Python视频爬虫实现下载头条视频功能示例

站长吐槽：头条搜索爬虫暴力抓取网站内容

今日头条过新手任务：做好这三点就行了

Python爬虫实现抓取腾讯视频所有电影【实战必学】

小白学 Python 爬虫：自动化测试框架 Selenium 从入门到实战

Python爬虫实战：批量下载网站图片

今日头条开户要预存多少钱？官方和代理商一样吗？