欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

程序员文章站 2022-06-23 23:43:38
最近找工作,爬虫面试的一个面试题。涉及的反爬还是比较全面的,结果公司要求高,要解决视频链接时效性问题,凉凉。 直接上代码 import requests import time from datetime import datetime import json import execjs impor ......

最近找工作,爬虫面试的一个面试题。涉及的反爬还是比较全面的,结果公司要求高,要解决视频链接时效性问题,凉凉。

直接上代码

import requests
import time
from datetime import datetime
import json
import execjs
import hashlib
import re
import csv
from zlib import crc32
from base64 import b64decode
import random
import urllib3
import os
import threading
from queue import queue
from lxml import etree

# 查看js版本信息
# print(execjs.get().name)
# 屏蔽ssl验证警告
urllib3.disable_warnings()

"""
需要nodejs环境,需要修改subprocess.py文件内的class popen(object)类中的__init__(..encode='utf-8)否则调用js文件时会报错
请求列表页时.py文件中的ua头要与js文件中一致,不然很难请求到数据,请求详情页时要用ua池否则会封浏览器/ip
会有一些空白表格,是因为该账号七天内为发表内容,或者该账号被封禁
输出结果在此文件所在根目录下/toutiao/
右键运行此py文件,newsign.js文件,toutiao.csv文件需在同一文件夹内
爬取的视频有时效性
"""


# 定义ua池
def headers():
    # 各种pc端
    user_agent_list = [
        # opera
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/39.0.2171.95 safari/537.36 opr/26.0.1656.60",
        "opera/8.0 (windows nt 5.1; u; en)",
        "mozilla/5.0 (windows nt 5.1; u; en; rv:1.8.1) gecko/20061208 firefox/2.0.0 opera 9.50",
        "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; en) opera 9.50",
        # firefox
        "mozilla/5.0 (windows nt 6.1; wow64; rv:34.0) gecko/20100101 firefox/34.0",
        "mozilla/5.0 (x11; u; linux x86_64; zh-cn; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (maverick) firefox/3.6.10",
        # safari
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/534.57.2 (khtml, like gecko) version/5.1.7 safari/534.57.2",
        # chrome
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/39.0.2171.71 safari/537.36",
        "mozilla/5.0 (x11; linux x86_64) applewebkit/537.11 (khtml, like gecko) chrome/23.0.1271.64 safari/537.11",
        "mozilla/5.0 (windows; u; windows nt 6.1; en-us) applewebkit/534.16 (khtml, like gecko) chrome/10.0.648.133 safari/534.16",
        # 360
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/30.0.1599.101 safari/537.36",
        "mozilla/5.0 (windows nt 6.1; wow64; trident/7.0; rv:11.0) like gecko",
        # 淘宝浏览器
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/536.11 (khtml, like gecko) chrome/20.0.1132.11 taobrowser/2.0 safari/536.11",
        # 猎豹浏览器
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.1 (khtml, like gecko) chrome/21.0.1180.71 safari/537.1 lbbrowser",
        "mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; wow64; trident/5.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; .net4.0c; .net4.0e; lbbrowser)",
        "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; qqdownload 732; .net4.0c; .net4.0e; lbbrowser)",
        # qq浏览器
        "mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; wow64; trident/5.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; .net4.0c; .net4.0e; qqbrowser/7.0.3698.400)",
        "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; qqdownload 732; .net4.0c; .net4.0e)",
        # sogou浏览器
        "mozilla/5.0 (windows nt 5.1) applewebkit/535.11 (khtml, like gecko) chrome/17.0.963.84 safari/535.11 se 2.x metasr 1.0",
        "mozilla/4.0 (compatible; msie 7.0; windows nt 5.1; trident/4.0; sv1; qqdownload 732; .net4.0c; .net4.0e; se 2.x metasr 1.0)",
        # maxthon浏览器
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) maxthon/4.4.3.4000 chrome/30.0.1599.101 safari/537.36",
        # uc浏览器
        "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/38.0.2125.122 ubrowser/4.0.3214.0 safari/537.36",
    ]
    useragent = random.choice(user_agent_list)
    headers = {'user-agent': useragent}
    return headers


headers_a = {
    "user-agent": "mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/78.0.3904.87 safari/537.36",
}
# 代理ip
proxy = {
    'http': '183.57.44.62:808'
}
# cookies值
cookies = {'s_v_web_id': 'b68312370162a4754efb0510a0f6d394'}


# 获取_signature
def get_signature(user_id, max_behot_time):
    with open('newsign.js', 'r', encoding='utf-8') as f:
        jsdata = f.read()
    execjs.get()
    ctx = execjs.compile(jsdata).call('tac', str(user_id) + str(
        max_behot_time))  # 复原tac.sign(userinfo.id + "" + i.param.max_behot_time)
    return ctx


# 获取as,cp
def get_as_cp():  # 该函数主要是为了获取as和cp参数,程序参考今日头条中的加密js文件:home_4abea46.js
    zz = {}
    now = round(time.time())
    # print(now)  # 获取当前计算机时间
    e = hex(int(now)).upper()[2:]  # hex()转换一个整数对象为16进制的字符串表示
    # print('e:', e)
    a = hashlib.md5()  # hashlib.md5().hexdigest()创建hash对象并返回16进制结果
    # print('a:', a)
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    # print('i:', i)
    if len(e) != 8:
        zz = {'as': '479bb4b7254c150',
              'cp': '7e0ac8874bb0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s = s + n[i] + e[i]
    for j in range(5):
        r = r + e[j + 3] + a[j]
    zz = {
        'as': 'a1' + s + e[-3:],
        'cp': e[0:3] + r + 'e1'
    }
    # print('zz:', zz)
    return zz


# 获取as,cp,_signature(弃用)
def get_js():
    f = open(r"juejin.js", 'r', encoding='utf-8')  ##打开js文件
    line = f.readline()
    htmlstr = ''
    while line:
        htmlstr = htmlstr + line
        line = f.readline()
    ctx = execjs.compile(htmlstr)
    return ctx.call('get_as_cp_signature')


# print(json.loads(get_js())['as'])


# 文章数据
break_flag = []


def wenzhang(url=none, max_behot_time=0, n=0, csv_name=0):
    max_qingqiu = 50
    headers1 = ['发表时间', '标题', '来源', '所有图片', '文章内容']
    first_url = 'https://www.toutiao.com/c/user/article/?page_type=1&user_id=%s&max_behot_time=%s&count=20&as=%s&cp=%s&_signature=%s' % (
        url.split('/')[-2], max_behot_time, get_as_cp()['as'], get_as_cp()['cp'],
        get_signature(url.split('/')[-2], max_behot_time))
    while n < max_qingqiu and not break_flag:
        try:
            # print(url)
            r = requests.get(first_url, headers=headers_a, cookies=cookies)
            data = json.loads(r.text)
            # print(data)
            max_behot_time = data['next']['max_behot_time']
            if max_behot_time:
                article_list = data['data']
                for i in article_list:
                    try:
                        if i['article_genre'] == 'article':
                            res = requests.get('https://www.toutiao.com/i' + i['group_id'], headers=headers(),
                                               cookies=cookies)
                            # time.sleep(1)
                            article_title = re.findall("title: '(.*?)'", res.text)
                            article_content = re.findall("content: '(.*?)'", res.text, re.s)[0]
                            # pattern = re.compile(r"[(a-za-z~\-_!@#$%\^\+\*&\\\/\?\|:\.<>{}()';=)*|\d]")
                            # article_content = re.sub(pattern, '', article_content[0])
                            article_content = article_content.replace('&quot;', '').replace('u003c', '<').replace(
                                'u003e',
                                '>').replace(
                                '&#x3d;',
                                '=').replace(
                                'u002f', '/').replace('\\', '')
                            article_images = etree.html(article_content)
                            article_image = article_images.xpath('//img/@src')
                            article_time = re.findall("time: '(.*?)'", res.text)
                            article_source = re.findall("source: '(.*?)'", res.text, re.s)
                            result_time = []
                            [result_time.append(i) for i in
                             str(article_time[0]).split(' ')[0].replace('-', ',').split(',')]
                            # print(result_time)
                            cha = (datetime.now() - datetime(int(result_time[0]), int(result_time[1]),
                                                             int(result_time[2]))).days
                            # print(cha)
                            if 30 < cha <= 32:
                                # print('完成')
                                # break_flag.append(1)
                                # break
                                continue
                            if cha > 32:
                                print('完成')
                                break_flag.append(1)
                                break
                            row = {'发表时间': article_time[0], '标题': article_title[0].strip('&quot;'),
                                   '来源': article_source[0],'所有图片':article_image,
                                   '文章内容': article_content.strip()}
                            with open('/toutiao/' + str(csv_name) + '文章.csv', 'a', newline='', encoding='gb18030')as f:
                                f_csv = csv.dictwriter(f, headers1)
                                # f_csv.writeheader()
                                f_csv.writerow(row)
                            print('正在爬取文章:', article_title[0].strip('&quot;'), article_time[0],
                                  'https://www.toutiao.com/i' + i['group_id'])
                            time.sleep(1)
                        else:
                            pass
                    except exception as e:
                        print(e, 'https://www.toutiao.com/i' + i['group_id'])
                wenzhang(url=url, max_behot_time=max_behot_time, csv_name=csv_name, n=n)
            else:
                pass
        except keyerror:
            n += 1
            print('第' + str(n) + '次请求', first_url)
            time.sleep(1)
            if n == max_qingqiu:
                print('请求超过最大次数')
                break_flag.append(1)
            else:
                pass
        except exception as e:
            print(e)
    else:
        pass

        # print(max_behot_time)
        # print(data)


# 文章详情页数据(已合并到文章数据)
def get_wenzhang_detail(url, csv_name=0):
    headers1 = ['发表时间', '标题', '来源', '文章内容']
    res = requests.get(url, headers=headers_a, cookies=cookies)
    # time.sleep(1)
    article_title = re.findall("title: '(.*?)'", res.text)
    article_content = re.findall("content: '(.*?)'", res.text, re.s)
    pattern = re.compile(r"[(a-za-z~\-_!@#$%\^\+\*&\\\/\?\|:\.<>{}()';=)*|\d]")
    article_content = re.sub(pattern, '', article_content[0])
    article_time = re.findall("time: '(.*?)'", res.text)
    article_source = re.findall("source: '(.*?)'", res.text, re.s)
    result_time = []
    [result_time.append(i) for i in str(article_time[0]).split(' ')[0].replace('-', ',').split(',')]
    # print(result_time)
    cha = (datetime.now() - datetime(int(result_time[0]), int(result_time[1]), int(result_time[2]))).days
    # print(cha)
    if cha > 8:
        return none

    row = {'发表时间': article_time[0], '标题': article_title[0].strip('&quot;'), '来源': article_source[0],
           '文章内容': article_content.strip()}
    with open('/toutiao/' + str(csv_name) + '文章.csv', 'a', newline='')as f:
        f_csv = csv.dictwriter(f, headers1)
        # f_csv.writeheader()
        f_csv.writerow(row)
    print('正在爬取文章:', article_title[0].strip('&quot;'), article_time[0], url)
    time.sleep(0.5)
    return 'ok'


# 视频数据
break_flag_video = []


def shipin(url, max_behot_time=0, csv_name=0, n=0):
    max_qingqiu = 20
    headers2 = ['视频发表时间', '标题', '来源', '视频链接']
    first_url = 'https://www.toutiao.com/c/user/article/?page_type=0&user_id=%s&max_behot_time=%s&count=20&as=%s&cp=%s&_signature=%s' % (
        url.split('/')[-2], max_behot_time, get_as_cp()['as'], get_as_cp()['cp'],
        get_signature(url.split('/')[-2], max_behot_time))
    while n < max_qingqiu and not break_flag_video:
        try:
            res = requests.get(first_url, headers=headers_a, cookies=cookies)
            data = json.loads(res.text)
            # print(data)
            max_behot_time = data['next']['max_behot_time']
            if max_behot_time:
                video_list = data['data']
                for i in video_list:
                    try:
                        start_time = i['behot_time']
                        video_title = i['title']
                        video_source = i['source']
                        detail_url = 'https://www.ixigua.com/i' + i['item_id']

                        resp = requests.get(detail_url, headers=headers())
                        r = str(random.random())[2:]
                        url_part = "/video/urls/v/1/toutiao/mp4/{}?r={}".format(
                            re.findall('"video_id":"(.*?)"', resp.text)[0], r)
                        s = crc32(url_part.encode())
                        api_url = "https://ib.365yg.com{}&s={}".format(url_part, s)
                        resp = requests.get(api_url, headers=headers())
                        j_resp = resp.json()
                        video_url = j_resp['data']['video_list']['video_1']['main_url']
                        video_url = b64decode(video_url.encode()).decode()
                        # print((int(str(time.time()).split('.')[0])-start_time)/86400)
                        if 30 < (int(str(time.time()).split('.')[0]) - start_time) / 86400 <= 32:
                            # print('完成')
                            # break_flag_video.append(1)
                            continue
                        if (int(str(time.time()).split('.')[0]) - start_time) / 86400 > 32:
                            print('完成')
                            break_flag_video.append(1)
                            break
                        row = {'视频发表时间': time.strftime('%y-%m-%d %h:%m:%s', time.localtime(start_time)),
                               '标题': video_title, '来源': video_source,
                               '视频链接': video_url}
                        with open('/toutiao/' + str(csv_name) + '视频.csv', 'a', newline='', encoding='gb18030')as f:
                            f_csv = csv.dictwriter(f, headers2)
                            # f_csv.writeheader()
                            f_csv.writerow(row)
                        print('正在爬取视频:', video_title, detail_url, video_url)
                        time.sleep(3)
                    except exception as e:
                        print(e, 'https://www.ixigua.com/i' + i['item_id'])
                shipin(url=url, max_behot_time=max_behot_time, csv_name=csv_name, n=n)
        except keyerror:
            n += 1
            print('第' + str(n) + '次请求', first_url)
            time.sleep(3)
            if n == max_qingqiu:
                print('请求超过最大次数')
                break_flag_video.append(1)
        except exception as e:
            print(e)
    else:
        pass


# 微头条
break_flag_weitoutiao = []


def weitoutiao(url, max_behot_time=0, n=0, csv_name=0):
    max_qingqiu = 20
    headers3 = ['微头条发表时间', '来源', '标题', '文章内图片', '微头条内容']
    while n < max_qingqiu and not break_flag_weitoutiao:
        try:

            first_url = 'https://www.toutiao.com/api/pc/feed/?category=pc_profile_ugc&utm_source=toutiao&visit_user_id=%s&max_behot_time=%s' % (
                url.split('/')[-2], max_behot_time)
            # print(first_url)
            res = requests.get(first_url, headers=headers_a, cookies=cookies)
            data = json.loads(res.text)
            # print(data)
            max_behot_time = data['next']['max_behot_time']
            weitoutiao_list = data['data']
            for i in weitoutiao_list:
                try:
                    detail_url = 'https://www.toutiao.com/a' + str(i['concern_talk_cell']['id'])
                    # print(detail_url)
                    resp = requests.get(detail_url, headers=headers(), cookies=cookies)
                    start_time = re.findall("time: '(.*?)'", resp.text, re.s)
                    weitoutiao_name = re.findall("name: '(.*?)'", resp.text, re.s)
                    weitoutiao_title = re.findall("title: '(.*?)'", resp.text, re.s)
                    weitoutiao_images = re.findall('images: \["(.*?)"\]',resp.text,re.s)
                    # print(weitoutiao_images)
                    if weitoutiao_images:
                        weitoutiao_image = 'http:' + weitoutiao_images[0].replace('u002f','/').replace('\\','')
                        # print(weitoutiao_image)
                    else:
                        weitoutiao_image = '此头条内无附件图片'
                    weitoutiao_content = re.findall("content: '(.*?)'", resp.text, re.s)
                    result_time = []
                    [result_time.append(i) for i in str(start_time[0]).split(' ')[0].replace('-', ',').split(',')]
                    # print(result_time)
                    cha = (
                        datetime.now() - datetime(int(result_time[0]), int(result_time[1]), int(result_time[2]))).days
                    # print(cha)
                    if cha > 30:
                        break_flag_weitoutiao.append(1)
                        print('完成')
                        break
                    row = {'微头条发表时间': start_time[0], '来源': weitoutiao_name[0],
                           '标题': weitoutiao_title[0].strip('&quot;'),'文章内图片': weitoutiao_image,
                           '微头条内容': weitoutiao_content[0].strip('&quot;')}
                    with open('/toutiao/' + str(csv_name) + '微头条.csv', 'a', newline='', encoding='gb18030')as f:
                        f_csv = csv.dictwriter(f, headers3)
                        # f_csv.writeheader()
                        f_csv.writerow(row)
                    time.sleep(1)
                    print('正在爬取微头条', weitoutiao_name[0], start_time[0], detail_url)
                except exception as e:
                    print(e, 'https://www.toutiao.com/a' + str(i['concern_talk_cell']['id']))
            weitoutiao(url=url, max_behot_time=max_behot_time, csv_name=csv_name, n=n)
        except keyerror:
            n += 1
            print('第' + str(n) + '次请求')
            time.sleep(2)
            if n == max_qingqiu:
                print('请求超过最大次数')
                break_flag_weitoutiao.append(1)
            else:
                pass
        except exception as e:
            print(e)
    else:
        pass


# 获取需要爬取的网站数据
def csv_read(path):
    data = []
    with open(path, 'r', encoding='gb18030') as f:
        reader = csv.reader(f, dialect='excel')
        for row in reader:
            data.append(row)
    return data


# 启动函数
def main():
    for j, i in enumerate(csv_read('toutiao-suoyou.csv')):
        # data_url = data.get_nowait()
        if '文章' in i[3]:
            # 启动抓取文章函数
            print('当前正在抓取文章第', j, i[2])
            headers1 = ['发表时间', '标题', '来源', '所有图片', '文章内容']
            with open('/toutiao/' + i[0] + '文章.csv', 'a', newline='')as f:
                f_csv = csv.dictwriter(f, headers1)
                f_csv.writeheader()
            break_flag.clear()
            wenzhang(url=i[2], csv_name=i[0])

        if '视频' in i[3]:
            # 启动爬取视频的函数
            print('当前正在抓取视频第', j, i[2])
            headers2 = ['视频发表时间', '标题', '来源', '视频链接']
            with open('/toutiao/' + i[0] + '视频.csv', 'a', newline='')as f:
                f_csv = csv.dictwriter(f, headers2)
                f_csv.writeheader()
            break_flag_video.clear()
            shipin(url=i[2], csv_name=i[0])

        if '微头条' in i[3]:
            # 启动获取微头条的函数
            headers3 = ['微头条发表时间', '来源', '标题', '文章内图片', '微头条内容']
            print('当前正在抓取微头条第', j, i[2])
            with open('/toutiao/' + i[0] + '微头条.csv', 'a', newline='')as f:
                f_csv = csv.dictwriter(f, headers3)
                f_csv.writeheader()
            break_flag_weitoutiao.clear()
            weitoutiao(url=i[2], csv_name=i[0])


# 多线程启用
def get_all(urlqueue):
    while true:
        try:
            # 不阻塞的读取队列数据
            data_url = urlqueue.get_nowait()
            # i = urlqueue.qsize()
        except exception as e:
            break
        # print(data_url)
        # if '文章' in data_url[3]:
        #     # 启动抓取文章函数
        #     print('当前正在抓取文章', data_url[2])
        #     headers1 = ['发表时间', '标题', '来源', '所有图片', '文章内容']
        #     with open('/toutiao/' + data_url[0] + '文章.csv', 'a', newline='')as f:
        #         f_csv = csv.dictwriter(f, headers1)
        #         f_csv.writeheader()
        #     break_flag.clear()
        #     wenzhang(url=data_url[2], csv_name=data_url[0])

        if '视频' in data_url[3]:
            # 启动爬取视频的函数
            print('当前正在抓取视频', data_url[2])
            headers2 = ['视频发表时间', '标题', '来源', '视频链接']
            with open('/toutiao/' + data_url[0] + '视频.csv', 'a', newline='')as f:
                f_csv = csv.dictwriter(f, headers2)
                f_csv.writeheader()
            break_flag_video.clear()
            shipin(url=data_url[2], csv_name=data_url[0])
            #
        # if '微头条' in data_url[3]:
        #     # 启动获取微头条的函数
        #     headers3 = ['微头条发表时间', '来源', '标题','文章内图片', '微头条内容']
        #     print('当前正在抓取微头条', data_url[2])
        #     with open('/toutiao/' + data_url[0] + '微头条.csv', 'a', newline='')as f:
        #         f_csv = csv.dictwriter(f, headers3)
        #         f_csv.writeheader()
        #     break_flag_weitoutiao.clear()
        #     weitoutiao(url=data_url[2], csv_name=data_url[0])


if __name__ == '__main__':
    # 创建存储目录
    path = '/toutiao/'
    if not os.path.exists(path):
        os.mkdir(path)

    """单一脚本使用main函数,开启多线程按照下面方法控制线程数,开启多线程会请求过于频繁,导致头条反爬封ip等,需要设置代理ip"""
    # main()


    urlqueue = queue()
    for j, i in enumerate(csv_read('toutiao-suoyou.csv')):
        urlqueue.put(i)
    # print(urlqueue.get_nowait())
    # print(urlqueue.qsize())
    threads = []
    # 可以调节线程数, 进而控制抓取速度
    threadnum = 4
    for i in range(0, threadnum):
        t = threading.thread(target=get_all, args=(urlqueue,))
        threads.append(t)

    for t in threads:
        # 设置为守护线程,当守护线程退出时,由它启动的其它子线程将同时退出,
        # t.setdaemon(true)
        t.start()
    for t in threads:
        # 多线程多join的情况下,依次执行各线程的join方法, 这样可以确保主线程最后退出, 且各个线程间没有阻塞
        t.join()

        # pass

读取csv文件中的用户信息

递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

 

 抓取的结果

递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

 

 递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

 

 递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

 

 递归爬取今日头条指定用户一个月内发表的所有文章,视频,微头条

 

 内容仅供参考学习使用,有意见可联系作者删除。。。。。。

求份爬虫工作