爬取微信公众号文章并保存为PDF文件（Python方法）

程序员文章站 2022-03-12 15:07:02

...

前言

第一次写博客，主要内容是爬取微信公众号的文章，将文章以PDF格式保存在本地。

爬取微信公众号文章（使用wechatsogou）

1.安装

pip install wechatsogou --upgrade

wechatsogou是一个基于搜狗微信搜索的微信公众号爬虫接口

2.使用方法

使用方法如下所示

import wechatsogou
# captcha_break_time为验证码输入错误的重试次数，默认为1
ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)
# 公众号名称
gzh_name = ''
# 将该公众号最近10篇文章信息以字典形式返回
data = ws_api.get_gzh_article_by_history(gzh_name)

data数据结构：

{
    'gzh': {
        'wechat_name': '',  # 名称
        'wechat_id': '',  # 微信id
        'introduction': '',  # 简介
        'authentication': '',  # 认证
        'headimage': ''  # 头像
    },
    'article': [
        {
            'send_id': int,  # 群发id，注意不唯一，因为同一次群发多个消息，而群发id一致
            'datetime': int,  # 群发datatime 10位时间戳
            'type': '',  # 消息类型，均是49（在手机端历史消息页有其他类型，网页端最近10条消息页只有49），表示图文
            'main': int,  # 是否是一次群发的第一次消息 1 or 0
            'title': '',  # 文章标题
            'abstract': '',  # 摘要
            'fileid': int,  #
            'content_url': '',  # 文章链接
            'source_url': '',  # 阅读原文的链接
            'cover': '',  # 封面图
            'author': '',  # 作者
            'copyright_stat': int,  # 文章类型，例如：原创啊
        },
        ...
    ]
}

这里需要得到两个信息：文章标题，文章url。

得到文章url以后，就可以根据url将html页面转换成pdf文件了。

生成PDF文件

1.安装wkhtmltopdf

下载地址:https://wkhtmltopdf.org/downloads.html

2.安装pdfkit

pip install pdfkit

3.使用方法

import pdfkit
# 根据url生成pdf
pdfkit.from_url('http://baidu.com','out.pdf')
# 根据html文件生成pdf
pdfkit.from_file('test.html','out.pdf')
# 根据html代码生成pdf
pdfkit.from_string('Hello!','out.pdf')

如果直接用上面得到的文章url去生成pdf，会出现pdf文件不显示文章图片的问题。

解决办法：

# 该方法根据文章url对html进行处理，使图片显示
content_info = ws_api.get_article_content(url)
# 得到html代码(代码不完整，需要加入head、body等标签)
html_code = content_info['content_html']

然后根据html_code构造完整的html代码，调用pdfkit.from_string()方法生成pdf文件，这时候会发现文章中的图片在pdf文件中显示出来了。

完整代码

import os
import pdfkit
import datetime
import wechatsogou

# 初始化API
ws_api = wechatsogou.WechatSogouAPI(captcha_break_time=3)


def url2pdf(url, title, targetPath):
    '''
    使用pdfkit生成pdf文件
    :param url: 文章url
    :param title: 文章标题
    :param targetPath: 存储pdf文件的路径
    '''
    try:
        content_info = ws_api.get_article_content(url)
    except:
        return False
    # 处理后的html
    html = f'''
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>{title}</title>
    </head>
    <body>
    <h2 style="text-align: center;font-weight: 400;">{title}</h2>
    {content_info['content_html']}
    </body>
    </html>
    '''
    try:
        pdfkit.from_string(html, targetPath + os.path.sep + f'{title}.pdf')
    except:
        # 部分文章标题含特殊字符，不能作为文件名
        filename = datetime.datetime.now().strftime('%Y%m%d%H%M%S') + '.pdf'
        pdfkit.from_string(html, targetPath + os.path.sep + filename)


if __name__ == '__main__':
    # 此处为要爬取公众号的名称
    gzh_name = ''
    targetPath = os.getcwd() + os.path.sep + gzh_name
    # 如果不存在目标文件夹就进行创建
    if not os.path.exists(targetPath):
        os.makedirs(targetPath)
    # 将该公众号最近10篇文章信息以字典形式返回
    data = ws_api.get_gzh_article_by_history(gzh_name)
    article_list = data['article']
    for article in article_list:
        url = article['content_url']
        title = article['title']
        url2pdf(url, title, targetPath)

相关学习推荐：python教程

以上就是爬取微信公众号文章并保存为PDF文件（Python方法）的详细内容，更多请关注其它相关文章！

相关标签： Python爬取微信公众号文章

上一篇：为什么微信图标变红色？

下一篇： html5中的clear是什么意思

爬取微信公众号文章并保存为PDF文件（Python方法）

前言

爬取微信公众号文章（使用wechatsogou）

1.安装

2.使用方法

生成PDF文件

1.安装wkhtmltopdf

2.安装pdfkit

3.使用方法

完整代码

python爬取微信公众号文章

python爬取指定微信公众号文章

Python selenium爬取微信公众号文章代码详解

python爬取微信公众号文章的方法

Python爬虫微信公众号文章爬取

python爬取微信公众号文章图片并转为PDF

python爬取微信公众号文章

python爬取指定微信公众号文章

python爬取微信公众号文章的方法

Python selenium爬取微信公众号文章代码详解

爬取微信公众号文章并保存为PDF文件（Python方法）

前言

爬取微信公众号文章（使用wechatsogou）

1.安装

2.使用方法

生成PDF文件

1.安装wkhtmltopdf

2.安装pdfkit

3.使用方法

完整代码

python爬取微信公众号文章

python爬取指定微信公众号文章

Python selenium爬取微信公众号文章代码详解

python爬取微信公众号文章的方法

Python爬虫 微信公众号文章爬取

python爬取微信公众号文章图片并转为PDF

python爬取微信公众号文章

python爬取指定微信公众号文章

python爬取微信公众号文章的方法

Python selenium爬取微信公众号文章代码详解

Python爬虫微信公众号文章爬取