python之pdf合并，markdown，html，word转pdf

程序员文章站 2022-05-28 14:46:34

...

学的有点多，总结一下，防止遗忘，这里来记录一下，好记性不如烂笔头，时隔一年，忘得渣都不剩

首先先介绍几个pdf相关的模块

1.pdfkit，一个将html，url的内容，以及string，转pdf的模块，速度很快，基本秒合成，如果数据很大的就另说了

2.pandoc，这个不得不说是文档界的瑞士军刀，超级强大的，个人认为是和视频界的ffmpeg一个级别的神器！！！
但是如果其他转pdf的话的需要另行下载一个支持库，我试了下，但是转换不快，就没在用了。

3.html2text: 主要用于将html代码转markdown

这个库解析出来很快，用法很简单，常用：md_data = html2text.html2text(html_code)，官方链接→

4.python-docx：

https://stackoom.com/question/3f76h/%E6%97%A0%E6%B3%95%E5%AE%89%E8%A3%85python-docx-MacOS

1.pdfkit详解

1.1window安装：

pip install pdfkit
pdfkit版本：0.6.1

然后window下还需安装pdfkit支持的pdf转换程序 wkhtmltopdf，网址→，wkhtmltopdf这个程序是专门用于html转pdf的，效果是目前综合转换最棒的。

pdfkit是针对wxhtmltopdf封装的命令包，官方文档→

1.2 ubuntu下安装：

直接使用命令安装的话：debian / ubuntu仓库中的版本功能减少（因为它编译时未添加wkhtmltopdf QT补丁），例如添加了轮廓，页眉，页脚，TOC等。要使用此选项，您应该从wkhtmltopdf网站安装静态二进制文件

 sudo apt-get install wkhtmltopdf

这里需要到wkhtmltopdf网站上下载deb，然后进行编译安装，才可以支持轮廓，页眉，页脚，TOC，也不会出错！

weget https://github.com/wkhtmltopdf/wkhtmltopdf/releases/download/0.12.5/wkhtmltox_0.12.5-1.xenial_amd64.deb

# sudo dpkg -i <package.deb>
sudo dpkg -i wkhtmltox_0.12.5-1.xenial_amd64.deb

1.3用法：

基本的用法有三个，
from_url：从url获取html代码转pdf，
from_file：从本地html文件读取数据转pdf，
from_string：从字符串转pdf
基本用法：

import pdfkit

pdfkit.from_url('https://www.baidu.com/', 'out.pdf')

html_path = './test.html'
pdfkit.from_file(html_path, 'out.pdf')

st = '一个例子，用于生成pdf'
pdfkit.from_string(st, 'out.pdf')

高级用法：
1.pdfkie.from_url([‘url1.url’, ‘url2.url’, ‘url3.url’], ‘out.pdf’)多个url的html代码合成一个pdf，并且每个html的标题会转为pdf的一个书签，还不错。

2.pdfkit.from_file([‘h1.html’, ‘h2.html’], ‘merge_html.pdf’)

pdfkit.from_url(['google.com', 'yandex.ru', 'engadget.com'], 'out.pdf')

pdfkit.from_file(['file1.html', 'file2.html'], 'out.pdf')

3.pdf的options选项，设置页眉，页脚，页边距，等等功能…

更多option功能可以访问wkhtmltopdf的命令行选项，在命令行中输入wkhtmltopdf -h,会有超级多的选项，需要哪个功能学习哪个，或者直接看wkhtmltopdf options文档→，查到哪个option，就放到下面的字典中，key和vlue型。

options = {
	'page-size': 'A4',       # A4(default), Letter（书信大小）, A0，A1,B1，etc（等等）.
    'margin-top': '0.75in',   # 页面上边距
    'margin-right': '0.25in',
    'margin-bottom': '0.75in',
    'margin-left': '0.25in',
    'minimum-font-size': '30',   # 页面字体大小
    'footer-left': my_ip_word,  # 左页脚设定文本，center，left，right
	# 'header-center': my_ip_word,   # 页眉中间，
    'footer-font-size': '8',
	# 'header-font-size': '8',
    'encoding': "UTF-8"  # 设定生成pdf文件为utf-8编码，避免生成pdf中文乱码
	'custom-header' : [    # 它的request的headers ua标识头
        ('Accept-Encoding', 'gzip')
    ]
    'cookie': [         # request的cookies值，应该是在from_url中使用
	#其实本地数据无需使用这个cookies以及headers，直接用requests将数据提取出来
        ('cookie-name1', 'cookie-value1'),
        ('cookie-name2', 'cookie-value2'),
    ],
	}

pdfkit.from_file(html_path, '06.pdf', options=options)
}

下面是ISO标准的页面尺寸大小，

QPrinter::A0	5	841 x 1189 mm
QPrinter::A1	6	594 x 841 mm
QPrinter::A2	7	420 x 594 mm
QPrinter::A3	8	297 x 420 mm
QPrinter::A4	0	210 x 297 mm, 8.26 x 11.69 inches
QPrinter::A5	9	148 x 210 mm
QPrinter::A6	10	105 x 148 mm
QPrinter::A7	11	74 x 105 mm
QPrinter::A8	12	52 x 74 mm
QPrinter::A9	13	37 x 52 mm
QPrinter::B0	14	1000 x 1414 mm
QPrinter::B1	15	707 x 1000 mm
QPrinter::B2	17	500 x 707 mm
QPrinter::B3	18	353 x 500 mm
QPrinter::B4	19	250 x 353 mm
QPrinter::B5	1	176 x 250 mm, 6.93 x 9.84 inches
QPrinter::B6	20	125 x 176 mm
QPrinter::B7	21	88 x 125 mm
QPrinter::B8	22	62 x 88 mm
QPrinter::B9	23	33 x 62 mm
QPrinter::B10	16	31 x 44 mm
QPrinter::C5E	24	163 x 229 mm
QPrinter::Comm10E	25	105 x 241 mm, U.S. Common 10 Envelope
QPrinter::DLE	26	110 x 220 mm
QPrinter::Executive	4	7.5 x 10 inches, 190.5 x 254 mm
QPrinter::Folio	27	210 x 330 mm
QPrinter::Ledger	28	431.8 x 279.4 mm
QPrinter::Legal	3	8.5 x 14 inches, 215.9 x 355.6 mm
QPrinter::Letter	2	8.5 x 11 inches, 215.9 x 279.4 mm
QPrinter::Tabloid	29	279.4 x 431.8 mm
QPrinter::Custom	30	Unknown, or a user defined size.

注意事项
1.pdfkit_from_file中的中file必须是用utf8编码，因为如果内容有中文的话，会出现合成的pdf中文乱码，因为pdfkit默认不支持中文编码，所以自己添加一个呗，所以在写入file.html的时候需使用将html代码用utf8编码

html_data = ''
with open('file.html', 'ab', encoding='utf8') as f:
	f.write(html_data)
或者直接是f.write(html_data.encode('utf8'))

2.或者在html的文件的head首部加上,声明这个html使用

<head><meta charset="UTF-8"></head>

2.html2text

安装：pip install html2text，更新：pip install html2text --upgrade

介绍：转换强大，没有什么错误，就是有点缺陷的就是，其实也不能说是缺陷，是网站的数据问题。就是有的图片链接是基于web主页链接的一个子链接,例如：/2020/test.jpg,这样的话如果你不进行图片链接不全，他就会默认这样 [!test] ](/2020/test.jpg),所以还是需要自己进行判断和修补。


import html2text


# 1.使用类方法
h = html2text.HTML2Text()      # 定义一个HTML2Text对象
h.ignore_links = True             # 是否屏蔽连接
html = '<p>是你吗教主<a href="http://www.baidu.com">是吗？</a></p>'
md_data = h.handle(html)
print(md_data)
# 是你吗教主[是吗？](http://www.baidu.com)   False
# 是你吗教主是吗？                   Ture

# 2.基础用法：

md_data_base = html2text.html2text(html)
print(md_data_base)
# 是你吗教主[是吗？](http://www.baidu.com)

3.pandoc

文档界格式转换最强大的的库软件，和音视频界的ffmpeg的地位一样，都是神器，文档地址→
用到那个就学习那个功能，个人觉得没必要都学习，也没那时间，除非要基于pandoc做一款产品的时候，那时候就需要精通功能以及源码了。

pandoc下载地址：地址→

3.2markdown转word：

pandoc md_file.md -o word.docx
在python中直接：用格式符%s，然后用subprocess.Popen异步执行cmd，就可以进行转换

from subprocess import Popen

cmd = 'pandoc %s.md -o %s.docx' % (md_path, word_path)
Popen(cmd)   # 注意这个Popen执行cmd命令是异步的，并不会阻塞进程，但是就有点缺点，你不知道他什么时候执行完毕，如果你想通过转换后的的文件的是否存在，来触发事件，

python之pdf合并，markdown，html，word转pdf

1.pdfkit详解

1.1window安装：

1.2 ubuntu下安装：

1.3用法：

2.html2text

3.pandoc

3.2markdown转word：

C#实现HTML转WORD及WORD转PDF的方法

Java实现Word/Pdf/TXT转html

Python3转换html到pdf的不同解决方案

Java实现Word/Pdf/TXT转html的示例

Python实现批量Word转PDF

JAVA基于OPENXML的word文档插入、合并、替换操作系列之html转word

C#实现HTML转WORD及WORD转PDF的方法

Java实现Word/Pdf/TXT转html

jacob：word转Html,PDF

js实现markdown转html转pdf