python怎么读取pdf文本内容

程序员文章站 2024-01-25 17:16:34

...

python读取pdf文本内容的方法：首先打开相应的python脚本文件；然后使用PDFMiner工具来读取pdf文本内容；最后通过print输出读取后的内容即可。

python怎么读取pdf文本内容

python读取pdf文本内容

python处理pdf也是常用的技术了，对于python3来说，pdfminer3k是一个非常好的工具。

PDFMiner是一个可以从PDF文档中提取信息的工具。与其他PDF相关的工具不同，它注重的完全是获取和分析文本数据。

PDFMiner允许你获取某一页中文本的准确位置和一些诸如字体、行数的信息。它包括一个PDF转换器，可以把PDF文件转换成HTML等格式。它还有一个扩展的PDF解析器，可以用于除文本分析以外的其他用途。

pip install pdfminer3k

首先，为了满足大部分人的需求，我先给一个通用一点的脚本来读取pdf中的文本：

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf

def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
    return lines
 
 
 
if __name__ == '__main__':
    with open('t1.pdf', "rb") as my_pdf:
        print(read_pdf(my_pdf))

我主要是想在pdf中抽出自己想要的一些关键信息，所以需要找到这些信息的共同点。幸运的是，这些关键信息的行都含有'//'，所以我只需找到含有'//'的行就行了，于是写了以下脚本。

这样就可以直接使用了，我们先看脚本：

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
 
 
def read_pdf(pdf):
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
 
    units = [1, 2, 3, 5, 7, 8, 9, 11, 12, 13]
    header = '\x0cUNIT '
    # print(lines[0:100])
    count = 0
    flag = False
    text = open('words.txt', 'w+')
    for line in lines:
        if line.startswith(header):
            flag = False
            count += 1
            if count in units:
                flag = True
                print(line)
                text.writelines(line + '\n')
        if '//' in line and flag:
            text_line = line.split('//')[0].split('. ')[-1]
            print(text_line)
            text.writelines(text_line+'\n')
    text.close()
 
 
def _main():
    my_pdf = open('t1.pdf', "rb")
    read_pdf(my_pdf)
    my_pdf.close()
 
 
if __name__ == '__main__':
    _main()

其实看到lines = str(content).split("\n")那一行就够了，我们可以把lines都print出来，就可以看到pdf里面的内容。

这样我们就可以把pdf文件处理看作简单的字符串数据处理了。接下来的脚本操作也不用过多解释了。

更多相关知识，请访问 PHP中文网！！

python怎么读取pdf文本内容

python检索特定内容的文本文件实例

PHP读取PDF内容配合Xpdf的使用_php实例

怎么用Python来读取和处理文件后缀？

python大佬云盘800G视频，怎么爬视频以及内容是什么，我脸红了！

python读取文本中数据并转化为DataFrame的实例

Python 文本文件内容批量抽取实例

如何样用pdflib库读取pdf文件内的内容

java读取xslx内容，内容转换成docx和pdf,包括图片

请问：怎么将数据库中的内容导出为PDF文档后打印输出

php借助Xpdf读取PDF中的内容