python工具——pdfToTxt

程序员文章站 2022-04-11 10:30:56

...

preface：最近小妹需要将pdf文件转为word/txt，将里面的文字copy出来。一般可以复制的pdf可以使用adobe pdf之类的软件直接转，但是遇到不可复制的那种（比如截图到word转成的pdf），则没法用普通的软件转为word了。其次，网上的一些软件也声称可以转为word，但我实际上没遇到好使的，可能我尝试的少。列举下我尝试过的：

1、比如https://smallpdf.com/cn，效果还可以，免费的7天，长期需要付费。

2、WPS的word，能够转为pdf，但是限制3页内，升级为会员(每年90元)则可以无限制转。效果还可以

作为写代码的，知道能处理图片的工具，于是直接查找资料写代码，以程序猿的方式实现pdf转word了。

一、环境及工具

工具：MAC、python2.7
软件包安装
- pdf2image：pip install pdf2image，python的这个包用来将pdf转为图片
- poppler：brew install poppler，跑pdf2image所需的软件
- pytesseract：pip install pytesseract，用来识别图片中的问题，主要是使用ocr的方式
- PIL：conda install PIL，安装PIL用来处理图片（使用pip安装老失败）
- 下载语言包：https://github.com/tesseract-ocr/tessdata，放到/usr/local/Cellar/tesseract/3.05.01/share/tessdata路径下

二、开发

说明：
- 第一部分：
  - 使用convert_from_path函数，将pdf文件转为image对象
  - 将image对象保存下来
- 第二部分：
  - 识别每张图片中的文本：主要调用pytesseract.image_to_string函数，而中文时则需要加入参数lang='chi_sim'
  - 将文本保存下来

from PIL import Image 
import pytesseract 
import sys 
reload(sys)
sys.setdefaultencoding('utf-8')
from pdf2image import convert_from_path 
import os 
''' 
Part #1 : Converting PDF to images 
'''
# Store all the pages of the PDF in a variable 
PDF_file = 'xxx.pdf'
pages = convert_from_path(PDF_file, 500) 
# Counter to store images of each page of PDF to image 
image_counter = 1
for page in pages: 
  
    # Declaring filename for each page of PDF as JPG 
    # For each page, filename will be: 
    # PDF page 1 -> page_1.jpg 
    # PDF page 2 -> page_2.jpg 
    # PDF page 3 -> page_3.jpg 
    # .... 
    # PDF page n -> page_n.jpg 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG') 
    image_counter = image_counter + 1
  
''' 
Part #2 - Recognizing text from the images using OCR 
'''
# Variable to get count of total number of pages 
filelimit = image_counter-1
  
# Creating a text file to write the output 
# 源文档的代码对中文不太好使
outfile = "out_text.txt"
f = open(outfile, "a") 
for i in range(1, filelimit + 1): 
    filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
    text = text.replace('-\n', '')     
    f.write(text) 
f.close() 

# 自测好使
path = '/Users/shifengmac/Desktop/xiaomei/'
outfile1 = path + 'outImage/outFile1.txt'
f1 = open(outfile1, 'w')
for i in range(0,44):
    filename = path + 'image/page_{}.jpg'.format(i)
    print filename
    image = Image.open(filename)
    code = pytesseract.image_to_string(image, lang='chi_sim')
    f1.write(code)
    f1.write('\n'*2 +'--------{}---------'.format(i) + '\n'*2)
f1.close()

pdf2image参考：https://github.com/Belval/pdf2image

Python | Reading contents of PDF using OCR (Optical Character Recognition)：https://www.geeksforgeeks.org/python-reading-contents-of-pdf-using-ocr-optical-character-recognition/

ocr识别参考：https://www.jianshu.com/p/649497187175