使用pdfbox实现pdf文本提取和合并功能示例

程序员文章站 2024-02-27 20:27:51

有时我们需要对pdf文件进行一些处理，提取文本、合并等。以前我们使用a-pdf text extractor免费工具,为什么不自己写一个呢? 现在我们可以使用pdfbox-...

有时我们需要对pdf文件进行一些处理，提取文本、合并等。以前我们使用a-pdf text extractor免费工具,为什么不自己写一个呢?
现在我们可以使用pdfbox-0.7.3这个开源类库. 下载解包后引用:

pdfbox-0.7.3.dll
ikvm.gnu.classpath.dll

新建一个项目,代码很简单:

复制代码代码如下:

public static string parsetotxtstringusingpdfbox(string filename){
pddocument doc = pddocument.load(filename);
pdftextstripper stripper = new pdftextstripper();
return stripper.gettext(doc);
}

获得这个textstring,再把它们写成磁盘文件就可以了, 像这样的方法:

复制代码代码如下:

public static void writetotextfile(string str,string txtpath)
{
if (string.isnullorempty(txtpath))
throw new argumentnullexception("output file path should not be null");
using (var txtwriter = new streamwriter(txtpath))
{
txtwriter.write(str);
txtwriter.close();
}
}

其它的功能您可以自行发挥了. 这个类库目前支持:

pdf to text extraction
merge pdf documents
pdf document encryption/decryption
lucene search engine integration
fill in form data fdf and xfdf
create a pdf from a text file
create images from pdf pages
print a pdf

上一篇：使用java从乱码文本中解析出正确的文本

下一篇： Django自定义插件实现网站登录验证码功能