利用POI读取word、Excel文件的最佳实践教程

程序员文章站 2024-03-31 13:47:10

前言 poi是 apache 旗下一款读写微软家文档声名显赫的类库。应该很多人在做报表的导出，或者创建 word 文档以及读取之类的都是用过 poi。poi 也的确对于这...

前言

poi是 apache 旗下一款读写微软家文档声名显赫的类库。应该很多人在做报表的导出，或者创建 word 文档以及读取之类的都是用过 poi。poi 也的确对于这些操作带来很大的便利性。我最近做的一个工具就是读取计算机中的 word 以及 excel 文件。

poi结构说明

包名称说明

hssf提供读写microsoft excel xls格式档案的功能。

xssf提供读写microsoft excel ooxml xlsx格式档案的功能。

hwpf提供读写microsoft word doc格式档案的功能。

hslf提供读写microsoft powerpoint格式档案的功能。

hdgf提供读microsoft visio格式档案的功能。

hpbf提供读microsoft publisher格式档案的功能。

hsmf提供读microsoft outlook格式档案的功能。

下面就word和excel两方面讲解以下遇到的一些坑：

word 篇

对于 word 文件，我需要的就是提取文件中正文的文字。所以可以创建一个方法来读取 doc 或者 docx 文件：

 private static string readdoc(string filepath, inputstream is) {
  string text= "";
  try {
   if (filepath.endswith("doc")) {
    wordextractor ex = new wordextractor(is);
    text = ex.gettext();
    ex.close();
    is.close();
   } else if(filepath.endswith("docx")) {
    xwpfdocument doc = new xwpfdocument(is);
    xwpfwordextractor extractor = new xwpfwordextractor(doc);
    text = extractor.gettext();
    extractor.close();
    is.close();
   }
  } catch (exception e) {
   logger.error(filepath, e);
  } finally {
   if (is != null) {
    is.close();
   }
  }
  return text;
 }

理论上来说，这段代码应该对于读取大多数 doc 或者 docx 文件都是有效的。但是!!!!我发现了一个奇怪的问题，就是我的代码在读取某些 doc 文件的时候，经常会给出这样的一个异常：

org.apache.poi.poifs.filesystem.officexmlfileexception: the supplied data appears to be in the office 2007+ xml. you are calling the part of poi that deals with ole2 office documents.

这个异常的意思是什么呢，通俗的来讲，就是你打开的文件并不是一个 doc 文件，你应该使用读取 docx 的方法去读取。但是我们明明打开的就是一个后缀是 doc 的文件啊！

其实 doc 和 docx 的本质不同的，doc 是 ole2 类型，而 docx 而是 ooxml 类型。如果你用压缩文件打开一个 docx 文件，你会发现一些文件夹：

利用POI读取word、Excel文件的最佳实践教程

本质上 docx 文件就是一个 zip 文件，里面包含了一些 xml 文件。所以，一些 docx 文件虽然大小不大，但是其内部的 xml 文件确实比较大的，这也是为什么在读取某些看起来不是很大的 docx 文件的时候却耗费了大量的内存。

然后我使用压缩文件打开这个 doc 文件，果不其然，其内部正是如上图，所以本质上我们可以认为它是一个 docx 文件。可能是因为它是以某种兼容模式保存从而导致如此坑爹的问题。所以，现在我们根据后缀名来判断一个文件是 doc 或者 docx 就是不可靠的了。

老实说，我觉得这应该不是一个很少见的问题。但是我在谷歌上并没有找到任何关于此的信息。how to know whether a file is .docx or .doc format from apache poi 这个例子是通过 zipinputstream 来判断文件是否是 docx 文件：

boolean iszip = new zipinputstream( filestream ).getnextentry() != null;

但我并不觉得这是一个很好的方法，因为我得去构建一个zipinpustream，这很显然不好。另外，这个操作貌似会影响到 inputstream，所以你在读取正常的 doc 文件会有问题。或者你使用 file 对象去判断是否是一个 zip 文件。但这也不是一个好方法，因为我还需要在压缩文件中读取 doc 或者 docx 文件，所以我的输入必须是 inputstream，所以这个选项也是不可以的。我在 * 上和一帮老外扯了大半天，有时候我真的很怀疑这帮老外的理解能力，不过最终还是有一个大佬给出了一个让我欣喜若狂的解决方案，filemagic。这个是一个 poi 3.17新增加的一个特性：

public enum filemagic {
 /** ole2 / biff8+ stream used for office 97 and higher documents */
 ole2(headerblockconstants._signature),
 /** ooxml / zip stream */
 ooxml(ooxml_file_header),
 /** xml file */
 xml(raw_xml_file_header),
 /** biff2 raw stream - for excel 2 */
 biff2(new byte[]{
   0x09, 0x00, // sid=0x0009
   0x04, 0x00, // size=0x0004
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 }),
 /** biff3 raw stream - for excel 3 */
 biff3(new byte[]{
   0x09, 0x02, // sid=0x0209
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 }),
 /** biff4 raw stream - for excel 4 */
 biff4(new byte[]{
   0x09, 0x04, // sid=0x0409
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x70, 0x00 // 0x70 = multiple values
 },new byte[]{
   0x09, 0x04, // sid=0x0409
   0x06, 0x00, // size=0x0006
   0x00, 0x00, // unused
   0x00, 0x01
 }),
 /** old ms write raw stream */
 mswrite(
   new byte[]{0x31, (byte)0xbe, 0x00, 0x00 },
   new byte[]{0x32, (byte)0xbe, 0x00, 0x00 }),
 /** rtf document */
 rtf("{\\rtf"),
 /** pdf document */
 pdf("%pdf"),
 // keep unknown always as last enum!
 /** unknown magic */
 unknown(new byte[0]);

 final byte[][] magic;

 filemagic(long magic) {
  this.magic = new byte[1][8];
  littleendian.putlong(this.magic[0], 0, magic);
 }

 filemagic(byte[]... magic) {
  this.magic = magic;
 }

 filemagic(string magic) {
  this(magic.getbytes(localeutil.charset_1252));
 }

 public static filemagic valueof(byte[] magic) {
  for (filemagic fm : values()) {
   int i=0;
   boolean found = true;
   for (byte[] ma : fm.magic) {
    for (byte m : ma) {
     byte d = magic[i++];
     if (!(d == m || (m == 0x70 && (d == 0x10 || d == 0x20 || d == 0x40)))) {
      found = false;
      break;
     }
    }
    if (found) {
     return fm;
    }
   }
  }
  return unknown;
 }

 /**
  * get the file magic of the supplied inputstream (which must
  * support mark and reset).<p>
  *
  * if unsure if your inputstream does support mark / reset,
  * use {@link #preparetocheckmagic(inputstream)} to wrap it and make
  * sure to always use that, and not the original!<p>
  *
  * even if this method returns {@link filemagic#unknown} it could potentially mean,
  * that the zip stream has leading junk bytes
  *
  * @param inp an inputstream which supports either mark/reset
  */
 public static filemagic valueof(inputstream inp) throws ioexception {
  if (!inp.marksupported()) {
   throw new ioexception("getfilemagic() only operates on streams which support mark(int)");
  }

  // grab the first 8 bytes
  byte[] data = ioutils.peekfirst8bytes(inp);

  return filemagic.valueof(data);
 }


 /**
  * checks if an {@link inputstream} can be reseted (i.e. used for checking the header magic) and wraps it if not
  *
  * @param stream stream to be checked for wrapping
  * @return a mark enabled stream
  */
 public static inputstream preparetocheckmagic(inputstream stream) {
  if (stream.marksupported()) {
   return stream;
  }
  // we used to process the data via a pushbackinputstream, but user code could provide a too small one
  // so we use a bufferedinputstream instead now
  return new bufferedinputstream(stream);
 }
}

在这给出主要的代码，其主要就是根据 inputstream 前 8 个字节来判断文件的类型，毫无以为这就是最优雅的解决方式。一开始，其实我也是在想对于压缩文件的前几个字节似乎是由不同的定义的，magicmumber。因为 filemagic 的依赖和3.16 版本是兼容的，所以我只需要加入这个类就可以了，因此我们现在读取 word 文件的正确做法是：

 private static string readdoc (string filepath, inputstream is) {
  string text= "";
  is = filemagic.preparetocheckmagic(is);
  try {
   if (filemagic.valueof(is) == filemagic.ole2) {
    wordextractor ex = new wordextractor(is);
    text = ex.gettext();
    ex.close();
   } else if(filemagic.valueof(is) == filemagic.ooxml) {
    xwpfdocument doc = new xwpfdocument(is);
    xwpfwordextractor extractor = new xwpfwordextractor(doc);
    text = extractor.gettext();
    extractor.close();
   }
  } catch (exception e) {
   logger.error("for file " + filepath, e);
  } finally {
   if (is != null) {
    is.close();
   }
  }
  return text;
 }

excel 篇

对于 excel 篇，我也就不去找之前的方案和现在的方案的对比了。就给出我现在的最佳做法了：

 @suppresswarnings("deprecation" )
 private static string readexcel(string filepath, inputstream inp) throws exception {
  workbook wb;
  stringbuilder sb = new stringbuilder();
  try {
   if (filepath.endswith(".xls")) {
    wb = new hssfworkbook(inp);
   } else {
    wb = streamingreader.builder()
      .rowcachesize(1000) // number of rows to keep in memory (defaults to 10)
      .buffersize(4096)  // buffer size to use when reading inputstream to file (defaults to 1024)
      .open(inp);   // inputstream or file for xlsx file (required)
   }
   sb = readsheet(wb, sb, filepath.endswith(".xls"));
   wb.close();
  } catch (ole2notofficexmlfileexception e) {
   logger.error(filepath, e);
  } finally {
   if (inp != null) {
    inp.close();
   }
  }
  return sb.tostring();
 }

 private static string readexcelbyfile(string filepath, file file) {
  workbook wb;
  stringbuilder sb = new stringbuilder();
  try {
   if (filepath.endswith(".xls")) {
    wb = workbookfactory.create(file);
   } else {
    wb = streamingreader.builder()
      .rowcachesize(1000) // number of rows to keep in memory (defaults to 10)
      .buffersize(4096)  // buffer size to use when reading inputstream to file (defaults to 1024)
      .open(file);   // inputstream or file for xlsx file (required)
   }
   sb = readsheet(wb, sb, filepath.endswith(".xls"));
   wb.close();
  } catch (exception e) {
   logger.error(filepath, e);
  }
  return sb.tostring();
 }

 private static stringbuilder readsheet(workbook wb, stringbuilder sb, boolean isxls) throws exception {
  for (sheet sheet: wb) {
   for (row r: sheet) {
    for (cell cell: r) {
     if (cell.getcelltype() == cell.cell_type_string) {
      sb.append(cell.getstringcellvalue());
      sb.append(" ");
     } else if (cell.getcelltype() == cell.cell_type_numeric) {
      if (isxls) {
       dataformatter formatter = new dataformatter();
       sb.append(formatter.formatcellvalue(cell));
      } else {
       sb.append(cell.getstringcellvalue());
      }
      sb.append(" ");
     }
    }
   }
  }
  return sb;
 }

其实，对于 excel 读取，我的工具面临的最大问题就是内存溢出。经常在读取某些特别大的 excel 文件的时候都会带来一个内存溢出的问题。后来我终于找到一个优秀的工具，它可以流式的读取 xlsx 文件，将一些特别大的文件拆分成小的文件去读。

另外一个做的优化就是，对于可以使用 file 对象的场景下，我是去使用 file 对象去读取文件而不是使用 inputstream 去读取，因为使用 inputstream 需要把它全部加载到内存中，所以这样是非常占用内存的。

最后，我的一点小技巧就是使用 cell.getcelltype 去减少一些数据量，因为我只需要获取一些文字以及数字的字符串内容就可以了。

以上，就是我在使用 poi 读取文件的一些探索和发现，希望对你能有所帮助。上面的这些例子也是在我的一款工具中的应用（这款工具主要是可以帮助你在电脑中进行内容的全文搜索），感兴趣的可以看看，欢迎 star 或者 pr。

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，如果有疑问大家可以留言交流，谢谢大家对的支持。

上一篇： PHP Filter过滤器全面解析

下一篇： InvocationHandler中invoke()方法的调用问题分析

利用POI读取word、Excel文件的最佳实践教程

利用POI读取word、Excel文件的最佳实践教程

利用POI读取word、Excel文件的最佳实践教程

利用poi读取远程excel文件的简单方法