如何用Java来进行文件切割和简单的内容过滤的实现

程序员文章站 2024-03-08 12:04:58

一由来去年由于项目的需求，要将一个任意一个文件制作成一个xml文件，并且需要保持文件内容本身不产生变化，还要能够将这个xml重新还原为原文件。如果小型的文件还好处理，...

一由来

去年由于项目的需求，要将一个任意一个文件制作成一个xml文件，并且需要保持文件内容本身不产生变化，还要能够将这个xml重新还原为原文件。如果小型的文件还好处理，大型的xml,比如几个g的文件，基本上就oom了，很难直接从节点中提取数据。所以我采用了流的方式。于是有了这个文件的裁剪工具。

二使用场景

本工具可能的使用场景：

1.对任一文件的切割/裁剪。通过字节流的方式，开始节点和终止节点，裁剪出两个节点之间的部分。

2.往任一文件的头/尾拼接指定字符串。可以很容易将一个文件嵌入在某一个节点中。

3.简单的文本抽取。可以根据自己定义的规则，提取出来想要的文本内容，并且允许对提取出来的文本进行再处理（当然，只是进行简单地抽取文字，并不是什么智能的复杂过程的抽取t_t ）。

4.文本过滤。根据自己制定的规则，过滤掉指定的文字。

整个工具仅是对java文件操作api的简单加工，并且也没有使用nio。在需要高效率的文件处理情景下，本工具的使用有待考量。文章目的是为了给出自己的一种解决方案，若有更好的方案，欢迎大家给出适当的建议。

三如何使用

别的先不说，来看看如何使用吧！

1.读取文件指定片段

读取第0~1048个字节之间的内容。

public void readasbytes(){
    fileextractor cuter = new fileextractor();
    byte[] bytes = cuter.from("d:\\11.txt").start(0).end(1048).readasbytes();
  }

2.文件切割

将第0~1048个字节之间的部分切割为一个新文件。

public file splitasfile(){
    fileextractor cuter = new fileextractor();
    return cuter.from("d:\\11.txt").to("d:\\22.txt").start(0).end(1048).extractasfile();
  }

3.将文件拼接到一个xml节点中

将整个文件的内容作为body节点，写入到一个xml文件中。返回新生成的xml文件对象。

  public file appendtext(){

    fileextractor cuter = new fileextractor();
    return cuter.from("d:\\11.txt").to("d:\\44.xml").appendasfile("<document><body>", "</body></document>");

  }

4.读取并处理文件中的指定内容

假如有需求：读取11.txt的前三行文字。其中，第一行和第二行不能出现”帅”字，并且在第三行文字后加上字符串“我好帅！”。

public string extracttext(){
    fileextractor cuter = new fileextractor();
    return cuter.from("d:\\11.txt").extractasstring(new easyprocesser() {
      @override
      public string finalstep(string line, int linenumber, status status) {

        if(linenumber==3){
          status.shouldcontinue = false;//表示不再继续读取文件内容
          return line+"我好帅!";
        }
        return line.replaceall("帅","");
      }
    });

  }

4.简单的文本过滤

将一个文件中所有的“bug”去掉，且返回一个处理后的新文件。

  public file killbugs(){
    fileextractor cuter = new fileextractor();
    return cuter.from("d:\\bugs.txt").to("d:\\nobug.txt").extractasfile(new easyprocesser() {
      @override
      public string finalstep(string line, int linenumber, status status) {
        return line.replaceall("bug", "");
      }
    }); 
  }

四基本流程

通过接口回调的方式，将文件的读取过程和处理过程分离开来；定义了iteratorfile类来负责遍历一个文件，读取文件的内容；分字节、行两种的方式来进行文件内容的遍历。下面的介绍，也会分为读取和处理两个部分单独介绍。

五文件的读取

定义回调接口

定义一个接口process,对外暴露了两个文件内容处理方法，一个支持按字节进行读取，一个方法支持按行读取。

public interface process{

  /**
   * @param b 本次读取的数据
   * @param length 本次读取的有效长度
   * @param currentindex 当前读取到的位置
   * @param available 读取文件的总长度
   * @return true 表示继续读取文件，false表示终止读取文件
   * @time 2017年1月22日 下午4:56:41
   */
  public boolean dowhat(byte[] b,int length,int currentindex,int available);

  /**
   * 
   * @param line 本次读取到的行
   * @param currentindex 行号
   * @return true 表示继续读取文件，false表示终止读取文件
   * @time 2017年1月22日 下午4:59:03
   */
  public boolean dowhat(string line,int currentindex);

让itratorfile中本身实现这个接口，但是默认都是返回true,不做任何的处理。如下所示：

public class iteratorfile implements process
{
......
/**
   * 按照字节来读取遍历文件内容，根据自定义需要重写该方法
   */
  @override
  public boolean dowhat(byte[] b, int length,int currentindex,int available) {
    return true;
  }

  /**
   * 按照行来读取遍历文件内容，根据自定义需要重写该方法
   */
  @override
  public boolean dowhat(string line,int currentindex) {
    return true;
  }
......
}

按字节遍历文件内容

实现按照字节的方式来进行文件的遍历（读取）。在这里使用了skip（）方法来控制从第几个节点开始读取内容；然后在使用文件流读取的时候，将每次读取到得数据传递给回调接口的方法；需要注意的是，每次读取到得数据是存在一个字节数组bytes里面的，每次读取的长度也是需要传递给回调接口的。我们很容易看出，一旦dowhat()返回false,文件的读取立即就退出了。

public void iterator2bytes(){
    init();
    int length = -1;
    fileinputstream fis = null;
    try {
      file = new file(in);
      fis = new fileinputstream(file);
      available = fis.available();
      fis.skip(getstart());
      readedindex = getstart();
      if (!beforeitrator()) return;
      while ((length=fis.read(bytes))!=-1) {
        readedindex+=length;
        if(!dowhat(bytes, length,readedindex,available)){
          break;
        }
      }
      if(!afteritrator()) return;
    } catch (filenotfoundexception e) {
      e.printstacktrace();
    } catch (ioexception e) {
      e.printstacktrace();
    }finally{
      try {
        fis.close();
      } catch (ioexception e) {
        e.printstacktrace();
      }
    }
  }

按行来遍历文件内容

常规的文件读取方式，在while循环中，调用了回调接口的方法，并且传递相关的数据。

  public void iterator2line(){
    init();
    bufferedreader reader = null;
    filereader read = null;
    string line = null;
    try {
      file = new file(in);
      read = new filereader(file);
      reader = new bufferedreader(read);
      if (!beforeitrator()) return;
      while ( null != (line=reader.readline())) {
        readedindex++;
        if(!dowhat(line,readedindex)){
          break;
        }
      }
      if(!afteritrator()) return ;
    } catch (filenotfoundexception e) {
      e.printstacktrace();
    } catch (ioexception e) {
      e.printstacktrace();
    }finally{
      try {
        read.close();
        reader.close();
      } catch (ioexception e) {
        e.printstacktrace();
      }
    }
  }

然后，还需要提供方法来设置要读取的源文件路径。

  public iteratorfile from(string in){
    this.in = in;
    return this;
  }

六文件内容处理

fileextractor介绍

定义了fileextractor类，来封装对文件内容的处理操作；该类会引用到遍历文件所需要的类iteratorfile。

fileextractor的基本方法

  /**
   * 往文件头或者文件结尾插入字符串
   * @tips 不能对同一个文件输出路径反复执行该方法，否则会出现文本异常，因为用到了randomaccessfile,如有需要，调用前需手动删除原有的同名文件
   * @param startstr 文件开头要插入的字符串
   * @param endstr 文件结尾要插入的字符串
   * @return 生成的新文件
   * @time 2017年1月22日 下午5:05:35
   */
  public file appendasfile(final string startstr,string endstr){}


/**
   * 从指定位置截取文件
   * @tips 适合所有的文件类型
   * @return
   * @time 2017年1月22日 下午5:06:36
   */
  public file splitasfile(){}


/**
   * 文本文件的特殊处理（情景：文本抽取，文本替换等）
   * @tips 只适合文本文件，对于二进制文件，因为换行符的原因导致文件出现可能无法执行等问题。
   * @time 2017年1月22日 下午5:09:14
   */
  public file extractasfile(flowlineprocesser method) {


/**
   * 文本文件的特殊处理（情景：文本抽取，文本替换等）
   * @tips 只适合文本文件，对于二进制文件，因为换行符的原因导致文件出现可能无法执行等问题。
   * @time 2017年1月22日 下午5:09:14
   */
  public string extractasstring(flowlineprocesser method) {}

  /**
   * 读取指定位置的文件内容为字节数组
   * @return
   * @time 2017年1月23日 上午11:06:18
   */
  public byte[] readasbytes(){}

其中，返回值为file的方法在处理完成后，都出返回一个经过内容后的新文件。

其他方法

同样，设置源文件位置的方法，以及截取位置的相关方法

  /**
   * 设置源文件
   */
  public fileextractor from(string in){
    this.in = in;
    return this;
  }

  /**
   * 设置生成临时文件的位置（返回值为file的方法均需要设置）
   */
  public fileextractor to(string out) {
    this.out = out;
    return this;
  }

  /**
   * 文本开始截取的位置（包含此位置），字节相关的方法均需要设置
   */
  public fileextractor start(int start){
    this.startpos = start;
    return this;
  }

  /**
   * 文本截取的终止位置（包含此位置），字节相关方法均需要设置
   */
  public fileextractor end(int end) {
    this.endpos = end;
    return this;
  }

按字节读取文件时的文件内容处理

有几个重点：

1.因为要根据字节的位置来进行文件截取，所以需要根据字节来遍历文件，所以要重写dowhat()字节遍历的的方法。并在外部构造一个outputstream来进行新文件的写出工作。

2.每次遍历读取出的文件内容，都存放在一个字节数组b里面，但并不是b中的数据都是有用的，所以需要传递b有效长度length。

3.readedindex记录了到本次为止（包括本次）为止，已经读取了多少位数据。

4.按照自己来遍历文件时，如何判断读取到了的终止位置？

当（已读的数据总长度）readedindex>endpos（终止节点）时，说明本次读取的时候超过了应该终止的位置，此时b数组中有一部分数据就是多读的了，这部分数据是不应该被保存的。我们可以通过计算得到读超了多少位，即length-(readedindex-endpos-1)，那么只要保存这部分数据就可以了。

读取指定片段的文件内容：

  //本方法在需要读取的数据多时，不建议使用，因为byte[]是不可变的，多次读取的时候，需要进行多次的byete[] copy过程，效率“感人”。
  public byte[] readasbytes(){

    try {
      checkin();
    } catch (exception e) {
      e.printstacktrace();
      return null;
    }

    //临时保存字节的容器
    final bytesbuffer buffer = new bytesbuffer();

    iteratorfile c = new iteratorfile(){
      @override
      public boolean dowhat(byte[] b, int length, int currentindex,
          int available) {
        if(readedindex>endpos){
          //说明已经读取到了endingpos位置并且读超了
          buffer.addbytes(b, 0, length-(readedindex-endpos-1)-1);
          return false;
        }else{
          buffer.addbytes(b, 0, length-1);
        }
        return true;
      }
    };
    //按照字节进行遍历
    c.from(in).start(startpos).iterator2bytes();

    return buffer.tobytes();

  }

当文件很大时，生成一个新的文件的比较靠谱的方法，所以，类似直接返回byte[]，在文件读取之前，设置一个outputsteam,在内容循环读取的过程中，将读取的内容写入到一个新文件中去。

  public file splitasfile(){
    ......
    final outputstream os = fileutils.openout(file);
    try {
      iteratorfile itfile = new iteratorfile(){
        @override
        public boolean dowhat(byte[] b, int length,int readedindex,int available) {
          try {
            if(readedindex>endpos){
              //说明已经读取到了endingpos位置,并且读超了readedindex-getend()-1位
              os.write(b, 0, length-(readedindex-endpos-1));
              return false;//终止读取
            }else{
              os.write(b, 0, length);
            }
            return true;
          } catch (ioexception e) {
            e.printstacktrace();
            return false;
          }
        }
      }.from(in).start(startpos);

      itfile.iterator2bytes();

    } catch (exception e) {
      e.printstacktrace();
      this.tempfile = null;
    }finally{
      try {
        os.flush();
        os.close();
      } catch (ioexception e) {
        e.printstacktrace();
      }
    }
    return gettempfile();
  }

按行来读取时的文件内容处理

首先，再次声明，按行来遍历文件的时候，只适合文本文件。除非你对每一行的换行符用\r还是\n没有要求。像exe文件，如果用行来遍历的话，你写出为一个新的文件的时候，任意一个的换行符的不对都可能导致一个exe文件变为”unexe”文件！

过程中，我用到了：

一个辅助类status，来辅助控制遍历的流程。

一个接口flowlineprocesser，类似于一个处理文本的流水线。

status和flowlineprocesser是相互辅助的，status也能辅助flowlineprocesse是流水线的具体过程，status是控制处理过程中怎么处理d的。

我也想了许多次，到底要不要把这个过程搞的这么复杂。但是还是先留着吧…

先看辅助类status:

public class status{
  /**
   * 是否找到了开头,默认false，若true则后续的遍历不会执行相应的firststep()方法
   */
  public boolean overfirststep = false;

  /**
   * 是否找到了结尾，默认false,若true则后续的遍历不会执行相应的finalstep()方法
   */
  public boolean overfinalstep = false;

  /**
   * 是否继续读取源文件，默认true表示继续读取，false则表示，执行本次操作后，遍历终止
   */
  public boolean shouldcontinue = true;
}

然后是flowlineprocesser接口：

flowlineprocesser是一个接口，类似于一个流水线。定义了两步操作，分别对应两个方法fiststep()和finalstep()。其中两个方法的返回值都是string，firststep接受到得line是真正从文件中读取到的行，它将line经过自己的处理后，返回处理后的line给finalstep。所以，finalstep中得line其实是firststep处理后的结果。但是最终真正返回给主处理流程的line，正是finalstep处理后的返回值。

public interface flowlineprocesser{
  /**
   * 
   * @param line 读取到的行
   * @param linenumber 行号,从1开始
   * @param status 控制器
   * @return
   * @time 2017年1月22日 下午5:02:02
   */
  string firststep(string line,int linenumber,status status);

  /**
   * @tips 
   * @param line 读取到的行（是firststep()处理后的结果）
   * @param linenumber 行号,从1开始
   * @param status 控制器
   * @return
   * @time 2017年1月22日 下午5:02:09
   */
  string finalstep(string line,int linenumber,status status);
}

现在，可以来看一下如何去实现文本的抽取了：

所有读取的行，都临时存到一个stringbuilder中去。firststep先进行一次处理，得到返回值后传递给finalstep,再次处理后，将得到的结果保存下来。如果最后的结果是null,则不会保存。

  public string extractasstring(flowlineprocesser method) {

    try {
      checkin();
    } catch (exception e) {
      e.printstacktrace();
      return null;
    }

    final stringbuilder builder = new stringbuilder();

    this.mmethod = method;

    new iteratorfile(){
      status status = new status();
      @override
      public boolean dowhat(string line, int currentindex) {
        string lineafterprocess = "";

        if(!status.overfirststep){
          lineafterprocess = mmethod.firststep(line, currentindex,status);
        }

        if(!status.shouldcontinue){
          return false;
        }

        if(!status.overfinalstep){
          lineafterprocess = mmethod.finalstep(lineafterprocess,currentindex,status);
        }

        if(lineafterprocess!=null){
          builder.append(lineafterprocess);
          builder.append(getlinestr());//换行符被写死在这里了
        }

        if(!status.shouldcontinue){
          return false;
        }
        return true;
    }

    }.from(in).iterator2line();

    return builder.tostring();

  }

当要抽取的文本太大的时候，可以采用生成新文件的方式。与返回string的流程基本一致。

  public file extractasfile(flowlineprocesser method) {

    try {
      checkin();
      checkout();
    } catch (exception e) {
      e.printstacktrace();
      return null;
    }

    this.mmethod = method;
    file file = initoutfile();
    if(file==null){
      return null;
    }

    filewriter filewriter = null;
    try {
      filewriter = new filewriter(file);
    } catch (exception e) {
      e.printstacktrace();
      return null;
    }

    final bufferedwriter writer = new bufferedwriter(filewriter);

    iteratorfile itfile = new iteratorfile(){
      status status = new status();
      @override
      public boolean dowhat(string line, int currentindex) {
        string lineafterprocess = "";

        if(!status.overfirststep){
          lineafterprocess = mmethod.firststep(line, currentindex,status);
        }

        if(!status.shouldcontinue){
          return false;
        }

        if(!status.overfinalstep){
          lineafterprocess = mmethod.finalstep(lineafterprocess,currentindex,status);
        }

        if(lineafterprocess!=null){
          try {
            writer.write(lineafterprocess);
            writer.newline();//todo 换行符在此给写死了
          } catch (ioexception e) {
            e.printstacktrace();
            return false;
          }
        }

        if(!status.shouldcontinue){
          return false;
        }
        return true;

      }
    };

    itfile.from(in).iterator2line();

    if(writer!=null){
      try {
        writer.close();
      } catch (ioexception e) {
        e.printstacktrace();
      }
    }
    try {
      filewriter.close();
    } catch (ioexception e) {
      e.printstacktrace();
    }
    return gettempfile();

  }

好啦，介绍到此就要结束啦，我们下次再聊~

代码包供您下载哦！—>代码包

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

上一篇： Ubuntu 16.04下安装PHP 7过程详解

下一篇： asp.net验证提示美化效果代码(打包下载)