Java读取写入文件所以方法含大文件读取
程序员文章站
2022-07-01 16:56:09
...
I/O Performance By Glen McCluskey March 1999 This article discusses and illustrates a variety of techniques for improving Java I/O performance. Most of the techniques center around tuning disk file I/O, but some are applicable to network I/O and window output as well. The first set of techniques presented below cover low-level I/O issues, and then higher-level issues such as compression, formatting, and serialization are discussed. Note, however, the discussion does not cover application design issues, such as choice of search algorithms and data structures, nor does it discuss system-level issues such as file caching. When discussing Java I/O, it's worth noting that the Java programming language assumes two distinct types of disk file organization. One is based on streams of bytes, the other on character sequences. In the Java language a character is represented using two bytes, not one byte as in other common languages such as C. Because of this, some translation is required to read characters from a file. This distinction is important in some contexts, as several of the examples will illustrate. Low-Level I/O Issues Basic Rules for Speeding Up I/O Buffering Reading/Writing Text Files Formatting Costs Random Access High-Level I/O Issues Compression Caching Tokenization Serialization Obtaining Information About Files Further Information Basic Rules for Speeding Up I/O As a means of starting the discussion, here are some basic rules on how to speed up I/O: 1.Avoid accessing the disk. 2.Avoid accessing the underlying operating system. 3.Avoid method calls. 4.Avoid processing bytes and characters individually. These rules obviously cannot be applied in a "blanket" way, because if that were the case, no I/O would ever get done! But to see how they can be applied, consider the following three-part example that counts the number of newline bytes ('\n') in a file. Approach 1: Read Method The first approach simply uses the read method on a FileInputStream: import java.io.*; public class intro1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); int cnt = 0; int b; while ((b = fis.read()) != -1) { if (b == '\n') cnt++; } fis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } However, this approach triggers a lot of calls to the underlying runtime system, that is, FileInputStream.read, a native method that returns the next byte of the file. Approach 2: Using a Large Buffer The second approach avoids the above problem, by using a large buffer: import java.io.*; public class intro2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); BufferedInputStream bis = new BufferedInputStream(fis); int cnt = 0; int b; while ((b = bis.read()) != -1) { if (b == '\n') cnt++; } bis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } BufferedInputStream.read takes the next byte from the input buffer, and only rarely accesses the underlying system. Approach 3: Direct Buffering The third approach avoids BufferedInputStream and does buffering directly, thereby eliminating the read method calls: import java.io.*; public class intro3 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); byte buf[] = new byte[2048]; int cnt = 0; int n; while ((n = fis.read(buf)) != -1) { for (int i = 0; i < n; i++) { if (buf[i] == '\n') cnt++; } } fis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } For a 1 MB input file, the execution times in seconds of the programs are: intro1 6.9 intro2 0.9 intro3 0.4 or about a 17 to 1 difference between the slowest and fastest. This huge speedup doesn't necessarily prove that you should always emulate the third approach, in which you do your own buffering. Such an approach may be error-prone, especially in handling end-of-file events, if it is not carefully implemented. It may also be less readable than the alternatives. But it's useful to keep in mind where the time goes, and how it can be reclaimed when necessary. Approach 2 is probably "right" for most applications. Buffering Approaches 2 and 3 use the technique of buffering, where large chunks of a file are read from disk, and then accessed a byte or character at a time. Buffering is a basic and important technique for speeding I/O, and several Java classes support buffering (BufferedInputStream for bytes, BufferedReader for characters). An obvious question is: Will making the buffer bigger make I/O go faster? Java buffers typically are by default 1024 or 2048 bytes long. A buffer larger than this may help speed I/O, but often by only a few percent, say 5 to 10%. Approach 4: Whole File The extreme case of buffering would be to determine the length of a file in advance, and then read in the whole file: import java.io.*; public class readfile { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { int len = (int)(new File(args[0]).length()); FileInputStream fis = new FileInputStream(args[0]); byte buf[] = new byte[len]; fis.read(buf); fis.close(); int cnt = 0; for (int i = 0; i < len; i++) { if (buf[i] == '\n') cnt++; } System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } This approach is convenient, in that a file can be treated as an array of bytes. But there's an obvious problem of possibly not having enough memory to read in a very large file. Another aspect of buffering concerns text output to a terminal window. By default, System.out (a PrintStream) is line buffered, meaning that the output buffer is flushed when a newline character is encountered. This is important for interactivity, where you'd like to have an input prompt displayed before actually entering any input. Approach 5: Disabling Line Buffering But line buffering can be disabled, as in this example: import java.io.*; public class bufout { public static void main(String args[]) { FileOutputStream fdout = new FileOutputStream(FileDescriptor.out); BufferedOutputStream bos = new BufferedOutputStream(fdout, 1024); PrintStream ps = new PrintStream(bos, false); System.setOut(ps); final int N = 100000; for (int i = 1; i <= N; i++) System.out.println(i); ps.close(); } } This program writes the integers 1..100000 to the output, and runs about three times faster than the default equivalent that has line buffering enabled. Buffering is also an important part of one of the examples presented below, where a buffer is used to speed up random file access. Reading/Writing Text Files Earlier the idea was mentioned that method call overhead can be significant when reading characters from a file. Another example of this can be found in a program that counts the number of lines in a text file: import java.io.*; public class line1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); BufferedInputStream bis = new BufferedInputStream(fis); DataInputStream dis = new DataInputStream(bis); int cnt = 0; while (dis.readLine() != null) cnt++; dis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } This program uses the old DataInputStream.readLine method, which is implemented using read method calls to obtain each character. A newer approach is to say: import java.io.*; public class line2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); int cnt = 0; while (br.readLine() != null) cnt++; br.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } } This approach can be faster. For example, on a 6 MB text file with 200,000 lines, the second program is around 20% faster than the first. But even if the second program isn't faster, there's an important issue to note. The first program evokes a deprecation warning from the Java 2 compiler, because DataInputStream.readLine is obsolete. It does not properly convert bytes to characters, and would not be an appropriate choice for manipulating text files containing anything other than ASCII text byte streams (recall that the Java language uses the Unicode character set, not ASCII). This is where the distinction between byte streams and character streams noted earlier comes into play. A program such as: import java.io.*; public class conv1 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("out1"); PrintStream ps = new PrintStream(fos); ps.println("\uffff\u4321\u1234"); ps.close(); } catch (IOException e) { System.err.println(e); } } } writes an output file, but without preserving the Unicode characters that are actually output. The Reader/Writer I/O classes are character-based, and are designed to resolve this issue. OutputStreamWriter is where the encoding of characters to bytes is applied. A program that uses PrintWriter to write out Unicode characters looks like this: import java.io.*; public class conv2 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("out2"); OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8"); PrintWriter pw = new PrintWriter(osw); pw.println("\uffff\u4321\u1234"); pw.close(); } catch (IOException e) { System.err.println(e); } } } This program uses the UTF8 encoding, which has the property of encoding ASCII text as itself, and other characters as two or three bytes. Formatting Costs Actually writing data to a file is only part of the cost of output. Another significant cost is data formatting. Consider a three-part example, one that writes out lines like: The square of 5 is 25 Approach 1 The first approach is simply to write out a fixed string, to get an idea of the intrinsic I/O cost: public class format1 { public static void main(String args[]) { final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = "The square of 5 is 25\n"; System.out.print(s); } } } Approach 2 The second approach employs simple formatting using "+": public class format2 { public static void main(String args[]) { int n = 5; final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = "The square of " + n + " is " + n * n + "\n"; System.out.print(s); } } } Approach 3 The third approach uses the MessageFormat class from the java.text package: import java.text.*; public class format3 { public static void main(String args[]) { MessageFormat fmt = new MessageFormat("The square of {0} is {1}\n"); Object values[] = new Object[2]; int n = 5; values[0] = new Integer(n); values[1] = new Integer(n * n); final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = fmt.format(values); System.out.print(s); } } } These programs produce identical output. The running times are: format1 1.3 format2 1.8 format3 7.8 or about a 6 to 1 difference between the slowest and fastest. The third program would be even slower if the format had not been precompiled and the static convenience method had been used instead: Approach 4 MessageFormat.format(String, Object[]) as in: import java.text.*; public class format4 { public static void main(String args[]) { String fmt = "The square of {0} is {1}\n"; Object values[] = new Object[2]; int n = 5; values[0] = new Integer(n); values[1] = new Integer(n * n); final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = MessageFormat.format(fmt, values); System.out.print(s); } } } which takes 1/3 longer than the previous example. The fact that approach 3 is quite a bit slower than approaches 1 and 2 doesn't mean that you shouldn't use it. But you need to be aware of the cost in time. Message formats are quite important in internationalization contexts, and an application concerned about this issue might typically read the format from a resource bundle, and then use it. Random Access RandomAccessFile is a Java class for doing random access I/O (at the byte level) on files. The class provides a seek method, similar to that found in C/C++, to move the file pointer to an arbitrary location, from which point bytes can then be read or written. The seek method accesses the underlying runtime system, and as such, tends to be expensive. One cheaper alternative is to set up your own buffering on top of a RandomAccessFile, and implement a read method for bytes directly. The parameter to read is the byte offset >= 0 of the desired byte. An example of how this is done is: import java.io.*; public class ReadRandom { private static final int DEFAULT_BUFSIZE = 4096; private RandomAccessFile raf; private byte inbuf[]; private long startpos = -1; private long endpos = -1; private int bufsize; public ReadRandom(String name) throws FileNotFoundException { this(name, DEFAULT_BUFSIZE); } public ReadRandom(String name, int b) throws FileNotFoundException { raf = new RandomAccessFile(name, "r"); bufsize = b; inbuf = new byte[bufsize]; } public int read(long pos) { if (pos < startpos || pos > endpos) { long blockstart = (pos / bufsize) * bufsize; int n; try { raf.seek(blockstart); n = raf.read(inbuf); } catch (IOException e) { return -1; } startpos = blockstart; endpos = blockstart + n - 1; if (pos < startpos || pos > endpos) return -1; } return inbuf[(int)(pos - startpos)] & 0xffff; } public void close() throws IOException { raf.close(); } public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { ReadRandom rr = new ReadRandom(args[0]); long pos = 0; int c; byte buf[] = new byte[1]; while ((c = rr.read(pos)) != -1) { pos++; buf[0] = (byte)c; System.out.write(buf, 0, 1); } rr.close(); } catch (IOException e) { System.err.println(e); } } } The driver program simply reads the bytes in sequence and writes them out. This technique is helpful if you have locality of access, where nearby bytes in the file are read at about the same time. For example, if you are implementing a binary search scheme on a sorted file, this approach might be useful. It's of less value if you are truly doing random access at arbitrary points in a large file. Compression Java provides classes for compressing and uncompressing byte streams. These are found in the java.util.zip package, and also serve as the basis for Jar files (a Jar file is a Zip file with an added manifest). The following program takes a single input file, and writes a compressed output Zip file, with a single entry representing the input file: import java.io.*; import java.util.zip.*; public class compress { public static void doit( String filein, String fileout ) { FileInputStream fis = null; FileOutputStream fos = null; try { fis = new FileInputStream(filein); fos = new FileOutputStream(fileout); ZipOutputStream zos = new ZipOutputStream(fos); ZipEntry ze = new ZipEntry(filein); zos.putNextEntry(ze); final int BUFSIZ = 4096; byte inbuf[] = new byte[BUFSIZ]; int n; while ((n = fis.read(inbuf)) != -1) zos.write(inbuf, 0, n); fis.close(); fis = null; zos.close(); fos = null; } catch (IOException e) { System.err.println(e); } finally { try { if (fis != null) fis.close(); if (fos != null) fos.close(); } catch (IOException e) { } } } public static void main(String args[]) { if (args.length != 2) { System.err.println("missing filenames"); System.exit(1); } if (args[0].equals(args[1])) { System.err.println("filenames are identical"); System.exit(1); } doit(args[0], args[1]); } } The next program reverses the process, taking an input Zip file that is assumed to have a single entry in it, and uncompresses that entry to the output file: import java.io.*; import java.util.zip.*; public class uncompress { public static void doit( String filein, String fileout ) { FileInputStream fis = null; FileOutputStream fos = null; try { fis = new FileInputStream(filein); fos = new FileOutputStream(fileout); ZipInputStream zis = new ZipInputStream(fis); ZipEntry ze = zis.getNextEntry(); final int BUFSIZ = 4096; byte inbuf[] = new byte[BUFSIZ]; int n; while ((n = zis.read(inbuf, 0, BUFSIZ)) != -1) fos.write(inbuf, 0, n); zis.close(); fis = null; fos.close(); fos = null; } catch (IOException e) { System.err.println(e); } finally { try { if (fis != null) fis.close(); if (fos != null) fos.close(); } catch (IOException e) { } } } public static void main(String args[]) { if (args.length != 2) { System.err.println("missing filenames"); System.exit(1); } if (args[0].equals(args[1])) { System.err.println("filenames are identical"); System.exit(1); } doit(args[0], args[1]); } } Whether compression helps or hurts I/O performance depends a lot on your local hardware setup; specifically the relative speeds of the processor and disk drives. Compression using Zip technology implies typically a 50% reduction in data size, but at the cost of some time to compress and decompress. An experiment with large (5 to 10 MB) compressed text files, using a 300-MHz Pentium PC with IDE disk drives, showed an elapsed time speedup of around 1/3 in reading compressed files from disk, over reading uncompressed ones. An example of where compression is useful is in writing to very slow media such as floppy disks. A test using a fast processor (300 MHz Pentium) and a slow floppy (the conventional floppy drive found on PCs), showed that compressing a large text file and then writing to the floppy drive results in a speedup of around 50% over simply copying the file directly to the floppy drive. Caching A detailed discussion of hardware caching is beyond the scope of this paper. But sometimes software caching can be used to speed up I/O. Consider a case where you want to read lines of a text file in random order. One way to do this is to read in all the lines, and store them in an ArrayList (a collection class similar to Vector): import java.io.*; import java.util.ArrayList; public class LineCache { private ArrayList list = new ArrayList(); public LineCache(String fn) throws IOException { FileReader fr = new FileReader(fn); BufferedReader br = new BufferedReader(fr); String ln; while ((ln = br.readLine()) != null) list.add(ln); br.close(); } public String getLine(int n) { if (n < 0) throw new IllegalArgumentException(); return (n < list.size() ? (String)list.get(n) : null); } public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { LineCache lc = new LineCache(args[0]); int i = 0; String ln; while ((ln = lc.getLine(i++)) != null) System.out.println(ln); } catch (IOException e) { System.err.println(e); } } } The getLine method is then used to retrieve an arbitrary line. This technique is quite useful, but obviously uses a lot of memory for large files, and so has limitations. An alternative might be to simply remember the last 100 lines that were requested, and read from the disk for any other requests. This scheme works well if there is locality of access of the lines, but not so well if line requests are truly random. Tokenization Tokenization refers to the process of breaking byte or character sequences into logical chunks, for example words. Java offers a StreamTokenizer class, that operates like this: import java.io.*; public class token1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); StreamTokenizer st = new StreamTokenizer(br); st.resetSyntax(); st.wordChars('a', 'z'); int tok; while ((tok = st.nextToken()) != StreamTokenizer.TT_EOF) { if (tok == StreamTokenizer.TT_WORD) ;// st.sval has token } br.close(); } catch (IOException e) { System.err.println(e); } } } This example tokenizes in terms of lower-case words (letters a-z). If you implement the equivalent yourself, it might look like: import java.io.*; public class token2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); int maxlen = 256; int currlen = 0; char wordbuf[] = new char[maxlen]; int c; do { c = br.read(); if (c >= 'a' && c <= 'z') { if (currlen == maxlen) { maxlen *= 1.5; char xbuf[] = new char[maxlen]; System.arraycopy( wordbuf, 0, xbuf, 0, currlen); wordbuf = xbuf; } wordbuf[currlen++] = (char)c; } else if (currlen > 0) { String s = new String(wordbuf, 0, currlen); // do something with s currlen = 0; } } while (c != -1); br.close(); } catch (IOException e) { System.err.println(e); } } } The second program runs about 20% faster than the first, at the expense of having to write some tricky low-level code. StreamTokenizer is sort of a hybrid class, in that it will read from character-based streams (like BufferedReader), but at the same time operates in terms of bytes, treating all characters with two-byte values (greater than 0xff) as though they are alphabetic characters. Serialization Serialization is used to convert arbitrary Java data structures into byte streams, using a standardized format. For example, the following program writes out an array of random integers: import java.io.*; import java.util.*; public class serial1 { public static void main(String args[]) { ArrayList al = new ArrayList(); Random rn = new Random(); final int N = 100000; for (int i = 1; i <= N; i++) al.add(new Integer(rn.nextInt())); try { FileOutputStream fos = new FileOutputStream("test.ser"); BufferedOutputStream bos = new BufferedOutputStream(fos); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(al); oos.close(); } catch (Throwable e) { System.err.println(e); } } } and this program reads the array back in: import java.io.*; import java.util.*; public class serial2 { public static void main(String args[]) { ArrayList al = null; try { FileInputStream fis = new FileInputStream("test.ser"); BufferedInputStream bis = new BufferedInputStream(fis); ObjectInputStream ois = new ObjectInputStream(bis); al = (ArrayList)ois.readObject(); ois.close(); } catch (Throwable e) { System.err.println(e); } } } Note that we used buffering to speed the I/O operations. Is there a faster way than serialization to write out large volumes of data, and then read it back? Probably not, except in special cases. For example, suppose that you decide to write out a 64-bit long integer as text instead of as a set of 8 bytes. The maximum length of a long integer as text is around 20 characters, or 2.5 times as long as the binary representation. So it seems likely that this format wouldn't be any faster. In some cases, however, such as bitmaps, a special format might be an improvement. However, using your own scheme does work against the standard offered by serialization, so doing so involves some tradeoffs. Beyond the actual I/O and formatting costs of serialization (using DataInputStream and DataOutputStream), there are other costs, for example, the need to create new objects when deserializing. Note also that the methods of DataOutputStream can be used to develop semi-custom data formats, for example: import java.io.*; import java.util.*; public class binary1 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("outdata"); BufferedOutputStream bos = new BufferedOutputStream(fos); DataOutputStream dos = new DataOutputStream(bos); Random rn = new Random(); final int N = 10; dos.writeInt(N); for (int i = 1; i <= N; i++) { int r = rn.nextInt(); System.out.println(r); dos.writeInt(r); } dos.close(); } catch (IOException e) { System.err.println(e); } } } and: import java.io.*; public class binary2 { public static void main(String args[]) { try { FileInputStream fis = new FileInputStream("outdata"); BufferedInputStream bis = new BufferedInputStream(fis); DataInputStream dis = new DataInputStream(bis); int N = dis.readInt(); for (int i = 1; i <= N; i++) { int r = dis.readInt(); System.out.println(r); } dis.close(); } catch (IOException e) { System.err.println(e); } } } These programs write 10 integers to a file and then read them back. Obtaining Information About Files Our discussion so far has centered on input and output for individual files. But there's another aspect of speeding I/O performance, that relates to finding out properties of files. For example, consider a small program that prints the length of a filename: import java.io.*; public class length1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } File f = new File(args[0]); long len = f.length(); System.out.println(len); } } The Java runtime system itself cannot know the length of a file, and so must query the underlying operating system to obtain this information. This holds true for other file information, such as whether a file is a directory, the time it was last modified, and so on. The File class in the java.io package provides a set of methods to query this information. Such querying is in general expensive in terms of time, and should be used as little as possible. A longer example of querying file information, one that recursively walks the file system roots to dump out a set of all the file pathnames on a system, looks like this: import java.io.*; public class roots { public static void visit(File f) { System.out.println(f); } public static void walk(File f) { visit(f); if (f.isDirectory()) { String list[] = f.list(); for (int i = 0; i < list.length; i++) walk(new File(f, list[i])); } } public static void main(String args[]) { File list[] = File.listRoots(); for (int i = 0; i < list.length; i++) { if (list[i].exists()) walk(list[i]); else System.err.println("not accessible: " + list[i]); } } }
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
上一篇: Laravel 最佳实践
下一篇: 桂皮能长期吃吗,桂皮黑暗料理吃法你会吗