详解Huffman编码算法之Java实现
huffman编码介绍
huffman编码处理的是字符以及字符对应的二进制的编码配对问题,分为编码和解码,目的是压缩字符对应的二进制数据长度。我们知道字符存贮和传输的时候都是二进制的(计算机只认识0/1),那么就有字符与二进制之间的mapping关系。字符属于字符集(charset), 字符需要通过编码(encode)为二进制进行存贮和传输,显示的时候需要解码(decode)回字符,字符集与编码方法是一对多关系(unicode可以用utf-8,utf-16等编码)。理解了字符集,编码以及解码,满天飞的乱码问题也就游刃而解了。以英文字母小写a为例, ascii编码中,十进制为97,二进制为01100001。ascii的每一个字符都用8个bit(1byte)编码,假如有1000个字符要传输,那么就要传输8000个bit。问题来了,英文中字母e的使用频率为12.702%,而z为0.074%,前者是后者的100多倍,但是确使用相同位数的二进制。可以做得更好,方法就是可变长度编码,指导原则就是频率高的用较短的位数编码,频率低的用较长位数编码。huffman编码算法就是处理这样的问题。
huffman编码java实现
huffman编码算法主要用到的数据结构是完全二叉树(full binary tree)和优先级队列。后者用的是java.util.priorityqueue,前者自己实现(都为内部类),代码如下:
static class tree { private node root; public node getroot() { return root; } public void setroot(node root) { this.root = root; } } static class node implements comparable<node> { private string chars = ""; private int frequence = 0; private node parent; private node leftnode; private node rightnode; @override public int compareto(node n) { return frequence - n.frequence; } public boolean isleaf() { return chars.length() == 1; } public boolean isroot() { return parent == null; } public boolean isleftchild() { return parent != null && this == parent.leftnode; } public int getfrequence() { return frequence; } public void setfrequence(int frequence) { this.frequence = frequence; } public string getchars() { return chars; } public void setchars(string chars) { this.chars = chars; } public node getparent() { return parent; } public void setparent(node parent) { this.parent = parent; } public node getleftnode() { return leftnode; } public void setleftnode(node leftnode) { this.leftnode = leftnode; } public node getrightnode() { return rightnode; } public void setrightnode(node rightnode) { this.rightnode = rightnode; } }
统计数据
既然要按频率来安排编码表,那么首先当然得获得频率的统计信息。我实现了一个方法处理这样的问题。如果已经有统计信息,那么转为map<character,integer>即可。如果你得到的信息是百分比,乘以100或1000,或10000。总是可以转为整数。比如12.702%乘以1000为12702,huffman编码只关心大小问题。统计方法实现如下:
public static map<character, integer> statistics(char[] chararray) { map<character, integer> map = new hashmap<character, integer>(); for (char c : chararray) { character character = new character(c); if (map.containskey(character)) { map.put(character, map.get(character) + 1); } else { map.put(character, 1); } } return map; }
构建树
构建树是huffman编码算法的核心步骤。思想是把所有的字符挂到一颗完全二叉树的叶子节点,任何一个非页子节点的左节点出现频率不大于右节点。算法为把统计信息转为node存放到一个优先级队列里面,每一次从队列里面弹出两个最小频率的节点,构建一个新的父node(非叶子节点), 字符内容刚弹出来的两个节点字符内容之和,频率也是它们的和,最开始的弹出来的作为左子节点,后面一个作为右子节点,并且把刚构建的父节点放到队列里面。重复以上的动作n-1次,n为不同字符的个数(每一次队列里面个数减1)。结束以上步骤,队列里面剩一个节点,弹出作为树的根节点。代码如下:
private static tree buildtree(map<character, integer> statistics, list<node> leafs) { character[] keys = statistics.keyset().toarray(new character[0]); priorityqueue<node> priorityqueue = new priorityqueue<node>(); for (character character : keys) { node node = new node(); node.chars = character.tostring(); node.frequence = statistics.get(character); priorityqueue.add(node); leafs.add(node); } int size = priorityqueue.size(); for (int i = 1; i <= size - 1; i++) { node node1 = priorityqueue.poll(); node node2 = priorityqueue.poll(); node sumnode = new node(); sumnode.chars = node1.chars + node2.chars; sumnode.frequence = node1.frequence + node2.frequence; sumnode.leftnode = node1; sumnode.rightnode = node2; node1.parent = sumnode; node2.parent = sumnode; priorityqueue.add(sumnode); } tree tree = new tree(); tree.root = priorityqueue.poll(); return tree; }
编码
某个字符对应的编码为,从该字符所在的叶子节点向上搜索,如果该字符节点是父节点的左节点,编码字符之前加0,反之如果是右节点,加1,直到根节点。只要获取了字符和二进制码之间的mapping关系,编码就非常简单。代码如下:
public static string encode(string originalstr, map<character, integer> statistics) { if (originalstr == null || originalstr.equals("")) { return ""; } char[] chararray = originalstr.tochararray(); list<node> leafnodes = new arraylist<node>(); buildtree(statistics, leafnodes); map<character, string> encodinfo = buildencodinginfo(leafnodes); stringbuffer buffer = new stringbuffer(); for (char c : chararray) { character character = new character(c); buffer.append(encodinfo.get(character)); } return buffer.tostring(); } private static map<character, string> buildencodinginfo(list<node> leafnodes) { map<character, string> codewords = new hashmap<character, string>(); for (node leafnode : leafnodes) { character character = new character(leafnode.getchars().charat(0)); string codeword = ""; node currentnode = leafnode; do { if (currentnode.isleftchild()) { codeword = "0" + codeword; } else { codeword = "1" + codeword; } currentnode = currentnode.parent; } while (currentnode.parent != null); codewords.put(character, codeword); } return codewords; }
解码
因为huffman编码算法能够保证任何的二进制码都不会是另外一个码的前缀,解码非常简单,依次取出二进制的每一位,从树根向下搜索,1向右,0向左,到了叶子节点(命中),退回根节点继续重复以上动作。代码如下:
public static string decode(string binarystr, map<character, integer> statistics) { if (binarystr == null || binarystr.equals("")) { return ""; } char[] binarychararray = binarystr.tochararray(); linkedlist<character> binarylist = new linkedlist<character>(); int size = binarychararray.length; for (int i = 0; i < size; i++) { binarylist.addlast(new character(binarychararray[i])); } list<node> leafnodes = new arraylist<node>(); tree tree = buildtree(statistics, leafnodes); stringbuffer buffer = new stringbuffer(); while (binarylist.size() > 0) { node node = tree.root; do { character c = binarylist.removefirst(); if (c.charvalue() == '0') { node = node.leftnode; } else { node = node.rightnode; } } while (!node.isleaf()); buffer.append(node.chars); } return buffer.tostring(); }
测试以及比较
以下测试huffman编码的正确性(先编码,后解码,包括中文),以及huffman编码与常见的字符编码的二进制字符串比较。代码如下:
public static void main(string[] args) { string oristr = "huffman codes compress data very effectively: savings of 20% to 90% are typical, " + "depending on the characteristics of the data being compressed. 中华崛起"; map<character, integer> statistics = statistics(oristr.tochararray()); string encodedbinaristr = encode(oristr, statistics); string decodedstr = decode(encodedbinaristr, statistics); system.out.println("original sstring: " + oristr); system.out.println("huffman encoed binary string: " + encodedbinaristr); system.out.println("decoded string from binariy string: " + decodedstr); system.out.println("binary string of utf-8: " + getstringofbyte(oristr, charset.forname("utf-8"))); system.out.println("binary string of utf-16: " + getstringofbyte(oristr, charset.forname("utf-16"))); system.out.println("binary string of us-ascii: " + getstringofbyte(oristr, charset.forname("us-ascii"))); system.out.println("binary string of gb2312: " + getstringofbyte(oristr, charset.forname("gb2312"))); } public static string getstringofbyte(string str, charset charset) { if (str == null || str.equals("")) { return ""; } byte[] bytearray = str.getbytes(charset); int size = bytearray.length; stringbuffer buffer = new stringbuffer(); for (int i = 0; i < size; i++) { byte temp = bytearray[i]; buffer.append(getstringofbyte(temp)); } return buffer.tostring(); } public static string getstringofbyte(byte b) { stringbuffer buffer = new stringbuffer(); for (int i = 7; i >= 0; i--) { byte temp = (byte) ((b >> i) & 0x1); buffer.append(string.valueof(temp)); } return buffer.tostring(); }
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。