数据压缩_实验三_LZW编解码器

程序员文章站 2022-03-23 11:11:06

...

文章目录

LZW编解码算法实验

一、实验原理

1、基本思想
2、LZW编码原理
3、LZW解码原理
4、数据结构分析

二、主要模块分析

1、词典初始化
2、编码中查找字符串是否存在词典内
3、添加新字符串至词典
4、解码时存储字符串
5、LZW编码
6、LZW解码

三、实验结果

1、编码简单字符串
2、编码其他文件类型及压缩率分析

LZW编解码算法实验

一、实验原理

1、基本思想

LZW的编码思想是不断地从字符流中提取新的字符串，通俗地理解为新“词条”，然后用“代号”也就是码字表示这个“词条”。这样一来，对字符流的编码就变成了用码字去替换字符流，生成码字流，从而达到压缩数据的目的。LZW编码是围绕称为词典的转换表来完成的。

LZW编码器和解码器会在接受数据时动态地创建字典，编码器和解码器也会产生相同的字典。

2、LZW编码原理

数据压缩_实验三_LZW编解码器

注意：如果待压缩的数据没有任何可重复的结构，那么使用字典条目中的新编码的可能性很小，这会导致数据膨胀而不是压缩。

3、LZW解码原理

数据压缩_实验三_LZW编解码器

注意：若出现形如“abba”的特殊情况，解码时CW可能不存在于当前词典中，但观察此类情况可以发现，索引对应字符串具有最后字符为前缀字符串首字符特点。对此，可依据流程图中第二分支进行操作。x

4、数据结构分析

树是动态建立的，且树种每个节点可能存在多个子节点。因此数据结构应设计成一个节点可拥有任意个子节点，但无需为其预留空间。

树用数组dict[]表示，数组下标用pointer表示，所以dict[pointer]表示一个节点。树的结构体定义如下：

struct {
	int suffix;
	int parent, firstchild, nextsibling;
} dictionary[MAX_CODE + 1];

二、主要模块分析

1、词典初始化

void InitDictionary(void) {
	int i;
	// 将单个字符以Ascii形式存入词典
	for (i = 0; i < 256; i++) {
		dictionary[i].suffix = i;
		dictionary[i].parent = -1;	// 无母节点
		dictionary[i].firstchild = -1;	// 无孩子节点
		dictionary[i].nextsibling = i + 1;
	}
	dictionary[255].nextsibling = -1;
	next_code = 256;	// 下一个新增结点索引
}

2、编码中查找字符串是否存在词典内

三种情况：

无前缀字符，即字符串内仅单个字符，直接返回尾缀字符；
有前缀字符，找出前缀字符对应的第一个孩子节点，逐个匹配兄弟节点直至找到为止，返回索引值；
有前缀字符，上一种情况遍历所有兄弟节点后都没有可以匹配的兄弟节点。返回-1，即字符串不存在词典内。

int InDictionary(int character, int string_code) {
	int sibling;
	if (0 > string_code) return character; //没有前缀，单个字符
	sibling = dictionary[string_code].firstchild;
	while (-1 < sibling) {
		if (character == dictionary[sibling].suffix) return sibling;
		sibling = dictionary[sibling].nextsibling;
	}
	return -1;
}

3、添加新字符串至词典

void AddToDictionary(int character, int string_code) {
	int firstsibling, nextsibling;
	if (0 > string_code) return;
	//设置新结点，从256开始
	dictionary[next_code].suffix = character;// 尾缀字符W
	dictionary[next_code].parent = string_code;// 前缀字符p
	dictionary[next_code].nextsibling = -1;
	dictionary[next_code].firstchild = -1;
	// 前缀字符p的第一个孩子结点
	firstsibling = dictionary[string_code].firstchild;

	if (-1 < firstsibling) {	// the parent has child
		nextsibling = firstsibling;
		// 找到最后一个兄弟结点
		while (-1 < dictionary[nextsibling].nextsibling)
			nextsibling = dictionary[nextsibling].nextsibling;
    // 将创建的新节点连入树中
		dictionary[nextsibling].nextsibling = next_code;
	}
	else {// no child before, modify it to be the first
		dictionary[string_code].firstchild = next_code;
	}
	next_code++;
}

4、解码时存储字符串

由于编码后文件存储字符串的索引，故输出对应字符串时应先自叶节点至根结点倒序将逐个字符存入d_stack[]（输出时由数组最后一个成员开始打印）。

int DecodeString(int start, int code) {
	int count;
	count = start;
	while (0 <= code) {	// 由叶节点回溯至第一个字符
		d_stack[count] = dictionary[code].suffix;
		code = dictionary[code].parent;	
		count++;
	}
	return count;
}

5、LZW编码

void LZWEncode(FILE* fp, BITFILE* bf) {
	int character;
	int string_code;
	int index;
	unsigned long file_length;
	//用fseek函数把位置指针移到文件尾
	//再用ftell函数获得这时位置指针距文件头的字节数
	//这个字节数就是文件的长度
	fseek(fp, 0, SEEK_END);
	file_length = ftell(fp); //ftell用来获取文件的当前读写位置
	fseek(fp, 0, SEEK_SET);
	BitsOutput(bf, file_length, 4 * 8);
	InitDictionary(); // 词典初始化
	string_code = -1;
  // 循环至数据流结束
	while (EOF != (character = fgetc(fp))) {
		index = InDictionary(character, string_code);	// 检查字符串是否存在词典内
		if (0 <= index) {	// string+character in dictionary
			string_code = index;	// P <- P + C
		}
		else {	// string+character not in dictionary
			output(bf, string_code);	// 输出前缀字符对应索引
			if (MAX_CODE > next_code) {	// free space in dictionary
				// add string+character to dictionary
				AddToDictionary(character, string_code);
			}
			string_code = character;	// P <- C
		}
	}
  // 输出最后一个字符串
	output(bf, string_code);
}

6、LZW解码

void LZWDecode(BITFILE* bf, FILE* fp) {
	int character;
	int new_code, last_code;
	int phrase_length;
	unsigned long file_length;
  
	file_length = BitsInput(bf, 4 * 8);	// 获取文件长度
	if (-1 == file_length) file_length = 0;
	InitDictionary(); // 字典初始化
	last_code = -1;
	while (file_length > 0) {
		new_code = input(bf);	// 读入一个新字符
		if (new_code<next_code) {	// 存在字典中
			phrase_length = DecodeString(0, new_code);	// 将字符串压入堆栈
		}
		else {	// 不存在字典中 abba情况
			d_stack[0] = character;	// 将上一个字符串末尾字符作为作为新字符串末尾字符
			phrase_length = DecodeString(1, last_code);
		}
		character = d_stack[phrase_length - 1];	// 每次循环均更新character为上一个字符串末尾字符
		while (phrase_length > 0) {	//	从堆栈逐个压出字符，打印至输出文件
			phrase_length--;
			fputc(d_stack[phrase_length], fp);
			file_length--;
		}
		if (next_code < MAX_CODE) {	// add the new phrase to dictionary
			AddToDictionary(character, last_code);
		}
		last_code = new_code;	// 	PW <- CW
	}
}

三、实验结果

1、编码简单字符串

编码前
编码后
解码后

2、编码其他文件类型及压缩率分析

数据压缩_实验三_LZW编解码器

文件类型	编码前/KB	编码后/KB	压缩率
png（图像）	930	1151	123.76%
jpg（图像）	67	98	146.27%
avi（视频）	1162	1388	119.45%
cif（视频）	3564	2111	59.23%
pdf（文档）	18184	18330	100.80%
ppt（文档）	1139	1228	107.81%
txt（文档）	2288	1434	62.67%
doc（文档）	636	139	21.86%
xls（文档）	87	39	44.83%
caj（文档）	5438	6329	116.38%

由表可知，对于不同文件类型，LZW编码的压缩率各不相同。其中，对于doc、txt、xls等文本文件，LZW编码压缩效果最好。对于图像、视频文件，LZW编码压缩效果较差。

分析原因，LZW编码基于重复字符，对于数据中重复率较低的文件，需要不断创建大量新词条存入词典，实际输出压缩文件即便存储的是词条索引，也与源文件数据量差别不大。

与此同时，程序中的LZW编解码器，读入8bit输出16bit索引，在压缩效率不高的情况下，压缩后文件大小反而可能增加。

数据压缩_实验三_LZW编解码器

文章目录

LZW编解码算法实验

一、实验原理

1、基本思想

2、LZW编码原理

3、LZW解码原理

4、数据结构分析

二、主要模块分析

1、词典初始化

2、编码中查找字符串是否存在词典内

3、添加新字符串至词典

4、解码时存储字符串

5、LZW编码

6、LZW解码

三、实验结果

1、编码简单字符串

2、编码其他文件类型及压缩率分析

数据压缩实验三：Huffman编解码

数据压缩实验三 Huffman编解码算法实现与压缩效率分析

数据压缩实验报告3 LZW编解码

数据压缩学习实验（三）PNG文件读取及转换至YUV色彩空间的C++实现