信息安全的一个实验——构建信源模型

程序员文章站 2022-03-31 08:08:20

...

信息安全的一个实验——构建信源模型

这篇博客主要讲述了编者在校用java做的一个实验，是用java来构建信源模。主要实验内容就是用java读取一个文本中的字符串，并统计字符串中的26个字母出现的频数和频率。

实现思路

最外层用for循环该字符串的每个字符，然后里面内嵌一层for循环26个字母的，总的执行次数为26n（n表示字符串的字符个数），总的时间复杂度为O(n)。

这里用了一个普通的思路，后期应该会改进算法。

实验内容

1、随机产生一个一行五列数组，使其恰好符合信源概率的要求；
2、基于给定英文材料Types of Speech.txt，以26个英文字母为信源消息符号，构建该信源的数学模型。
1）统计26个英文字母出现的频数；
2）计算26个英文字母出现的频率，并以频率近似概率；
3）构建信源概率模型；
4）计算信源熵。

代码实现

import java.io.*;
import static java.lang.System.out;

/**
 * *构建信源模型
 *  @author Canlong
*/
public class Test2 {

    public static void main(String[] args){
        //第一小题
        array5Col();

        //第二小题
        try {
            countNum();
        } catch (IOException e){
            throw new RuntimeException("读取不到文件");
        }
    }


    /**
     * 1.随机产生一个一行五列数组，使其恰好符合信源概率的要求
     */
    public static void array5Col(){
        //1的概率为0.2，2的概率为0.3，3的概率为0.5
        int[] array = {1,1,2,2,2,3,3,3,3,3};
        //一行五列的数组
        int[][] array5 = new int[1][5];
        out.println("其信源概率为：1的概率为0.2，2的概率为0.3，3的概率为0.5。产生一个一行五列数组：");
        for(int i=0;i<array5[0].length;i++) {
            int randomNum = (int) Math.floor(Math.random() * 10);
            array5[0][i]=array[randomNum];
            out.print(array5[0][i]+"，");
        }
        //换行
        out.println();
    }

    /**
     * 2.统计文件中26个字母的频率并计算信息熵
     * @throws IOException 抛出找不到文件异常
     */
    public static void countNum() throws IOException {
        //文件路径
        String strPath  = "C:/Users/hasee/Desktop/Types of Speech.txt";
        //26个字母出现的总次数
        double sumAllNum = 0;
        //存储频率
        double[] frequency = new double[26];
        //模型的信息熵 entropy
        double infoEntr = 0.0;

        //读取文件
        BufferedReader bw = new BufferedReader(new InputStreamReader(new FileInputStream(strPath),"UTF-8"));
        //存储文件的字符串
        StringBuilder textStrBuilder = new StringBuilder();
        String line;
        while((line=bw.readLine())!= null){
            textStrBuilder.append(line);
        }
        String textStr = textStrBuilder.toString();
        out.println("要统计的字符串为：\r\n"+textStr);
        textStr = textStr.toLowerCase();
        //统计字符串各个字母的个数
        char[] textChar = textStr.toCharArray();
        //存放26个字母和对应的次数
        char[][] char26AndNum = new char[2][26];
        //将26个字母放入到字符数组
        //表示字符a的编码数
        int intA = 97;
        //表示字符z的编码数
        int intZ = 123;
        for(int i=intA;i<intZ;i++){
            char26AndNum[0][i-intA]=(char)(i);
        }
        //比较字符串和26个字母的是否相等，并且计算次数
        for(int i=0;i<textChar.length;i++){
            //法一：循环26个字母，判断是否相等
//            for(int j=0;j<char26AndNum[0].length;j++){
//                //如果字符相等，则对应的二维数组+1
//                if(Character.toString(textChar[i]).equals(Character.toString(char26AndNum[0][j]))){
//                    char26AndNum[1][j]++;
//                }
//            }
            //法二，将26个字母ASCII码-'a'作为数组下标,当字母等于那个数组下标时，直接将该元素++
            if(textChar[i] >= 'a' && textChar[i]<='z'){
                char26AndNum[1][textChar[i]-'a']++;
            }
        }
        //输出26个字母及其所对应次数，即计算频数
        for(int i=0;i<char26AndNum[1].length;i++){
            sumAllNum += (double)char26AndNum[1][i];
        }
        out.println("总次数为："+sumAllNum);
        //计算频率
        for(int i=0;i<char26AndNum[1].length;i++) {
            frequency[i] = char26AndNum[1][i] / sumAllNum;
            out.println("字母为：" + char26AndNum[0][i] + "，对应出现的次数为：" + (int) char26AndNum[1][i] + ",其频率为：" + frequency[i]);
            if (frequency[i] != 0) {
                //计算信息熵，信息熵=频率1*log2（1/频率）
                infoEntr -= frequency[i] * (Math.log(frequency[i]) / Math.log(2));
            }
        }
        out.println("信息熵为："+infoEntr);
    }

}

代码所涉及到Types of Speech.txt文件内容

Standard usage includes those words and expressions understood, used, and accepted by a majority of the speakers of a language in any situation regardless of the level of formality. As such, these words and expressions are well defined and listed in standard dictionaries. Colloquialisms, on the other hand, are familiar words and idioms that are understood by almost all speakers of a language and used in informal speech or writing, but not considered appropriate for more formal situations. Almost all idiomatic expressions are colloquial language. Slang, however, refers to words and expressions understood by a large number of speakers but not accepted as good, formal usage by the majority. Colloquial expressions and even slang may be found in standard dictionaries but will be so identified. Both colloquial usage and slang are more common in speech than in writing.Colloquial speech often passes into standard speech. Some slang also passes into standard speech, but other slang expressions enjoy momentary popularity followed by obscurity. In some cases, the majority never accepts certain slang phrases but nevertheless retains them in their collective memories. Every generation seems to require its own set of words to describe familiar objects and events. It has been pointed out by a number of linguists that three cultural conditions are necessary for the creation of a large body of slang expressions. First, the introduction and acceptance of new objects and situations in the society; second, a diverse population with a large number of subgroups; third, association among the subgroups and the majority population.Finally, it is worth noting that the terms 'standard' 'colloquial' and 'slang' exist only as abstract labels for scholars who study language. Only a tiny number of the speakers of any language will be aware that they are using colloquial or slang expressions. Most speakers of English will, during appropriate situations, select and use all three types of expressions.

运行结果

信息安全的一个实验——构建信源模型

总结

本次实验内容并没有太大的难点。但是当时还是遇到了两点内容被卡住了，导致做实验的时候浪费了一点时间。

（1）太过于纠结高效的算法，当时接触到题目的时候就想着怎么把算法的时间复杂度降下去。
当时有一种思路就是通过HashMap，存储26个字母以及它们对应的次数的，以26个字母作为key，它们对应的次数作为value，这样在for循环字符串的时候就不用循环26个字母了，只要通过26个字母的HashMap中contain()方法就可以找到该该字母，从而降低了时间复杂度。
想着以为利用HashMap可以，但是实践的时候才发现自己又忘记了HashMap里面的值不能用++自增增加的。从而放弃了这一思路。后来，又在百度上搜索有关这类问题的解决方法，大多数方法都是普通的二重for循环思路。搜索挺长时间后，发现了一个很好的思路：如果直接将26个字母变为一维数组的下标，通过数组的值进行增加，作为字母对应的频数，就可以减少存储内存和循环时间。然而在实践中发现该思路不能完美地与本次实验要求融合，从而最终也放弃了该思路，使用了普通思路。后期应该会使用那种突出的思路去改进自己实验。
（2）只是由于太久没有看信息论有关知识了，当时做实验的时间做到信息熵的时候忘记了公式，后来经过百度之后才回想起来。提醒了自己应该要回去复习信息论的知识了，不然会很容易忘光。