欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Hadoop之倒排索引

程序员文章站 2022-04-28 18:10:35
...

需求

输入N个文件,生成带详细信息的倒排索引。其中,输入为莎士比亚文集Shakespeare.tar.gz
举例如下,有4个输入文件:

  • d1.txt: cat dog cat fox
  • d2.txt: cat bear cat cat fox
  • d3.txt: fox wolf dog
  • d4.txt: wolf hen rabbit cat sheep
    要求建立如下格式的倒排索引:
  • cat —>3: {(d1.txt,2,4),(d2.txt,3,5),(d4.txt,1,5)}
  • 单词—>出现该单词的文件个数: {(出现该单词的文件名(或文件id),单词在该文件中的出现次数,该文件 的总单词数),……}

环境

  • 硬件环境:Intel® Core™i5-8250U aaa@qq.com 1.8GHz / 8GB内存
  • 软件环境:Ubuntu18.04 64-bit Java 13.0.1 Hadoop 2.7.4

倒排索引

倒排索引是文档检索系统中最常用的数据结构,被广泛用于全文搜索引擎。它主要是用来存储某个单词(或词组)在一个文档或一组文档的存储位置的映射,即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容,而是进行了相反的操作(根据关键字来查找文档),因而称为倒排索引(Inverted Index)。通常情况下,倒排索引由一个单词(词组)以及相关的文档列表(标示文档的ID号,或者是指定文档所在位置的URI)组成,如下图所示:
Hadoop之倒排索引
从上图可以看出,单词1出现在{文档1、文档4、文档13、……}中,单词2出现在{文档3、文档5、文档15、……}中,而单词3出现在{文档1、文档8、文档20、….}中,还需要给每个文档添加一个权值,用来指出每个文档与搜素内容相关的相关度,如下图所示:
Hadoop之倒排索引
最常用的是使用词频作为权重,即记录单词在文档中出现的次数了。以英文为例,如下图所示,索引文件中的“MapReduce”一行表示:“MapReduce”这个单词在文本T0中出现过1次,T1中出现过1次,T2中出现过2次。当搜索条件为“MapReduce”、“is”、“simple”时,对应的集合为:{T0,T1,T2}∩{ T0,T1}∩{ T0,T1}={ T0,T1},即文本T0和T1包含所要索引的单词,而且只有T0是连续的。
Hadoop之倒排索引

总体思路

通过两个 Map/Reduce 过程就可以到达要求。 第一个Map/Reduce 统计各个文件中的所有单词的出现次数,以及各个文件单词总数;第二个 Map/Reduce 根据第一个 Map/Reduce 的统计结果处理加工得到单词倒排索引。

  • 第一个 map/reduce 详细设计
    在 map 过程中,重写 map 类, 利用 StringTokenizer 类,将 map 方法中的
    value 值中存储的文本,拆分成一个个的单词, 并获取文件名, 以两种格式进
    行输出<filename+word,1>或者<filename,1>
public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException 
        {
            //获取文件名
            FileSplit fileSplit= (FileSplit)context.getInputSplit();
            String fileName = fileSplit.getPath().getName();
    
            //获取单词在单个文件中出现次数,及文件单词总数
            StringTokenizer itr= new StringTokenizer(value.toString());
            for(; itr.hasMoreTokens(); ) 
            {
                String word =removeNonLetters( itr.nextToken().toLowerCase());
                String fileWord = fileName+"\001"+word;
                if(!word.equals(""))
                {
                    context.write(new Text(fileWord), new IntWritable(1));
                    context.write(new Text(fileName), new IntWritable(1));
                }
            }
        }

在Reduce过程中,统计得到每个文件中每个单词的出现次数,以及每个文件的单词总数,输出<key,count>。

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
    int sum = 0;
    for (IntWritable val : values)
    {
        sum += val.get();
    }
    context.write(key,new IntWritable(sum));
}
  • 第二个 map/reduce 详细设计
    Map过程读取第一个MR的输出,对value值进行拆分,重新组合后输出键为固定Text类型值index,值为filename+word+count或者filename+count。
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
    String valStr = value.toString();
    String[] records = valStr.split("\t");
    context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
}

Reduce过程中定义四个HashMap,Map<String,Integer> wordinfilescount,key为单词+文件名,value为单词在该文件中出现的次数;Map<String,Integer> filescount ,key为文件名,value为文件的单词总数;Map<String,Integer> wordinfiles, key为单词,value为单词在多少个文件中现;Map<String,String> indexes,key为单词,value为倒排索引。读取values值,根据设定分隔符拆分,判断拆分后长度如果为2,则该值为文件名+文件单词总数,将拆分后的文件名及文件单词总数,组成键值对放入Map<String,Integer> filescount;拆分后长度如果为3,则该值为文件名+单词+单词在该文件中出现次数,将拆分后的文件名+单词及单词在该文件中出现次数组成键值对放入Map<String,Integer> wordinfilescount,同时统计单词在多少个文件中出现,并组成键值对放入Map<String,Integer> wordinfiles。遍历Map<String,Integer> wordinfilescount,将单词作为键,“单词->出现该单词的文件个数:总文件个数:{(出现该单词的文件名,单词在该文件中的出现次数,该文件的总单词数)”作为值,放入Map<String,String> indexes中。遍历Map<String,String> indexes获取倒排索引并输出全部索引。

public static class  InverseReducer extends Reducer<Text, Text, Text, NullWritable> {
    private Map<String, Integer> wordinfilescount = new HashMap<String, Integer>();//key为单词+文件名,value为单词在该文件中出现的次数
    private Map<String, Integer> filescount = new HashMap<String, Integer>();//key为文件名,value为文件的单词总数
    private Map<String, Integer> wordinfiles = new HashMap<String, Integer>();//key为单词,value为单词在多少的文件中出现
    private Map<String, String> indexes = new HashMap<String, String>();//key为单词,value为倒排索引

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        //拆分输入,获取单词出现在几个文件中以及在该文件中出现次数,各个文件的单词总数,总文件数
        for (Text val : values) {
            String valStr = val.toString();
            String[] records = valStr.split("\001");

            switch (records.length) {
                case 2:
                    filescount.put(records[0], Integer.parseInt(records[1]));
                    break;
                case 3: {
                    wordinfilescount.put(valStr, Integer.parseInt(records[2]));
                    if (!wordinfiles.containsKey(records[1])) {
                        wordinfiles.put(records[1], 1);
                    } else {
                        wordinfiles.put(records[1], wordinfiles.get(records[1]) + 1);
                    }
                }
                ;
                break;

            }

        }

        //处理获取倒排索引
        for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
            String valStr = entry.getKey();
            String[] records = valStr.split("\001");
            String word = records[1];
            if (!indexes.containsKey(word)) {
                StringBuilder sb = new StringBuilder();
                sb.append(word)
                        .append("->")
                        .append(wordinfiles.get(word))
                        .append(":")
                        .append("{(")
                        .append(records[0])
                        .append(",")
                        .append(entry.getValue())
                        .append(",")
                        .append(filescount.get(records[0]))
                        .append(")");
                indexes.put(word, sb.toString());
            } else {
                StringBuilder sb = new StringBuilder();
                sb.append(",(")
                        .append(records[0])
                        .append(",")
                        .append(entry.getValue())
                        .append(",")
                        .append(filescount.get(records[0]))
                        .append(")");
                indexes.put(word, indexes.get(word) + sb.toString());
            }
        }
        for (Entry<String, String> entry : indexes.entrySet()) {
            context.write(new Text(entry.getValue() + "}"), NullWritable.get());
        }
    }
}

完整源代码

import java.io.IOException;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.HashMap;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class InverseIndex {

    /*
     * 第一个mr的map类,获取每个单词在单个文件中出现次数,输入为每个文件行偏移量,输出为<word+filename,1>
     * 或<filename,1>
     */
    public static class statisticsMap extends
            Mapper<Object, Text, Text, IntWritable> {
        private Text mapKey = new Text("key");
        @Override
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            //获取文件名
            FileSplit fileSplit= (FileSplit)context.getInputSplit();
            String fileName = fileSplit.getPath().getName();

            //获取单词在单个文件中出现次数,及文件单词总数
            StringTokenizer itr= new StringTokenizer(value.toString());
            for(; itr.hasMoreTokens(); ) {
                String word =removeNonLetters( itr.nextToken().toLowerCase());
                String fileWord = fileName+"\001"+word;
                if(!word.equals("")){
                    context.write(new Text(fileWord), new IntWritable(1));
                    context.write(new Text(fileName), new IntWritable(1));
                }
            }
        }
        //去掉字符串非字母字符
        public static String removeNonLetters(String original){

            StringBuffer aBuffer=new StringBuffer(original.length());
            char aCharacter;
            for(int i=0;i<original.length();i++){
                aCharacter=original.charAt(i);
                if(Character.isLetter(aCharacter)){
                    aBuffer.append(aCharacter);
                }
            }
            return new String(aBuffer);
        }
    }

    //第一个mr的reduce类,统计汇总出现单词的文件个数,及每个文件中单词出现个数及每个文件单词个数,
    public static class statisticsReduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key,new IntWritable(sum));
        }
    }

    public static class InverseMapper extends
            Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String valStr = value.toString();
            String[] records = valStr.split("\t");
            context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
        }
    }

    public static class  InverseReducer extends
            Reducer<Text, Text, Text, NullWritable> {
        private Map<String,Integer> wordinfilescount = new HashMap<String,Integer>();//key为单词+文件名,value为单词在该文件中出现的次数
        private Map<String,Integer> filescount = new HashMap<String,Integer>();//key为文件名,value为文件的单词总数
        private Map<String,Integer> wordinfiles = new HashMap<String,Integer>();//key为单词,value为单词在多少的文件中出现
        private Map<String,String> indexes = new HashMap<String,String>();//key为单词,value为倒排索引

        public void reduce( Text  key, Iterable<Text>  values, Context context)
                throws IOException, InterruptedException {
            //拆分输入,获取单词出现在几个文件中以及在该文件中出现次数,各个文件的单词总数,总文件数
            for (Text val : values) {
                String valStr = val.toString();
                String[] records = valStr.split("\001");

                switch(records.length){
                    case 2:filescount.put(records[0], Integer.parseInt(records[1]));
                        break;
                    case 3:{
                        wordinfilescount.put(valStr,  Integer.parseInt(records[2]));
                        if(!wordinfiles.containsKey(records[1])){
                            wordinfiles.put(records[1], 1);
                        }else{
                            wordinfiles.put(records[1], wordinfiles.get(records[1])+1);
                        }
                    };
                    break;

                }

            }

            //处理获取倒排索引
            for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
                String valStr = entry.getKey();
                String[] records = valStr.split("\001");
                String word = records[1];
                if(!indexes.containsKey(word)){
                    StringBuilder sb = new StringBuilder();
                    sb.append(word)
                            .append("->")
                            .append(wordinfiles.get(word))
                            .append(":")
                            .append("{(")
                            .append( records[0])
                            .append(",")
                            .append(entry.getValue())
                            .append(",")
                            .append(filescount.get( records[0]))
                            .append(")");
                    indexes.put(word,sb.toString() );
                }else{
                    StringBuilder sb = new StringBuilder();
                    sb.append(",(")
                            .append( records[0])
                            .append(",")
                            .append(entry.getValue())
                            .append(",")
                            .append(filescount.get( records[0]))
                            .append(")");
                    indexes.put(word,indexes.get(word)+sb.toString() );
                }
            }
            for (Entry<String, String> entry : indexes.entrySet()) {
                context.write(new Text(entry.getValue()+"}"), NullWritable.get());
            }
        }
    }
    //统计单词在文件中出现次数及单个文件单词总数
    public static void StatisticsTask(String[] args) throws Exception{
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();

        if (otherArgs.length < 3) {
            System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "ipstatistics1");
        job.setMapperClass(statisticsMap.class);
        job.setReducerClass(statisticsReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(
                otherArgs[1]));
        job.waitForCompletion(true) ;
    }
    //根据统计结果输出倒排索引
    public static void  InverseTask(String[] args) throws Exception{
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length < 3) {
            System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "ipstatistics2");
        job.setMapperClass( InverseMapper.class);
        job.setReducerClass( InverseReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
        FileOutputFormat.setOutputPath(job, new Path(
                otherArgs[2]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    public static void main(String[] args) throws Exception {
        StatisticsTask(args);
        InverseTask(args);
    }
}

部署运行

  • 启动Hadoop集群
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# start-all.sh
  • 编译源码
    切换工作目录并编译
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java

报了一个错误:

src/InverseIndex.java:8: error: package org.apache.hadoop.conf does not exist
import org.apache.hadoop.conf.Configuration;
...

这个错误是因为CLASSPATH环境变量没配置对,在~/.bashrc的末尾添加如下语句:

vim ~/.bashrc
#在末尾添加如下一句
eexport CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH

保存退出,通过下面的命令使修改生效

source ~/.bashrc

重新编译

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java

可以看到bin目录下生成了编译后的文件

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# ls bin/ -lh
total 24K
-rw-r--r-- 1 root root 2.3K 12月 25 20:02  InverseIndex.class
-rw-r--r-- 1 root root 1.9K 12月 25 20:02 'InverseIndex$InverseMapper.class'
-rw-r--r-- 1 root root 4.1K 12月 25 20:02 'InverseIndex$InverseReducer.class'
-rw-r--r-- 1 root root 2.9K 12月 25 20:02 'InverseIndex$statisticsMap.class'
-rw-r--r-- 1 root root 1.7K 12月 25 20:02 'InverseIndex$statisticsReduce.class'
  • 打包jar文件
jar -cvf InverseIndex.jar -C bin/ .
#输出
added manifest
adding: InverseIndex$InverseReducer.class(in = 4186) (out= 1820)(deflated 56%)
adding: InverseIndex.class(in = 2282) (out= 1160)(deflated 49%)
adding: InverseIndex$InverseMapper.class(in = 1909) (out= 771)(deflated 59%)
adding: InverseIndex$statisticsMap.class(in = 2960) (out= 1371)(deflated 53%)
adding: InverseIndex$statisticsReduce.class(in = 1672) (out= 714)(deflated 57%)
  • 上传输入文件
    在HDFS中创建input目录,用于存储输入文件
hadoop fs -mkdir input

将输入文件从本地拷贝到HDFS的input目录

hadoop fs -put /usr/local/big_data/homework/data/0ws*.txt input

查看是否拷贝成功

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls input/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 35 items
-rw-r--r--   1 root supergroup     146884 2019-12-25 13:45 input/0ws0110.txt
-rw-r--r--   1 root supergroup     166807 2019-12-25 13:45 input/0ws0210.txt
-rw-r--r--   1 root supergroup     161786 2019-12-25 13:45 input/0ws0310.txt
-rw-r--r--   1 root supergroup     189377 2019-12-25 13:45 input/0ws0410.txt
-rw-r--r--   1 root supergroup     101367 2019-12-25 13:45 input/0ws0610.txt
-rw-r--r--   1 root supergroup     136652 2019-12-25 13:45 input/0ws0910.txt
-rw-r--r--   1 root supergroup     138374 2019-12-25 13:45 input/0ws1010.txt
-rw-r--r--   1 root supergroup     117959 2019-12-25 13:45 input/0ws1110.txt
-rw-r--r--   1 root supergroup     143801 2019-12-25 13:45 input/0ws1210.txt
-rw-r--r--   1 root supergroup     137832 2019-12-25 13:45 input/0ws1410.txt
-rw-r--r--   1 root supergroup     145078 2019-12-25 13:45 input/0ws1510.txt
-rw-r--r--   1 root supergroup     156279 2019-12-25 13:45 input/0ws1610.txt
-rw-r--r--   1 root supergroup     112280 2019-12-25 13:45 input/0ws1710.txt
-rw-r--r--   1 root supergroup     137320 2019-12-25 13:45 input/0ws1810.txt
-rw-r--r--   1 root supergroup     159175 2019-12-25 13:45 input/0ws1910.txt
-rw-r--r--   1 root supergroup     142894 2019-12-25 13:45 input/0ws2010.txt
-rw-r--r--   1 root supergroup     169826 2019-12-25 13:45 input/0ws2110.txt
-rw-r--r--   1 root supergroup     139007 2019-12-25 13:45 input/0ws2210.txt
-rw-r--r--   1 root supergroup     170479 2019-12-25 13:45 input/0ws2310.txt
-rw-r--r--   1 root supergroup     132199 2019-12-25 13:45 input/0ws2410.txt
-rw-r--r--   1 root supergroup     140595 2019-12-25 13:45 input/0ws2510.txt
-rw-r--r--   1 root supergroup     184147 2019-12-25 13:45 input/0ws2610.txt
-rw-r--r--   1 root supergroup     130617 2019-12-25 13:45 input/0ws2810.txt
-rw-r--r--   1 root supergroup     150544 2019-12-25 13:45 input/0ws3010.txt
-rw-r--r--   1 root supergroup     143937 2019-12-25 13:45 input/0ws3110.txt
-rw-r--r--   1 root supergroup     172030 2019-12-25 13:45 input/0ws3210.txt
-rw-r--r--   1 root supergroup     158008 2019-12-25 13:45 input/0ws3310.txt
-rw-r--r--   1 root supergroup     119985 2019-12-25 13:45 input/0ws3410.txt
-rw-r--r--   1 root supergroup     165919 2019-12-25 13:45 input/0ws3510.txt
-rw-r--r--   1 root supergroup     181370 2019-12-25 13:45 input/0ws3610.txt
-rw-r--r--   1 root supergroup     126492 2019-12-25 13:45 input/0ws3710.txt
-rw-r--r--   1 root supergroup     177241 2019-12-25 13:45 input/0ws3910.txt
-rw-r--r--   1 root supergroup     161652 2019-12-25 13:45 input/0ws4010.txt
-rw-r--r--   1 root supergroup     115675 2019-12-25 13:45 input/0ws4110.txt
-rw-r--r--   1 root supergroup     160868 2019-12-25 13:45 input/0ws4210.txt

可以看到输入文件已经拷贝成功。其中的警告信息是因为java版本太高,并不影响hadoop的正常使用,可以忽略。

  • 运行Jar文件
hadoop jar InverseIndex.jar InverseIndex input tmp output

其中, input 是存储输入文件的目录; tmp 存储的是第一个 map/reduce的输出,也是第二个 map/reduce 的输入; output 存储的第二个map/reduce 的输出,也就是最终结果。
Note:tmp和output目录不能事先创建好,如果事先创建好,会报类似下面的错误

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop jar InvertedIndex.jar InvertedIndex input invertedindex/output
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/12/25 14:21:37 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/12/25 14:21:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/root/invertedindex/output already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:266)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:139)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:691)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
	at InvertedIndex.main(InvertedIndex.java:120)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

存储输出文件的目录,系统会自己创建。

  • 查看输出结果
    查看HDFS上output目录内容
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls output/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 2 items
-rw-r--r--   1 root supergroup          0 2019-12-25 16:38 output/_SUCCESS
-rw-r--r--   1 root supergroup    3821132 2019-12-25 16:38 output/part-r-00000

output目录下的part-r-00000文件就是存储的最终输出结果。

  • 查看结果输出文件内容
hadoop fs -cat output/part-r-00000

可以看到输出如下内容(截取部分)

broach->4:{(0ws0110.txt,1,24700),(0ws3710.txt,1,21558),(0ws4210.txt,1,27895),(0ws0910.txt,1,23692)}
throgh->1:{(0ws1010.txt,1,24377)}
lustier->3:{(0ws1910.txt,1,27807),(0ws1610.txt,1,27734),(0ws3010.txt,1,26502)}
instances->6:{(0ws2510.txt,1,25060),(0ws1110.txt,1,20592),(0ws2210.txt,1,24615),(0ws2410.txt,1,22949),(0ws2110.txt,1,29220),(0ws2610.txt,1,32135)}
stout->12:{(0ws0310.txt,2,27573),(0ws0410.txt,1,32588),(0ws3610.txt,1,31061),(0ws1610.txt,2,27734),(0ws4210.txt,1,27895),(0ws3710.txt,1,21558),(0ws3410.txt,1,20206),(0ws0210.txt,2,28689),(0ws2810.txt,1,23284),(0ws0110.txt,3,24700),(0ws1410.txt,1,23904),(0ws1910.txt,1,27807)}
iuliet->2:{(0ws3110.txt,8,25123),(0ws1610.txt,54,27734)}
flee->1:{(0ws1210.txt,1,24901)}
fled->26:{(0ws2010.txt,1,24815),(0ws2210.txt,5,24615),(0ws0410.txt,4,32588),(0ws1410.txt,1,23904),(0ws3710.txt,1,21558),(0ws0610.txt,2,17919),(0ws3910.txt,5,30687),(0ws3010.txt,1,26502),(0ws0910.txt,1,23692),(0ws3510.txt,3,28227),(0ws1910.txt,1,27807),(0ws1710.txt,3,19538),(0ws0210.txt,6,28689),(0ws3310.txt,3,27164),(0ws2110.txt,3,29220),(0ws1510.txt,7,24965),(0ws3610.txt,1,31061),(0ws3210.txt,2,29959),(0ws1110.txt,3,20592),(0ws4010.txt,3,28122),(0ws3410.txt,7,20206),(0ws0110.txt,8,24700),(0ws1610.txt,2,27734),(0ws0310.txt,8,27573),(0ws1810.txt,1,24517),(0ws2410.txt,3,22949)}
hedgd->2:{(0ws1810.txt,1,24517),(0ws1410.txt,1,23904)}
hedge->10:{(0ws1910.txt,1,27807),(0ws3010.txt,1,26502),(0ws2410.txt,1,22949),(0ws1010.txt,1,24377),(0ws0210.txt,1,28689),(0ws2210.txt,1,24615),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2610.txt,1,32135),(0ws4010.txt,2,28122)}
flea->5:{(0ws1210.txt,1,24901),(0ws1010.txt,1,24377),(0ws2310.txt,2,29377),(0ws3310.txt,1,27164),(0ws2810.txt,1,23284)}
taxing->1:{(0ws2510.txt,1,25060)}
responsiue->1:{(0ws2610.txt,1,32135)}
flew->4:{(0ws0210.txt,1,28689),(0ws3310.txt,1,27164),(0ws4210.txt,1,27895),(0ws1810.txt,1,24517)}
treuant->1:{(0ws1210.txt,1,24901)}
commandment->5:{(0ws2610.txt,4,32135),(0ws4010.txt,1,28122),(0ws3610.txt,1,31061),(0ws2110.txt,1,29220),(0ws2510.txt,1,25060)}
sailemaker->1:{(0ws1010.txt,1,24377)}
carried->16:{(0ws3910.txt,1,30687),(0ws2610.txt,1,32135),(0ws1710.txt,1,19538),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2210.txt,1,24615),(0ws0610.txt,2,17919),(0ws3310.txt,1,27164),(0ws1110.txt,1,20592),(0ws3110.txt,4,25123),(0ws4210.txt,2,27895),(0ws3610.txt,2,31061),(0ws1210.txt,1,24901),(0ws3410.txt,1,20206),(0ws3010.txt,2,26502),(0ws2310.txt,1,29377)}
auoiding->1:{(0ws0310.txt,1,27573)}
inthrald->1:{(0ws0110.txt,1,24700)}
clearenesse->1:{(0ws3410.txt,1,20206)}
blithe->2:{(0ws2210.txt,1,24615),(0ws0910.txt,1,23692)}
clothes->8:{(0ws2310.txt,1,29377),(0ws1010.txt,1,24377),(0ws0110.txt,1,24700),(0ws1610.txt,1,27734),(0ws1910.txt,1,27807),(0ws2610.txt,1,32135),(0ws3110.txt,1,25123),(0ws4010.txt,1,28122)}
raunge->2:{(0ws1510.txt,1,24965),(0ws2310.txt,1,29377)}
gonerill->1:{(0ws3310.txt,18,27164)}
stowd->1:{(0ws3210.txt,1,29959)}
carrier->2:{(0ws1910.txt,4,27807),(0ws0910.txt,1,23692)}
carries->15:{(0ws1910.txt,1,27807),(0ws3710.txt,1,21558),(0ws3510.txt,1,28227),(0ws1710.txt,2,19538),(0ws0610.txt,1,17919),(0ws2410.txt,1,22949),(0ws2610.txt,1,32135),(0ws4210.txt,2,27895),(0ws1810.txt,1,24517),(0ws4110.txt,1,19796),(0ws2510.txt,1,25060),(0ws1210.txt,1,24901),(0ws3610.txt,1,31061),(0ws3010.txt,3,26502),(0ws1610.txt,1,27734)}
obiections->3:{(0ws0210.txt,1,28689),(0ws4210.txt,1,27895),(0ws0110.txt,1,24700)}
vnsuspected->2:{(0ws0410.txt,1,32588),(0ws1010.txt,1,24377)}
arguments->7:{(0ws1210.txt,1,24901),(0ws2810.txt,3,23284),(0ws0310.txt,2,27573),(0ws4210.txt,1,27895),(0ws1410.txt,1,23904),(0ws3310.txt,1,27164),(0ws0410.txt,1,32588)}
shreds->2:{(0ws3610.txt,1,31061),(0ws2610.txt,1,32135)}
raunges->1:{(0ws3610.txt,1,31061)}