Hadoop之倒排索引

程序员文章站 2022-04-28 18:10:35

...

需求

输入N个文件，生成带详细信息的倒排索引。其中，输入为莎士比亚文集Shakespeare.tar.gz。
举例如下，有4个输入文件：

d1.txt: cat dog cat fox
d2.txt: cat bear cat cat fox
d3.txt: fox wolf dog
d4.txt: wolf hen rabbit cat sheep
要求建立如下格式的倒排索引：
cat —>3: {(d1.txt,2,4),(d2.txt,3,5),(d4.txt,1,5)}
单词—>出现该单词的文件个数: {(出现该单词的文件名（或文件id)，单词在该文件中的出现次数，该文件的总单词数),……}

环境

硬件环境：Intel® Core™i5-8250U aaa@qq.com 1.8GHz / 8GB内存
软件环境：Ubuntu18.04 64-bit Java 13.0.1 Hadoop 2.7.4

倒排索引

倒排索引是文档检索系统中最常用的数据结构，被广泛用于全文搜索引擎。它主要是用来存储某个单词（或词组）在一个文档或一组文档的存储位置的映射，即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容，而是进行了相反的操作（根据关键字来查找文档），因而称为倒排索引（Inverted Index）。通常情况下，倒排索引由一个单词（词组）以及相关的文档列表（标示文档的ID号，或者是指定文档所在位置的URI）组成，如下图所示：
Hadoop之倒排索引
从上图可以看出，单词1出现在{文档1、文档4、文档13、……}中，单词2出现在{文档3、文档5、文档15、……}中，而单词3出现在{文档1、文档8、文档20、….}中，还需要给每个文档添加一个权值，用来指出每个文档与搜素内容相关的相关度，如下图所示：
Hadoop之倒排索引
最常用的是使用词频作为权重，即记录单词在文档中出现的次数了。以英文为例，如下图所示，索引文件中的“MapReduce”一行表示：“MapReduce”这个单词在文本T0中出现过1次，T1中出现过1次，T2中出现过2次。当搜索条件为“MapReduce”、“is”、“simple”时，对应的集合为：{T0，T1，T2}∩{ T0，T1}∩{ T0，T1}={ T0，T1}，即文本T0和T1包含所要索引的单词，而且只有T0是连续的。
Hadoop之倒排索引

总体思路

通过两个 Map/Reduce 过程就可以到达要求。第一个Map/Reduce 统计各个文件中的所有单词的出现次数，以及各个文件单词总数；第二个 Map/Reduce 根据第一个 Map/Reduce 的统计结果处理加工得到单词倒排索引。

第一个 map/reduce 详细设计
在 map 过程中，重写 map 类，利用 StringTokenizer 类，将 map 方法中的
value 值中存储的文本，拆分成一个个的单词，并获取文件名，以两种格式进
行输出<filename+word,1>或者<filename,1>

public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException 
        {
            //获取文件名
            FileSplit fileSplit= (FileSplit)context.getInputSplit();
            String fileName = fileSplit.getPath().getName();
    
            //获取单词在单个文件中出现次数，及文件单词总数
            StringTokenizer itr= new StringTokenizer(value.toString());
            for(; itr.hasMoreTokens(); ) 
            {
                String word =removeNonLetters( itr.nextToken().toLowerCase());
                String fileWord = fileName+"\001"+word;
                if(!word.equals(""))
                {
                    context.write(new Text(fileWord), new IntWritable(1));
                    context.write(new Text(fileName), new IntWritable(1));
                }
            }
        }

在Reduce过程中，统计得到每个文件中每个单词的出现次数，以及每个文件的单词总数，输出<key,count>。

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
    int sum = 0;
    for (IntWritable val : values)
    {
        sum += val.get();
    }
    context.write(key,new IntWritable(sum));
}

第二个 map/reduce 详细设计
Map过程读取第一个MR的输出，对value值进行拆分，重新组合后输出键为固定Text类型值index，值为filename+word+count或者filename+count。

public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
    String valStr = value.toString();
    String[] records = valStr.split("\t");
    context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
}

Reduce过程中定义四个HashMap，Map<String,Integer> wordinfilescount，key为单词+文件名，value为单词在该文件中出现的次数；Map<String,Integer> filescount ，key为文件名，value为文件的单词总数；Map<String,Integer> wordinfiles， key为单词，value为单词在多少个文件中现；Map<String,String> indexes，key为单词，value为倒排索引。读取values值，根据设定分隔符拆分，判断拆分后长度如果为2，则该值为文件名+文件单词总数，将拆分后的文件名及文件单词总数，组成键值对放入Map<String,Integer> filescount；拆分后长度如果为3，则该值为文件名+单词+单词在该文件中出现次数，将拆分后的文件名+单词及单词在该文件中出现次数组成键值对放入Map<String,Integer> wordinfilescount，同时统计单词在多少个文件中出现，并组成键值对放入Map<String,Integer> wordinfiles。遍历Map<String,Integer> wordinfilescount，将单词作为键，“单词->出现该单词的文件个数:总文件个数：{(出现该单词的文件名，单词在该文件中的出现次数，该文件的总单词数)”作为值，放入Map<String,String> indexes中。遍历Map<String,String> indexes获取倒排索引并输出全部索引。

public static class  InverseReducer extends Reducer<Text, Text, Text, NullWritable> {
    private Map<String, Integer> wordinfilescount = new HashMap<String, Integer>();//key为单词+文件名，value为单词在该文件中出现的次数
    private Map<String, Integer> filescount = new HashMap<String, Integer>();//key为文件名，value为文件的单词总数
    private Map<String, Integer> wordinfiles = new HashMap<String, Integer>();//key为单词，value为单词在多少的文件中出现
    private Map<String, String> indexes = new HashMap<String, String>();//key为单词，value为倒排索引

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        //拆分输入，获取单词出现在几个文件中以及在该文件中出现次数，各个文件的单词总数，总文件数
        for (Text val : values) {
            String valStr = val.toString();
            String[] records = valStr.split("\001");

            switch (records.length) {
                case 2:
                    filescount.put(records[0], Integer.parseInt(records[1]));
                    break;
                case 3: {
                    wordinfilescount.put(valStr, Integer.parseInt(records[2]));
                    if (!wordinfiles.containsKey(records[1])) {
                        wordinfiles.put(records[1], 1);
                    } else {
                        wordinfiles.put(records[1], wordinfiles.get(records[1]) + 1);
                    }
                }
                ;
                break;

            }

        }

        //处理获取倒排索引
        for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
            String valStr = entry.getKey();
            String[] records = valStr.split("\001");
            String word = records[1];
            if (!indexes.containsKey(word)) {
                StringBuilder sb = new StringBuilder();
                sb.append(word)
                        .append("->")
                        .append(wordinfiles.get(word))
                        .append(":")
                        .append("{(")
                        .append(records[0])
                        .append(",")
                        .append(entry.getValue())
                        .append(",")
                        .append(filescount.get(records[0]))
                        .append(")");
                indexes.put(word, sb.toString());
            } else {
                StringBuilder sb = new StringBuilder();
                sb.append(",(")
                        .append(records[0])
                        .append(",")
                        .append(entry.getValue())
                        .append(",")
                        .append(filescount.get(records[0]))
                        .append(")");
                indexes.put(word, indexes.get(word) + sb.toString());
            }
        }
        for (Entry<String, String> entry : indexes.entrySet()) {
            context.write(new Text(entry.getValue() + "}"), NullWritable.get());
        }
    }
}

完整源代码

import java.io.IOException;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.HashMap;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class InverseIndex {

    /*
     * 第一个mr的map类，获取每个单词在单个文件中出现次数，输入为每个文件行偏移量，输出为<word+filename,1>
     * 或<filename,1>
     */
    public static class statisticsMap extends
            Mapper<Object, Text, Text, IntWritable> {
        private Text mapKey = new Text("key");
        @Override
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            //获取文件名
            FileSplit fileSplit= (FileSplit)context.getInputSplit();
            String fileName = fileSplit.getPath().getName();

            //获取单词在单个文件中出现次数，及文件单词总数
            StringTokenizer itr= new StringTokenizer(value.toString());
            for(; itr.hasMoreTokens(); ) {
                String word =removeNonLetters( itr.nextToken().toLowerCase());
                String fileWord = fileName+"\001"+word;
                if(!word.equals("")){
                    context.write(new Text(fileWord), new IntWritable(1));
                    context.write(new Text(fileName), new IntWritable(1));
                }
            }
        }
        //去掉字符串非字母字符
        public static String removeNonLetters(String original){

            StringBuffer aBuffer=new StringBuffer(original.length());
            char aCharacter;
            for(int i=0;i<original.length();i++){
                aCharacter=original.charAt(i);
                if(Character.isLetter(aCharacter)){
                    aBuffer.append(aCharacter);
                }
            }
            return new String(aBuffer);
        }
    }

    //第一个mr的reduce类，统计汇总出现单词的文件个数，及每个文件中单词出现个数及每个文件单词个数,
    public static class statisticsReduce extends
            Reducer<Text, IntWritable, Text, IntWritable> {

        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {

            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key,new IntWritable(sum));
        }
    }

    public static class InverseMapper extends
            Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String valStr = value.toString();
            String[] records = valStr.split("\t");
            context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
        }
    }

    public static class  InverseReducer extends
            Reducer<Text, Text, Text, NullWritable> {
        private Map<String,Integer> wordinfilescount = new HashMap<String,Integer>();//key为单词+文件名，value为单词在该文件中出现的次数
        private Map<String,Integer> filescount = new HashMap<String,Integer>();//key为文件名，value为文件的单词总数
        private Map<String,Integer> wordinfiles = new HashMap<String,Integer>();//key为单词，value为单词在多少的文件中出现
        private Map<String,String> indexes = new HashMap<String,String>();//key为单词，value为倒排索引

        public void reduce( Text  key, Iterable<Text>  values, Context context)
                throws IOException, InterruptedException {
            //拆分输入，获取单词出现在几个文件中以及在该文件中出现次数，各个文件的单词总数，总文件数
            for (Text val : values) {
                String valStr = val.toString();
                String[] records = valStr.split("\001");

                switch(records.length){
                    case 2:filescount.put(records[0], Integer.parseInt(records[1]));
                        break;
                    case 3:{
                        wordinfilescount.put(valStr,  Integer.parseInt(records[2]));
                        if(!wordinfiles.containsKey(records[1])){
                            wordinfiles.put(records[1], 1);
                        }else{
                            wordinfiles.put(records[1], wordinfiles.get(records[1])+1);
                        }
                    };
                    break;

                }

            }

            //处理获取倒排索引
            for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
                String valStr = entry.getKey();
                String[] records = valStr.split("\001");
                String word = records[1];
                if(!indexes.containsKey(word)){
                    StringBuilder sb = new StringBuilder();
                    sb.append(word)
                            .append("->")
                            .append(wordinfiles.get(word))
                            .append(":")
                            .append("{(")
                            .append( records[0])
                            .append(",")
                            .append(entry.getValue())
                            .append(",")
                            .append(filescount.get( records[0]))
                            .append(")");
                    indexes.put(word,sb.toString() );
                }else{
                    StringBuilder sb = new StringBuilder();
                    sb.append(",(")
                            .append( records[0])
                            .append(",")
                            .append(entry.getValue())
                            .append(",")
                            .append(filescount.get( records[0]))
                            .append(")");
                    indexes.put(word,indexes.get(word)+sb.toString() );
                }
            }
            for (Entry<String, String> entry : indexes.entrySet()) {
                context.write(new Text(entry.getValue()+"}"), NullWritable.get());
            }
        }
    }
    //统计单词在文件中出现次数及单个文件单词总数
    public static void StatisticsTask(String[] args) throws Exception{
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();

        if (otherArgs.length < 3) {
            System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "ipstatistics1");
        job.setMapperClass(statisticsMap.class);
        job.setReducerClass(statisticsReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(
                otherArgs[1]));
        job.waitForCompletion(true) ;
    }
    //根据统计结果输出倒排索引
    public static void  InverseTask(String[] args) throws Exception{
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length < 3) {
            System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "ipstatistics2");
        job.setMapperClass( InverseMapper.class);
        job.setReducerClass( InverseReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
        FileOutputFormat.setOutputPath(job, new Path(
                otherArgs[2]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    public static void main(String[] args) throws Exception {
        StatisticsTask(args);
        InverseTask(args);
    }
}

部署运行

启动Hadoop集群

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# start-all.sh

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java

报了一个错误：

src/InverseIndex.java:8: error: package org.apache.hadoop.conf does not exist
import org.apache.hadoop.conf.Configuration;
...

这个错误是因为CLASSPATH环境变量没配置对，在~/.bashrc的末尾添加如下语句：

vim ~/.bashrc
#在末尾添加如下一句
eexport CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH

保存退出，通过下面的命令使修改生效

source ~/.bashrc

重新编译

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java

可以看到bin目录下生成了编译后的文件

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# ls bin/ -lh
total 24K
-rw-r--r-- 1 root root 2.3K 12月 25 20:02  InverseIndex.class
-rw-r--r-- 1 root root 1.9K 12月 25 20:02 'InverseIndex$InverseMapper.class'
-rw-r--r-- 1 root root 4.1K 12月 25 20:02 'InverseIndex$InverseReducer.class'
-rw-r--r-- 1 root root 2.9K 12月 25 20:02 'InverseIndex$statisticsMap.class'
-rw-r--r-- 1 root root 1.7K 12月 25 20:02 'InverseIndex$statisticsReduce.class'

打包jar文件

jar -cvf InverseIndex.jar -C bin/ .
#输出
added manifest
adding: InverseIndex$InverseReducer.class(in = 4186) (out= 1820)(deflated 56%)
adding: InverseIndex.class(in = 2282) (out= 1160)(deflated 49%)
adding: InverseIndex$InverseMapper.class(in = 1909) (out= 771)(deflated 59%)
adding: InverseIndex$statisticsMap.class(in = 2960) (out= 1371)(deflated 53%)
adding: InverseIndex$statisticsReduce.class(in = 1672) (out= 714)(deflated 57%)

上传输入文件
在HDFS中创建input目录，用于存储输入文件

hadoop fs -mkdir input

将输入文件从本地拷贝到HDFS的input目录

hadoop fs -put /usr/local/big_data/homework/data/0ws*.txt input

查看是否拷贝成功

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls input/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 35 items
-rw-r--r--   1 root supergroup     146884 2019-12-25 13:45 input/0ws0110.txt
-rw-r--r--   1 root supergroup     166807 2019-12-25 13:45 input/0ws0210.txt
-rw-r--r--   1 root supergroup     161786 2019-12-25 13:45 input/0ws0310.txt
-rw-r--r--   1 root supergroup     189377 2019-12-25 13:45 input/0ws0410.txt
-rw-r--r--   1 root supergroup     101367 2019-12-25 13:45 input/0ws0610.txt
-rw-r--r--   1 root supergroup     136652 2019-12-25 13:45 input/0ws0910.txt
-rw-r--r--   1 root supergroup     138374 2019-12-25 13:45 input/0ws1010.txt
-rw-r--r--   1 root supergroup     117959 2019-12-25 13:45 input/0ws1110.txt
-rw-r--r--   1 root supergroup     143801 2019-12-25 13:45 input/0ws1210.txt
-rw-r--r--   1 root supergroup     137832 2019-12-25 13:45 input/0ws1410.txt
-rw-r--r--   1 root supergroup     145078 2019-12-25 13:45 input/0ws1510.txt
-rw-r--r--   1 root supergroup     156279 2019-12-25 13:45 input/0ws1610.txt
-rw-r--r--   1 root supergroup     112280 2019-12-25 13:45 input/0ws1710.txt
-rw-r--r--   1 root supergroup     137320 2019-12-25 13:45 input/0ws1810.txt
-rw-r--r--   1 root supergroup     159175 2019-12-25 13:45 input/0ws1910.txt
-rw-r--r--   1 root supergroup     142894 2019-12-25 13:45 input/0ws2010.txt
-rw-r--r--   1 root supergroup     169826 2019-12-25 13:45 input/0ws2110.txt
-rw-r--r--   1 root supergroup     139007 2019-12-25 13:45 input/0ws2210.txt
-rw-r--r--   1 root supergroup     170479 2019-12-25 13:45 input/0ws2310.txt
-rw-r--r--   1 root supergroup     132199 2019-12-25 13:45 input/0ws2410.txt
-rw-r--r--   1 root supergroup     140595 2019-12-25 13:45 input/0ws2510.txt
-rw-r--r--   1 root supergroup     184147 2019-12-25 13:45 input/0ws2610.txt
-rw-r--r--   1 root supergroup     130617 2019-12-25 13:45 input/0ws2810.txt
-rw-r--r--   1 root supergroup     150544 2019-12-25 13:45 input/0ws3010.txt
-rw-r--r--   1 root supergroup     143937 2019-12-25 13:45 input/0ws3110.txt
-rw-r--r--   1 root supergroup     172030 2019-12-25 13:45 input/0ws3210.txt
-rw-r--r--   1 root supergroup     158008 2019-12-25 13:45 input/0ws3310.txt
-rw-r--r--   1 root supergroup     119985 2019-12-25 13:45 input/0ws3410.txt
-rw-r--r--   1 root supergroup     165919 2019-12-25 13:45 input/0ws3510.txt
-rw-r--r--   1 root supergroup     181370 2019-12-25 13:45 input/0ws3610.txt
-rw-r--r--   1 root supergroup     126492 2019-12-25 13:45 input/0ws3710.txt
-rw-r--r--   1 root supergroup     177241 2019-12-25 13:45 input/0ws3910.txt
-rw-r--r--   1 root supergroup     161652 2019-12-25 13:45 input/0ws4010.txt
-rw-r--r--   1 root supergroup     115675 2019-12-25 13:45 input/0ws4110.txt
-rw-r--r--   1 root supergroup     160868 2019-12-25 13:45 input/0ws4210.txt

可以看到输入文件已经拷贝成功。其中的警告信息是因为java版本太高，并不影响hadoop的正常使用，可以忽略。

运行Jar文件

hadoop jar InverseIndex.jar InverseIndex input tmp output

其中， input 是存储输入文件的目录； tmp 存储的是第一个 map/reduce的输出，也是第二个 map/reduce 的输入； output 存储的第二个map/reduce 的输出，也就是最终结果。
Note：tmp和output目录不能事先创建好，如果事先创建好，会报类似下面的错误

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop jar InvertedIndex.jar InvertedIndex input invertedindex/output
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/12/25 14:21:37 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/12/25 14:21:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/root/invertedindex/output already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:266)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:139)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:691)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
	at InvertedIndex.main(InvertedIndex.java:120)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:567)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

存储输出文件的目录，系统会自己创建。

查看输出结果
查看HDFS上output目录内容

aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls output/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 2 items
-rw-r--r--   1 root supergroup          0 2019-12-25 16:38 output/_SUCCESS
-rw-r--r--   1 root supergroup    3821132 2019-12-25 16:38 output/part-r-00000

output目录下的part-r-00000文件就是存储的最终输出结果。

查看结果输出文件内容

hadoop fs -cat output/part-r-00000

可以看到输出如下内容（截取部分）

broach->4:{(0ws0110.txt,1,24700),(0ws3710.txt,1,21558),(0ws4210.txt,1,27895),(0ws0910.txt,1,23692)}
throgh->1:{(0ws1010.txt,1,24377)}
lustier->3:{(0ws1910.txt,1,27807),(0ws1610.txt,1,27734),(0ws3010.txt,1,26502)}
instances->6:{(0ws2510.txt,1,25060),(0ws1110.txt,1,20592),(0ws2210.txt,1,24615),(0ws2410.txt,1,22949),(0ws2110.txt,1,29220),(0ws2610.txt,1,32135)}
stout->12:{(0ws0310.txt,2,27573),(0ws0410.txt,1,32588),(0ws3610.txt,1,31061),(0ws1610.txt,2,27734),(0ws4210.txt,1,27895),(0ws3710.txt,1,21558),(0ws3410.txt,1,20206),(0ws0210.txt,2,28689),(0ws2810.txt,1,23284),(0ws0110.txt,3,24700),(0ws1410.txt,1,23904),(0ws1910.txt,1,27807)}
iuliet->2:{(0ws3110.txt,8,25123),(0ws1610.txt,54,27734)}
flee->1:{(0ws1210.txt,1,24901)}
fled->26:{(0ws2010.txt,1,24815),(0ws2210.txt,5,24615),(0ws0410.txt,4,32588),(0ws1410.txt,1,23904),(0ws3710.txt,1,21558),(0ws0610.txt,2,17919),(0ws3910.txt,5,30687),(0ws3010.txt,1,26502),(0ws0910.txt,1,23692),(0ws3510.txt,3,28227),(0ws1910.txt,1,27807),(0ws1710.txt,3,19538),(0ws0210.txt,6,28689),(0ws3310.txt,3,27164),(0ws2110.txt,3,29220),(0ws1510.txt,7,24965),(0ws3610.txt,1,31061),(0ws3210.txt,2,29959),(0ws1110.txt,3,20592),(0ws4010.txt,3,28122),(0ws3410.txt,7,20206),(0ws0110.txt,8,24700),(0ws1610.txt,2,27734),(0ws0310.txt,8,27573),(0ws1810.txt,1,24517),(0ws2410.txt,3,22949)}
hedgd->2:{(0ws1810.txt,1,24517),(0ws1410.txt,1,23904)}
hedge->10:{(0ws1910.txt,1,27807),(0ws3010.txt,1,26502),(0ws2410.txt,1,22949),(0ws1010.txt,1,24377),(0ws0210.txt,1,28689),(0ws2210.txt,1,24615),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2610.txt,1,32135),(0ws4010.txt,2,28122)}
flea->5:{(0ws1210.txt,1,24901),(0ws1010.txt,1,24377),(0ws2310.txt,2,29377),(0ws3310.txt,1,27164),(0ws2810.txt,1,23284)}
taxing->1:{(0ws2510.txt,1,25060)}
responsiue->1:{(0ws2610.txt,1,32135)}
flew->4:{(0ws0210.txt,1,28689),(0ws3310.txt,1,27164),(0ws4210.txt,1,27895),(0ws1810.txt,1,24517)}
treuant->1:{(0ws1210.txt,1,24901)}
commandment->5:{(0ws2610.txt,4,32135),(0ws4010.txt,1,28122),(0ws3610.txt,1,31061),(0ws2110.txt,1,29220),(0ws2510.txt,1,25060)}
sailemaker->1:{(0ws1010.txt,1,24377)}
carried->16:{(0ws3910.txt,1,30687),(0ws2610.txt,1,32135),(0ws1710.txt,1,19538),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2210.txt,1,24615),(0ws0610.txt,2,17919),(0ws3310.txt,1,27164),(0ws1110.txt,1,20592),(0ws3110.txt,4,25123),(0ws4210.txt,2,27895),(0ws3610.txt,2,31061),(0ws1210.txt,1,24901),(0ws3410.txt,1,20206),(0ws3010.txt,2,26502),(0ws2310.txt,1,29377)}
auoiding->1:{(0ws0310.txt,1,27573)}
inthrald->1:{(0ws0110.txt,1,24700)}
clearenesse->1:{(0ws3410.txt,1,20206)}
blithe->2:{(0ws2210.txt,1,24615),(0ws0910.txt,1,23692)}
clothes->8:{(0ws2310.txt,1,29377),(0ws1010.txt,1,24377),(0ws0110.txt,1,24700),(0ws1610.txt,1,27734),(0ws1910.txt,1,27807),(0ws2610.txt,1,32135),(0ws3110.txt,1,25123),(0ws4010.txt,1,28122)}
raunge->2:{(0ws1510.txt,1,24965),(0ws2310.txt,1,29377)}
gonerill->1:{(0ws3310.txt,18,27164)}
stowd->1:{(0ws3210.txt,1,29959)}
carrier->2:{(0ws1910.txt,4,27807),(0ws0910.txt,1,23692)}
carries->15:{(0ws1910.txt,1,27807),(0ws3710.txt,1,21558),(0ws3510.txt,1,28227),(0ws1710.txt,2,19538),(0ws0610.txt,1,17919),(0ws2410.txt,1,22949),(0ws2610.txt,1,32135),(0ws4210.txt,2,27895),(0ws1810.txt,1,24517),(0ws4110.txt,1,19796),(0ws2510.txt,1,25060),(0ws1210.txt,1,24901),(0ws3610.txt,1,31061),(0ws3010.txt,3,26502),(0ws1610.txt,1,27734)}
obiections->3:{(0ws0210.txt,1,28689),(0ws4210.txt,1,27895),(0ws0110.txt,1,24700)}
vnsuspected->2:{(0ws0410.txt,1,32588),(0ws1010.txt,1,24377)}
arguments->7:{(0ws1210.txt,1,24901),(0ws2810.txt,3,23284),(0ws0310.txt,2,27573),(0ws4210.txt,1,27895),(0ws1410.txt,1,23904),(0ws3310.txt,1,27164),(0ws0410.txt,1,32588)}
shreds->2:{(0ws3610.txt,1,31061),(0ws2610.txt,1,32135)}
raunges->1:{(0ws3610.txt,1,31061)}

上一篇：利用MR中文分词完成倒排索引

下一篇：倒排索引的java实现

Hadoop之倒排索引

需求

环境

倒排索引

总体思路

完整源代码

部署运行

hadoop入门之hdfs的重要配置项的说明

hadoop入门之namenode工作特点介绍

hadoop入门之设置datanode的心跳时间的方法

hadoop入门之验证hdfs是否能够正常运行的方法

hadoop入门之通过页面验证hadoop是否安装成功的方法

hadoop入门之hdfs基本操作命令使用方法

hadoop入门之通过java代码实现将本地文件上传到hadoop的文件系统

hadoop入门之hadoop集群验证任务存放在不同的节点上

hadoop入门之统计单词在文件中出现的个数示例

Sql Server 查询性能优化之走出索引的误区分析