Hadoop之倒排索引
需求
输入N个文件,生成带详细信息的倒排索引。其中,输入为莎士比亚文集Shakespeare.tar.gz。
举例如下,有4个输入文件:
- d1.txt: cat dog cat fox
- d2.txt: cat bear cat cat fox
- d3.txt: fox wolf dog
- d4.txt: wolf hen rabbit cat sheep
要求建立如下格式的倒排索引: - cat —>3: {(d1.txt,2,4),(d2.txt,3,5),(d4.txt,1,5)}
- 单词—>出现该单词的文件个数: {(出现该单词的文件名(或文件id),单词在该文件中的出现次数,该文件 的总单词数),……}
环境
- 硬件环境:Intel® Core™i5-8250U aaa@qq.com 1.8GHz / 8GB内存
- 软件环境:Ubuntu18.04 64-bit Java 13.0.1 Hadoop 2.7.4
倒排索引
倒排索引是文档检索系统中最常用的数据结构,被广泛用于全文搜索引擎。它主要是用来存储某个单词(或词组)在一个文档或一组文档的存储位置的映射,即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容,而是进行了相反的操作(根据关键字来查找文档),因而称为倒排索引(Inverted Index)。通常情况下,倒排索引由一个单词(词组)以及相关的文档列表(标示文档的ID号,或者是指定文档所在位置的URI)组成,如下图所示:
从上图可以看出,单词1出现在{文档1、文档4、文档13、……}中,单词2出现在{文档3、文档5、文档15、……}中,而单词3出现在{文档1、文档8、文档20、….}中,还需要给每个文档添加一个权值,用来指出每个文档与搜素内容相关的相关度,如下图所示:
最常用的是使用词频作为权重,即记录单词在文档中出现的次数了。以英文为例,如下图所示,索引文件中的“MapReduce”一行表示:“MapReduce”这个单词在文本T0中出现过1次,T1中出现过1次,T2中出现过2次。当搜索条件为“MapReduce”、“is”、“simple”时,对应的集合为:{T0,T1,T2}∩{ T0,T1}∩{ T0,T1}={ T0,T1},即文本T0和T1包含所要索引的单词,而且只有T0是连续的。
总体思路
通过两个 Map/Reduce 过程就可以到达要求。 第一个Map/Reduce 统计各个文件中的所有单词的出现次数,以及各个文件单词总数;第二个 Map/Reduce 根据第一个 Map/Reduce 的统计结果处理加工得到单词倒排索引。
- 第一个 map/reduce 详细设计
在 map 过程中,重写 map 类, 利用 StringTokenizer 类,将 map 方法中的
value 值中存储的文本,拆分成一个个的单词, 并获取文件名, 以两种格式进
行输出<filename+word,1>或者<filename,1>
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
//获取文件名
FileSplit fileSplit= (FileSplit)context.getInputSplit();
String fileName = fileSplit.getPath().getName();
//获取单词在单个文件中出现次数,及文件单词总数
StringTokenizer itr= new StringTokenizer(value.toString());
for(; itr.hasMoreTokens(); )
{
String word =removeNonLetters( itr.nextToken().toLowerCase());
String fileWord = fileName+"\001"+word;
if(!word.equals(""))
{
context.write(new Text(fileWord), new IntWritable(1));
context.write(new Text(fileName), new IntWritable(1));
}
}
}
在Reduce过程中,统计得到每个文件中每个单词的出现次数,以及每个文件的单词总数,输出<key,count>。
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key,new IntWritable(sum));
}
- 第二个 map/reduce 详细设计
Map过程读取第一个MR的输出,对value值进行拆分,重新组合后输出键为固定Text类型值index,值为filename+word+count或者filename+count。
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String valStr = value.toString();
String[] records = valStr.split("\t");
context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
}
Reduce过程中定义四个HashMap,Map<String,Integer> wordinfilescount,key为单词+文件名,value为单词在该文件中出现的次数;Map<String,Integer> filescount ,key为文件名,value为文件的单词总数;Map<String,Integer> wordinfiles, key为单词,value为单词在多少个文件中现;Map<String,String> indexes,key为单词,value为倒排索引。读取values值,根据设定分隔符拆分,判断拆分后长度如果为2,则该值为文件名+文件单词总数,将拆分后的文件名及文件单词总数,组成键值对放入Map<String,Integer> filescount;拆分后长度如果为3,则该值为文件名+单词+单词在该文件中出现次数,将拆分后的文件名+单词及单词在该文件中出现次数组成键值对放入Map<String,Integer> wordinfilescount,同时统计单词在多少个文件中出现,并组成键值对放入Map<String,Integer> wordinfiles。遍历Map<String,Integer> wordinfilescount,将单词作为键,“单词->出现该单词的文件个数:总文件个数:{(出现该单词的文件名,单词在该文件中的出现次数,该文件的总单词数)”作为值,放入Map<String,String> indexes中。遍历Map<String,String> indexes获取倒排索引并输出全部索引。
public static class InverseReducer extends Reducer<Text, Text, Text, NullWritable> {
private Map<String, Integer> wordinfilescount = new HashMap<String, Integer>();//key为单词+文件名,value为单词在该文件中出现的次数
private Map<String, Integer> filescount = new HashMap<String, Integer>();//key为文件名,value为文件的单词总数
private Map<String, Integer> wordinfiles = new HashMap<String, Integer>();//key为单词,value为单词在多少的文件中出现
private Map<String, String> indexes = new HashMap<String, String>();//key为单词,value为倒排索引
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//拆分输入,获取单词出现在几个文件中以及在该文件中出现次数,各个文件的单词总数,总文件数
for (Text val : values) {
String valStr = val.toString();
String[] records = valStr.split("\001");
switch (records.length) {
case 2:
filescount.put(records[0], Integer.parseInt(records[1]));
break;
case 3: {
wordinfilescount.put(valStr, Integer.parseInt(records[2]));
if (!wordinfiles.containsKey(records[1])) {
wordinfiles.put(records[1], 1);
} else {
wordinfiles.put(records[1], wordinfiles.get(records[1]) + 1);
}
}
;
break;
}
}
//处理获取倒排索引
for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
String valStr = entry.getKey();
String[] records = valStr.split("\001");
String word = records[1];
if (!indexes.containsKey(word)) {
StringBuilder sb = new StringBuilder();
sb.append(word)
.append("->")
.append(wordinfiles.get(word))
.append(":")
.append("{(")
.append(records[0])
.append(",")
.append(entry.getValue())
.append(",")
.append(filescount.get(records[0]))
.append(")");
indexes.put(word, sb.toString());
} else {
StringBuilder sb = new StringBuilder();
sb.append(",(")
.append(records[0])
.append(",")
.append(entry.getValue())
.append(",")
.append(filescount.get(records[0]))
.append(")");
indexes.put(word, indexes.get(word) + sb.toString());
}
}
for (Entry<String, String> entry : indexes.entrySet()) {
context.write(new Text(entry.getValue() + "}"), NullWritable.get());
}
}
}
完整源代码
import java.io.IOException;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.HashMap;
import java.util.Map.Entry;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class InverseIndex {
/*
* 第一个mr的map类,获取每个单词在单个文件中出现次数,输入为每个文件行偏移量,输出为<word+filename,1>
* 或<filename,1>
*/
public static class statisticsMap extends
Mapper<Object, Text, Text, IntWritable> {
private Text mapKey = new Text("key");
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//获取文件名
FileSplit fileSplit= (FileSplit)context.getInputSplit();
String fileName = fileSplit.getPath().getName();
//获取单词在单个文件中出现次数,及文件单词总数
StringTokenizer itr= new StringTokenizer(value.toString());
for(; itr.hasMoreTokens(); ) {
String word =removeNonLetters( itr.nextToken().toLowerCase());
String fileWord = fileName+"\001"+word;
if(!word.equals("")){
context.write(new Text(fileWord), new IntWritable(1));
context.write(new Text(fileName), new IntWritable(1));
}
}
}
//去掉字符串非字母字符
public static String removeNonLetters(String original){
StringBuffer aBuffer=new StringBuffer(original.length());
char aCharacter;
for(int i=0;i<original.length();i++){
aCharacter=original.charAt(i);
if(Character.isLetter(aCharacter)){
aBuffer.append(aCharacter);
}
}
return new String(aBuffer);
}
}
//第一个mr的reduce类,统计汇总出现单词的文件个数,及每个文件中单词出现个数及每个文件单词个数,
public static class statisticsReduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key,new IntWritable(sum));
}
}
public static class InverseMapper extends
Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String valStr = value.toString();
String[] records = valStr.split("\t");
context.write(new Text("index"),new Text(records[0]+"\001"+records[1]));
}
}
public static class InverseReducer extends
Reducer<Text, Text, Text, NullWritable> {
private Map<String,Integer> wordinfilescount = new HashMap<String,Integer>();//key为单词+文件名,value为单词在该文件中出现的次数
private Map<String,Integer> filescount = new HashMap<String,Integer>();//key为文件名,value为文件的单词总数
private Map<String,Integer> wordinfiles = new HashMap<String,Integer>();//key为单词,value为单词在多少的文件中出现
private Map<String,String> indexes = new HashMap<String,String>();//key为单词,value为倒排索引
public void reduce( Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//拆分输入,获取单词出现在几个文件中以及在该文件中出现次数,各个文件的单词总数,总文件数
for (Text val : values) {
String valStr = val.toString();
String[] records = valStr.split("\001");
switch(records.length){
case 2:filescount.put(records[0], Integer.parseInt(records[1]));
break;
case 3:{
wordinfilescount.put(valStr, Integer.parseInt(records[2]));
if(!wordinfiles.containsKey(records[1])){
wordinfiles.put(records[1], 1);
}else{
wordinfiles.put(records[1], wordinfiles.get(records[1])+1);
}
};
break;
}
}
//处理获取倒排索引
for (Entry<String, Integer> entry : wordinfilescount.entrySet()) {
String valStr = entry.getKey();
String[] records = valStr.split("\001");
String word = records[1];
if(!indexes.containsKey(word)){
StringBuilder sb = new StringBuilder();
sb.append(word)
.append("->")
.append(wordinfiles.get(word))
.append(":")
.append("{(")
.append( records[0])
.append(",")
.append(entry.getValue())
.append(",")
.append(filescount.get( records[0]))
.append(")");
indexes.put(word,sb.toString() );
}else{
StringBuilder sb = new StringBuilder();
sb.append(",(")
.append( records[0])
.append(",")
.append(entry.getValue())
.append(",")
.append(filescount.get( records[0]))
.append(")");
indexes.put(word,indexes.get(word)+sb.toString() );
}
}
for (Entry<String, String> entry : indexes.entrySet()) {
context.write(new Text(entry.getValue()+"}"), NullWritable.get());
}
}
}
//统计单词在文件中出现次数及单个文件单词总数
public static void StatisticsTask(String[] args) throws Exception{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length < 3) {
System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "ipstatistics1");
job.setMapperClass(statisticsMap.class);
job.setReducerClass(statisticsReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(
otherArgs[1]));
job.waitForCompletion(true) ;
}
//根据统计结果输出倒排索引
public static void InverseTask(String[] args) throws Exception{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length < 3) {
System.err.println("Usage: ipstatistics <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "ipstatistics2");
job.setMapperClass( InverseMapper.class);
job.setReducerClass( InverseReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(job, new Path(
otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
StatisticsTask(args);
InverseTask(args);
}
}
部署运行
- 启动Hadoop集群
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# start-all.sh
- 编译源码
切换工作目录并编译
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java
报了一个错误:
src/InverseIndex.java:8: error: package org.apache.hadoop.conf does not exist
import org.apache.hadoop.conf.Configuration;
...
这个错误是因为CLASSPATH环境变量没配置对,在~/.bashrc的末尾添加如下语句:
vim ~/.bashrc
#在末尾添加如下一句
eexport CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
保存退出,通过下面的命令使修改生效
source ~/.bashrc
重新编译
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# javac -d bin/ src/InverseIndex.java
可以看到bin目录下生成了编译后的文件
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# ls bin/ -lh
total 24K
-rw-r--r-- 1 root root 2.3K 12月 25 20:02 InverseIndex.class
-rw-r--r-- 1 root root 1.9K 12月 25 20:02 'InverseIndex$InverseMapper.class'
-rw-r--r-- 1 root root 4.1K 12月 25 20:02 'InverseIndex$InverseReducer.class'
-rw-r--r-- 1 root root 2.9K 12月 25 20:02 'InverseIndex$statisticsMap.class'
-rw-r--r-- 1 root root 1.7K 12月 25 20:02 'InverseIndex$statisticsReduce.class'
- 打包jar文件
jar -cvf InverseIndex.jar -C bin/ .
#输出
added manifest
adding: InverseIndex$InverseReducer.class(in = 4186) (out= 1820)(deflated 56%)
adding: InverseIndex.class(in = 2282) (out= 1160)(deflated 49%)
adding: InverseIndex$InverseMapper.class(in = 1909) (out= 771)(deflated 59%)
adding: InverseIndex$statisticsMap.class(in = 2960) (out= 1371)(deflated 53%)
adding: InverseIndex$statisticsReduce.class(in = 1672) (out= 714)(deflated 57%)
- 上传输入文件
在HDFS中创建input目录,用于存储输入文件
hadoop fs -mkdir input
将输入文件从本地拷贝到HDFS的input目录
hadoop fs -put /usr/local/big_data/homework/data/0ws*.txt input
查看是否拷贝成功
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls input/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 35 items
-rw-r--r-- 1 root supergroup 146884 2019-12-25 13:45 input/0ws0110.txt
-rw-r--r-- 1 root supergroup 166807 2019-12-25 13:45 input/0ws0210.txt
-rw-r--r-- 1 root supergroup 161786 2019-12-25 13:45 input/0ws0310.txt
-rw-r--r-- 1 root supergroup 189377 2019-12-25 13:45 input/0ws0410.txt
-rw-r--r-- 1 root supergroup 101367 2019-12-25 13:45 input/0ws0610.txt
-rw-r--r-- 1 root supergroup 136652 2019-12-25 13:45 input/0ws0910.txt
-rw-r--r-- 1 root supergroup 138374 2019-12-25 13:45 input/0ws1010.txt
-rw-r--r-- 1 root supergroup 117959 2019-12-25 13:45 input/0ws1110.txt
-rw-r--r-- 1 root supergroup 143801 2019-12-25 13:45 input/0ws1210.txt
-rw-r--r-- 1 root supergroup 137832 2019-12-25 13:45 input/0ws1410.txt
-rw-r--r-- 1 root supergroup 145078 2019-12-25 13:45 input/0ws1510.txt
-rw-r--r-- 1 root supergroup 156279 2019-12-25 13:45 input/0ws1610.txt
-rw-r--r-- 1 root supergroup 112280 2019-12-25 13:45 input/0ws1710.txt
-rw-r--r-- 1 root supergroup 137320 2019-12-25 13:45 input/0ws1810.txt
-rw-r--r-- 1 root supergroup 159175 2019-12-25 13:45 input/0ws1910.txt
-rw-r--r-- 1 root supergroup 142894 2019-12-25 13:45 input/0ws2010.txt
-rw-r--r-- 1 root supergroup 169826 2019-12-25 13:45 input/0ws2110.txt
-rw-r--r-- 1 root supergroup 139007 2019-12-25 13:45 input/0ws2210.txt
-rw-r--r-- 1 root supergroup 170479 2019-12-25 13:45 input/0ws2310.txt
-rw-r--r-- 1 root supergroup 132199 2019-12-25 13:45 input/0ws2410.txt
-rw-r--r-- 1 root supergroup 140595 2019-12-25 13:45 input/0ws2510.txt
-rw-r--r-- 1 root supergroup 184147 2019-12-25 13:45 input/0ws2610.txt
-rw-r--r-- 1 root supergroup 130617 2019-12-25 13:45 input/0ws2810.txt
-rw-r--r-- 1 root supergroup 150544 2019-12-25 13:45 input/0ws3010.txt
-rw-r--r-- 1 root supergroup 143937 2019-12-25 13:45 input/0ws3110.txt
-rw-r--r-- 1 root supergroup 172030 2019-12-25 13:45 input/0ws3210.txt
-rw-r--r-- 1 root supergroup 158008 2019-12-25 13:45 input/0ws3310.txt
-rw-r--r-- 1 root supergroup 119985 2019-12-25 13:45 input/0ws3410.txt
-rw-r--r-- 1 root supergroup 165919 2019-12-25 13:45 input/0ws3510.txt
-rw-r--r-- 1 root supergroup 181370 2019-12-25 13:45 input/0ws3610.txt
-rw-r--r-- 1 root supergroup 126492 2019-12-25 13:45 input/0ws3710.txt
-rw-r--r-- 1 root supergroup 177241 2019-12-25 13:45 input/0ws3910.txt
-rw-r--r-- 1 root supergroup 161652 2019-12-25 13:45 input/0ws4010.txt
-rw-r--r-- 1 root supergroup 115675 2019-12-25 13:45 input/0ws4110.txt
-rw-r--r-- 1 root supergroup 160868 2019-12-25 13:45 input/0ws4210.txt
可以看到输入文件已经拷贝成功。其中的警告信息是因为java版本太高,并不影响hadoop的正常使用,可以忽略。
- 运行Jar文件
hadoop jar InverseIndex.jar InverseIndex input tmp output
其中, input 是存储输入文件的目录; tmp 存储的是第一个 map/reduce的输出,也是第二个 map/reduce 的输入; output 存储的第二个map/reduce 的输出,也就是最终结果。
Note:tmp和output目录不能事先创建好,如果事先创建好,会报类似下面的错误
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop jar InvertedIndex.jar InvertedIndex input invertedindex/output
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/12/25 14:21:37 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/12/25 14:21:37 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/root/invertedindex/output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:266)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:139)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:691)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at InvertedIndex.main(InvertedIndex.java:120)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
存储输出文件的目录,系统会自己创建。
- 查看输出结果
查看HDFS上output目录内容
aaa@qq.com:/usr/local/big_data/homework/MapReduce/InvertedIndex# hadoop fs -ls output/
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Found 2 items
-rw-r--r-- 1 root supergroup 0 2019-12-25 16:38 output/_SUCCESS
-rw-r--r-- 1 root supergroup 3821132 2019-12-25 16:38 output/part-r-00000
output目录下的part-r-00000文件就是存储的最终输出结果。
- 查看结果输出文件内容
hadoop fs -cat output/part-r-00000
可以看到输出如下内容(截取部分)
broach->4:{(0ws0110.txt,1,24700),(0ws3710.txt,1,21558),(0ws4210.txt,1,27895),(0ws0910.txt,1,23692)}
throgh->1:{(0ws1010.txt,1,24377)}
lustier->3:{(0ws1910.txt,1,27807),(0ws1610.txt,1,27734),(0ws3010.txt,1,26502)}
instances->6:{(0ws2510.txt,1,25060),(0ws1110.txt,1,20592),(0ws2210.txt,1,24615),(0ws2410.txt,1,22949),(0ws2110.txt,1,29220),(0ws2610.txt,1,32135)}
stout->12:{(0ws0310.txt,2,27573),(0ws0410.txt,1,32588),(0ws3610.txt,1,31061),(0ws1610.txt,2,27734),(0ws4210.txt,1,27895),(0ws3710.txt,1,21558),(0ws3410.txt,1,20206),(0ws0210.txt,2,28689),(0ws2810.txt,1,23284),(0ws0110.txt,3,24700),(0ws1410.txt,1,23904),(0ws1910.txt,1,27807)}
iuliet->2:{(0ws3110.txt,8,25123),(0ws1610.txt,54,27734)}
flee->1:{(0ws1210.txt,1,24901)}
fled->26:{(0ws2010.txt,1,24815),(0ws2210.txt,5,24615),(0ws0410.txt,4,32588),(0ws1410.txt,1,23904),(0ws3710.txt,1,21558),(0ws0610.txt,2,17919),(0ws3910.txt,5,30687),(0ws3010.txt,1,26502),(0ws0910.txt,1,23692),(0ws3510.txt,3,28227),(0ws1910.txt,1,27807),(0ws1710.txt,3,19538),(0ws0210.txt,6,28689),(0ws3310.txt,3,27164),(0ws2110.txt,3,29220),(0ws1510.txt,7,24965),(0ws3610.txt,1,31061),(0ws3210.txt,2,29959),(0ws1110.txt,3,20592),(0ws4010.txt,3,28122),(0ws3410.txt,7,20206),(0ws0110.txt,8,24700),(0ws1610.txt,2,27734),(0ws0310.txt,8,27573),(0ws1810.txt,1,24517),(0ws2410.txt,3,22949)}
hedgd->2:{(0ws1810.txt,1,24517),(0ws1410.txt,1,23904)}
hedge->10:{(0ws1910.txt,1,27807),(0ws3010.txt,1,26502),(0ws2410.txt,1,22949),(0ws1010.txt,1,24377),(0ws0210.txt,1,28689),(0ws2210.txt,1,24615),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2610.txt,1,32135),(0ws4010.txt,2,28122)}
flea->5:{(0ws1210.txt,1,24901),(0ws1010.txt,1,24377),(0ws2310.txt,2,29377),(0ws3310.txt,1,27164),(0ws2810.txt,1,23284)}
taxing->1:{(0ws2510.txt,1,25060)}
responsiue->1:{(0ws2610.txt,1,32135)}
flew->4:{(0ws0210.txt,1,28689),(0ws3310.txt,1,27164),(0ws4210.txt,1,27895),(0ws1810.txt,1,24517)}
treuant->1:{(0ws1210.txt,1,24901)}
commandment->5:{(0ws2610.txt,4,32135),(0ws4010.txt,1,28122),(0ws3610.txt,1,31061),(0ws2110.txt,1,29220),(0ws2510.txt,1,25060)}
sailemaker->1:{(0ws1010.txt,1,24377)}
carried->16:{(0ws3910.txt,1,30687),(0ws2610.txt,1,32135),(0ws1710.txt,1,19538),(0ws3510.txt,1,28227),(0ws2010.txt,1,24815),(0ws2210.txt,1,24615),(0ws0610.txt,2,17919),(0ws3310.txt,1,27164),(0ws1110.txt,1,20592),(0ws3110.txt,4,25123),(0ws4210.txt,2,27895),(0ws3610.txt,2,31061),(0ws1210.txt,1,24901),(0ws3410.txt,1,20206),(0ws3010.txt,2,26502),(0ws2310.txt,1,29377)}
auoiding->1:{(0ws0310.txt,1,27573)}
inthrald->1:{(0ws0110.txt,1,24700)}
clearenesse->1:{(0ws3410.txt,1,20206)}
blithe->2:{(0ws2210.txt,1,24615),(0ws0910.txt,1,23692)}
clothes->8:{(0ws2310.txt,1,29377),(0ws1010.txt,1,24377),(0ws0110.txt,1,24700),(0ws1610.txt,1,27734),(0ws1910.txt,1,27807),(0ws2610.txt,1,32135),(0ws3110.txt,1,25123),(0ws4010.txt,1,28122)}
raunge->2:{(0ws1510.txt,1,24965),(0ws2310.txt,1,29377)}
gonerill->1:{(0ws3310.txt,18,27164)}
stowd->1:{(0ws3210.txt,1,29959)}
carrier->2:{(0ws1910.txt,4,27807),(0ws0910.txt,1,23692)}
carries->15:{(0ws1910.txt,1,27807),(0ws3710.txt,1,21558),(0ws3510.txt,1,28227),(0ws1710.txt,2,19538),(0ws0610.txt,1,17919),(0ws2410.txt,1,22949),(0ws2610.txt,1,32135),(0ws4210.txt,2,27895),(0ws1810.txt,1,24517),(0ws4110.txt,1,19796),(0ws2510.txt,1,25060),(0ws1210.txt,1,24901),(0ws3610.txt,1,31061),(0ws3010.txt,3,26502),(0ws1610.txt,1,27734)}
obiections->3:{(0ws0210.txt,1,28689),(0ws4210.txt,1,27895),(0ws0110.txt,1,24700)}
vnsuspected->2:{(0ws0410.txt,1,32588),(0ws1010.txt,1,24377)}
arguments->7:{(0ws1210.txt,1,24901),(0ws2810.txt,3,23284),(0ws0310.txt,2,27573),(0ws4210.txt,1,27895),(0ws1410.txt,1,23904),(0ws3310.txt,1,27164),(0ws0410.txt,1,32588)}
shreds->2:{(0ws3610.txt,1,31061),(0ws2610.txt,1,32135)}
raunges->1:{(0ws3610.txt,1,31061)}
上一篇: 利用MR中文分词完成倒排索引
下一篇: 倒排索引的java实现