MapReduce编程实例——WordCount
创建MapReduce项目
基于Eclipse的
结构图如下:
创建class Wcjob
package cn.edu.gznc.wc;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Wcjob {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job wordCountJob = Job.getInstance(conf);
//重要:指定本job所在的jar包
wordCountJob.setJarByClass(Wcjob.class);
//设置wordCountJob所用的mapper逻辑类为哪个类
wordCountJob.setMapperClass(WcMapper.class);
//设置wordCountJob所用的reducer逻辑类为哪个类
wordCountJob.setReducerClass(WcReducer.class);
//设置map阶段输出的kv数据类型
wordCountJob.setMapOutputKeyClass(Text.class);
wordCountJob.setMapOutputValueClass(IntWritable.class);
//设置最终输出的kv数据类型
wordCountJob.setOutputKeyClass(Text.class);
wordCountJob.setOutputValueClass(IntWritable.class);
//设置要处理的文本数据所存放的路径
FileInputFormat.setInputPaths(wordCountJob, args[0]);
FileOutputFormat.setOutputPath(wordCountJob, new Path(args[1]));
//提交job给hadoop集群
wordCountJob.waitForCompletion(true);
}
}
创建class WcMapper
package cn.edu.gznc.wc;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/*
* KEYIN:输入kv数据对中key的数据类型
* VALUEIN:输入kv数据对中value的数据类型
* KEYOUT:输出kv数据对中key的数据类型
* VALUEOUT:输出kv数据对中value的数据类型
*/
public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
/*
* map方法是提供给map task进程来调用的,map task进程是每读取一行文本来调用一次我们自定义的map方法
* map task在调用map方法时,传递的参数:
* 一行的起始偏移量LongWritable作为key
* 一行的文本内容Text作为value
*/
@Override
protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {
//拿到一行文本内容,转换成String 类型
String line = value.toString();
//将这行文本切分成单词
String[] words=line.split(" ");
//输出<单词,1>
for(String word:words){
context.write(new Text(word), new IntWritable(1));
}
}
}
创建class WcReducer
package cn.edu.gznc.wc;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/*
* KEYIN:对应mapper阶段输出的key类型
* VALUEIN:对应mapper阶段输出的value类型
* KEYOUT:reduce处理完之后输出的结果kv对中key的类型
* VALUEOUT:reduce处理完之后输出的结果kv对中value的类型
*/
public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
/*
* reduce方法提供给reduce task进程来调用
*
* reduce task会将shuffle阶段分发过来的大量kv数据对进行聚合,聚合的机制是相同key的kv对聚合为一组
* 然后reduce task对每一组聚合kv调用一次我们自定义的reduce方法
* 比如:<hello,1><hello,1><hello,1><tom,1><tom,1><tom,1>
* hello组会调用一次reduce方法进行处理,tom组也会调用一次reduce方法进行处理
* 调用时传递的参数:
* key:一组kv中的key
* values:一组kv中所有value的迭代器
*/
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
//定义一个计数器
int count = 0;
//通过value这个迭代器,遍历这一组kv中所有的value,进行累加
for(IntWritable value:values){
count+=value.get();
}
//输出这个单词的统计结果
context.write(key, new IntWritable(count));
}
}
导出jar包
这里我是导出到hadoop安装目录,但是我觉得不好管理,所以在hadoop里面创建了一个myjar的文件夹来存放,把wc.jar移动进去
准备数据
创建数据
我在hadoop的安装目录在创建一个data文件夹,作为存放数据的
我的数据是叫做mapreduce.data
内容是
Diamond is well-known as the strongest of all natural materials, and with that strength comes another tightly linked property: brittleness. But now, an international team of researchers from MIT, *, Singapore, and Korea has found that when grown in extremely tiny, needle-like shapes, diamond can bend and stretch, much like rubber, and snap back to its original shape.
The surprising finding is being reported this week in the journal Science, in a paper by senior author Ming Dao, a principal research scientist in MIT’s Department of Materials Science and Engineering; MIT postdoc Daniel Bernoulli; senior author Subra Suresh, former MIT dean of engineering and now president of Singapore’s Nanyang Technological University; graduate students Amit Banerjee and Hongti Zhang at City University of *; and seven others from CUHK and institutions in Ulsan, South Korea.
The results, the researchers say, could open the door to a variety of diamond-based devices for applications such as sensing, data storage, actuation, biocompatible in vivo imaging, optoelectronics, and drug delivery. For example, diamond has been explored as a possible biocompatible carrier for delivering drugs into cancer cells.The team showed that the narrow diamond needles, similar in shape to the rubber tips on the end of some toothbrushes but just a few hundred nanometers (billionths of a meter) across, could flex and stretch by as much as 9 percent without breaking, then return to their original configuration, Dao says.
Ordinary diamond in bulk form, Bernoulli says, has a limit of well below 1 percent stretch. “It was very surprising to see the amount of elastic deformation the nanoscale diamond could sustain,” he says.
“We developed a unique nanomechanical approach to precisely control and quantify the ultralarge elastic strain distributed in the nanodiamond samples,” says Yang Lu, senior co-author and associate professor of mechanical and biomedical engineering at CUHK. Putting crystalline materials such as diamond under ultralarge elastic strains, as happens when these pieces flex, can change their mechanical properties as well as thermal, optical, magnetic, electrical, electronic, and chemical reaction properties in significant ways, and could be used to design materials for specific applications through “elastic strain engineering,” the team says.
The team measured the bending of the diamond needles, which were grown through a chemical vapor deposition process and then etched to their final shape, by observing them in a scanning electron microscope while pressing down on the needles with a standard nanoindenter diamond tip (essentially the corner of a cube). Following the experimental tests using this system, the team did many detailed simulations to interpret the results and was able to determine precisely how much stress and strain the diamond needles could accommodate without breaking.The researchers also developed a computer model of the nonlinear elastic deformation for the actual geometry of the diamond needle, and found that the maximum tensile strain of the nanoscale diamond was as high as 9 percent. The computer model also predicted that the corresponding maximum local stress was close to the known ideal tensile strength of diamond — i.e. the theoretical limit achievable by defect-free diamond.
When the entire diamond needle was made of one crystal, failure occurred at a tensile strain as high as 9 percent. Until this critical level was reached, the deformation could be completely reversed if the probe was retracted from the needle and the specimen was unloaded. If the tiny needle was made of many grains of diamond, the team showed that they could still achieve unusually large strains. However, the maximum strain achieved by the polycrystalline diamond needle was less than one-half that of the single crystalline diamond needle.
Yonggang Huang, a professor of civil and environmental engineering and mechanical engineering at Northwestern University, who was not involved in this research, agrees with the researchers’ assessment of the potential impact of this work. “The surprise finding of ultralarge elastic deformation in a hard and brittle material — diamond — opens up unprecedented possibilities for tuning its optical, optomechanical, magnetic, phononic, and catalytic properties through elastic strain engineering,” he says.
Huang adds “When elastic strains exceed 1 percent, significant material property changes are expected through quantum mechanical calculations. With controlled elastic strains between 0 to 9 percent in diamond, we expect to see some surprising property changes.”
The team also included Muk-Fung Yuen, Jiabin Liu, Jian Lu, Wenjun Zhang, and Yang Lu at the City University of *; and Jichen Dong and Feng Ding at the Institute for Basic Science, in South Korea. The work was funded by the Research Grants Council of the * Special Administrative Region, Singapore-MIT Alliance for Rresearch and Technology (SMART), Nanyang Technological University Singapore, and the National Natural Science Foundation of China.
把数据上传到hadoop文件系统
hadoop fs -put ./data/mapreduce.data /user/hadoop/input/
/user 下没有hadoop和input的需要自己创建,
创建命令:
hadoop fs -mkdir /user/hadoop
hadoop fs -mkdir /user/hadoop/input
查看一下:
hadoop fs -ls /user/hadoop/input/
运行
使用jar包运行
hadoop jar ./myjar/wc.jar cn.edu.gznc.wc.Wcjob /user/hadoop/input/mapreduce.data /user/hadoop/outputwc/
然后就会出现一系列的任务提示和进度信息
18/04/24 15:02:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/24 15:02:54 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.164.146:18040
18/04/24 15:02:57 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/04/24 15:02:58 INFO input.FileInputFormat: Total input paths to process : 1
18/04/24 15:02:58 INFO mapreduce.JobSubmitter: number of splits:1
18/04/24 15:02:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1524510537951_0005
18/04/24 15:03:01 INFO impl.YarnClientImpl: Submitted application application_1524510537951_0005
18/04/24 15:03:01 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1524510537951_0005/
18/04/24 15:03:01 INFO mapreduce.Job: Running job: job_1524510537951_0005
18/04/24 15:03:14 INFO mapreduce.Job: Job job_1524510537951_0005 running in uber mode : false
18/04/24 15:03:15 INFO mapreduce.Job: map 0% reduce 0%
18/04/24 15:03:37 INFO mapreduce.Job: map 100% reduce 0%
18/04/24 15:03:46 INFO mapreduce.Job: map 100% reduce 100%
18/04/24 15:03:47 INFO mapreduce.Job: Job job_1524510537951_0005 completed successfully
18/04/24 15:03:48 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=9895
FILE: Number of bytes written=213221
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5280
HDFS: Number of bytes written=4156
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=20392
Total time spent by all reduces in occupied slots (ms)=6464
Total time spent by all map tasks (ms)=20392
Total time spent by all reduce tasks (ms)=6464
Total vcore-seconds taken by all map tasks=20392
Total vcore-seconds taken by all reduce tasks=6464
Total megabyte-seconds taken by all map tasks=20881408
Total megabyte-seconds taken by all reduce tasks=6619136
Map-Reduce Framework
Map input records=21
Map output records=788
Map output bytes=8313
Map output materialized bytes=9895
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=426
Reduce shuffle bytes=9895
Reduce input records=788
Reduce output records=426
Spilled Records=1576
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=161
CPU time spent (ms)=17670
Physical memory (bytes) snapshot=310317056
Virtual memory (bytes) snapshot=1679785984
Total committed heap usage (bytes)=136122368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=5164
File Output Format Counters
Bytes Written=4156
出现上述的信息表示成功,查看一下输出文件
hadoop fs -ls /user/hadoop/outputwc/
输出的内容就存在part-r-00000里面
查看一下运行结果内容
hadoop fs -cat /user/hadoop/outputwc/part-r-00000