mapreduce:数据去重
程序员文章站
2022-07-05 22:00:52
mapreduce:数据去重
输入数据:
1997-01-02 phone
1998-10-01 window
1997-01-02 phone
2001-11-23...
mapreduce:数据去重
输入数据: 1997-01-02 phone 1998-10-01 window 1997-01-02 phone 2001-11-23 xbox 2013-08-16 vr 1997-01-02 phone 2001-11-23 xbox 2013-08-16 vr
需求:去除其中的重复元素,每个日期对应的商品只保存一份 输出: 1997-01-02 phone 1998-10-01 window 2001-11-23 xbox 2013-08-16 vr 思路: 根据mapreduce特性,在reduce函数执行之前;会对相同key的数据经进行分组,将相同key的value放入一组(实际是一个集合) 分组之后每个key都是唯一的。即shuffle的过程,就可以利用key达到数据去重的效果。 这里将原数据不做任何处理,直接让其作为key原样输出,value输出类型为NullWritable;就能实现去重。
public class DistinctDataDemo { public static class MyMapper extends Mapper{ @Override protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException { //原数据作为key直接输出 context.write(value,NullWritable.get()); } } public static class MyReucer extends Reducer{ @Override protected void reduce(Text key, Iterable value, Context context) throws IOException, InterruptedException{ //原样输出 context.write(key,NullWritable.get()); } } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { //获取配置对象 Configuration conf = new Configuration(); //获取FileSystem对象 FileSystem fs = FSUtil.getFS(); //创建作业对象 Job job = Job.getInstance(conf,"distinctdataDemo"); //设置运行主类 job.setJarByClass(DistinctDataDemo.class); //设置mapper参数 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); job.setMapperClass(MyMapper.class); //设置输入文件路径 FileInputFormat.setInputPaths(job,new Path(args[0])); //设置reducer参数 job.setReducerClass(MyReucer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); //设置输出文件路径 Path outPath = new Path(args[1]); if(fs.exists(outPath)){ //目录已存在,则删除 fs.delete(outPath,true); } FileOutputFormat.setOutputPath(job,outPath); //提交作业 boolean res = job.waitForCompletion(true); System.exit(res ? 0 : -1); } }
上一篇: SQL Server 2016正式版安装配置过程图文详解
下一篇: Centos7安装Xen详情