欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

MapReduce序列化及分区的java代码示例

程序员文章站 2022-03-18 21:52:10
概述 序列化(Serialization)是指把结构化对象转化为字节流。 反序列化(Deserialization)是序列化的逆过程。把字节流转为结构化对象。 当要在进程间传递对象或持久化对象的时候,就需要序列化对象成字节流,反之当要将接收到或从磁盘读取的字节流转换为对象,就要进行反序列化。 Jav ......

概述

  序列化(Serialization)是指把结构化对象转化为字节流

  反序列化(Deserialization)是序列化的逆过程。把字节流转为结构化对象。

  当要在进程间传递对象或持久化对象的时候,就需要序列化对象成字节流,反之当要将接收到或从磁盘读取的字节流转换为对象,就要进行反序列化。

  Java 的序列化(Serializable)是一个重量级序列化框架,一个对象被序列化后,会附带很多额外的信息(各种校验信息,header,继承体系…),不便于在网络中高效传输;所以,hadoop 自己开发了一套序列化机制( Writable),精简,高效。不用像 java 对象类一样传输多层的父子关系,需要哪个属性就传输哪个属性值,大大的减少网络传输的开销。

  Writable是Hadoop的序列化格式,hadoop定义了这样一个Writable接口。一个类要支持可序列化只需实现这个接口即可。

public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
}

 

  如需要将自定义的 bean 放在 key 中传输,则还需要实现 comparable 接口,因为 mapreduce 框中的 shuffle 过程一定会对 key 进行排序,此时,自定义的bean 实现的接口应该是:WritableComparable

 

代码示例

  1 . 需求

    统计每一个用户(手机号)所耗费的总上行流量、下行流量,总流量结果的基础之上再加一个需求:将统计结果按照总流量倒序排序。

  准备数据

1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157995052     13826544101    5C-0E-8B-C7-F1-E0:CMCC    120.197.40.4            4    0    264    0    200
1363157991076     13926435656    20-10-7A-28-CC-0A:CMCC    120.196.100.99            2    4    132    1512    200
1363154400022     13926251106    5C-0E-8B-8B-B1-50:CMCC    120.197.40.4            4    0    240    0    200
1363157993044     18211575961    94-71-AC-CD-E6-18:CMCC-EASY    120.196.100.99    iface.qiyi.com    视频网站    15    12    1527    2106    200
1363157995074     84138413    5C-0E-8B-8C-E8-20:7DaysInn    120.197.40.4    122.72.52.12        20    16    4116    1432    200
1363157993055     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            18    15    1116    954    200
1363157995033     15920133257    5C-0E-8B-C7-BA-20:CMCC    120.197.40.4    sug.so.360.cn    信息安全    20    20    3156    2936    200
1363157983019     13719199419    68-A1-B7-03-07-B1:CMCC-EASY    120.196.100.82            4    0    240    0    200
1363157984041     13660577991    5C-0E-8B-92-5C-20:CMCC-EASY    120.197.40.4    s19.cnzz.com    站点统计    24    9    6960    690    200
1363157973098     15013685858    5C-0E-8B-C7-F7-90:CMCC    120.197.40.4    rank.ie.sogou.com    搜索引擎    28    27    3659    3538    200
1363157986029     15989002119    E8-99-C4-4E-93-E0:CMCC-EASY    120.196.100.99    www.umeng.com    站点统计    3    3    1938    180    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157992093     13560439658    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            15    9    918    4938    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157986041     13480253104    5C-0E-8B-C7-FC-80:CMCC-EASY    120.197.40.4            3    3    180    180    200
1363157984040     13602846565    5C-0E-8B-8B-B6-00:CMCC    120.197.40.4    2052.flash2-http.qq.com    综合门户    15    12    1938    2910    200
1363157995093     13922314466    00-FD-07-A2-EC-BA:CMCC    120.196.100.82    img.qfc.cn        12    12    3008    3720    200
1363157982040     13502468823    5C-0A-5B-6A-0B-D4:CMCC-EASY    120.196.100.99    y0.ifengimg.com    综合门户    57    102    7335    110349    200
1363157986072     18320173382    84-25-DB-4F-10-1A:CMCC-EASY    120.196.100.99    input.shouji.sogou.com    搜索引擎    21    18    9531    2412    200
1363157990043     13925057413    00-1F-64-E1-E6-9A:CMCC    120.196.100.55    t3.baidu.com    搜索引擎    69    63    11058    48243    200
1363157988072     13760778710    00-FD-07-A4-7B-08:CMCC    120.196.100.82            2    2    120    120    200
1363157985066     13726238888    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157993055     13560436666    C4-17-FE-BA-DE-D9:CMCC    120.196.100.99            18    15    1116    954    200

 

  2 . 分析

    实现自定义的bean 来封装流量信息,并将bean 作为 map 输出的 key 来传输

    MR 程序在处理数据的过程中会对数据排序(map 输出的 kv 对传输到 reduce之前,会排序),排序的依据是 map 输出的 key。所以,我们如果要实现自己需要的排序规则,则可以考虑将排序因素放到 key 中,让 key 实现接口:WritableComparable,然后重写 key 的 compareTo 方法。

  3 . 未排序的实现

    自定义JavaBean

public class FlowBean implements WritableComparable<FlowBean>{
    private long upFlow;
    private long downFlow;
    private long sumFlow;

    public FlowBean() {
    }
    public FlowBean(long upFlow, long downFlow, long sumFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }
    public FlowBean(long upFlow, long downFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }
    public void set(long upFlow, long downFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    public long getUpFlow() {
        return upFlow;
    }
    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }
    public long getDownFlow() {
        return downFlow;
    }
    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }
    public long getSumFlow() {
        return sumFlow;
    }
    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    /**
     * 序列化方法
     * @param out
     * @throws IOException
     */
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    /**
     * 反序列化方法
     * 先序列化的先反序列化
     * @param in
     * @throws IOException
     */
    @Override
    public void readFields(DataInput in) throws IOException {
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    /**
     * 指定对象排序的方法
     *  如果指定的数与参数相等返回 0。
     *  如果指定的数小于参数返回 -1。
     *  如果指定的数大于参数返回 1。
     */
    @Override
    public int compareTo(FlowBean o) {
        return this.getSumFlow() > o.getSumFlow() ? -1 : 1 ;//按照指定的总流量的倒序排序
//        return this.getSumFlow() > o.getSumFlow() ? 1 : -1 ;//按照指定的总流量的正序排序
    }
}

 

    Mapper方法

public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean> {
    Text k = new Text();
    FlowBean v = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] fields = line.split("\t");

        String phoNum = fields[1];//提前目标文件中的手机号
        long upFlow = Long.parseLong(fields[fields.length-3]);//提取目标文件中的上行流量
        long downFlow = Long.parseLong(fields[fields.length-2]);//提取目标文件中的下行流量

        k.set(phoNum);
        v.set(upFlow,downFlow);
        context.write(k,v);
    }
}

 

    Reducer方法

public class FlowSumReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
    FlowBean v = new FlowBean();

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
        long sumUpFlow = 0;
        long sumDownFlowd = 0;
        for (FlowBean value : values) {
            sumUpFlow += value.getUpFlow();//获取每条记录的上行流量并计算总和
            sumDownFlowd += value.getDownFlow();//获取每条记录的下行流量并计算总和
        }
        v.set(sumUpFlow ,sumDownFlowd);
        context.write(key,v);
    }
}

 

    主方法

public class FlowSumRunner {
    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        //指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
        conf.set("mapreduce.framework.name","local");
        //通过job方法获得mr程序运行的实例
        Job job = Job.getInstance(conf);

        //指定本次mr程序的运行主类
        job.setJarByClass(FlowSumRunner.class);
        //指定本次mr程序使用的mapper reduce
        job.setMapperClass(FlowSumMapper.class);
        job.setReducerClass(FlowSumReducer.class);
        //指定本次mr程序map输出的数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        //指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        //指定本次mr程序待处理数据目录   输出结果存放目录
        FileInputFormat.addInputPath(job,new Path("D:\\flowsum\\input"));
        FileOutputFormat.setOutputPath(job,new Path("D:\\flowsum\\output"));

        //提交本次mr程序
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
    }
}

 

    

   3 . 排序的实现

      使用上面的输出作为该需求的输入

    Mapper方法

public class FlowSumSortMapper extends Mapper<LongWritable,Text,FlowBean,Text> {
    FlowBean k = new FlowBean();
    Text v = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] fileds = line.split("\t");

        String phoNum = fileds[0];
        long sumUpFlow = Long.parseLong(fileds[1]);
        long sumDownFlow = Long.parseLong(fileds[2]);

        v.set(phoNum);
        k.set(sumUpFlow,sumDownFlow);
        context.write(k,v);
    }
}

    Reducer方法

public class FlowSumSortReducer extends Reducer<FlowBean,Text,Text,FlowBean> {
    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        Text phoNum = values.iterator().next();//iterator中只有一个值
        context.write(phoNum,key);
    }
}

    主方法

 1 //得出上题结果的基础之上再加一个需求:将统计结果按照总流量倒序排序
 2 public class FlowSumSortDriver {
 3     public static void main(String[] args) throws Exception{
 4         Configuration conf = new Configuration();
 5         //指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
 6         conf.set("mapreduce.framework.name","local");
 7 
 8         //通过job方法获得mr程序运行的实例
 9         Job job = Job.getInstance(conf);
10 
11         //指定本次mr程序的运行主类
12         job.setJarByClass(FlowSumSortDriver.class);
13         //指定本次mr程序使用的mapper reduce
14         job.setMapperClass(FlowSumSortMapper.class);
15         job.setReducerClass(FlowSumSortReducer.class);
16         //指定本次mr程序map输出的数据类型
17         job.setMapOutputKeyClass(FlowBean.class);
18         job.setMapOutputValueClass(Text.class);
19         //指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
20         job.setOutputKeyClass(Text.class);
21         job.setOutputValueClass(FlowBean.class);
22         //指定本次mr程序待处理数据目录   输出结果存放目录
23         FileInputFormat.addInputPath(job,new Path("D:\\flowsum\\output"));
24         FileOutputFormat.setOutputPath(job,new Path("D:\\flowsum\\outsortput"));
25 
26         //提交本次mr程序
27         boolean b = job.waitForCompletion(true);
28         System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
29     }
30 }

 

Mapreduce的分区—Partitioner

MapReduce序列化及分区的java代码示例

1 .  需求

    将流量汇总统计结果按照手机归属地不同省份输出到不同文件中。

2 .  分析

    Mapreduce 中会将 map 输出的 kv 对,按照相同 key 分组,然后分发给不同的 reducetask。

    默认的分发规则为:根据 key 的 hashcode%reducetask 数来分发

    所以:如果要按照我们自己的需求进行分组,则需要改写数据分发(分组)组件 Partitioner,自定义一个 CustomPartitioner 继承抽象类:Partitioner,然后在job 对象中,设置自定义 partitioner: job.setPartitionerClass(CustomPartitioner.class)

3 .  实现

    自定义partitioner类

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {   
    public static HashMap<String, Integer>  provinceMap = new HashMap<String, Integer>();   
    static{
        provinceMap.put("134", 0);
        provinceMap.put("135", 1);
        provinceMap.put("136", 2);
        provinceMap.put("137", 3);
        provinceMap.put("138", 4);
    }

    @Override
    public int getPartition(Text key, FlowBean value, int numPartitions) {
        Integer code = provinceMap.get(key.toString().substring(0, 3));       
        if (code != null) {
            return code;
        }      
        return 5;
    }
}

 

    Mapper、Reducer及主方法

 1 public class FlowSumProvince {   
 2  public static class FlowSumProvinceMapper extends Mapper<LongWritable, Text, Text, FlowBean>{        
 3      Text k = new Text();
 4      FlowBean  v = new FlowBean();
 5      
 6      @Override
 7      protected void map(LongWritable key, Text value,Context context)
 8              throws IOException, InterruptedException {
 9              //拿取一行文本转为String
10              String line = value.toString();
11              //按照分隔符\t进行分割
12              String[] fileds = line.split("\t");
13              //获取用户手机号
14              String phoneNum = fileds[1];
15              
16              long upFlow = Long.parseLong(fileds[fileds.length-3]);
17              long downFlow = Long.parseLong(fileds[fileds.length-2]);
18              
19              k.set(phoneNum);
20              v.set(upFlow, downFlow);            
21              context.write(k,v);                
22         }        
23     }
24        
25     public static class FlowSumProvinceReducer extends Reducer<Text, FlowBean, Text, FlowBean>{        
26         FlowBean  v  = new FlowBean();         
27         @Override
28         protected void reduce(Text key, Iterable<FlowBean> flowBeans,Context context) throws IOException, InterruptedException {            
29             long upFlowCount = 0;
30             long downFlowCount = 0;
31             
32             for (FlowBean flowBean : flowBeans) {                
33                 upFlowCount += flowBean.getUpFlow();                
34                 downFlowCount += flowBean.getDownFlow();                
35             }
36             v.set(upFlowCount, downFlowCount);            
37             context.write(key, v);
38     }
39        
40     public static void main(String[] args) throws Exception{       
41         Configuration conf = new Configuration();
42         Job job = Job.getInstance(conf);
43 
44         //指定我这个 job 所在的 jar包位置
45         job.setJarByClass(FlowSumProvince.class);       
46         //指定我们使用的Mapper是那个类  reducer是哪个类
47         job.setMapperClass(FlowSumProvinceMapper.class);
48         job.setReducerClass(FlowSumProvinceReducer.class);        
49         // 设置我们的业务逻辑 Mapper 类的输出 key 和 value 的数据类型
50         job.setMapOutputKeyClass(Text.class);
51         job.setMapOutputValueClass(FlowBean.class);        
52         // 设置我们的业务逻辑 Reducer 类的输出 key 和 value 的数据类型
53         job.setOutputKeyClass(Text.class);
54         job.setOutputValueClass(FlowBean.class);
55                 
56         //这里设置运行reduceTask的个数
57         job.setNumReduceTasks(6);
58                
59         //这里指定使用我们自定义的分区组件
60         job.setPartitionerClass(ProvincePartitioner.class);
61                
62         FileInputFormat.setInputPaths(job, new Path("D:\\flowsum\\input"));
63         // 指定处理完成之后的结果所保存的位置
64         FileOutputFormat.setOutputPath(job, new Path("D:\\flowsum\\outputProvince"));        
65         boolean res = job.waitForCompletion(true);
66         System.exit(res ? 0 : 1);       
67     }
68  }
69 }