欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

大数据Hadoop之Hadoop序列化案例实操

程序员文章站 2022-03-08 07:58:01
...
1.需求:

统计每一个手机号耗费的总上行流量、下行流量、总流量

(1)输入数据:

1	13736230513	192.196.100.1	www.atguigu.com	2481	24681	200
2	13846544121	192.196.100.2			264	0	200
3 	13956435636	192.196.100.3			132	1512	200
4 	13966251146	192.168.100.1			240	0	404
5 	18271575951	192.168.100.2	www.atguigu.com	1527	2106	200
6 	84188413	192.168.100.3	www.atguigu.com	4116	1432	200
7 	13590439668	192.168.100.4			1116	954	200
8 	15910133277	192.168.100.5	www.hao123.com	3156	2936	200
9 	13729199489	192.168.100.6			240	0	200
10 	13630577991	192.168.100.7	www.shouhu.com	6960	690	200
11 	15043685818	192.168.100.8	www.baidu.com	3659	3538	200
12 	15959002129	192.168.100.9	www.atguigu.com	1938	180	500
13 	13560439638	192.168.100.10			918	4938	200
14 	13470253144	192.168.100.11			180	180	200
15 	13682846555	192.168.100.12	www.qq.com	1938	2910	200
16 	13992314666	192.168.100.13	www.gaga.com	3008	3720	200
17 	13509468723	192.168.100.14	www.qinghua.com	7335	110349	404
18 	18390173782	192.168.100.15	www.sogou.com	9531	2412	200
19 	13975057813	192.168.100.16	www.baidu.com	11058	48243	200
20 	13768778790	192.168.100.17			120	120	200
21 	13568436656	192.168.100.18	www.alibaba.com	2481	24681	200
22 	13568436656	192.168.100.19			1116	954	200

(2)输入数据格式:

7 	13560436666	120.196.100.99		1116		 954			200
id	手机号码		网络ip			上行流量  下行流量     网络状态码

(3)期望输出数据格式:

13560436666 		1116		   954 		2070
手机号码		    上行流量        下行流量		总流量
2.需求分析:

大数据Hadoop之Hadoop序列化案例实操

3.编写MapReduce程序:

(1)编写流量统计的Bean对象:

package com.mapreduce.flowcount;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class FlowCountBean implements Writable{

	private long upFlow; // 上行流量
	private long downFlow; // 下行流量
	private long sumFlow; // 总流量
	
	// 空参构造器,为了反序列化时反射使用
	public FlowCountBean() {
		
	}
	
	public FlowCountBean(long upFlow, long downFlow) {
		this.upFlow = upFlow;
		this.downFlow = downFlow;
		sumFlow = upFlow + downFlow;
	}

	// 反序列化方法
	@Override
	public void readFields(DataInput in) throws IOException {
		// 注意反序列化时,要与序列化时一致
		upFlow = in.readLong();
		downFlow = in.readLong();
		sumFlow = in.readLong();
	}

	// 序列化方法
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeLong(upFlow);
		out.writeLong(downFlow);
		out.writeLong(sumFlow);
	}
	
	@Override
	public String toString() {
		// 方便后续数据的使用,设置成\t间隔
		return upFlow + "\t" + downFlow + "\t" + sumFlow;
	}

	public long getUpFlow() {
		return upFlow;
	}

	public void setUpFlow(long upFlow) {
		this.upFlow = upFlow;
	}

	public long getDownFlow() {
		return downFlow;
	}

	public void setDownFlow(long downFlow) {
		this.downFlow = downFlow;
	}

}

(2)编写Mapper类:

package com.mapreduce.flowcount;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowCountBean>{
	Text k = new Text();
	FlowCountBean v = new FlowCountBean();
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		// 1. 获取一行数据
		String line = value.toString();
		
		// 2. 切割数据
		String[] fields = line.split("\t");
		
		// 3. 封装对象
		// 取出手机号
		String phone = fields[1];
		// 取出上行流量和下行流量
		long upFlow = Long.parseLong(fields[fields.length - 3]);
		long downFlow = Long.parseLong(fields[fields.length - 2]);
		k.set(phone);
		v.setUpFlow(upFlow);
		v.setDownFlow(downFlow);
		// 4. 写出
		context.write(k, v);
	}
}

(3)编写Reducer类:

package com.mapreduce.flowcount;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowCountReduce extends Reducer<Text, FlowCountBean, Text, FlowCountBean>{
	
	protected void reduce(Text k, Iterable<FlowCountBean> values, Context context) 
			throws java.io.IOException ,InterruptedException {
		long sumUpFlow = 0;
		long sumDownFlow = 0;
		
		// 1. 遍历所有的bean,将其中的上行,下行流量分别累加
		for (FlowCountBean flowCountBean : values) {
			sumUpFlow += flowCountBean.getUpFlow();
			sumDownFlow += flowCountBean.getDownFlow();
		}
		// 2. 封装对象
		FlowCountBean v = new FlowCountBean(sumUpFlow, sumDownFlow);
		
		// 3. 写出
		context.write(k, v);
	}
}

(4)编写Driver驱动类:

package com.mapreduce.flowcount;

import java.io.FileInputStream;
import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FilterOutputFormat;

public class FlowCountDriver {
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		args = new String[] {"D:\\hadoop-2.7.1\\winMR\\FlowCount\\input", "D:\\hadoop-2.7.1\\winMR\\FlowCount\\output1"};
		// 1. 过的job对象
		Job job = Job.getInstance();
		
		// 2. 设置jar路径
		job.setJarByClass(FlowCountDriver.class);
		
		// 3. 关联map和reduce
		job.setMapperClass(FlowCountMapper.class);
		job.setReducerClass(FlowCountReduce.class);
		
		// 4. 设置map输出的k, v 类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowCountBean.class);
		
		// 5. 设置最终输出的k, v 类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowCountBean.class);
		
		// 6. 设置输入输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		// 7. 提交job
		job.waitForCompletion(true);
	}
}

4.输出结果:

(1)文件位置:
大数据Hadoop之Hadoop序列化案例实操
(2)文件内容:
大数据Hadoop之Hadoop序列化案例实操

相关标签: 大数据 MR