分别使用Hadoop MapReduce、hive统计手机流量博客分类： HiveHadoop HiveHadoop

程序员文章站 2024-03-18 08:02:39

...

问题导读
1.hive实现统计的查询语句是什么？
2.生产环境中为什么建议使用外部表？
3.hadoop mapreduce创建类DataWritable的作用是什么？
4.为什么创建类DataWritable？
5.如何实现统计手机流量？
6.对比hive与mapreduce统计手机流量的区别？

分别使用Hadoop MapReduce、hive统计手机流量

博客分类： HiveHadoop HiveHadoop

1.使用Hive进行手机流量统计

很多公司在使用hive对数据进行处理。
hive是hadoop家族成员，是一种解析like sql语句的框架。它封装了常用MapReduce任务，让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种，内表和外表。
Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。
在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。这样外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。
Hive的内外表，还有一个Partition的分区的知识点，用于避免全表扫描，快速检索。后期的文章会提到。
接下来开始正式开始《Hive统计手机流量》

原始数据：

1363157985066 13726230503    00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com             24    27    2481 24681 200
1363157995052 13826544101    5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4                   4    0    264    0    200
1363157991076 13926435656    20-10-7A-28-CC-0A:CMCC  120.196.100.99                2    4    132    1512 200
1363154400022 13926251106    5C-0E-8B-8B-B1-50:CMCC  120.197.40.4                   4    0    240    0    200
1363157993044 18211575961    94-71-AC-CD-E6-18:CMCC-EASY    120.196.100.99  iface.qiyi.com  瑙.?缃..       15 2    1527 2106 200
1363157995074 84138413       5C-0E-8B-8C-E8-20:7DaysInn    120.197.40.4 122.72.52.12          20    16    4116 1432 200
1363157993055 13560439658    C4-17-FE-BA-DE-D9:CMCC  120.196.100.99                18    15    1116 954    200
1363157995033 15920133257    5C-0E-8B-C7-BA-20:CMCC  120.197.40.4 sug.so.360.cn 淇℃.瀹..       20    20 156 2936 200

操作步骤：

#配置好Hive之后，使用hive命令启动hive框架。hive启动属于懒加载模式，会比较慢
hive;
#使用show databases命令查看当前数据库信息
hive> show databases;
OK
default
hive
Time taken: 3.389 seconds
#使用 use hive命令，使用指定的数据库 hive数据库是我之前创建的
use hive；
#创建表，这里是创建内表。内表加载hdfs上的数据，会将被加载文件中的内容剪切走。
#外表没有这个问题，所以在实际的生产环境中，建议使用外表。
create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t';
#加载数据，这里是从hdfs加载数据，也可用linux下加载数据需要local关键字
load data inpath'/HTTP_20130313143750.dat' into table ll;
#数据加载完毕之后，hdfs的
#执行hive 的like sql语句,对数据进行统计
select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;

复制代码

执行结果如下：

hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-07-17 19:51:42,599 Stage-1 map = 0%, reduce = 0%
2013-07-17 19:52:40,474 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:41,690 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:42,693 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:43,698 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:44,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:45,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:46,712 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:47,715 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:48,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:49,758 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:50,763 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
2013-07-17 19:52:51,772 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
2013-07-17 19:52:52,775 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
2013-07-17 19:52:53,779 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
MapReduce Total cumulative CPU time: 50 seconds 0 msec
Ended Job = job_201307160252_0006
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 50.0 sec HDFS Read: 2787075 HDFS Write: 16518 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 0 msec
OK
13402169727 171 108 11286 130230
13415807477 2067 1683 169668 1994181
13416127574 1501 1094 161963 802756
13416171820 113 99 10630 32120
13417106524 160 128 18688 13088
13418002498 240 256 22136 86896
13418090588 456 351 98934 67470
13418117364 264 152 29436 49966
13418173218 37680 48348 2261286 73159722
13418666750 22432 26482 1395648 39735552
13420637670 20 20 1480 1480
......
Time taken: 75.24 seconds

复制代码

分别使用Hadoop MapReduce、hive统计手机流量

博客分类： HiveHadoop HiveHadoop

2.Hadoop MapReduce手机流量统计

自定义一个writable

package cn.maoxiangyi.hadoop.wordcount;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class DataWritable implements Writable {
private int upPackNum;
private int downPackNum;
private int upPayLoad;
private int downPayLoad;
public DataWritable() {
super();
}
public DataWritable(int upPackNum, int downPackNum, int upPayLoad,
int downPayLoad) {
super();
this.upPackNum = upPackNum;
this.downPackNum = downPackNum;
this.upPayLoad = upPayLoad;
this.downPayLoad = downPayLoad;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(upPackNum);
out.writeInt(downPackNum);
out.writeInt(upPayLoad);
out.writeInt(downPayLoad);
}
@Override
public void readFields(DataInput in) throws IOException {
upPackNum = in.readInt();
downPackNum = in.readInt();
upPayLoad = in.readInt();
downPayLoad =in.readInt();
}
public int getUpPackNum() {
return upPackNum;
}
public void setUpPackNum(int upPackNum) {
this.upPackNum = upPackNum;
}
public int getDownPackNum() {
return downPackNum;
}
public void setDownPackNum(int downPackNum) {
this.downPackNum = downPackNum;
}
public int getUpPayLoad() {
return upPayLoad;
}
public void setUpPayLoad(int upPayLoad) {
this.upPayLoad = upPayLoad;
}
public int getDownPayLoad() {
return downPayLoad;
}
public void setDownPayLoad(int downPayLoad) {
this.downPayLoad = downPayLoad;
}
@Override
public String toString() {
return " " + upPackNum + " "
+ downPackNum + " " + upPayLoad + " "
+ downPayLoad;
}
}

复制代码

MapReduc函数

package cn.maoxiangyi.hadoop.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DataTotalMapReduce {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Job job = new Job(configuration);
job.setJarByClass(DataTotalMapReduce.class);
job.setMapperClass(DataTotalMapper.class);
job.setReducerClass(DataTotalReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DataWritable.class);
job.setCombinerClass(DataTotalReducer.class);
Path inputDir = new Path("hdfs://hadoop0:9000/HTTP_20130313143750.dat");
FileInputFormat.addInputPath(job, inputDir);
Path outputDir = new Path("hdfs://hadoop0:9000/dataTotal");
FileOutputFormat.setOutputPath(job, outputDir);
job.waitForCompletion(true);
}
}
/**
*
*
*
* 1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82
* i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101
* 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656
* 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
*
*
*/
class DataTotalMapper extends Mapper<LongWritable, Text, Text, DataWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String lineStr = value.toString();
String[] strArr = lineStr.split("\t");
String phpone = strArr[1];
String upPackNum = strArr[6];
String downPackNum = strArr[7];
String upPayLoad = strArr[8];
String downPayLoad = strArr[9];
context.write(
new Text(phpone),
new DataWritable(Integer.parseInt(upPackNum), Integer
.parseInt(downPackNum), Integer.parseInt(upPayLoad),
Integer.parseInt(downPayLoad)));
}
}
class DataTotalReducer extends Reducer<Text, DataWritable, Text, DataWritable> {
@Override
protected void reduce(Text k2, Iterable<DataWritable> v2, Context context)
throws IOException, InterruptedException {
int upPackNumSum = 0;
int downPackNumSum = 0;
int upPayLoadSum = 0;
int downPayLoadSum = 0;
for (DataWritable dataWritable : v2) {
upPackNumSum += dataWritable.getUpPackNum();
downPackNumSum += dataWritable.getDownPackNum();
upPayLoadSum += dataWritable.getUpPayLoad();
downPayLoadSum += dataWritable.getDownPayLoad();
}
context.write(k2, new DataWritable(upPackNumSum, downPackNumSum, upPayLoadSum, downPayLoadSum));
}
}

复制代码

结果节选

13402169727 171 108 11286 130230
13415807477 2067 1683 169668 1994181
13416127574 1501 1094 161963 802756
13416171820 113 99 10630 32120
13417106524 160 128 18688 13088
13418002498 240 256 22136 86896
13418090588 456 351 98934 67470
13418117364 264 152 29436 49966
13418173218 37680 48348 2261286 73159722
13418666750 22432 26482 1395648 39735552
13420637670 20 20 1480 1480
13422149173 40 32 4000 3704
13422311151 465 535 33050 661790
13424077835 84 72 15612 9948
13424084200 765 690 60930 765675
13428887537 43892 44830 2925330 65047620
13430219372 454 352 33792 192876
13430234524 27852 39056 1767220 52076614
13430237899 1293 1165 166346 808613
13430258776 4681 4783 350511 6609423
13430266620 10544 9377 11600817 5728002
13432023893 40 0 2400 0

复制代码

http://www.aboutyun.com/forum.php?highlight=hive&mod=viewthread&tid=7455

分别使用Hadoop MapReduce、hive统计手机流量 博客分类： HiveHadoop HiveHadoop