Hive分析搜索引擎的数据(搜狗实验室数据) 博客分类: Hive HiveHadoop
程序员文章站
2024-03-22 15:30:46
...
搜狗实验室:http://www.sogou.com/labs/resources.html
用户查询日志:http://www.sogou.com/labs/dl/q.html
将上面三种数据全部下载下来,解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中;
注意:如果数据乱码的情况下,请先转成utf-8的格式,方便后面的练习;
数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列:时间;第二列:id;第四列:搜索关键词;第四列:关键词在搜索结果中的排名;第五列:查询结果列表中用户点击了第几个;第六列:url;
创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;
创建对应分区目录,并将数据上传到对应的分区中
加载数据到对应的分区中
查看分区
查看每个分区的数量
从上面看执行情况,发现hive执行时使用了一个 job,执行时间比较长,原因有二:
1.因为自己搭建的是四台虚拟机进行测试,和生产环境的大集群没有可比性;
2.hive执行MapReduce过程本身就很慢;
执行的过程中可通过浏览器查看job运行情况:
练习:(使用分区为sogouq3的数据,因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个,点击了第一个的有多少;
2.查询分区d='sogouq3'中搜索的关键字排名前5的数据;
3.url搜索访问此时排名前5的
4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中;
5.统计用户每个时段搜索所在比例;
用户查询日志:http://www.sogou.com/labs/dl/q.html
将上面三种数据全部下载下来,解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中;
注意:如果数据乱码的情况下,请先转成utf-8的格式,方便后面的练习;
数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列:时间;第二列:id;第四列:搜索关键词;第四列:关键词在搜索结果中的排名;第五列:查询结果列表中用户点击了第几个;第六列:url;
创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;
hive> create external table tbsogou(id string,websession string,keyword string,search_seq int,click_seq int,website string) partitioned by (d string) row format delimited fields terminated by '\t' lines terminated by '\n'; OK Time taken: 0.208 seconds [img]http://dl2.iteye.com/upload/attachment/0115/7441/03188d9f-132d-3178-9e26-0668e800f756.png[/img]
创建对应分区目录,并将数据上传到对应的分区中
root@master1:/usr/local/hive/hivedata# ll 总用量 1379652 drwxr-xr-x 2 root root 4096 3月 6 19:05 ./ drwxr-xr-x 6 root root 4096 3月 6 19:08 ../ -rw-rw-r-- 1 jylu jylu 108750574 6月 14 2014 SogouQ1.txt -rw-rw-r-- 1 jylu jylu 217441417 6月 14 2014 SogouQ2.txt -rw-rw-r-- 1 jylu jylu 1086552775 7月 13 2014 SogouQ3.txt root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 16/03/06 22:26:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 16/03/06 22:26:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 16/03/06 22:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ2.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 16/03/06 22:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 16/03/06 22:27:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ3.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 16/03/06 22:27:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -lsr /user/hive/warehouse/testdb.db/tbsogou/ lsr: DEPRECATED: Please use 'ls -R' instead. 16/03/06 22:28:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable drwxr-xr-x - root supergroup 0 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 -rw-r--r-- 2 root supergroup 108750574 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1/SogouQ1.txt drwxr-xr-x - root supergroup 0 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 -rw-r--r-- 2 root supergroup 217441417 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2/SogouQ2.txt drwxr-xr-x - root supergroup 0 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 -rw-r--r-- 2 root supergroup 1086552775 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3/SogouQ3.txt root@master1:/usr/local/hive/hivedata#
加载数据到对应的分区中
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq1')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq1'; OK Time taken: 0.266 seconds hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq2')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq2'; OK Time taken: 0.177 seconds hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq3')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq3'; OK Time taken: 0.245 seconds
查看分区
hive> show partitions tbsogou; OK d=sogouq1 d=sogouq2 d=sogouq3 Time taken: 0.152 seconds, Fetched: 3 row(s)
查看每个分区的数量
hive> select d,count(1) from tbsogou group by d; Query ID = root_20160307213457_df786923-5b7e-41e3-be18-bf6c2c06a926 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 6 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457352585589_0001, Tracking URL = http://master1:8088/proxy/application_1457352585589_0001/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457352585589_0001 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 6 2016-03-07 21:35:21,593 Stage-1 map = 0%, reduce = 0% 2016-03-07 21:36:22,322 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 3.86 sec 2016-03-07 21:36:39,202 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 13.62 sec 2016-03-07 21:36:42,474 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 14.91 sec 2016-03-07 21:36:46,047 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 15.84 sec 2016-03-07 21:36:48,170 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 16.12 sec 2016-03-07 21:36:49,239 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 16.56 sec 2016-03-07 21:36:58,007 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 17.53 sec 2016-03-07 21:37:07,956 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 21.51 sec 2016-03-07 21:37:12,077 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 24.44 sec 2016-03-07 21:37:15,700 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 25.86 sec 2016-03-07 21:37:16,828 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 26.18 sec 2016-03-07 21:37:53,346 Stage-1 map = 100%, reduce = 4%, Cumulative CPU 27.15 sec 2016-03-07 21:37:56,467 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 28.87 sec 2016-03-07 21:37:58,541 Stage-1 map = 100%, reduce = 28%, Cumulative CPU 29.83 sec 2016-03-07 21:37:59,589 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 31.31 sec 2016-03-07 21:38:01,668 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 33.3 sec 2016-03-07 21:38:02,712 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 35.63 sec 2016-03-07 21:38:05,801 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 37.31 sec 2016-03-07 21:38:06,831 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 39.19 sec MapReduce Total cumulative CPU time: 39 seconds 190 msec Ended Job = job_1457352585589_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 6 Cumulative CPU: 39.19 sec HDFS Read: 1491774463 HDFS Write: 49 SUCCESS Total MapReduce CPU Time Spent: 39 seconds 190 msec OK sogouq1 1000000 sogouq2 2000000 sogouq3 10000000 Time taken: 194.981 seconds, Fetched: 3 row(s)
从上面看执行情况,发现hive执行时使用了一个 job,执行时间比较长,原因有二:
1.因为自己搭建的是四台虚拟机进行测试,和生产环境的大集群没有可比性;
2.hive执行MapReduce过程本身就很慢;
执行的过程中可通过浏览器查看job运行情况:
练习:(使用分区为sogouq3的数据,因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个,点击了第一个的有多少;
hive> select count(*)cnt from tbsogou where d='sogouq3' and search_seq=1 and click_seq=1; Query ID = root_20160313111650_9c4bfb1e-c330-4ef1-b6f1-ef0d74812678 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0005, Tracking URL = http://master1:8088/proxy/application_1457829749863_0005/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0005 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1 2016-03-13 11:17:00,345 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:17:12,813 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 4.75 sec 2016-03-13 11:17:13,845 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 5.04 sec 2016-03-13 11:17:16,936 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 8.86 sec 2016-03-13 11:17:17,975 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 9.81 sec 2016-03-13 11:17:26,298 Stage-1 map = 40%, reduce = 13%, Cumulative CPU 15.56 sec 2016-03-13 11:17:28,357 Stage-1 map = 53%, reduce = 13%, Cumulative CPU 19.81 sec 2016-03-13 11:17:32,491 Stage-1 map = 60%, reduce = 13%, Cumulative CPU 21.32 sec 2016-03-13 11:17:38,644 Stage-1 map = 67%, reduce = 20%, Cumulative CPU 22.65 sec 2016-03-13 11:17:40,708 Stage-1 map = 73%, reduce = 20%, Cumulative CPU 23.21 sec 2016-03-13 11:17:44,834 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 24.91 sec 2016-03-13 11:17:47,903 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.27 sec MapReduce Total cumulative CPU time: 26 seconds 270 msec Ended Job = job_1457829749863_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 1 Cumulative CPU: 26.27 sec HDFS Read: 1147287940 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 26 seconds 270 msec OK 3891022 Time taken: 58.658 seconds, Fetched: 1 row(s)
2.查询分区d='sogouq3'中搜索的关键字排名前5的数据;
hive> select keyword,count(*) as cnt from tbsogou where d='sogouq3' group by keyword order by cnt desc limit 5; Query ID = root_20160313111820_181be93c-07f4-4aa3-ab42-778b9cc2f312 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0006, Tracking URL = http://master1:8088/proxy/application_1457829749863_0006/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0006 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:18:31,471 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:18:46,946 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 6.6 sec 2016-03-13 11:18:57,652 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 30.33 sec 2016-03-13 11:18:59,234 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 32.57 sec 2016-03-13 11:19:02,356 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 34.44 sec 2016-03-13 11:19:03,644 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 36.89 sec 2016-03-13 11:19:05,416 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 36.89 sec 2016-03-13 11:19:09,267 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 40.57 sec 2016-03-13 11:19:10,466 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 42.18 sec 2016-03-13 11:19:22,900 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 45.82 sec 2016-03-13 11:19:27,047 Stage-1 map = 87%, reduce = 0%, Cumulative CPU 49.62 sec 2016-03-13 11:19:28,201 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 50.0 sec 2016-03-13 11:19:29,228 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 51.6 sec 2016-03-13 11:19:38,556 Stage-1 map = 100%, reduce = 8%, Cumulative CPU 52.38 sec 2016-03-13 11:19:44,830 Stage-1 map = 100%, reduce = 26%, Cumulative CPU 53.72 sec 2016-03-13 11:19:47,907 Stage-1 map = 100%, reduce = 27%, Cumulative CPU 56.14 sec 2016-03-13 11:19:51,016 Stage-1 map = 100%, reduce = 29%, Cumulative CPU 59.15 sec 2016-03-13 11:19:54,105 Stage-1 map = 100%, reduce = 34%, Cumulative CPU 61.99 sec 2016-03-13 11:19:56,161 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 63.76 sec 2016-03-13 11:20:13,727 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 67.11 sec 2016-03-13 11:20:15,901 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 71.64 sec 2016-03-13 11:20:16,927 Stage-1 map = 100%, reduce = 88%, Cumulative CPU 74.51 sec 2016-03-13 11:20:20,012 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 76.43 sec 2016-03-13 11:20:23,082 Stage-1 map = 100%, reduce = 96%, Cumulative CPU 78.55 sec 2016-03-13 11:20:25,143 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 79.45 sec MapReduce Total cumulative CPU time: 1 minutes 19 seconds 450 msec Ended Job = job_1457829749863_0006 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0007, Tracking URL = http://master1:8088/proxy/application_1457829749863_0007/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0007 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2016-03-13 11:20:37,281 Stage-2 map = 0%, reduce = 0% 2016-03-13 11:20:55,833 Stage-2 map = 17%, reduce = 0%, Cumulative CPU 2.7 sec 2016-03-13 11:20:58,917 Stage-2 map = 33%, reduce = 0%, Cumulative CPU 7.23 sec 2016-03-13 11:20:59,942 Stage-2 map = 44%, reduce = 0%, Cumulative CPU 8.87 sec 2016-03-13 11:21:00,966 Stage-2 map = 61%, reduce = 0%, Cumulative CPU 9.46 sec 2016-03-13 11:21:03,019 Stage-2 map = 83%, reduce = 0%, Cumulative CPU 11.85 sec 2016-03-13 11:21:06,098 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 13.3 sec 2016-03-13 11:21:11,232 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 17.45 sec MapReduce Total cumulative CPU time: 17 seconds 450 msec Ended Job = job_1457829749863_0007 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 79.45 sec HDFS Read: 1147297321 HDFS Write: 115331402 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 17.45 sec HDFS Read: 115338803 HDFS Write: 110 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 36 seconds 900 msec OK 百度 77627 baidu 36564 人体艺术 29598 4399小游戏 23306 优酷 20847 Time taken: 172.861 seconds, Fetched: 5 row(s) 从分析的结果看,用过搜狗查询的关键词是“百度”的挺高的。
3.url搜索访问此时排名前5的
hive> select website ,count(website) as cnt from tbsogou where d='sogouq3' group by website order by cnt desc limit 5; Query ID = root_20160313112231_24d147cb-8dce-412c-b255-67dc65660cd4 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0008, Tracking URL = http://master1:8088/proxy/application_1457829749863_0008/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0008 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:22:39,114 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:22:54,627 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 16.32 sec 2016-03-13 11:23:00,819 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 28.21 sec 2016-03-13 11:23:01,845 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 29.87 sec 2016-03-13 11:23:03,426 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 29.87 sec 2016-03-13 11:23:09,805 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 39.44 sec 2016-03-13 11:23:11,009 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 42.0 sec 2016-03-13 11:23:14,083 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 46.6 sec 2016-03-13 11:23:26,957 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 53.89 sec 2016-03-13 11:23:30,204 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 55.56 sec 2016-03-13 11:23:32,269 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 56.56 sec 2016-03-13 11:23:35,356 Stage-1 map = 87%, reduce = 0%, Cumulative CPU 57.3 sec 2016-03-13 11:23:50,951 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 62.5 sec 2016-03-13 11:23:51,977 Stage-1 map = 91%, reduce = 0%, Cumulative CPU 63.37 sec 2016-03-13 11:23:54,032 Stage-1 map = 92%, reduce = 0%, Cumulative CPU 63.52 sec 2016-03-13 11:23:55,053 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 63.69 sec 2016-03-13 11:23:58,131 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 64.0 sec 2016-03-13 11:24:00,172 Stage-1 map = 95%, reduce = 0%, Cumulative CPU 64.32 sec 2016-03-13 11:24:03,255 Stage-1 map = 96%, reduce = 0%, Cumulative CPU 64.59 sec 2016-03-13 11:24:06,351 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 65.05 sec 2016-03-13 11:24:08,590 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 65.28 sec 2016-03-13 11:24:11,687 Stage-1 map = 99%, reduce = 0%, Cumulative CPU 65.67 sec 2016-03-13 11:24:18,963 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 66.17 sec 2016-03-13 11:24:49,446 Stage-1 map = 100%, reduce = 24%, Cumulative CPU 69.83 sec 2016-03-13 11:24:50,461 Stage-1 map = 100%, reduce = 31%, Cumulative CPU 71.2 sec 2016-03-13 11:24:52,685 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 72.16 sec 2016-03-13 11:24:53,716 Stage-1 map = 100%, reduce = 47%, Cumulative CPU 73.29 sec 2016-03-13 11:24:55,755 Stage-1 map = 100%, reduce = 53%, Cumulative CPU 73.96 sec 2016-03-13 11:25:09,145 Stage-1 map = 100%, reduce = 59%, Cumulative CPU 74.82 sec 2016-03-13 11:25:13,234 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 79.95 sec 2016-03-13 11:25:14,267 Stage-1 map = 100%, reduce = 61%, Cumulative CPU 84.89 sec 2016-03-13 11:25:18,691 Stage-1 map = 100%, reduce = 62%, Cumulative CPU 86.7 sec 2016-03-13 11:25:19,725 Stage-1 map = 100%, reduce = 63%, Cumulative CPU 87.4 sec 2016-03-13 11:25:21,777 Stage-1 map = 100%, reduce = 65%, Cumulative CPU 88.1 sec 2016-03-13 11:25:22,801 Stage-1 map = 100%, reduce = 66%, Cumulative CPU 88.66 sec 2016-03-13 11:25:24,370 Stage-1 map = 100%, reduce = 69%, Cumulative CPU 90.07 sec 2016-03-13 11:25:25,806 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 91.48 sec 2016-03-13 11:25:26,846 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 92.05 sec 2016-03-13 11:25:27,899 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 93.2 sec 2016-03-13 11:25:28,922 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 93.57 sec 2016-03-13 11:25:32,045 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 94.4 sec 2016-03-13 11:25:34,164 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 95.36 sec 2016-03-13 11:25:37,297 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 96.01 sec 2016-03-13 11:25:44,816 Stage-1 map = 100%, reduce = 85%, Cumulative CPU 97.26 sec 2016-03-13 11:25:55,122 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 98.23 sec 2016-03-13 11:25:58,213 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 99.1 sec 2016-03-13 11:26:01,270 Stage-1 map = 100%, reduce = 94%, Cumulative CPU 101.44 sec 2016-03-13 11:26:04,348 Stage-1 map = 100%, reduce = 97%, Cumulative CPU 103.83 sec 2016-03-13 11:26:06,427 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 104.88 sec MapReduce Total cumulative CPU time: 1 minutes 44 seconds 880 msec Ended Job = job_1457829749863_0008 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0009, Tracking URL = http://master1:8088/proxy/application_1457829749863_0009/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0009 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2016-03-13 11:26:23,710 Stage-2 map = 0%, reduce = 0% 2016-03-13 11:26:44,247 Stage-2 map = 28%, reduce = 0%, Cumulative CPU 8.53 sec 2016-03-13 11:26:50,529 Stage-2 map = 57%, reduce = 0%, Cumulative CPU 13.83 sec 2016-03-13 11:26:53,648 Stage-2 map = 66%, reduce = 0%, Cumulative CPU 15.83 sec 2016-03-13 11:26:54,668 Stage-2 map = 72%, reduce = 0%, Cumulative CPU 16.36 sec 2016-03-13 11:26:56,719 Stage-2 map = 83%, reduce = 0%, Cumulative CPU 17.23 sec 2016-03-13 11:26:59,790 Stage-2 map = 93%, reduce = 0%, Cumulative CPU 18.65 sec 2016-03-13 11:27:02,862 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 19.38 sec 2016-03-13 11:27:10,374 Stage-2 map = 100%, reduce = 17%, Cumulative CPU 20.61 sec 2016-03-13 11:27:13,444 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 22.25 sec 2016-03-13 11:27:15,542 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 24.15 sec MapReduce Total cumulative CPU time: 24 seconds 150 msec Ended Job = job_1457829749863_0009 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 104.88 sec HDFS Read: 1147298011 HDFS Write: 297356296 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 24.15 sec HDFS Read: 297363766 HDFS Write: 140 SUCCESS Total MapReduce CPU Time Spent: 2 minutes 9 seconds 30 msec OK http://www.baidu.com/ 148348 http://www.4399.com/ 38611 http://www.youku.com/ 28959 http://www.hao123.com/ 28715 http://qzone.qq.com/ 26020 Time taken: 285.674 seconds, Fetched: 5 row(s)
4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中;
drop table if exists tmp_pre_hour_seach_info; create table tmp_pre_hour_seach_info as select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2); hive> drop table if exists tmp_pre_hour_seach_info; OK Time taken: 0.019 seconds hive> create table tmp_pre_hour_seach_info > as > select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2); Query ID = root_20160313115133_bd005a7b-edf2-4c4c-a4c1-3b83c3d34335 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0015, Tracking URL = http://master1:8088/proxy/application_1457829749863_0015/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0015 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:51:41,275 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:52:01,969 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 13.69 sec 2016-03-13 11:52:03,000 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 14.02 sec 2016-03-13 11:52:11,604 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 28.14 sec 2016-03-13 11:52:13,801 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 29.28 sec 2016-03-13 11:52:17,954 Stage-1 map = 60%, reduce = 1%, Cumulative CPU 33.8 sec 2016-03-13 11:52:19,522 Stage-1 map = 60%, reduce = 3%, Cumulative CPU 34.15 sec 2016-03-13 11:52:20,803 Stage-1 map = 67%, reduce = 3%, Cumulative CPU 35.55 sec 2016-03-13 11:52:22,921 Stage-1 map = 73%, reduce = 4%, Cumulative CPU 35.76 sec 2016-03-13 11:52:23,947 Stage-1 map = 73%, reduce = 5%, Cumulative CPU 37.26 sec 2016-03-13 11:52:25,437 Stage-1 map = 100%, reduce = 5%, Cumulative CPU 38.09 sec 2016-03-13 11:52:26,481 Stage-1 map = 100%, reduce = 7%, Cumulative CPU 38.15 sec 2016-03-13 11:52:27,553 Stage-1 map = 100%, reduce = 8%, Cumulative CPU 38.2 sec 2016-03-13 11:52:51,246 Stage-1 map = 100%, reduce = 11%, Cumulative CPU 38.85 sec 2016-03-13 11:52:53,288 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 39.85 sec 2016-03-13 11:52:54,307 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 41.77 sec 2016-03-13 11:52:59,424 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 43.97 sec 2016-03-13 11:53:09,679 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 47.44 sec MapReduce Total cumulative CPU time: 47 seconds 440 msec Ended Job = job_1457829749863_0015 Moving data to: hdfs://master1:9000/user/hive/warehouse/testdb.db/tmp_pre_hour_seach_info Table testdb.tmp_pre_hour_seach_info stats: [numFiles=5, numRows=24, totalSize=234, rawDataSize=210] MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 47.44 sec HDFS Read: 1147301096 HDFS Write: 664 SUCCESS Total MapReduce CPU Time Spent: 47 seconds 440 msec OK Time taken: 98.72 seconds hive> select * from tmp_pre_hour_seach_info order by hour ; Query ID = root_20160313122830_a6be9cad-fc18-4c9e-9950-f6f97be0f996 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0016, Tracking URL = http://master1:8088/proxy/application_1457829749863_0016/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0016 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2016-03-13 12:28:38,814 Stage-1 map = 0%, reduce = 0% 2016-03-13 12:28:50,158 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.48 sec 2016-03-13 12:28:57,412 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.87 sec MapReduce Total cumulative CPU time: 3 seconds 870 msec Ended Job = job_1457829749863_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.87 sec HDFS Read: 9717 HDFS Write: 234 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 870 msec OK 00 167799 01 121494 02 85056 03 63484 04 52604 05 51630 06 61241 07 97949 08 307308 09 516763 10 612179 11 597820 12 595252 13 623990 14 614563 15 640062 16 636304 17 581630 18 591573 19 682096 20 707056 21 659366 22 542951 23 389667 Time taken: 27.597 seconds, Fetched: 24 row(s)
5.统计用户每个时段搜索所在比例;
hive> select hour,cnt/total from tmp_pre_hour_seach_info a join (select sum(cnt)total from tmp_pre_hour_seach_info) b where 1=1 order by hour; Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Stage-3:MAPRED' is a cross product Query ID = root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0022, Tracking URL = http://master1:8088/proxy/application_1457829749863_0022/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0022 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2016-03-13 12:39:58,777 Stage-1 map = 0%, reduce = 0% 2016-03-13 12:40:10,073 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.3 sec 2016-03-13 12:40:17,294 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec MapReduce Total cumulative CPU time: 3 seconds 480 msec Ended Job = job_1457829749863_0022 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/03/13 12:40:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Execution log at: /tmp/root/root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89.log 2016-03-13 12:40:22 Starting to launch local task to process map join; maximum memory = 518979584 2016-03-13 12:40:23 Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable 2016-03-13 12:40:24 Uploaded 1 File to: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (559 bytes) 2016-03-13 12:40:24 End of local task; Time Taken: 1.176 sec. Execution completed successfully MapredLocal task succeeded Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0023, Tracking URL = http://master1:8088/proxy/application_1457829749863_0023/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0023 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2016-03-13 12:40:32,128 Stage-3 map = 0%, reduce = 0% 2016-03-13 12:40:39,379 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.54 sec 2016-03-13 12:40:46,562 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 2.98 sec MapReduce Total cumulative CPU time: 2 seconds 980 msec Ended Job = job_1457829749863_0023 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.48 sec HDFS Read: 10788 HDFS Write: 117 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 2.98 sec HDFS Read: 10776 HDFS Write: 556 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 460 msec OK 00 0.016780173516828326 01 0.012149598038448027 02 0.00850573864353989 03 0.006348503480606734 04 0.005260485745917659 05 0.0051630841582717794 06 0.006124199824457138 07 0.009795059659472449 08 0.030731300920205 09 0.0516771423374201 10 0.06121889786803525 11 0.05978297446248374 12 0.05952617027657551 13 0.06240001712027906 14 0.06145730175401859 15 0.06400724331806608 16 0.06363143719242624 17 0.05816394807235358 18 0.05915826427970776 19 0.06821071183460291 20 0.07070675252006607 21 0.06593767478409898 22 0.0542959850245559 23 0.03896733516756323 Time taken: 55.825 seconds, Fetched: 24 row(s)