Hive：简单查询不启用Mapreduce job而启用Fetch task

程序员文章站 2022-07-14 14:31:39

...

一、背景：

如果在hive中仅仅查询某个表的一列，Hive也会默认启用MapReduce Job来完成这个任务。我们都知道，启用MapReduce Job是会消耗系统开销的。对于这个问题，从Hive0.10.0版本开始，对于简单的查询语句（没有函数、排序、不需要聚合的查询语句），类似SELECT <col> from <table> LIMIT n语句，当开启Fetch Task功能，就执行一个简单的查询语句不会生成MapRreduce作业，而是直接使用Fetch Task，从hdfs文件系统中进行查询输出数据，从而提高效率。

二、配置Fetch Task的方法

1、在hive提示符

hive> set hive.fetch.task.conversion=more;

2、启动hive时，加入参数

bin/hive --hiveconf hive.fetch.task.conversion=more

3、修改 hive-site.xml文件，加入属性，保存退出。
上面的两种方法都可以开启了Fetch任务，但是都是临时起作用的；如果你想一直启用这个功能，可以在${HIVE_HOME}/conf/hive-site.xml里面加入以下配置：

<property>

<name>hive.fetch.task.conversion</name>

<value>more</value>

<description>

Some select queries can be converted to single FETCH task

minimizing latency.Currently the query should be single

sourced not having any subquery and should not have

any aggregations or distincts (which incurrs RS),

lateral views and joins.

1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

2. more : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)

</description>

</property>

三、举例说明：

1、没有配置Fetch Task，默认启用MapReduce job完成任务。

hive> select id from t ;                 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there is no reduce operator
Starting Job = job_1402248601715_0004, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0004/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1402248601715_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 11:12:54,817 Stage-1 map = 0%,  reduce = 0%
2014-06-09 11:13:15,790 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
2014-06-09 11:13:16,982 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
MapReduce Total cumulative CPU time: 2 seconds 960 msec
Ended Job = job_1402248601715_0004
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 2.96 sec   HDFS Read: 257 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 960 msec
OK
Time taken: 51.496 seconds

查看上面运行日志，可以看到该次查询启动了mapreduce任务，mapper数为1，没有reducer任务。

2、配置fetch task，用到 hive.fetch.task.conversion 参数：

<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description>
    Some select queries can be converted to single FETCH task 
    minimizing latency.Currently the query should be single 
    sourced not having any subquery and should not have
    any aggregations or distincts (which incurrs RS), 
    lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
  </description>
</property>

该参数默认值为minimal，表示运行“select * ”并带有limit查询时候，会将其转换为FetchTask；如果参数值为more，则select某一些列并带有limit条件时，也会将其转换为FetchTask任务。当然，还有前提条件：单一数据源，即输入来源一个表或者分区；没有子查询；没有聚合运算和distinct；不能用于视图和join。

测试一下，先讲其参数值设为more，再运行：

hive> set hive.fetch.task.conversion=more;
hive> select id from t limit 1;           
OK
Time taken: 0.242 seconds
hive> select id from t ;                  
OK
Time taken: 0.496 seconds

t表是一个没有数据空表

上一篇： hive查询

下一篇： Hive 查询