使hadoop支持LZO压缩
前置配置
- hadoop
- maven
- 安装一些依赖
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
安装LZO
1、下载
#下载
wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
# 解压缩
[aaa@qq.com app]$ tar -zxvf lzo-2.06.tar.gz -C ../app
2、编译
[aaa@qq.com app]$ cd lzo-2.06/
[aaa@qq.com lzo-2.06]$ export CFLAGS=-m64
# 创建文件夹,用来存放编译之后的lzo
[aaa@qq.com lzo-2.06]$ mkdir complie
#指定编译之后的位置
[aaa@qq.com lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/complie/
#开始编译安装
[aaa@qq.com lzo-2.06]$ make && make install
# 查看编译是否成功 只要有如下内容 就可以了
[aaa@qq.com lzo-2.06]$ cd complie/
[aaa@qq.com complie]$ ll
total 12
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 include
drwxrwxr-x 2 hadoop hadoop 4096 Dec 6 17:08 lib
drwxrwxr-x 3 hadoop hadoop 4096 Dec 6 17:08 share
[aaa@qq.com complie]$
##安装hadoop-lzo
1、下载
[aaa@qq.com soft]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip
3.2 修改hadoop-lzo-master下的pom.xml文件
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0</hadoop.current.version> #这里修改成对应的hadoop版本号
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
3.3 增加配置
[aaa@qq.com app]$ cd hadoop-lzo-master/
[aaa@qq.com hadoop-lzo-master]$ export CFLAGS=-m64
[aaa@qq.com hadoop-lzo-master]$ export CXXFLAGS=-m64
[aaa@qq.com hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/complie/include/ # 这里需要提供编译好的lzo的include文件
[aaa@qq.com hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib/ # 这里需要提供编译好的lzo的lib文件
[aaa@qq.com hadoop-lzo-master]$
3.4 开始编译
[aaa@qq.com hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true
3.5 编译成功后进行配置
[aaa@qq.com hadoop-lzo-master]$
# 查看编译成功之后的包
[aaa@qq.com hadoop-lzo-master]$ ll
total 80
-rw-rw-r-- 1 hadoop hadoop 35147 Oct 13 2017 COPYING
-rw-rw-r-- 1 hadoop hadoop 19753 Dec 6 17:18 pom.xml
-rw-rw-r-- 1 hadoop hadoop 10170 Oct 13 2017 README.md
drwxrwxr-x 2 hadoop hadoop 4096 Oct 13 2017 scripts
drwxrwxr-x 4 hadoop hadoop 4096 Oct 13 2017 src
drwxrwxr-x 10 hadoop hadoop 4096 Dec 6 17:21 target
# 进入target/native/Linux-amd64-64 目录下执行如下命令
[aaa@qq.com hadoop-lzo-master]$ cd target/native/Linux-amd64-64
[aaa@qq.com Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
./
./libgplcompression.so
./libgplcompression.so.0
./libgplcompression.la
./libgplcompression.a
./libgplcompression.so.0.0.
[aaa@qq.com Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
# 这里很重要 需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 复制到hadoop中
[aaa@qq.com hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
[aaa@qq.com hadoop-lzo-master]$ cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib
4.配置hadoop配置文件
4.1 修改 vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# 增加 编译好的lzo包下的lib
export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib
4.2 修改 vim $HADOOP_HOME/etc/hadoop/core-site.
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.3 修改 vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<property>
<name>mapred.child.env </name>
<value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.4 重启hadoop
hadoop使用lzo
数据准备:
为了测试方便,我将blocksize改为了10M;
lzop page_views.dat
-rw-r--r-- 1 root root 76059972 Apr 17 14:19 page_views.dat
-rw-r--r-- 1 root root 34039337 Apr 17 14:19 page_views.dat.lzo
将lzo文件上传到hdfs上后进行wordcount测试:
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.14.0.jar wordcount\
/user/hive/warehouse/g6_hive_416.db/page_views_lzo /out3
只有一个splits,说明没有进行分片。
建立索引,使MapReduce的时候进行分片
hadoop jar hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/g6_hive_416.db/page_views_lzo
在同个目录下生成了index文件。
在执行一遍wordcount:
这时候的split为2;这明显也是不对的,我们设置的blocksize为10M,文件大小有32M。
更正:
查阅资料得知:我们需将inputformat设置成LzoTextInputFormat,不然还是会将索引文件当成普通文件。
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.14.0.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/user/hive/warehouse/g6_hive_416.db/page_views_lzo \
/out4
这样就对了。
hive中应用LZO
hive的本地库中没有LZO压缩。
所以在建表的时候有特殊的操作。
## 首先要开启压缩,并设置压缩解码器。
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
##另外建表的时候fileformat有特殊的格式写法,如:
create table page_views_lzo(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) row format delimited fields terminated by '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
总结
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
这两者的最大区别在于:lzop压缩后的文件后缀为:’.lzo’,可以建立索引
lzo压缩后的文件后缀为:‘.lzo_deflate’,不支持建立索引。
上一篇: Android使用无线连接手机调试
下一篇: 最短路之Floyd算法