欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

在Hadoop集群环境中为MySQL安装配置Sqoop的教程

程序员文章站 2024-02-25 11:06:28
sqoop是一个用来将hadoop和关系型数据库中的数据相互转移的工具,可以将一个关系型数据库(例如 : mysql ,oracle ,postgres等)中的数据导进到h...

sqoop是一个用来将hadoop和关系型数据库中的数据相互转移的工具,可以将一个关系型数据库(例如 : mysql ,oracle ,postgres等)中的数据导进到hadoop的hdfs中,也可以将hdfs的数据导进到关系型数据库中。

sqoop中一大亮点就是可以通过hadoop的mapreduce把数据从关系型数据库中导入数据到hdfs。


一、安装sqoop
1、下载sqoop压缩包,并解压

压缩包分别是:sqoop-1.2.0-cdh3b4.tar.gz,hadoop-0.20.2-cdh3b4.tar.gz, mysql jdbc驱动包mysql-connector-java-5.1.10-bin.jar

[root@node1 ~]# ll
drwxr-xr-x 15 root root  4096 feb 22 2011 hadoop-0.20.2-cdh3b4
-rw-r--r-- 1 root root 724225 sep 15 06:46 mysql-connector-java-5.1.10-bin.jar
drwxr-xr-x 11 root root  4096 feb 22 2011 sqoop-1.2.0-cdh3b4

2、将sqoop-1.2.0-cdh3b4拷贝到/home/hadoop目录下,并将mysql jdbc驱动包和hadoop-0.20.2-cdh3b4下的hadoop-core-0.20.2-cdh3b4.jar至sqoop-1.2.0-cdh3b4/lib下,最后修改一下属主。

[root@node1 ~]# cp mysql-connector-java-5.1.10-bin.jar sqoop-1.2.0-cdh3b4/lib
[root@node1 ~]# cp hadoop-0.20.2-cdh3b4/hadoop-core-0.20.2-cdh3b4.jar sqoop-1.2.0-cdh3b4/lib
[root@node1 ~]# chown -r hadoop:hadoop sqoop-1.2.0-cdh3b4
[root@node1 ~]# mv sqoop-1.2.0-cdh3b4 /home/hadoop
[root@node1 ~]# ll /home/hadoop
total 35748
-rw-rw-r-- 1 hadoop hadoop  343 sep 15 05:13 derby.log
drwxr-xr-x 13 hadoop hadoop  4096 sep 14 16:16 hadoop-0.20.2
drwxr-xr-x 9 hadoop hadoop  4096 sep 14 20:21 hive-0.10.0
-rw-r--r-- 1 hadoop hadoop 36524032 sep 14 20:20 hive-0.10.0.tar.gz
drwxr-xr-x 8 hadoop hadoop  4096 sep 25 2012 jdk1.7
drwxr-xr-x 12 hadoop hadoop  4096 sep 15 00:25 mahout-distribution-0.7
drwxrwxr-x 5 hadoop hadoop  4096 sep 15 05:13 metastore_db
-rw-rw-r-- 1 hadoop hadoop  406 sep 14 16:02 scp.sh
drwxr-xr-x 11 hadoop hadoop  4096 feb 22 2011 sqoop-1.2.0-cdh3b4
drwxrwxr-x 3 hadoop hadoop  4096 sep 14 16:17 temp
drwxrwxr-x 3 hadoop hadoop  4096 sep 14 15:59 user

3、配置configure-sqoop,注释掉对于hbase和zookeeper的检查

[root@node1 bin]# pwd
/home/hadoop/sqoop-1.2.0-cdh3b4/bin
[root@node1 bin]# vi configure-sqoop 

#!/bin/bash
#
# licensed to cloudera, inc. under one or more
# contributor license agreements. see the notice file distributed with
# this work for additional information regarding copyright ownership.
.
.
.
# check: if we can't find our dependencies, give up here.
if [ ! -d "${hadoop_home}" ]; then
 echo "error: $hadoop_home does not exist!"
 echo 'please set $hadoop_home to the root of your hadoop installation.'
 exit 1
fi
#if [ ! -d "${hbase_home}" ]; then
# echo "error: $hbase_home does not exist!"
# echo 'please set $hbase_home to the root of your hbase installation.'
# exit 1
#fi
#if [ ! -d "${zookeeper_home}" ]; then
# echo "error: $zookeeper_home does not exist!"
# echo 'please set $zookeeper_home to the root of your zookeeper installation.'
# exit 1
#fi

4、修改/etc/profile和.bash_profile文件,添加hadoop_home,调整path

[hadoop@node1 ~]$ vi .bash_profile 
# .bash_profile

# get the aliases and functions
if [ -f ~/.bashrc ]; then
  . ~/.bashrc
fi

# user specific environment and startup programs

hadoop_home=/home/hadoop/hadoop-0.20.2
path=$hadoop_home/bin:$path:$home/bin
export hive_home=/home/hadoop/hive-0.10.0
export mahout_home=/home/hadoop/mahout-distribution-0.7
export path hadoop_home

二、测试sqoop

1、查看mysql中的数据库:

[hadoop@node1 bin]$ ./sqoop list-databases --connect jdbc:mysql://192.168.1.152:3306/ --username sqoop --password sqoop
13/09/15 07:17:16 warn tool.basesqooptool: setting your password on the command-line is insecure. consider using -p instead.
13/09/15 07:17:17 info manager.mysqlmanager: executing sql statement: show databases
information_schema
mysql
performance_schema
sqoop
test

2、将mysql的表导入到hive中:

[hadoop@node1 bin]$ ./sqoop import --connect jdbc:mysql://192.168.1.152:3306/sqoop --username sqoop --password sqoop --table test --hive-import -m 1
13/09/15 08:15:01 warn tool.basesqooptool: setting your password on the command-line is insecure. consider using -p instead.
13/09/15 08:15:01 info tool.basesqooptool: using hive-specific delimiters for output. you can override
13/09/15 08:15:01 info tool.basesqooptool: delimiters with --fields-terminated-by, etc.
13/09/15 08:15:01 info tool.codegentool: beginning code generation
13/09/15 08:15:01 info manager.mysqlmanager: executing sql statement: select t.* from `test` as t limit 1
13/09/15 08:15:02 info manager.mysqlmanager: executing sql statement: select t.* from `test` as t limit 1
13/09/15 08:15:02 info orm.compilationmanager: hadoop_home is /home/hadoop/hadoop-0.20.2/bin/..
13/09/15 08:15:02 info orm.compilationmanager: found hadoop core jar at: /home/hadoop/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
13/09/15 08:15:03 info orm.compilationmanager: writing jar file: /tmp/sqoop-hadoop/compile/a71936fd2bb45ea6757df22751a320e3/test.jar
13/09/15 08:15:03 warn manager.mysqlmanager: it looks like you are importing from mysql.
13/09/15 08:15:03 warn manager.mysqlmanager: this transfer can be faster! use the --direct
13/09/15 08:15:03 warn manager.mysqlmanager: option to exercise a mysql-specific fast path.
13/09/15 08:15:03 info manager.mysqlmanager: setting zero datetime behavior to converttonull (mysql)
13/09/15 08:15:03 info mapreduce.importjobbase: beginning import of test
13/09/15 08:15:04 info manager.mysqlmanager: executing sql statement: select t.* from `test` as t limit 1
13/09/15 08:15:05 info mapred.jobclient: running job: job_201309150505_0009
13/09/15 08:15:06 info mapred.jobclient: map 0% reduce 0%
13/09/15 08:15:34 info mapred.jobclient: map 100% reduce 0%
13/09/15 08:15:36 info mapred.jobclient: job complete: job_201309150505_0009
13/09/15 08:15:36 info mapred.jobclient: counters: 5
13/09/15 08:15:36 info mapred.jobclient: job counters 
13/09/15 08:15:36 info mapred.jobclient:  launched map tasks=1
13/09/15 08:15:36 info mapred.jobclient: filesystemcounters
13/09/15 08:15:36 info mapred.jobclient:  hdfs_bytes_written=583323
13/09/15 08:15:36 info mapred.jobclient: map-reduce framework
13/09/15 08:15:36 info mapred.jobclient:  map input records=65536
13/09/15 08:15:36 info mapred.jobclient:  spilled records=0
13/09/15 08:15:36 info mapred.jobclient:  map output records=65536
13/09/15 08:15:36 info mapreduce.importjobbase: transferred 569.6514 kb in 32.0312 seconds (17.7842 kb/sec)
13/09/15 08:15:36 info mapreduce.importjobbase: retrieved 65536 records.
13/09/15 08:15:36 info hive.hiveimport: removing temporary files from import process: test/_logs
13/09/15 08:15:36 info hive.hiveimport: loading uploaded data into hive
13/09/15 08:15:36 info manager.mysqlmanager: executing sql statement: select t.* from `test` as t limit 1
13/09/15 08:15:36 info manager.mysqlmanager: executing sql statement: select t.* from `test` as t limit 1
13/09/15 08:15:41 info hive.hiveimport: logging initialized using configuration in jar:file:/home/hadoop/hive-0.10.0/lib/hive-common-0.10.0.jar!/hive-log4j.properties
13/09/15 08:15:41 info hive.hiveimport: hive history file=/tmp/hadoop/hive_job_log_hadoop_201309150815_1877092059.txt
13/09/15 08:16:10 info hive.hiveimport: ok
13/09/15 08:16:10 info hive.hiveimport: time taken: 28.791 seconds
13/09/15 08:16:11 info hive.hiveimport: loading data to table default.test
13/09/15 08:16:12 info hive.hiveimport: table default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 583323, raw_data_size: 0]
13/09/15 08:16:12 info hive.hiveimport: ok
13/09/15 08:16:12 info hive.hiveimport: time taken: 1.704 seconds
13/09/15 08:16:12 info hive.hiveimport: hive import complete.

三、sqoop 命令

sqoop大约有13种命令,和几种通用的参数(都支持这13种命令),这里先列出这13种命令。
接着列出sqoop的各种通用参数,然后针对以上13个命令列出他们自己的参数。sqoop通用参数又分common arguments,incremental import arguments,output line formatting arguments,input parsing arguments,hive arguments,hbase arguments,generic hadoop command-line arguments,下面说明一下几个常用的命令:
1.common arguments
通用参数,主要是针对关系型数据库链接的一些参数
1)列出mysql数据库中的所有数据库

sqoop list-databases –connect jdbc:mysql://localhost:3306/ –username root –password 123456


2)连接mysql并列出test数据库中的表

sqoop list-tables –connect jdbc:mysql://localhost:3306/test –username root –password 123456

命令中的test为mysql数据库中的test数据库名称 username password分别为mysql数据库的用户密码


3)将关系型数据的表结构复制到hive中,只是复制表的结构,表中的内容没有复制过去。

sqoop create-hive-table –connect jdbc:mysql://localhost:3306/test
–table sqoop_test –username root –password 123456 –hive-table
test

其中 –table sqoop_test为mysql中的数据库test中的表 –hive-table
test 为hive中新建的表名称


4)从关系数据库导入文件到hive中

sqoop import –connect jdbc:mysql://localhost:3306/zxtest –username
root –password 123456 –table sqoop_test –hive-import –hive-table
s_test -m 1


5)将hive中的表数据导入到mysql中,在进行导入之前,mysql中的表
hive_test必须已经提起创建好了。

sqoop export –connect jdbc:mysql://localhost:3306/zxtest –username
root –password root –table hive_test –export-dir
/user/hive/warehouse/new_test_partition/dt=2012-03-05


6)从数据库导出表的数据到hdfs上文件

./sqoop import –connect
jdbc:mysql://10.28.168.109:3306/compression –username=hadoop
–password=123456 –table hadoop_user_info -m 1 –target-dir
/user/test


7)从数据库增量导入表数据到hdfs中

./sqoop import –connect jdbc:mysql://10.28.168.109:3306/compression
–username=hadoop –password=123456 –table hadoop_user_info -m 1
–target-dir /user/test –check-column id –incremental append
–last-value 3