欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Sqoop部署和操作

程序员文章站 2022-04-19 18:37:36
...

Sqoop介绍

sqoop是把数据从RDBMS和hadoop之间进行导入导出,底层是使用MapReduce来实现的,只有Map没有Reduce

部署

  1. 下载CDH版本wget http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.16.2.tar.gz
  2. 解压 tar -zxvf sqoop-1.4.6-cdh5.16.2.tar.gz -C ~/app/
  3. 配置系统环境变量
vi .bashrc
export SQOOP_HOME=/home/ruoze/app/sqoop-1.4.6-cdh5.16.2
export PATH=$SQOOP_HOME/bin:$PATH
source .bashrc
  1. 配置文件
cp sqoop-env-template.sh sqoop-env.sh
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.16.2
  1. 驱动包:cp mysql-connector-java-5.1.27-bin.jar $SQOOP_HOME/lib/

操作命令

  1. 查看数据库和表
sqoop list-databases \
--connect jdbc:mysql://ruozedata001:3306 \
--password ruozedata \
--username root

sqoop list-tables \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password ruozedata \
--username root

MySql导入到HDFS上

  1. 表的数据导入到HDFS:sqoop import
sqoop import \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password ruozedata \
--username root \
--delete-target-dir \
--table salgrade \
--split-by 'GRADE'
[aaa@qq.com ~]$ hdfs dfs -ls salgrade
19/12/27 14:49:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 ruoze supergroup          0 2019-12-27 14:49 salgrade/_SUCCESS
-rw-r--r--   1 ruoze supergroup         11 2019-12-27 14:49 salgrade/part-m-00000
-rw-r--r--   1 ruoze supergroup         12 2019-12-27 14:49 salgrade/part-m-00001
-rw-r--r--   1 ruoze supergroup         12 2019-12-27 14:49 salgrade/part-m-00002
-rw-r--r--   1 ruoze supergroup         24 2019-12-27 14:49 salgrade/part-m-00003
[aaa@qq.com ~]$ hdfs dfs -cat salgrade/part*
19/12/27 14:50:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,700,1200
2,1201,1400
3,1401,2000
4,2001,3000
5,3001,9999

注意:sqoop java.lang.NoClassDefFoundError: org/json/JSONObject,sqoop缺少java-json.jar包.下载java in json的jar包放入进去

  1. Mysql导入到HDFS上,指定字段和分割符,空字段指定为’’,非空字段指定为’0’
sqoop import \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password ruozedata --username root \
--table emp  \
--delete-target-dir \
--mapreduce-job-name FromMySQL2HDFS \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--target-dir EMP_COLUMN_QUERY \
--fields-terminated-by '\t' \
--null-string '' \
--null-non-string '0' \
-m 1

[aaa@qq.com ~]$ hdfs dfs -cat EMP_COLUMN_QUERY/part*
19/12/27 15:08:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369	SMITH	CLERK	800.00	0
7499	ALLEN	SALESMAN	1600.00	300.00
7521	WARD	SALESMAN	1250.00	500.00
7566	JONES	MANAGER	2975.00	0
7654	MARTIN	SALESMAN	1250.00	1400.00
7698	BLAKE	MANAGER	2850.00	0
7782	CLARK	MANAGER	2450.00	0
7788	SCOTT	ANALYST	3000.00	0
7839	KING	PRESIDENT	5000.00	0
7844	TURNER	SALESMAN	1500.00	0.00
7876	ADAMS	CLERK	1100.00	0
7900	JAMES	CLERK	950.00	0
7902	FORD	ANALYST	3000.00	0
7934	MILLER	CLERK	1300.00	0

注意:This transfer can be faster! Use the --direct

  1. Mysql通过查询语句执行导出到HDFS上
sqoop import \
--connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata --username root \
--delete-target-dir \
--mapreduce-job-name FromMySQL2HDFS \
--target-dir JOIN \
--fields-terminated-by '\t' \
--null-string '' \
--null-non-string '0' \
--query 'select e.empno,e.ename,e.deptno,d.dname from emp e join dept d on e.deptno=d.deptno and $CONDITIONS' \
-m 1
[aaa@qq.com ~]$ hdfs dfs -cat JOIN/part*
19/12/27 15:19:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369	SMITH	20	RESEARCH
7499	ALLEN	30	SALES
7521	WARD	30	SALES
7566	JONES	20	RESEARCH
7654	MARTIN	30	SALES
7698	BLAKE	30	SALES
7782	CLARK	10	ACCOUNTING
7788	SCOTT	20	RESEARCH
7839	KING	10	ACCOUNTING
7844	TURNER	30	SALES
7876	ADAMS	20	RESEARCH
7900	JAMES	30	SALES
7902	FORD	20	RESEARCH
7934	MILLER	10	ACCOUNTING
  1. Mysql指定某个字段的某个值之后数据追加到之前的文件
sqoop import \
--connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata --username root \
--table emp  \
--mapreduce-job-name FromMySQL2HDFS \
--target-dir EMP_APPEND \
--fields-terminated-by '\t' \
--null-string '' \
--incremental append \
--check-column EMPNO \
--last-value 7788 \
--null-non-string '0' \
-m 1

Sqoop部署和操作

  1. 只执行mysql的语句
sqoop eval \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password mysqladminroot --username root \
--query "insert into dept values (60,'RD', 'BEIJING')"

HDFS导出到MySql上

sqoop export \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password mysqladminroot \
--username root \
--table emp_demo \
--export-dir /user/ruoze/emp \
-m 1

指定列导出数据
sqoop export \
-Dsqoop.export.records.per.statement=10 \  #批量导出
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password mysqladminroot \
--username root \
--table emp_demo \
--export-dir /user/ruoze/EMP_COLUMN_QUERY \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--fields-terminated-by '\t' \
-m 1

Mysql导入到Hive

在Hive创建表

CREATE TABLE emp_import_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
partitioned by (pt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

sqoop import \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password mysqladminroot \
--username root \
--table emp \
--hive-overwrite \
--delete-target-dir \
--hive-import --hive-database ruozedata_hive \
--hive-table emp_import_partition \
--hive-partition-key 'pt' \
--hive-partition-value '2019-12-30' \
--fields-terminated-by '\t' \
-m 1

注意: ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.解决方案:将hive 里面的lib下的hive-exec-**.jar 放到sqoop 的lib 下可以解决以下问题

Hive数据导出到Mysql

sqoop export \
--connect jdbc:mysql://ruozedata001:3306/ruozedata \
--password mysqladminroot \
--username root \
--table dept_demo \
--export-dir /user/hive/warehouse/ruozedata_hive.db/dept \
--fields-terminated-by '\t' \
-m 1

创建sqoop的job

sqoop job --create ruozedata-sqoop-job -- \
import --connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata \
--username root \
--table emp \
--delete-target-dir

注意:密码明文的,如何从文件中加载密码???

ETL操作

需求:emp和dept表是在Mysql中,把Mysql的数据抽取到Hive进行统计分析,然后将统计结果会写到Mysql中

  1. Hive中创建emp_etl和dept_etl两张表
CREATE TABLE emp_etl(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

CREATE TABLE dept_etl(
deptno int,
dname string,
loc string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
  1. 将mysql数据导入到hive中
sqoop import \
--connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata \
--username root \
--table emp \
--hive-overwrite \
--delete-target-dir \
--hive-import --hive-database ruozedata_hive \
--hive-table emp_etl \
--fields-terminated-by '\t' \
-m 1

sqoop import \
--connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata \
--username root \
--table dept \
--hive-overwrite \
--delete-target-dir \
--hive-import --hive-database ruozedata_hive \
--hive-table dept_etl \
--fields-terminated-by '\t' \
-m 1
  1. 在Hive中创建结果表并加载数据
CREATE TABLE result_etl(
empno int,
ename string,
deptno int,
dname string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

insert overwrite table result_etl select e.empno, e.ename, e.deptno, d.dname from emp_etl e join dept_etl d on e.deptno=d.deptno;
  1. 把结果数据导出到Mysql中
在mysql中创建结果表result_etl
create table result_etl(
empno int,
ename varchar(10),
deptno int,
dname varchar(20)
);

把hive统计结果导出到mysql中
sqoop export \
--connect jdbc:mysql://ruozedata001:3306/sqoop \
--password ruozedata \
--username root \
--table result_etl \
--export-dir /user/hive/warehouse/ruozedata_hive.db/result_etl \
--fields-terminated-by '\t' \
-m 1
相关标签: sqoop