Sqoop开荒
文章目录
Sqoop 简介
开源 工具
RDBMS---------------------------sqoop---------------------------->HDFS
Sqoop前:
RDBMS----->Hadoop
MR: DBinputformat------------TestOutputFormat
Hadoop------>RDBMS
MR:TestInputFormat--------->DBOutputFormat
MR存在的问题
- MapReduce麻烦
- 效率低(一个MR只能对应一个业务线)
基于MR存在的问题,抽取出一个框架,需要自定义:
- Driver
- username
- password
- url
- DB/table/sql
- hdfs path
- mapper’s
接入到框架之后
新的业务线接入只需要传入参数递给MR即可
- hadoop jar的方式来提交
- 动态的根据业务线传入参数
后期可以采用Spring Boot微服务构建大数据平台
Sqoop官方介绍
Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project: More information
Latest stable release is 1.4.7 (download, documentation). Latest cut of Sqoop2 is 1.99.7 (download, documentation). Note that 1.99.7 is not compatible with 1.4.7 and not feature complete, it is not intended for production deployment.
Sqoop : SQL - to - Hadoop
RDBMS <---------sqoop-----------> Hadoop(HDFS/Hive)
底层:一个读写操作,只需要map就能搞定 不需要reduce
Sqoop的两个版本 1.X 2.X(1.99.X)
Sqoop 1 架构图
只用到了Map task ,没用到Recude
Sqoop 2(1.99.x)架构图
recude也用到了
/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist
/opt/cloudera/parcels/CDH-5.12.1-1.cdh5.12.1.p0.3/lib/sqoop/
解压后放到 sqoop home 的 lib 文件夹下
Sqoop1 使用教程
基本操作
列出数据库
sqoop list-databases --connect jdbc:mysql://10.103.66.88:3306 --username name --password password
列出表
sqoop list-tables --connect jdbc:mysql://10.103.66.88:3306/information_schema --username
table导入到HDFS
sqoop import \
--connect jdbc:mysql://10.103.66.88:3306/lenovosbom \
--username xingwj1 \
--password xingwj1 \
--table ec
由于MySQL表中没有主键,出现了错误
需要用 --split-by 指定主键
或者是 -m 1 顺序导入
sqoop import \
--connect jdbc:mysql://10.103.66.88:3306/lenovosbom \
--username xingwj1 \
--password xingwj1 \
--table ec \
-m 1