Kettle与Hadoop(七)执行Sqoop作业
程序员文章站
2024-02-22 20:13:34
...
目录
一、Sqoop export
1. 建立一个作业,将HDFS文件导出到MySQL数据库。
(1)打开PDI,新建一个作业,如图1所示。
(2)编辑'Sqoop export'作业项,如图2所示。
说明:
- CDH631是已经建好的Hadoop集群连接,参见“https://wxy0327.blog.csdn.net/article/details/106406702#%E4%BA%8C%E3%80%81%E8%BF%9E%E6%8E%A5Hadoop%E9%9B%86%E7%BE%A4”。
- mysql_node3是已经建好的MySQL数据库连接,参见“https://wxy0327.blog.csdn.net/article/details/106406702#%E4%BA%94%E3%80%81%E5%90%8E%E7%BB%AD%EF%BC%88%E5%BB%BA%E7%AB%8BMySQL%E6%95%B0%E6%8D%AE%E5%BA%93%E8%BF%9E%E6%8E%A5%EF%BC%89”。
- /user/root/sales_order_ext/part-m-00000是HDFS上一个CSV格式的文件,具有102行记录,如图3所示。
- sales_order是MySQL中的目标表,结构与part-m-00000文件匹配,如图4所示。
2. 保存并执行作业,日志如下所示。
2020/06/08 09:28:40 - Spoon - Starting job...
2020/06/08 09:28:40 - sqoop_export - Start of job execution
2020/06/08 09:28:40 - sqoop_export - Starting entry [Sqoop export]
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - fs.default.name is deprecated. Instead, use fs.defaultFS
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Running Sqoop version: 1.4.7-cdh6.1.0
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Setting your password on the command-line is insecure. Consider using -P instead.
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Preparing to use a MySQL streaming resultset.
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Beginning code generation
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Executing SQL statement: SELECT t.* FROM `sales_order` AS t LIMIT 1
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - Executing SQL statement: SELECT t.* FROM `sales_order` AS t LIMIT 1
2020/06/08 09:28:40 - Sqoop export - 2020/06/08 09:28:40 - $HADOOP_MAPRED_HOME is not set
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Writing jar file: /tmp/sqoop-root/compile/7c09fc949771ba78a584cc07c1cd6b3f/sales_order.jar
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Beginning export of sales_order
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - SQOOP_HOME is unset. May not be able to find all job dependencies.
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Connecting to ResourceManager at manager/172.16.1.124:8032
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Disabling Erasure Coding for path: /user/root/.staging/job_1591323999364_0046
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Total input files to process : 1
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Total input files to process : 1
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - number of splits:4
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - fs.default.name is deprecated. Instead, use fs.defaultFS
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Submitting tokens for job: job_1591323999364_0046
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Executing with tokens: []
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Submitted application application_1591323999364_0046
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - The url to track the job: http://manager:8088/proxy/application_1591323999364_0046/
2020/06/08 09:28:41 - Sqoop export - 2020/06/08 09:28:41 - Running job: job_1591323999364_0046
2020/06/08 09:28:51 - Sqoop export - 2020/06/08 09:28:51 - Job job_1591323999364_0046 running in uber mode : false
2020/06/08 09:28:51 - Sqoop export - 2020/06/08 09:28:51 - map 0% reduce 0%
2020/06/08 09:28:58 - Sqoop export - 2020/06/08 09:28:58 - map 75% reduce 0%
2020/06/08 09:28:59 - Sqoop export - 2020/06/08 09:28:59 - map 100% reduce 0%
2020/06/08 09:28:59 - Sqoop export - 2020/06/08 09:28:59 - Job job_1591323999364_0046 completed successfully
2020/06/08 09:28:59 - Sqoop export - 2020/06/08 09:28:59 - Counters: 34
2020/06/08 09:28:59 - Sqoop export - File System Counters
2020/06/08 09:28:59 - Sqoop export - FILE: Number of bytes read=0
2020/06/08 09:28:59 - Sqoop export - FILE: Number of bytes written=968088
2020/06/08 09:28:59 - Sqoop export - FILE: Number of read operations=0
2020/06/08 09:28:59 - Sqoop export - FILE: Number of large read operations=0
2020/06/08 09:28:59 - Sqoop export - FILE: Number of write operations=0
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of bytes read=15868
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of bytes written=0
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of read operations=16
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of large read operations=0
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of write operations=0
2020/06/08 09:28:59 - Sqoop export - HDFS: Number of bytes read erasure-coded=0
2020/06/08 09:28:59 - Sqoop export - Job Counters
2020/06/08 09:28:59 - Sqoop export - Launched map tasks=4
2020/06/08 09:28:59 - Sqoop export - Data-local map tasks=2
2020/06/08 09:28:59 - Sqoop export - Rack-local map tasks=2
2020/06/08 09:28:59 - Sqoop export - Total time spent by all maps in occupied slots (ms)=18912
2020/06/08 09:28:59 - Sqoop export - Total time spent by all reduces in occupied slots (ms)=0
2020/06/08 09:28:59 - Sqoop export - Total time spent by all map tasks (ms)=18912
2020/06/08 09:28:59 - Sqoop export - Total vcore-milliseconds taken by all map tasks=18912
2020/06/08 09:28:59 - Sqoop export - Total megabyte-milliseconds taken by all map tasks=19365888
2020/06/08 09:28:59 - Sqoop export - Map-Reduce Framework
2020/06/08 09:28:59 - Sqoop export - Map input records=102
2020/06/08 09:28:59 - Sqoop export - Map output records=102
2020/06/08 09:28:59 - Sqoop export - Input split bytes=556
2020/06/08 09:28:59 - Sqoop export - Spilled Records=0
2020/06/08 09:28:59 - Sqoop export - Failed Shuffles=0
2020/06/08 09:28:59 - Sqoop export - Merged Map outputs=0
2020/06/08 09:28:59 - Sqoop export - GC time elapsed (ms)=351
2020/06/08 09:28:59 - Sqoop export - CPU time spent (ms)=5660
2020/06/08 09:28:59 - Sqoop export - Physical memory (bytes) snapshot=857292800
2020/06/08 09:28:59 - Sqoop export - Virtual memory (bytes) snapshot=10344894464
2020/06/08 09:28:59 - Sqoop export - Total committed heap usage (bytes)=739770368
2020/06/08 09:28:59 - Sqoop export - Peak Map Physical memory (bytes)=228241408
2020/06/08 09:28:59 - Sqoop export - Peak Map Virtual memory (bytes)=2586439680
2020/06/08 09:28:59 - Sqoop export - File Input Format Counters
2020/06/08 09:28:59 - Sqoop export - Bytes Read=0
2020/06/08 09:28:59 - Sqoop export - File Output Format Counters
2020/06/08 09:28:59 - Sqoop export - Bytes Written=0
2020/06/08 09:28:59 - Sqoop export - 2020/06/08 09:28:59 - Transferred 15.4961 KB in 18.7657 seconds (845.5859 bytes/sec)
2020/06/08 09:28:59 - Sqoop export - 2020/06/08 09:28:59 - Exported 102 records.
2020/06/08 09:28:59 - sqoop_export - Finished job entry [Sqoop export] (result=[true])
2020/06/08 09:28:59 - sqoop_export - Job execution finished
2020/06/08 09:28:59 - Spoon - Job has ended.
从日志中可以看到,作业已经成功执行。在作业所在目录,会生成一个名为sales_order.java的文件。
3. 检查MySQL表,结果如图5所示。
从图5可以看到,通过Sqoop export作业项将HDFS文件内容导出到MySQL表中。
二、Sqoop import
1. 建立一个作业,将MySQL表数据导入到HDFS。
(1)打开PDI,新建一个作业,如图6所示。
(2)编辑'Sqoop import'作业项,如图7所示。
说明:
- /user/root/sales_order是HDFS的目标目录,该目录应该尚不存在。
2. 保存并执行作业,日志如下所示。
2020/06/08 10:00:52 - Spoon - Starting job...
2020/06/08 10:00:52 - sqoop_import - Start of job execution
2020/06/08 10:00:52 - sqoop_import - Starting entry [Sqoop import]
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - fs.default.name is deprecated. Instead, use fs.defaultFS
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Running Sqoop version: 1.4.7-cdh6.1.0
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Setting your password on the command-line is insecure. Consider using -P instead.
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Preparing to use a MySQL streaming resultset.
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Beginning code generation
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Executing SQL statement: SELECT t.* FROM `sales_order` AS t LIMIT 1
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Executing SQL statement: SELECT t.* FROM `sales_order` AS t LIMIT 1
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - $HADOOP_MAPRED_HOME is not set
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Writing jar file: /tmp/sqoop-root/compile/7c09fc949771ba78a584cc07c1cd6b3f/sales_order.jar
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Setting zero DATETIME behavior to convertToNull (mysql)
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - Beginning import of sales_order
2020/06/08 10:00:52 - Sqoop import - 2020/06/08 10:00:52 - SQOOP_HOME is unset. May not be able to find all job dependencies.
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Connecting to ResourceManager at manager/172.16.1.124:8032
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Disabling Erasure Coding for path: /user/root/.staging/job_1591323999364_0048
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Using read commited transaction isolation
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - BoundingValsQuery: SELECT MIN(`order_number`), MAX(`order_number`) FROM `sales_order`
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Split size: 25; Num splits: 4 from: 101 to: 202
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - number of splits:4
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - fs.default.name is deprecated. Instead, use fs.defaultFS
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Submitting tokens for job: job_1591323999364_0048
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Executing with tokens: []
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Submitted application application_1591323999364_0048
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - The url to track the job: http://manager:8088/proxy/application_1591323999364_0048/
2020/06/08 10:00:53 - Sqoop import - 2020/06/08 10:00:53 - Running job: job_1591323999364_0048
2020/06/08 10:01:02 - Sqoop import - 2020/06/08 10:01:02 - Job job_1591323999364_0048 running in uber mode : false
2020/06/08 10:01:02 - Sqoop import - 2020/06/08 10:01:02 - map 0% reduce 0%
2020/06/08 10:01:09 - Sqoop import - 2020/06/08 10:01:09 - map 75% reduce 0%
2020/06/08 10:01:11 - Sqoop import - 2020/06/08 10:01:11 - map 100% reduce 0%
2020/06/08 10:01:11 - Sqoop import - 2020/06/08 10:01:11 - Job job_1591323999364_0048 completed successfully
2020/06/08 10:01:11 - Sqoop import - 2020/06/08 10:01:11 - Counters: 33
2020/06/08 10:01:11 - Sqoop import - File System Counters
2020/06/08 10:01:11 - Sqoop import - FILE: Number of bytes read=0
2020/06/08 10:01:11 - Sqoop import - FILE: Number of bytes written=969236
2020/06/08 10:01:11 - Sqoop import - FILE: Number of read operations=0
2020/06/08 10:01:11 - Sqoop import - FILE: Number of large read operations=0
2020/06/08 10:01:11 - Sqoop import - FILE: Number of write operations=0
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of bytes read=481
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of bytes written=6120
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of read operations=24
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of large read operations=0
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of write operations=8
2020/06/08 10:01:11 - Sqoop import - HDFS: Number of bytes read erasure-coded=0
2020/06/08 10:01:11 - Sqoop import - Job Counters
2020/06/08 10:01:11 - Sqoop import - Launched map tasks=4
2020/06/08 10:01:11 - Sqoop import - Other local map tasks=4
2020/06/08 10:01:11 - Sqoop import - Total time spent by all maps in occupied slots (ms)=19048
2020/06/08 10:01:11 - Sqoop import - Total time spent by all reduces in occupied slots (ms)=0
2020/06/08 10:01:11 - Sqoop import - Total time spent by all map tasks (ms)=19048
2020/06/08 10:01:11 - Sqoop import - Total vcore-milliseconds taken by all map tasks=19048
2020/06/08 10:01:11 - Sqoop import - Total megabyte-milliseconds taken by all map tasks=19505152
2020/06/08 10:01:11 - Sqoop import - Map-Reduce Framework
2020/06/08 10:01:11 - Sqoop import - Map input records=102
2020/06/08 10:01:11 - Sqoop import - Map output records=102
2020/06/08 10:01:11 - Sqoop import - Input split bytes=481
2020/06/08 10:01:11 - Sqoop import - Spilled Records=0
2020/06/08 10:01:11 - Sqoop import - Failed Shuffles=0
2020/06/08 10:01:11 - Sqoop import - Merged Map outputs=0
2020/06/08 10:01:11 - Sqoop import - GC time elapsed (ms)=500
2020/06/08 10:01:11 - Sqoop import - CPU time spent (ms)=5680
2020/06/08 10:01:11 - Sqoop import - Physical memory (bytes) snapshot=783245312
2020/06/08 10:01:11 - Sqoop import - Virtual memory (bytes) snapshot=10357563392
2020/06/08 10:01:11 - Sqoop import - Total committed heap usage (bytes)=659554304
2020/06/08 10:01:11 - Sqoop import - Peak Map Physical memory (bytes)=202842112
2020/06/08 10:01:11 - Sqoop import - Peak Map Virtual memory (bytes)=2591666176
2020/06/08 10:01:11 - Sqoop import - File Input Format Counters
2020/06/08 10:01:11 - Sqoop import - Bytes Read=0
2020/06/08 10:01:11 - Sqoop import - File Output Format Counters
2020/06/08 10:01:11 - Sqoop import - Bytes Written=6120
2020/06/08 10:01:11 - Sqoop import - 2020/06/08 10:01:11 - Transferred 5.9766 KB in 18.733 seconds (326.697 bytes/sec)
2020/06/08 10:01:11 - Sqoop import - 2020/06/08 10:01:11 - Retrieved 102 records.
2020/06/08 10:01:11 - sqoop_import - Finished job entry [Sqoop import] (result=[true])
2020/06/08 10:01:11 - sqoop_import - Job execution finished
2020/06/08 10:01:11 - Spoon - Job has ended.
从日志中可以看到,作业已经成功执行。在作业所在目录,会生成一个名为sales_order.java的文件。
3. 检查HDFS目标目录,结果如图8所示。
从图8可以看到,通过Sqoop import作业项将MySQL表中数据导入到HDFS中。