KUDU数据导入尝试一:TextFile数据导入Hive,Hive数据导入KUDU
程序员文章站
2022-08-29 09:56:24
背景 1. SQLSERVER数据库中单表数据几十亿,分区方案也已经无法查询出结果。故:采用导出功能,导出数据到Text文本(文本 40G)中。 2. 因上原因,所以本次的实验样本为:【数据量:61w条,文本大小:74M】 选择DataX原因 1. 试图维持统一的异构数据源同步方案。(其实行不通) ......
背景
- sqlserver数据库中单表数据几十亿,分区方案也已经无法查询出结果。故:采用导出功能,导出数据到text文本(文本>40g)中。
-
因上原因,所以本次的实验样本为:【数据量:61w条,文本大小:74m】
选择datax原因
- 试图维持统一的异构数据源同步方案。(其实行不通)
试图进入hive时,已经是压缩orc格式,降低存储大小,提高列式查询效率,以便后续查询hive数据导入kudu时提高效率(其实行不通)
1. 建hive表
进入hive,必须和textfile中的字段类型保持一致
create table event_hive_3( `#auto_id` string ,`#product_id` int ,`#event_name` string ,`#part_date` int ,`#server_id` int ,`#account_id` bigint ,`#user_id` bigint ,part_time string ,getitemid bigint ,consumemoneynum bigint ,price bigint ,getitemcnt bigint ,taskstate bigint ,tasktype bigint ,battlelev bigint ,level bigint ,itemid bigint ,itemcnt bigint ,moneynum bigint ,moneytype bigint ,vip bigint ,logid bigint ) row format delimited fields terminated by '\t' stored as orc;
2. 建kudu表
这个过程,自行发挥~
#idea中,执行单元测试【eventanalysisrepositorytest.createtable()】即可 public void createtable() throws exception { repository.getclient(); repository.createtable(event_sjmy.class,true); }
3. 建立impala表
进入impala-shell 或者hue;
use sd_dev_sdk_mobile; create external table `event_sjmy_datax` stored as kudu tblproperties( 'kudu.table_name' = 'event_sjmy_datax', 'kudu.master_addresses' = 'sdmain:7051')
4. 编辑datax任务
不直接load进hive的目的是为了进行一步文件压缩,降低内存占用,转为列式存储。
# 编辑一个任务 vi /home/jobs/texttohdfs.json; { "setting": {}, "job": { "setting": { "speed": { "channel": 2 } }, "content": [ { "reader": { "name": "txtfilereader", "parameter": { "path": ["/home/data"], "encoding": "gb2312", "column": [ { "index": 0, "type": "string" }, { "index": 1, "type": "int" }, { "index": 2, "type": "string" }, { "index": 3, "type": "int" }, { "index": 4, "type": "int" }, { "index": 5, "type": "long" }, { "index": 6, "type": "long" }, { "index": 7, "type": "string" }, { "index": 8, "type": "long" }, { "index": 9, "type": "long" }, { "index": 10, "type": "long" },{ "index": 11, "type": "long" },{ "index": 12, "type": "long" }, { "index": 13, "type": "long" }, { "index": 14, "type": "long" }, { "index": 15, "type": "long" }, { "index": 17, "type": "long" }, { "index": 18, "type": "long" }, { "index": 19, "type": "long" }, { "index": 20, "type": "long" }, { "index": 21, "type": "long" } ], "fielddelimiter": "/t" } }, "writer": { "name": "hdfswriter", "parameter": { "column": [{"name":"#auto_id","type":" string"},{"name":"#product_id","type":" int"},{"name":"#event_name","type":" string"},{"name":"#part_date","type":"int"},{"name":"#server_id","type":"int"},{"name":"#account_id","type":"bigint"},{"name":"#user_id","type":" bigint"},{"name":"part_time","type":" string"},{"name":"getitemid","type":" bigint"},{"name":"consumemoneynum","type":"bigint"},{"name":"price ","type":"bigint"},{"name":"getitemcnt ","type":"bigint"},{"name":"taskstate ","type":"bigint"},{"name":"tasktype ","type":"bigint"},{"name":"battlelev ","type":"bigint"},{"name":"level","type":"bigint"},{"name":"itemid ","type":"bigint"},{"name":"itemcnt ","type":"bigint"},{"name":"moneynum ","type":"bigint"},{"name":"moneytype ","type":"bigint"},{"name":"vip ","type":"bigint"},{"name":"logid ","type":"bigint"}], "compress": "none", "defaultfs": "hdfs://sdmain:8020", "fielddelimiter": "\t", "filename": "event_hive_3", "filetype": "orc", "path": "/user/hive/warehouse/dataxtest.db/event_hive_3", "writemode": "append" } } } ] } }
4.1 执行datax任务
注意哦,数据源文件,先放在/home/data下哦。数据源文件必须是个数据二维表。
#textfile中数据例子如下: {432297b4-ca5f-4116-901e-e19df3170880} 701 获得筹码 201906 2 4974481 1344825 00:01:06 0 0 0 0 0 0 0 0 0 0 100 2 3 31640 {caaf09c6-037d-43b9-901f-4cb5918fb774} 701 获得筹码 201906 2 5605253 1392330 00:02:25 0 0 0 0 0 0 0 0 0 0 390 2 10 33865 cd $datax_home/bin python datax.py /home/job/texttohdfs.json
效果图:
使用kudu从hive读取写入到kudu表中
进入shell
#进入shell: impala-shell; #选中库--如果表名有指定库名,可省略 use sd_dev_sdk_mobile; 输入sql: insert into sd_dev_sdk_mobile.event_sjmy_datax select `#auto_id`,`#event_name`,`#part_date`,`#product_id`,`#server_id`,`#account_id`,`#user_id`,part_time,getitemid,consumemoneynum,price,getitemcnt,taskstate,tasktype,battlelev,level,itemid,itemcnt,moneynum,moneytype,vip,logid from event_hive_3 ;
效果图:
看看这可怜的结果
这速度难以接受,我选择放弃。
打脸环节-原因分析:
- datax读取textfile到hive中的速度慢: datax对textfile的读取是单线程的,(2.0版本后可能会提供多线程readertextfile的能力),这直接浪费了集群能力和12核的cpu。且,文件还没法手动切割任务分节点执行。
- hive到kudu的数据慢:insert into xxx select * 这个【*】一定要注意,如果读取所有列,那列式查询的优势就没多少了,所以,转orc多此一举。
- impala读取hive数据时,内存消耗大!
唯一的好处: 降低硬盘资源的消耗(74m文件写到hdfs,压缩后只有15m),但是!!!这有何用?我要的是导入速度!如果只是为了压缩,应该load进hive,然后启用hive的insert到orc新表,充分利用集群资源!
代码如下
//1. 数据加载到textfile表中 load data inpath '/home/data/event-19-201906.txt' into table event_hive_3normal; //2. 数据查询出来写入到orc表中。 insert into event_hive_3orc select * from event_hive_3normal
实验失败~
优化思路:1.充分使用集群的cpu资源
2.避免大批量数据查询写入
优化方案:掏出我的老家伙,单flume读取本地数据文件sink到kafka, 集群中多flume消费kafka集群,sink到kudu !下午见!
推荐阅读
-
Hive中导入Amazon S3中的分区表数据的操作
-
KUDU数据导入尝试一:TextFile数据导入Hive,Hive数据导入KUDU
-
如何将excel中的数据导入hive仓库中
-
解决sqoop import 导入到hive后数据量变多的问题
-
使用pyspark模仿sqoop从oracle导数据到hive的主要功能(自动建表,分区导入,增量,解决数据换行符问题)
-
hive从mysql导入数据量变多的解决方案
-
Sqoop将数据从oracle导入到hive时,数据错位的问题解决
-
利用sqoop 将 hive/hdfs数据 导入 Oracle中
-
大数据学习之路31-hive的DDL语法,新增,删除分区,向分区中导入数据
-
Hive进阶之Hive数据导入