【四】hive DDL之create/drop/truncate table
中括号是必填,方括号是选填。
创建
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
创建表支持的数据类型和存储格式
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| JSONFILE -- (Note: Available in Hive 4.0.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
例子:创建以dt,country字段分区的表,存储格式为sequence file,该存储文件中默认列字段之间用ctrl-A分隔,每行一条记录。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
例子:创建以dt,country字段分区的表,存储格式为sequence file,自己指定列之间的分隔符、行之间的分隔符。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
以上创建page_view表,数据存在$HIVE_HOME/conf/hive-site.xml中配置的<hive.metastore.warehouse.dir>值下的/page_view路径。
创建External Tables外表提供location。
这样就不会使用默认的location。这在已经生成了数据的时候创建表会很方便。删除External Tables表的时候,文件系统中的数据不会被删除。(比如HDFS上已经有数据,针对这些数据创建一张hive的External Tables,location指定HDFS上数据的路径,但是这里需要确保HDFS上已有的数据行列的分割方式跟hive建表语句中指定的分隔方式相同)
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
Create Table As Select (CTAS)
可以用select查询A表得到的结果来创建B表。这种方式是原子性的,意味着直到所有查询结果都被填充,否则该表不会被其他用户看到。所以其他用户要么能看到该表的全部查询结果,要么什么都看不到。
使用这种方式建表有限制条件:
目标表不能是有分区的表
目标表不能是外表(External Tables)
目标表不能是a list bucketing table
例子,row format不是必填,之前工作中在用这种方式创建表的时候用的是stored as parquet,最开始用的stored as orc。
select创建出来的新表的存储格式,跟select的源表的存储格式可以不同,这是Hive很有魅力的一个特性。
如果select没有指定列别名,创建出来的表的列名会是_col0, _col1, and _col2...
CREATE TABLE new_key_value_store
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile
AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
Create Table Like
用Like方式创建表允许复制一个已存在的表的表结构,只是表名不一样,但是不复制数据。
CREATE TABLE empty_key_value_store
LIKE key_value_store [TBLPROPERTIES (property_name=property_value, ...)];
Bucketed Sorted Tables(桶列)
例子:按照dt,country字段分区,再按照userid字段分桶,一共32个桶,每个桶中按照viewTime字段排序。
桶是比分区更细的粒度。分桶是在分区下进行的。
map keys terminated by和collection items terminated by在数据有map、list类型时使用。
使用分桶,用户在写数据的时候reducer的数量要和桶的数量一样。读的时候要用CLUSTER BY and SORT BY命令。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
Skewed Tables倾斜表
当某一列或者某几列出现数据倾斜的时候,用这个功能可以提升性能,Hive会自动把有数据倾斜的列从表本来的存储中分出来,放到单独的文件中。
例子是一个倾斜列3个倾斜值。
CREATE TABLE list_bucket_single (key STRING, value STRING)
SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];
例子是2个倾斜列
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];
Temporary Tables临时表
临时表只有当然session可见,数据会被存储在用户的scratch directory路径,session结束后数据会被删除。
如果创建的临时表的名字,跟已存在的永久表的名字重复了,在这个session中,所有使用这个表名的操作都会指向临时表。
临时表不支持按字段分区,不支持创建索引。
Constraints
Hive支持没有经过验证的主键、外键约束。如果使用没用经过验证的约束,上游系统需要自己确保插入数据的约束完整性。
create table pk(id1 integer, id2 integer,
primary key(id1, id2) disable novalidate);
create table fk(id1 integer, id2 integer,
constraint c1 foreign key(id1, id2) references pk(id2, id1) disable novalidate);
显示有哪些表
show tables
删除
drop table
DROP TABLE [IF EXISTS] table_name [PURGE]; -- (Note: PURGE available in Hive 0.14.0 and later)
这会删除表的元数据和数据。
如果配置了Trash,数据会移动到 .Trash/Current directory。但是元数据会完全失去。
删除EXTERNAL table,文件系统上的数据不会被删除。
当删除一个被视图引用的表,视图会无效,该视图只能被删除或者重新创建。
如果指定了PURGE,数据会直接删除,不会移动到垃圾文件目录(Trash/Current directory)中,这样的话误删表无法恢复。
truncate table
TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
删除分区或者表中所有行。