0. Data Lake Analytics(DLA)简介
关于Data Lake的概念,更多阅读可以参考:
en.wikipedia.org/wiki/Data_l…
以及AWS和Azure关于Data Lake的解读:
amazonaws-china.com/big-data/da…
azure.microsoft.com/en-us/solut…
终于,阿里云现在也有了自己的数据湖分析产品:www.aliyun.com/product/dat…
可以点击申请使用(目前公测阶段还属于邀测模式,我们会尽快审批申请),体验本教程的TPC-H CSV数据格式的数据分析之旅。
产品文档:help.aliyun.com/product/701…
1. 开通Data Lake Analytics与OSS服务
如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:help.aliyun.com/document_de…
进行产品开通服务申请。
2. 下载TPC-H测试数据集
可以从这下载TPC-H 100MB的数据集:
public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_d…
3. 上传数据文件到OSS
登录阿里云官网的OSS控制台:oss.console.aliyun.com/overview
规划您要使用的OSS bucket,创建或选择好后,点击“文件管理”,因为有8个数据文件,为每个数据文件创建对应的文件目录:
创建好8个目录如下:
点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。
上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。
至此,8个数据文件都上传到了您的OSS bucket中:
oss://xxx/tpch_100m/customer/customer.tbloss://xxx/tpch_100m/lineitem/lineitem.tbloss://xxx/tpch_100m/nation/nation.tbloss://xxx/tpch_100m/orders/orders.tbloss://xxx/tpch_100m/part/part.tbloss://xxx/tpch_100m/partsupp/partsupp.tbloss://xxx/tpch_100m/region/region.tbloss://xxx/tpch_100m/supplier/supplier.tbl复制代码
4. 登录Data Lake Analytics控制台
openanalytics.console.aliyun.com/
点击“登录数据库”,输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。
5. 创建Schema和Table
输入创建SCHEMA的语句,点击“同步执行”。
CREATE SCHEMA tpch_100m with DBPROPERTIES(
LOCATION = 'oss://test-bucket-julian-1/tpch_100m/', catalog='oss');复制代码
(注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)
Schema创建好后,在“数据库”的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。
建表语句语法参考:help.aliyun.com/document_de…
TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)
CREATE EXTERNAL TABLE nation (
N_NATIONKEY INT,
N_NAME STRING,
N_ID STRING,
N_REGIONKEY INT,
N_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';CREATE EXTERNAL TABLE lineitem (
L_ORDERKEY INT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INT,
L_QUANTITY DOUBLE,
L_EXTENDEDPRICE DOUBLE,
L_DISCOUNT DOUBLE,
L_TAX DOUBLE,
L_RETURNFLAG STRING,
L_LINESTATUS STRING,
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT STRING,
L_SHIPMODE STRING,
L_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem';CREATE EXTERNAL TABLE orders (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders';CREATE EXTERNAL TABLE supplier (
S_SUPPKEY INT,
S_NAME STRING,
S_ADDRESS STRING,
S_NATIONKEY INT,
S_PHONE STRING,
S_ACCTBAL DOUBLE,
S_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier';CREATE EXTERNAL TABLE partsupp (
PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp';CREATE EXTERNAL TABLE customer (
C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DOUBLE,
C_MKTSEGMENT STRING,
C_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer';CREATE EXTERNAL TABLE part (
P_PARTKEY INT,
P_NAME STRING,
P_MFGR STRING,
P_BRAND STRING,
P_TYPE STRING,
P_SIZE INT,
P_CONTAINER STRING,
P_RETAILPRICE DOUBLE,
P_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/part';CREATE EXTERNAL TABLE region (
R_REGIONKEY INT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/region'; 查看更多复制代码
建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。
6. 执行TPC-H查询
TPC-H总共22条查询,如下:
Q1:
SELECT l_returnflag,
l_linestatus, Sum(l_quantity) AS sum_qty, Sum(l_extendedprice) AS sum_base_price, Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, Avg(l_quantity) AS avg_qty, Avg(l_extendedprice) AS avg_price, Avg(l_discount) AS avg_disc, Count(*) AS count_orderFROM lineitemWHERE l_shipdate <= date '1998-12-01' - INTERVAL '93' dayGROUP BY l_returnflag,
l_linestatusORDER BY l_returnflag,
l_linestatusLIMIT 1; 查看更多复制代码
Q2:
SELECT s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_commentFROM part,
supplier,
partsupp,
nation,
regionWHERE p_partkey = ps_partkeyAND s_suppkey = ps_suppkeyAND p_size = 35AND p_type LIKE '%NICKEL'AND s_nationkey = n_nationkeyAND n_regionkey = r_regionkeyAND r_name = 'MIDDLE EAST' 查看更多复制代码
Q3:
SELECT l_orderkey, Sum(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriorityFROM customer,
orders,
lineitemWHERE c_mktsegment = 'AUTOMOBILE'AND c_custkey = o_custkeyAND l_orderkey = o_orderkeyAND o_orderdate < date '1995-03-31'AND l_shipdate > date '1995-03-31'GROUP BY l_orderkey,
o_orderdate,
o_shippriorityORDER BY revenue DESC,
o_orderdateLIMIT 10;复制代码
Q4:
SELECT o_orderpriority, Count(*) AS order_countFROM orders,
lineitemWHERE o_orderdate >= date '1997-10-01'AND o_orderdate < date '1997-10-01' + INTERVAL '3' monthAND l_orderkey = o_orderkeyAND l_commitdate < l_receiptdateGROUP BY o_orderpriorityORDER BY o_orderpriorityLIMIT 1;复制代码
Q5:
SELECT n_name, Sum(l_extendedprice * (1 - l_discount)) AS revenueFROM customer,
orders,
lineitem,
supplier,
nation,
regionWHERE c_custkey = o_custkeyAND l_orderkey = o_orderkeyAND l_suppkey = s_suppkeyAND c_nationkey = s_nationkeyAND s_nationkey = n_nationkeyAND n_regionkey = r_regionkeyAND r_name = 'ASIA'AND o_orderdate >= date '1995-01-01'AND o_orderdate < date '1995-01-01' + INTERVAL '1' yearGROUP BY n_nameORDER BY revenue DESCLIMIT 1;复制代码
Q6:
SELECT sum(l_extendedprice * l_discount) AS revenueFROM lineitemWHERE l_shipdate >= date '1995-01-01'AND l_shipdate < date '1995-01-01' + interval '1' yearAND l_discount between 0.04 - 0.01 AND 0.04 + 0.01AND l_quantity < 24LIMIT 1;复制代码
Q7:
SELECT supp_nation,
cust_nation,
l_year, Sum(volume) AS revenueFROM ( SELECT n1.n_name AS supp_nation,
n2.n_name AS cust_nation, Extract(year FROM l_shipdate) AS l_year,
l_extendedprice * (1 - l_discount) AS volume FROM supplier,
lineitem,
orders,
customer,
nation n1,
nation n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ( (
n1.n_name = 'GERMANY'
AND n2.n_name = 'INDIA') OR (
n1.n_name = 'INDIA'
AND n2.n_name = 'GERMANY') ) AND l_shipdate BETWEEN date '1995-01-01' AND date '1996-12-31' ) AS shippingGROUP BY supp_nation,
cust_nation,
l_yearORDER BY supp_nation,
cust_nation,
l_yearLIMIT 1; 查看更多复制代码
Q8:
SELECT o_year, Sum( CASE
WHEN nation = 'INDIA' THEN volume ELSE 0
end) / Sum(volume) AS mkt_shareFROM ( SELECT Extract(year FROM o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) AS volume,
n2.n_name AS nation FROM part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region WHERE p_partkey = l_partkey AND s_suppkey = l_suppkey AND l_orderkey = o_orderkey AND o_custkey = c_custkey AND c_nationkey = n1.n_nationkey AND n1.n_regionkey = r_regionkey AND r_name = 'ASIA'
AND s_nationkey = n2.n_nationkey AND o_orderdate BETWEEN date '1995-01-01' AND date '1996-12-31'
AND p_type = 'STANDARD ANODIZED STEEL' ) AS all_nationsGROUP BY o_yearORDER BY o_yearLIMIT 1; 查看更多复制代码
Q9:
SELECT nation,
o_year, Sum(amount) AS sum_profitFROM ( SELECT n_name AS nation, Extract(year FROM o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount FROM part,
supplier,
lineitem,
partsupp,
orders,
nation WHERE s_suppkey = l_suppkey AND ps_suppkey = l_suppkey AND ps_partkey = l_partkey AND p_partkey = l_partkey AND o_orderkey = l_orderkey AND s_nationkey = n_nationkey AND p_name LIKE '%aquamarine%' ) AS profitGROUP BY nation,
o_yearORDER BY nation,
o_year DESCLIMIT 1; 查看更多复制代码
Q10:
SELECT c_custkey,
c_name, Sum(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_commentFROM customer,
orders,
lineitem,
nationWHERE c_custkey = o_custkeyAND l_orderkey = o_orderkeyAND o_orderdate >= date '1994-08-01'AND o_orderdate < date '1994-08-01' + INTERVAL '3' monthAND l_returnflag = 'R'AND c_nationkey = n_nationkeyGROUP BY c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_commentORDER BY revenue DESCLIMIT 20; 查看更多复制代码
Q11:
SELECT ps_partkey, Sum(ps_supplycost * ps_availqty) AS valueFROM partsupp,
supplier,
nationWHERE ps_suppkey = s_suppkeyAND s_nationkey = n_nationkeyAND n_name = 'PERU'GROUP BY ps_partkeyHAVING Sum(ps_supplycost * ps_availqty) >
(SELECT Sum(ps_supplycost * ps_availqty) * 0.0001000000 as sum_value FROM partsupp,
supplier,
nation WHERE ps_suppkey = s_suppkey AND s_nationkey = n_nationkey AND n_name = 'PERU')ORDER BY value DESCLIMIT 1;复制代码
Q12:
SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1
else 0end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1else 0end) AS low_line_countFROM orders,
lineitemWHERE o_orderkey = l_orderkeyAND l_shipmode in ('MAIL', 'TRUCK')AND l_commitdate < l_receiptdateAND l_shipdate < l_commitdateAND l_receiptdate >= date '1996-01-01'AND l_receiptdate < date '1996-01-01' + interval '1' yearGROUP BY l_shipmodeORDER BY l_shipmodeLIMIT 1;复制代码
Q13:
SELECT c_count, count(*) AS custdistFROM ( SELECT c_custkey, count(o_orderkey) AS c_count FROM customer,
orders WHERE c_custkey = o_custkey AND o_comment NOT LIKE '%pending%accounts%'
GROUP BY c_custkey ) AS c_ordersGROUP BY c_countORDER BY custdist DESC, c_count DESCLIMIT 1;复制代码
Q14:
SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenueFROM lineitem,
partWHERE l_partkey = p_partkeyAND l_shipdate >= date '1996-01-01'AND l_shipdate < date '1996-01-01' + interval '1' monthLIMIT 1;复制代码
Q15:
WITH revenue0 AS
(SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenueFROM lineitemWHERE l_shipdate >= date '1993-01-01'AND l_shipdate < date '1993-01-01' + interval '3' monthGROUP BY l_suppkey
)SELECT s_suppkey, s_name, s_address, s_phone, total_revenueFROM supplier, revenue0WHERE s_suppkey = supplier_noAND total_revenue IN ( SELECT max(total_revenue) FROM revenue0 )ORDER BY s_suppkey;复制代码
Q16:
SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cntFROM partsupp,
partWHERE p_partkey = ps_partkeyAND p_brand <> 'Brand#23'AND p_type NOT LIKE 'PROMO BURNISHED%'AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)AND ps_suppkey NOT IN ( SELECT s_suppkey FROM supplier WHERE s_comment LIKE '%Customer%Complaints%' )GROUP BY p_brand, p_type, p_sizeORDER BY supplier_cnt DESC, p_brand, p_type, p_sizeLIMIT 1;复制代码
Q17:
SELECT
sum(l_extendedprice) / 7.0 AS avg_yearlyFROM
lineitem,
partWHERE p_partkey = l_partkey AND p_brand = 'Brand#44'
AND p_container = 'WRAP PKG'
AND l_quantity < ( SELECT
0.2 * avg(l_quantity) FROM
lineitem, part WHERE
l_partkey = p_partkey
);复制代码
Q18:
SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)FROM customer,
orders,
lineitemWHERE o_orderkey IN ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING sum(l_quantity) > 315 )AND c_custkey = o_custkeyAND o_orderkey = l_orderkeyGROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalpriceORDER BY o_totalprice DESC, o_orderdateLIMIT 100;复制代码
Q19:
SELECT sum(l_extendedprice* (1 - l_discount)) AS revenueFROM lineitem,
partWHERE ( p_partkey = l_partkey and p_brand = 'Brand#12'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 6 and l_quantity <= 6 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#13'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 21 and l_quantity <= 21 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' )LIMIT 1; 查看更多复制代码
Q20:
with temp_table as
( select 0.5 * sum(l_quantity) as col1 from lineitem,
partsupp where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1993-01-01'
and l_shipdate < date '1993-01-01' + interval '1' year)select s_name, s_addressfrom supplier,
nationwhere s_suppkey in ( select ps_suppkey from partsupp,
temp_table where ps_partkey in ( select p_partkey from part where p_name like 'dark%' ) and ps_availqty > temp_table.col1 ) and s_nationkey = n_nationkey and n_name = 'JORDAN'order by s_namelimit 1;复制代码
Q21:
select
s_name, count(*) as numwaitfrom
supplier,
lineitem l1,
orders,
nationwhere
s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F'
and l1.l_receiptdate > l1.l_commitdate and exists ( select
* from
lineitem l2 where
l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey
) and not exists ( select
* from
lineitem l3 where
l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate
) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA'group by
s_nameorder by
numwait desc,
s_namelimit 100; 查看更多复制代码
Q22:
with temp_table_1 as
( select avg(c_acctbal) as avg_value from customer where c_acctbal > 0.00 and substring(c_phone from 1 for 2) in ('33', '29', '37', '35', '25', '27', '43')
),
temp_table_2 as( select count(*) as count1 from orders, customer where o_custkey = c_custkey
)select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbalfrom ( select substring(c_phone from 1 for 2) as cntrycode, c_acctbal from customer, temp_table_1, temp_table_2 where substring(c_phone from 1
for 2) in ('33', '29', '37', '35', '25', '27', '43') and c_acctbal > temp_table_1.avg_value and temp_table_2.count1 = 0) as custsalegroup by cntrycodeorder by cntrycodelimit 1;复制代码
7. 异步执行查询
Data Lake Analytics支持“同步执行”模式和“异步执行”模式。“同步执行”模式下,控制台界面等待执行结果返回;“异步执行”模式下,立刻返回查询任务的ID。
点击“执行状态”,可以看到该异步查询任务的执行状态,主要分为:“RUNNING”,“SUCCESS”,“FAILURE”。
点击“刷新”,当STATUS变为“SUCCESS”时,表示查询成功,同时可查看查询耗时“ELAPSE_TIME”和查询扫描的数据字节数“SCANNED_DATA_BYTES”。
8. 查看查询历史
点击“执行历史”,可以看到您执行的查询的历史详细信息,包括:
1)查询语句;
2)查询耗时与执行具体时间;
3)查询结果返回行数;
4)查询状态;
5)查询扫描的字节数;
6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。
查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。
{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv
{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata复制代码
其中QueryLocation为:
aliyun-oa-query-results-<your_account_id>-<oss_region>复制代码
9. 后续
至此,本教程一步一步教您如何利用Data
Lake Analytics云产品分析您OSS上的CSV格式的数据文件。除了CSV文件外,Data Lake
Analytics还支持Parquet、ORC、json、RCFile、AVRO等多种格式文件的数据分析能力。特别是Parquet、ORC,相比CSV文件,有极大的性能和成本优势(同样内容的数据集,拥有更小的存储空间、更快的查询性能,这也意味着更低的分析成本)。
后续陆续会有更多教程和文章,手把手教您轻松使用Data Lake Analytics进行数据湖上数据分析和探索,开启您的云上低成本、即存即用的数据分析和探索之旅。
原文链接:yq.aliyun.com/articles/62…