欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

核对Spark生成的数据流程(一)

程序员文章站 2022-03-03 20:42:25
...

日常开发中经常会验证spark生成的数据是否和源oracle库中数据join之后的结果相同?
也就是判断spark sql --> hive sql --> oracle sql 执行结果一致即可,比如我们需要核对201907月的数据:
(1)在测试oracle中执行sql

select
st.sst_code,
sum(case when o.order_type ='10721023' and pdet.part_type='10741001' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_total_p,
sum(case when o.order_type ='10721023' and pdet.part_type='10741003' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_accessory_total_p
from sbpopt.tt_einvoice_sap d left join sbpopt.tt_einvoice_item_sap t on d.vbeln = t.vbeln
left join (select ts.sst_code,tm.sst_name,ts.to_sender from sbpopt.tm_sst_sender ts, sbpopt.tm_sst tm
where ts.sst_code = tm.sst_code group by ts.sst_code,ts.to_sender,tm.sst_name ) st on substr(d.kunag, 6, 10) = st.to_sender
left join sbpopt.tt_part_order_detail det on t.order_code = det.order_code and t.aupos = det.no
left join sbpopt.tt_part_order o on det.order_id = o.order_id
left join sbpopt.tm_promotion_package pac on det.part_code = pac.pp_code and det.sn = pac.pp_version
left join sbpopt.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
left join sbpopt.tm_part_maindata pm on t.matnr = pm.part_code
where d.IS_CANCELED != 'Y' 
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and o.order_type = 10721023
and st.sst_code in ('74308100', '74308310') 
and to_char(d.fkdat,'yyyyMM') = '201907'
group by st.sst_code;

核对Spark生成的数据流程(一)
(2)其中sparksql执行的结果为dws表,直接统计即可

select 
aa.sst_code ,
sum(aa.einvoice_amount_tyre_p) ,
sum(aa.einvoice_accessory_oil_p)
from dws_tt_einvoice_shipping aa where aa.sst_code in ('74308100', '74308310') 
and aa.billing_date>='20190701' and aa.billing_date<'20190801'
group by aa.rssc_name,aa.rssc_code,aa.sst_name,aa.sst_code;

核对Spark生成的数据流程(一)
一个经销商指标只对上一个指标,另一个经销商两个指标完全一致??
(3)查找wd层汇总数据

select
sum(case when pdet.part_type='10741001' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_total_p,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308310' and substr(wi.te_fkdat,1,10)>='20190701' and substr(wi.te_fkdat,1,10)<'20190801'

核对Spark生成的数据流程(一)
wd层数据竟然和oracle中一致,看来是dws层代码逻辑有问题?
难道是left join有问题。先验证下数据再说

--1、查询wd明细数据A
select
substr(wi.te_fkdat,1,10) date,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join  (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)
--2、查询wd明细数据B
select
substr(wi.te_fkdat,1,10) date,
sum(case when wi.tpm_parttype='10741003' then nvl(wi.tei_zsp_value,0)+nvl(wi.tei_netwr,0) else 0 end) einvoice_accessory_total
from asmp.wd_tt_einvoice_item_sap wi
left join asmp.wd_tt_part_order wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (wi.tpm_parttype = '10741001' or wi.tpm_parttype = '10741003' ) and wo.order_type != 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)

通过对比发现 A_SQL中有的天数,B_SQL并没有6和27号,因此判断出确实是join的问题,最后把left join改成full join即可。