核对Spark生成的数据流程(一)
程序员文章站
2022-03-03 20:42:25
...
日常开发中经常会验证spark生成的数据是否和源oracle库中数据join之后的结果相同?
也就是判断spark sql --> hive sql --> oracle sql 执行结果一致即可,比如我们需要核对201907月的数据:
(1)在测试oracle中执行sql
select
st.sst_code,
sum(case when o.order_type ='10721023' and pdet.part_type='10741001' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_total_p,
sum(case when o.order_type ='10721023' and pdet.part_type='10741003' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_accessory_total_p
from sbpopt.tt_einvoice_sap d left join sbpopt.tt_einvoice_item_sap t on d.vbeln = t.vbeln
left join (select ts.sst_code,tm.sst_name,ts.to_sender from sbpopt.tm_sst_sender ts, sbpopt.tm_sst tm
where ts.sst_code = tm.sst_code group by ts.sst_code,ts.to_sender,tm.sst_name ) st on substr(d.kunag, 6, 10) = st.to_sender
left join sbpopt.tt_part_order_detail det on t.order_code = det.order_code and t.aupos = det.no
left join sbpopt.tt_part_order o on det.order_id = o.order_id
left join sbpopt.tm_promotion_package pac on det.part_code = pac.pp_code and det.sn = pac.pp_version
left join sbpopt.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
left join sbpopt.tm_part_maindata pm on t.matnr = pm.part_code
where d.IS_CANCELED != 'Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and o.order_type = 10721023
and st.sst_code in ('74308100', '74308310')
and to_char(d.fkdat,'yyyyMM') = '201907'
group by st.sst_code;
(2)其中sparksql执行的结果为dws表,直接统计即可
select
aa.sst_code ,
sum(aa.einvoice_amount_tyre_p) ,
sum(aa.einvoice_accessory_oil_p)
from dws_tt_einvoice_shipping aa where aa.sst_code in ('74308100', '74308310')
and aa.billing_date>='20190701' and aa.billing_date<'20190801'
group by aa.rssc_name,aa.rssc_code,aa.sst_name,aa.sst_code;
一个经销商指标只对上一个指标,另一个经销商两个指标完全一致??
(3)查找wd层汇总数据
select
sum(case when pdet.part_type='10741001' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_total_p,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308310' and substr(wi.te_fkdat,1,10)>='20190701' and substr(wi.te_fkdat,1,10)<'20190801'
wd层数据竟然和oracle中一致,看来是dws层代码逻辑有问题?
难道是left join有问题。先验证下数据再说
--1、查询wd明细数据A
select
substr(wi.te_fkdat,1,10) date,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)
--2、查询wd明细数据B
select
substr(wi.te_fkdat,1,10) date,
sum(case when wi.tpm_parttype='10741003' then nvl(wi.tei_zsp_value,0)+nvl(wi.tei_netwr,0) else 0 end) einvoice_accessory_total
from asmp.wd_tt_einvoice_item_sap wi
left join asmp.wd_tt_part_order wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (wi.tpm_parttype = '10741001' or wi.tpm_parttype = '10741003' ) and wo.order_type != 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)
通过对比发现 A_SQL中有的天数,B_SQL并没有6和27号,因此判断出确实是join的问题,最后把left join改成full join即可。
上一篇: 内核对象杂谈
下一篇: 《Windows》内核对象
推荐阅读
-
FreemarkerJavaDemo【Android将表单数据生成Word文档的方案之一(基于freemarker2.3.28,只能java生成)】
-
一键生成微信个人专属数据报告,了解你的微信社交历史
-
pandas使用apply多列生成一列数据的实例
-
深度学习 从零开始 —— 神经网络数学基础(一),学习Keras库的使用,神经网络简单流程,MNIST数据集使用
-
关于数据库主键生成策略的一些想法
-
关于数据库主键生成策略的一些想法
-
解析从 Vue 初始化到首次渲染生成 DOM 以及从 Vue 数据修改到页面更新 DOM 的流程
-
数组根据某个条件筛选出符合的数据,生成一个新的数组
-
给你的个人微信朋友圈数据生成一本电子书吧!
-
用一个完整的案例讲解Python数据分析的整个流程和基础知识