核对Spark生成的数据流程（一）

程序员文章站 2022-03-03 20:42:25

...

日常开发中经常会验证spark生成的数据是否和源oracle库中数据join之后的结果相同？
也就是判断spark sql --> hive sql --> oracle sql 执行结果一致即可，比如我们需要核对201907月的数据：
(1)在测试oracle中执行sql

select
st.sst_code,
sum(case when o.order_type ='10721023' and pdet.part_type='10741001' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_total_p,
sum(case when o.order_type ='10721023' and pdet.part_type='10741003' then nvl(pdet.promotion_amount*t.fkimg,0) else 0 end) einvoice_accessory_total_p
from sbpopt.tt_einvoice_sap d left join sbpopt.tt_einvoice_item_sap t on d.vbeln = t.vbeln
left join (select ts.sst_code,tm.sst_name,ts.to_sender from sbpopt.tm_sst_sender ts, sbpopt.tm_sst tm
where ts.sst_code = tm.sst_code group by ts.sst_code,ts.to_sender,tm.sst_name ) st on substr(d.kunag, 6, 10) = st.to_sender
left join sbpopt.tt_part_order_detail det on t.order_code = det.order_code and t.aupos = det.no
left join sbpopt.tt_part_order o on det.order_id = o.order_id
left join sbpopt.tm_promotion_package pac on det.part_code = pac.pp_code and det.sn = pac.pp_version
left join sbpopt.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
left join sbpopt.tm_part_maindata pm on t.matnr = pm.part_code
where d.IS_CANCELED != 'Y' 
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and o.order_type = 10721023
and st.sst_code in ('74308100', '74308310') 
and to_char(d.fkdat,'yyyyMM') = '201907'
group by st.sst_code;

核对Spark生成的数据流程（一）
(2)其中sparksql执行的结果为dws表，直接统计即可

select 
aa.sst_code ,
sum(aa.einvoice_amount_tyre_p) ,
sum(aa.einvoice_accessory_oil_p)
from dws_tt_einvoice_shipping aa where aa.sst_code in ('74308100', '74308310') 
and aa.billing_date>='20190701' and aa.billing_date<'20190801'
group by aa.rssc_name,aa.rssc_code,aa.sst_name,aa.sst_code;

核对Spark生成的数据流程（一）
一个经销商指标只对上一个指标，另一个经销商两个指标完全一致？？
(3)查找wd层汇总数据

select
sum(case when pdet.part_type='10741001' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_total_p,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) einvoice_accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308310' and substr(wi.te_fkdat,1,10)>='20190701' and substr(wi.te_fkdat,1,10)<'20190801'

核对Spark生成的数据流程（一）
wd层数据竟然和oracle中一致，看来是dws层代码逻辑有问题？
难道是left join有问题。先验证下数据再说

--1、查询wd明细数据A
select
substr(wi.te_fkdat,1,10) date,
sum(case when pdet.part_type='10741003' then nvl(pdet.promotion_amount*wi.tei_fkimg,0) else 0 end) accessory_total_p
from (select * from asmp.wd_tt_einvoice_item_sap e) wi
left join  (select * from asmp.wd_tt_part_order p ) wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
left join asmp.tm_promotion_package pac on wo.d_part_code = pac.pp_code and wo.d_sn = pac.pp_version
left join asmp.tm_promotion_package_detail pdet on pac.pp_id = pdet.pp_id
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (pdet.part_type = '10741001' or pdet.part_type = '10741003' ) and wo.order_type = 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)
--2、查询wd明细数据B
select
substr(wi.te_fkdat,1,10) date,
sum(case when wi.tpm_parttype='10741003' then nvl(wi.tei_zsp_value,0)+nvl(wi.tei_netwr,0) else 0 end) einvoice_accessory_total
from asmp.wd_tt_einvoice_item_sap wi
left join asmp.wd_tt_part_order wo on wi.tei_order_code = wo.d_order_code and wi.tei_aupos = wo.d_no
where wi.partition_brand='vw' and wi.te_is_canceled !='Y'
and (wi.tpm_parttype = '10741001' or wi.tpm_parttype = '10741003' ) and wo.order_type != 10721023
and wi.sst_code='74308100' and substr(wi.te_fkdat,1,7)='2019-07'
group by substr(wi.te_fkdat,1,10)

通过对比发现 A_SQL中有的天数，B_SQL并没有6和27号，因此判断出确实是join的问题，最后把left join改成full join即可。

上一篇：内核对象杂谈

下一篇：《Windows》内核对象

核对Spark生成的数据流程（一）

FreemarkerJavaDemo【Android将表单数据生成Word文档的方案之一（基于freemarker2.3.28，只能java生成）】

一键生成微信个人专属数据报告,了解你的微信社交历史

pandas使用apply多列生成一列数据的实例

深度学习从零开始 —— 神经网络数学基础（一），学习Keras库的使用，神经网络简单流程，MNIST数据集使用

关于数据库主键生成策略的一些想法

关于数据库主键生成策略的一些想法

解析从 Vue 初始化到首次渲染生成 DOM 以及从 Vue 数据修改到页面更新 DOM 的流程

数组根据某个条件筛选出符合的数据，生成一个新的数组

给你的个人微信朋友圈数据生成一本电子书吧！

用一个完整的案例讲解Python数据分析的整个流程和基础知识

核对Spark生成的数据流程（一）

FreemarkerJavaDemo【Android将表单数据生成Word文档的方案之一（基于freemarker2.3.28，只能java生成）】

一键生成微信个人专属数据报告,了解你的微信社交历史

pandas使用apply多列生成一列数据的实例

深度学习 从零开始 —— 神经网络数学基础（一），学习Keras库的使用，神经网络简单流程，MNIST数据集使用

关于数据库主键生成策略的一些想法

关于数据库主键生成策略的一些想法

解析从 Vue 初始化到首次渲染生成 DOM 以及从 Vue 数据修改到页面更新 DOM 的流程

数组根据某个条件筛选出符合的数据，生成一个新的数组

给你的个人微信朋友圈数据生成一本电子书吧！

用一个完整的案例讲解Python数据分析的整个流程和基础知识

深度学习从零开始 —— 神经网络数学基础（一），学习Keras库的使用，神经网络简单流程，MNIST数据集使用