当前位置 博文首页 > i_mycode的博客:hiveSql完成类递归计算
和同事一起探讨了个问题挺有意思,记录一下!
问题
每个月有数据如下:
需要计算截止到每个月,前月距离当月月数乘前月阅读数累计之和。
类似递归操作:本月到第一个月的累计和,前一个月到第一个月的累计和…一直到第二个月到第一个月的和 再加上第一个月 最终所有累积和的合计。
听起来比较绕,举个栗子:
图中每个月的阅读量是a,b,c,d,e。计算结果result如图公式。
到这里可以先思考下怎么做。。。
最终同事用excel完成了计算。这里想用hiveSql看看能不能统计,想了下思路。
原本想的是横向拉平数据,在每个月的后面都计算好当月所需要的前月乘后的数据,再横向累加,如下图:
这个思路用sql实现有两个关键点:a.拉平数据 和 b.对应拉平数据后的乘数(前月距离本月的月数)
select
createtime
,cast(split(str,',')[0] as bigint) as month1
,cast(split(str,',')[1] as bigint) as month2
,cast(split(str,',')[2] as bigint) as month3
,cast(split(str,',')[3] as bigint) as month4
,cast(split(str,',')[4] as bigint) as month5
from
(select
createtime,wm_concat(avg_view,',') as str
from tablename
group by createtime
) tmp;
select
createtime,
month1,lag(irank,1,0) over(order by irank) as lag1,
month2,lag(irank,2,0) over(order by irank) as lag2,
month2,lag(irank,3,0) over(order by irank) as lag3,
month2,lag(irank,4,0) over(order by irank) as lag4,
month2,lag(irank,5,0) over(order by irank) as lag5
from
(select
createtime
,cast(split(viewstr,',')[0] as bigint) as month1
,cast(split(viewstr,',')[1] as bigint) as month2
,cast(split(viewstr,',')[2] as bigint) as month3
,cast(split(viewstr,',')[3] as bigint) as month4
,cast(split(viewstr,',')[4] as bigint) as month5
,ROW_NUMBER() OVER(ORDER BY createtime) AS irank -- 多出一行自增序列
from
(select
createtime
,wm_concat(cast(avg_view as string),',') as viewstr
from tablename
group by createtime
) tmp
) tmp1
上述sql结果如图:
再将sql内的monthN*legN再横向累加即可
归结上述思路,发现繁琐的步骤就再横向拉平那块,88个月split要写88个,上面的leg又是88个。hive中对于这种繁琐的操作应该是有内置函数的。之前有写过内置函数的总结篇。HIVE窗口函数合集,发现果然有。
利用开窗函数的window子句其中的累加到当前行。只需要使用两次即可:
SELECT
createtime,avg_view,sample1,
sum(sample1) over(order by createtime rows between UNBOUNDED PRECEDING and CURRENT ROW ) as sample2
FROM
(select
createtime,avg_view,
sum(avg_view) over(order by createtime rows between UNBOUNDED PRECEDING and CURRENT ROW ) as sample1
from tablename
group by createtime,avg_view
) t
group by createtime,avg_view,sample1;
第一次使用得到第一行累加到当前行的统计值sample1。再次将sample1累加一次第一行到当前行即得到最终结果。如下图所示:
总结:同事使用excel完成的思路也是这样,累加两次。excel真香!