Hive窗口函數
@
官方文檔地址
Hive官網,點我就進
oracle,sqlserver都提供了窗口函數,但是在mysql5.5和5.6都沒有提供窗口函數!
窗口函數: 窗口+函數
- 窗口: 函數運行時計算的數據集的範圍
- 函數: 運行的函數!
僅僅支援以下函數:
Windowing functions
- LEAD (scalar_expression [,offset] [,default]): 返回當前行以下N行的指定列的列值!如果找不到,就採用默認值
- LAG (scalar_expression [,offset] [,default]): 返回當前行以上N行的指定列的列值!如果找不到,就採用默認值
- FIRST_VALUE(列名,[false(默認)]):返回當前窗口指定列的第一個值,第二個參數如果為true,代表加入第一個值為null,跳過空值,繼續尋找!
- LAST_VALUE(列名,[false(默認)]):返回當前窗口指定列的最後一個值,第二個參數如果為true,代表加入第一個值為null,跳過空值,繼續尋找!
統計類的函數(一般都需要結合over使用):min,max,avg,sum,count
排名分析:
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
注意:不是所有的函數在運行都是可以通過改變窗口的大小,來控制計算的數據集的範圍!所有的排名函數和LAG,LEAD,支援使用over(),但是在over()中不能定義 window_clause
格式: 函數 over( partition by 欄位 ,order by 欄位 window_clause )
窗口的大小可以通過windows_clause來指定:
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following
特殊情況:
- ①在over()中既沒有出現windows_clause,也沒有出現order by,窗口默認為rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
- ②在over()中(沒有出現windows_clause),指定了order by,窗口默認為rows between UNBOUNDED PRECEDING and CURRENT ROW
窗口函數和分組有什麼區別?
- ①如果是分組操作,select後只能寫分組後的欄位
- ②如果是窗口函數,窗口函數是在指定的窗口內,對每條記錄都執行一次函數
- ③如果是分組操作,有去重效果,而partition不去重!
練習
(9) 查詢前20%時間的訂單資訊
精確演算法:
select *
from
(select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
from business) tmp
where cdnum<=0.2
不精確計算:
select *
from
(select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum
from business) tmp
where cdnum=1
(8)查詢顧客的購買明細及顧客最近三次cost花費
最近三次: 當前和之前兩次 或 當前+前一次+後一次
當前和之前兩次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row)
from business
當前+前一次+後一次:
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business
或
select name,orderdate,cost,cost+
lag(cost,1,0) over(partition by name order by orderdate )+
lead(cost,1,0) over(partition by name order by orderdate )
from business
(7) 查詢顧客的購買明細及顧客本月最後一次購買的時間
select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING)
from business
(6) 查詢顧客的購買明細及顧客本月第一次購買的時間
select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate )
from business
(5) 查詢顧客的購買明細及顧客下次的購買時間
select name,orderdate,cost,lead(orderdate,1,'無數據') over(partition by name order by orderdate )
from business
(4)查詢顧客的購買明細及顧客上次的購買時間
select name,orderdate,cost,lag(orderdate,1,'無數據') over(partition by name order by orderdate )
from business
(3)查詢顧客的購買明細要將cost按照日期進行累加
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate )
from business
(2)查詢顧客的購買明細及月購買總額
select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) )
from business
(1)查詢在2017年4月份購買過的顧客及總人數
select name,count(*) over(rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
from business
where substring(orderdate,1,7)='2017-04'
group by name
等價於
select name,count(*) over()
from business
where substring(orderdate,1,7)='2017-04'
group by name