Hive SQL开窗函数实战

  • 2019 年 10 月 6 日
  • 筆記

Hive SQL开窗函数实战

开窗函数是数据的一种查询统计语法糖,多是用于离线统计,这同时也是大数据技术栈的应用场景。今天学习Hive SQL的开窗(窗口)函数,对比与MySQL,在MySQL8.0以前是不支持窗口函数的,而Oracle支持。

下面进入实战练习模式:

数据字段描述:

class    student score  math       a      20  

第一列为课程类型,比如math,第二列为学生的学号,假设是a、b、c、d这种,第三列为学生得分。下面对这个数据进行表构建与数据导入。

创建表及导入数据:

hive (cookie)> create database if not exists math;  OK  Time taken: 0.31 seconds  hive (cookie)> use math;  OK  Time taken: 0.01 seconds  hive (math)> drop table if exists math;  OK  Time taken: 0.007 seconds  hive (math)> create table math(class string,student string,score int) row format delimited fields terminated by ',';  OK  Time taken: 0.233 seconds  hive (math)> load data local inpath '/home/light/mysql/hive/math.txt' into table math;  Loading data to table math.math  OK  Time taken: 0.405 seconds  hive (math)> select * from math;  OK  math.class    math.student    math.score  math    a   20  math    b   70  math    c   100  math    d   30  math    e   60  math    f   50  math    g   50  Time taken: 0.17 seconds, Fetched: 7 row(s)  

执行下面查询与语句:

select     class,     student,     score,     sum(score) over (partition by class order by student rows between unbounded preceding and current row) as score1,     sum(score) over (partition by class order by student) as score2,     sum(score) over (partition by class) as score3,     sum(score) over (partition by class order by student rows between 3 preceding and current row) as score4,     sum(score) over (partition by class order by student rows between 3 preceding and 1 following) as score5,     sum(score) over (partition by class order by student rows between current row and unbounded following) as score6  from math;  

上述的代码解释:

如果不指定ROWS BETWEEN,默认为从起点到当前行;如果不指定ORDER BY,则将分组内所有值累加;

ROWS BETWEEN含义,也叫做WINDOW子句(这就是开窗!)

PRECEDING:往前;FOLLOWING:往后;CURRENT ROW:当前行

UNBOUNDED:起点,

  UNBOUNDED PRECEDING 表示从前面的起点,

  UNBOUNDED FOLLOWING:表示到后面的终点

结果:

结果描述:

score1:分组内从起点到当前行的score累积,例如c.score=a.score+b.score+c.score

score2:同score1。

score3:分组内(class)所有的score累加。

score4:分组内当前行+往前3行,如:e.score=b.score+c.score+d.score+e.score。(对应下表的加粗数字)。

class

student

score

score4

math

g

50

190

math

f

50

240

math

e

60

260

math

d

30

220

math

c

100

190

math

b

70

90

math

a

20

20

score5:分组内当前行+往前3行+往后1行,如:还是求上述e.score,那么就是在前面基础上加上后一行分数。(对应下表的加粗数字)。

class

student

score

score5

math

g

50

190

math

f

50

290

math

e

60

310

math

d

30

280

math

c

100

220

math

b

70

190

math

a

20

90

score6:分组内当前行+往后所有行,如:还是求上述e.score(对应下表的加粗数字)。

class

student

score

score6

math

g

50

50

math

f

50

100

math

e

60

160

math

d

30

190

math

c

100

290

math

b

70

360

math

a

20

380

其余的开窗函数,比如:min、avg、max与此类似。就不在赘述。