PostgreSQL 的窗口函數 OVER, WINDOW, PARTITION BY, RANGE

最近在數據處理中用到了窗函數, 把使用方法記錄一下, 暫時只有分組排序和滑動時間窗口的例子, 以後再逐步添加

場景

在SQL查詢時, 會遇到有兩類需要分組統計的場景, 在之前的SQL語法中是不方便實現的

  1. 場景1: 顧客維修設備的記錄表, 每次維修產生一條記錄, 每個記錄包含時間, 顧客ID和維修金額, 要取出每個顧客的維修次數和最後一次維修時的金額
  2. 場景2: 還是上面的維修記錄表, 要取出每個顧客的每次維修之間的時間間隔
  3. 場景3: 一個用戶賬戶的交易流水表, 要求每個小時的交易筆數和平均收支金額, 這個平均數的統計範圍是兩個小時(整點時間的前後一個小時)

使用窗函數直接SQL中使用窗函數就能解決這些問題, 否則需要使用臨時表, 函數或存儲過程進行處理.

窗函數

PostgreSQL 從2010年的版本8開始就支持窗函數了.

文檔

詳細說明建議查看官方文檔 //www.postgresql.org/docs/current/tutorial-window.html

函數說明

窗函數(window function)的計算方式與傳統的單行和聚合不同

  1. 窗函數是在當前表中, 基於當前行的相關行的計算, 注意是基於多行的計算
  2. 屬於一種聚合計算, 可以使用聚合類型的函數(aggregate function)
  3. 使用窗函數並不會導致結果的聚合, 也就是結果依然是當前的行結構

所以綜合的說, 窗口函數就是在行的基礎上, 允許對多行數據進行計算. 下面是一個簡單的窗函數例子, 將每個員工的薪資與其所在的部門的平均薪資進行比較

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;

關鍵詞

使用窗函數時會用到的一些關鍵詞

  • OVER 前面的查詢基於後面的窗口
  • PARTITION BY 類似於 GROUP BY 的語義, 專用於窗口的分組
  • ORDER BY 窗內的排序依據, 依據的字段決定了 RANGE 的類型
  • RANGE … PRECEDING 在當前值之前的範圍, 基準是當前記錄這個 ORDER BY 字段的值
  • RANGE … FOLLOWING 在當前值之後的範圍, 基準是當前記錄這個 ORDER BY 字段的值
  • RANGE BETWEEN … PRECEDING AND … FOLLOWING 前後範圍的組合
  • WINDOW 將窗口命名為變量, 可以在 SELECT 中重複使用

示例

按窗口打序號

功能: 將數據按指定的字段分組, 再按另一個字段排列, 給每個分組裡的數據打上序號.

這是一個常用技巧, 例如要計算各組內記錄之間的時間間隔, 但是用時間不方便join, 打完序號後就可以用序號join了

SELECT
    ROW_NUMBER() OVER w1 AS rn,
    sample_01.*
FROM
    sample_01
WINDOW 
w1 AS (PARTITION BY field_name ORDER BY created_at ASC);

簡單時間窗口統計

功能: 將數據表按指定字段(日期類型)進行排序, 然後基於每個記錄的這個字段創建一個固定寬度的時間窗口, 對窗口內的多個記錄進行統計

統計單個字段, 可以直接寫在select中

SELECT
    MAX(amount) OVER (ORDER BY traded_at RANGE '30 minutes' PRECEDING) AS amount_max,
    *
FROM sample_01
WHERE card_num = '6210812500006111111'

基於時間窗口變量進行多字段統計

功能: 和前一個功能一樣, 但是要進行多個不同的統計, 要重複用到這個窗口函數

如果要統計多個字段, 可以抽出單獨的WINDOW

SELECT
    MAX(rn) OVER w1 AS rn_max,
    MAX(amount) OVER w1 AS amount_max,
    AVG(amount) OVER w1 AS amount_avg,
    *
FROM sample_01_diff
WINDOW
    -- w1 AS (ORDER BY traded_at RANGE '30 minutes' PRECEDING)
    w1 AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING)
ORDER BY
    rn ASC

在這個例子中

  1. 先依據 card_num 這個字段進行分區,
  2. 然後按 traded_at 這個字段進行排序,
  3. 對每個記錄的 traded_at 值, 開啟一個 RANGE, 包含前面的30分鐘和後面的30分鐘, RANGE 中能用的類型和 ORDER BY 的字段類型是相關的
  4. SELECT中的 MAX, MIN 等聚合函數, 是基於上面的 RANGE 進行的

In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column. For numeric ordering columns it is typically of the same type as the ordering column, but for datetime ordering columns it is an interval. For example, if the ordering column is of type date or timestamp, one could write RANGE BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING. The offset is still required to be non-null and non-negative, though the meaning of 「non-negative」 depends on its data type.

多個窗口多個字段同時統計

功能: 在前面的功能基礎上, 同時存在多個時間窗口

SELECT
    -- 1 hour
    SUM(amount_in) OVER w1h AS h1_amount_in_sum,
    SUM(
        CASE
            WHEN amount_in = 0 THEN 0
            ELSE 1
        END
    ) OVER w1h AS h1_amount_in_count,
    SUM(amount_out) OVER w1h AS h1_amount_out_sum,
    SUM(
        CASE
            WHEN amount_out = 0 THEN 0
            ELSE 1
        END
    ) OVER w1h AS h1_amount_out_count,
    SUM(amount) OVER w1h AS h1_amount_sum,
    COUNT(amount) OVER w1h AS h1_amount_count,
    ROUND(AVG(amount) OVER w1h, 2) AS h1_amount_avg,
    FIRST_VALUE(amount) OVER w1h AS h1_amount_first,
    LAST_VALUE(amount) OVER w1h AS h1_amount_last,
    MAX(amount) OVER w1h AS h1_amount_max,
    MIN(amount) OVER w1h AS h1_amount_min,
    -- 3 hour
    SUM(amount_in) OVER w3h AS h3_amount_in_sum,
    SUM(
        CASE
            WHEN amount_in = 0 THEN 0
            ELSE 1
        END
    ) OVER w3h AS h3_amount_in_count,
    SUM(amount_out) OVER w3h AS h3_amount_out_sum,
    SUM(
        CASE
            WHEN amount_out = 0 THEN 0
            ELSE 1
        END
    ) OVER w3h AS h3_amount_out_count,
    SUM(amount) OVER w3h AS h3_amount_sum,
    COUNT(amount) OVER w3h AS h3_amount_count,
    ROUND(AVG(amount) OVER w3h, 2) AS h3_amount_avg,
    FIRST_VALUE(amount) OVER w3h AS h3_amount_first,
    LAST_VALUE(amount) OVER w3h AS h3_amount_last,
    MAX(amount) OVER w3h AS h3_amount_max,
    MIN(amount) OVER w3h AS h3_amount_min,
    *
FROM sample_01
WINDOW
    w1h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING),
    w3h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '90 minutes' PRECEDING AND '90 minutes' FOLLOWING)
;

參考