常用Hive函數的學習和總結

2019 年 12 月 17 日
筆記

今天來小結一下工作中經常會使用到的一些Hive函數。關於Hive函數的總結，網上早有十分全面的版本。參考：https://blog.csdn.net/doveyoung8/article/details/80014442。本文主要從最常用和實用的角度出發，說明幾個使用頻率較高的函數，更注重使用函數組合來解決實際問題而不局限於單個函數的使用。所有數據都是虛構，代碼均在本地的Hive環境上都通過測試。本文代碼較多，需要各位看官耐心學習，可以收藏備查，歡迎補充和討論。由於公眾號對代碼的支持不太友好，您可以在後台回復「hive函數」獲取本文的PDF版本，方便閱讀。

1.json字符串處理：get_json_object，lateral view，explode，substr，json_tuple

先簡要說明下幾個函數的用法：

語法: get_json_object(string json_string, string path) 返回值: string 說明：解析json的字符串json_string，返回path指定的內容。如果輸入的json字符串無效，那麼返回NULL。語法: explode(ARRAY)，經常和lateral view一起使用 返回值: 多行說明: 將數組中的元素拆分成多行顯示語法: substr(string A, int start, int len),substring(string A, int start, int len) 返回值: string 說明：返回字符串 A 從 start 位置開始，長度為 len 的字符串語法: json_tuple(string json_string, col1, col2, …) ，經常和lateral view一起使用 返回值: string 說明：同時解析多個json字符串中的多個字段

然後我們看實例：

--我們虛構的數據，jsondata.txt  1    {"store":{"fruit":[{"weight":8,"type":"apple"}, {"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"}}, "email":"amy@only_for_json_udf_test.net", "owner":"amy"}  2    {"store":{"fruit":[{"weight":8.1,"type":"apple"}, {"weight":9.2,"type":"pear"}], "bicycle":{"price":20.01,"color":"blue"}}, "email":"[email protected]", "owner":"bob"}

hive> create table json_data(id int, data string) row format delimited fields terminated by 't';  hive> load data local inpath 'jsondata.txt' into table json_data;

查詢單層值

hive> select id, get_json_object(data, '$.owner') from json_data;  1    amy  2    bob

查詢多層值1

#注意bicycle子串的格式同樣是json格式  hive> select id, get_json_object(data, '$.store.bicycle.price') from json_data;  1    19.95  2    20.01

查詢多層值2

#注意fruit子串的的格式是數組(帶有方括號)，不是標準的json格式，下面語句取出fruit的值  hive> select id, get_json_object(data, '$.store.fruit') from json_data;  1    [{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}]  2    [{"weight":8.1,"type":"apple"},{"weight":9.2,"type":"pear"}]      #可以使用索引訪問數據里的值，如得到第一個fruit的數據，結果是一個json格式的  hive> select id, get_json_object(data, '$.store.fruit[0]') from json_data;  1    {"weight":8,"type":"apple"}  2    {"weight":8.1,"type":"apple"}    #在上面語句的基礎上，可以獲得weight和type值。  hive> select id, get_json_object(data, '$.store.fruit[0].weight') from json_data;  1    8  2    8.1  hive> select id, get_json_object(data, '$.store.fruit[1].weight') from json_data;  1    9  2    9.2

查詢多層值3 如何同時獲得weight和type值，有下面兩種方式，效果一致。
方法1：和上一節一樣，用數組方式

hive> select id, get_json_object(data, '$.store.fruit[0].weight'), get_json_object(data, '$.store.fruit[0].type')      > from json_data;    1    8   apple    2    8.1 apple

方法2：用substr將數組的方括號截掉，轉換為json

hive> select id,      > get_json_object(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '$.weight'),      > get_json_object(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '$.type')      > from json_data;  1    8   apple  2    8.1 apple

查詢多層值4 按照上面的兩種方式，我們取到了fruit數組中第一個數據。但第二個數據，只能得到下面的效果（你可以試試看）： 1 8 apple 9 pear 2 8.1 apple 9.2 pear 能不能將相同的數排在一列，做出下面這樣的效果，答案是肯定的。 1 8 apple 1 9 pear 2 8.1 apple 2 9.2 pear 思路是：取到fruit之後，substr截掉前後的方括號，使用split按照'},'對其分割，之後用explode行轉列，再補全成完整的json，然後按照處理json的方式取數。步驟比較複雜，我們分三步看。

  #步驟1：截掉方括號，並用'},'分割，注意此時一三行不是完整的json，缺了右括號    hive> select id, fruit        > from json_data        > lateral view explode(split(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '},')) t as fruit;    1    {"weight":8,"type":"apple"    1    {"weight":9,"type":"pear"}    2    {"weight":8.1,"type":"apple"    2    {"weight":9.2,"type":"pear"}      #步驟2：case when 補全json，可以看到一三行結果是json格式了    hive> select a.id, case when substr(a.fruit, length(fruit), 1) = "}" then a.fruit else concat(a.fruit, '}') end as fruit_info        > from        > (        >     select id, fruit        >     from json_data        >     lateral view explode(split(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '},')) t as fruit        > ) a;    1    {"weight":8,"type":"apple"}    1    {"weight":9,"type":"pear"}    2    {"weight":8.1,"type":"apple"}    2    {"weight":9.2,"type":"pear"}      #步驟3：提取weight，type數據    hive> select b.id, get_json_object(b.fruit_info, '$.weight'), get_json_object(b.fruit_info, '$.type')        > from        > (        >     select a.id, case when substr(a.fruit, length(fruit), 1) = "}" then a.fruit else concat(a.fruit, '}') end as fruit_info        >     from        >     (        >         select id, fruit        >         from json_data        >         lateral view explode(split(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '},')) t as fruit        >     ) a        > ) b;    1    8   apple    1    9   pear    2    8.1 apple    2    9.2 pear

上面的步驟3，由於要獲取兩個字段，可以使用json_tuple函數代替，與later view連用，寫法如下：

  hive> select b.id, c.weight, c.type        > from        > (        >     select a.id, case when substr(a.fruit, length(fruit), 1) = "}" then a.fruit else concat(a.fruit, '}') end as fruit_info        >     from        >     (        >         select id, fruit        >         from json_data        >         lateral view explode(split(substr(get_json_object(data, '$.store.fruit'), 2, length(get_json_object(data, '$.store.fruit')) - 2), '},')) t as fruit        >     ) a        > ) b        > lateral view json_tuple(b.fruit_info,'weight', 'type') c as weight, type;    1    8   apple    1    9   pear    2    8.1 apple    2    9.2 pear

2.parse_url，regexp_replace，regexp_extract

語法: parse_url(string urlString, string partToExtract , string keyToExtract) 返回值: string 說明：返回 URL 中指定的部分。 partToExtract 的有效值為： HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. 語法: regexp_replace(string A, string B, string C) 返回值: string 說明：將字符串 A 中的符合正則表達式 B 的部分替換為 C。語法: regexp_extract(string subject, string pattern, int index) 返回值: string 說明：將字符串 subject 按照 pattern 正則表達式的規則拆分，返回 index 指定的字符。

下面看實例：

--我們虛構的數據，urldata.txt  1    https://ty.facebook.com/dwd/social?type=1&query=abc&id=1234  2    http://qq.tencent.com/dwd/category?type=2&query=def&id=5678#title1

hive> create table url_data(id int, data string) row format delimited fields terminated by 't';  hive> load data local inpath 'urldata.txt' into table url_data;

#獲取url協議  hive> select id, parse_url(data, 'PROTOCOL') from url_data;  1    https  2    http    #獲取主機名  hive> select id, parse_url(data, 'HOST') from url_data;  1    ty.facebook.com  2    qq.tencent.com    #獲取path  hive> select id, parse_url(data, 'PATH') from url_data;  1    /dwd/social  2    /dwd/category    #獲取所有參數的序列  hive> select id, parse_url(data, 'QUERY') from url_data;  1    type=1&query=abc&id=1234  2    type=2&query=def&id=5678    #獲取完整文件路徑  hive> select id, parse_url(data, 'FILE') from url_data;  1    /dwd/social?type=1&query=abc&id=1234  2    /dwd/category?type=2&query=def&id=5678    #獲取REF，沒有的返回NULL值  hive> select id, parse_url(data, 'REF') from url_data;  1    NULL  2    title1

插曲：獲取的參數序列是鍵值對的形式，能否將其拆分開呢？可以使用str_to_map函數.

語法: str_to_map(text, delimiter1, delimiter2) 返回值: map 說明：將字符串按照給定的分隔符轉換成 map 結構。第一個分隔符在K-V之間分割，第二個分隔符分割K-V本身

hive> select id, parse_url(data, 'PROTOCOL'), parse_url(data, 'HOST'), parse_url(data, 'PATH'), str_to_map(parse_url(data, 'QUERY'), '&', '=')['type'],      > str_to_map(parse_url(data, 'QUERY'), '&', '=')['query'], str_to_map(parse_url(data, 'QUERY'), '&', '=')['id']      >  from url_data;  1    https   ty.facebook.com /dwd/social     1   abc 1234  2    http    qq.tencent.com  /dwd/category   2   def 5678

如果不使用parse_url，能否對相應的url子串進行截取，可以藉助於regexp_extract，regexp_replace，不過可能調正則表達式需要用點功夫。

hive> select id, regexp_replace(data, 'dwd.+', "")      > from url_data;  1    https://ty.facebook.com/  2    http://qq.tencent.com/    hive> select id, regexp_extract(data, 'query=(.*)', 1)      > from url_data;  1    abc&id=1234  2    def&id=5678#title1    hive> select id, regexp_extract(data, 'query=(.*)&', 1)      > from url_data;  1    abc  2    def

3.collect_set，collect_list，concat，concat_ws

語法: collect_set (col) 返回值: array 說明: 將 col 字段進行去重，合併成一個數組。語法: collect_list (col) 返回值: array 說明: 將 col 字段合併成一個數組,不去重語法: concat(string A, string B…) 返回值: string 說明：返回輸入字符串連接後的結果，支持任意個輸入字符串語法: concat_ws(string SEP, string A, string B…) 返回值: string 說明：返回輸入字符串連接後的結果， SEP 表示各個字符串間的分隔符

--虛構的數據，fruitdata.txt  1001    apple  1001    pear  1001    banana  1001    pear  1002    blueberry  1002    bayberry

hive> create table fruit_data(id int, data string) row format delimited fields terminated by 't';  hive> load data local inpath 'fruitdata.txt' into table fruit_data;    hive> select id, collect_set(data)      > from fruit_data      > group by id;  1001    ["apple","pear","banana"]  1002    ["blueberry","bayberry"]    hive> select id, collect_list(data)      > from fruit_data      > group by id;  1001    ["apple","pear","banana","pear"]  1002    ["blueberry","bayberry"]

--虛構的數據，userdata.txt。想想一個用戶的粉絲在各個地域的分佈情況  1001    area1   5%  1001    area2   20%  1001    area3   25%  1001    area4   50%  2001    area1   20%  2001    area2   50%  2001    area3   30%

hive> create table user_data(id int, area string, data string) row format delimited fields terminated by 't';  hive> load data local inpath 'userdata.txt' into table user_data;    #按照每個用戶一行進行排列  hive> select id, collect_set(concat_ws(':', area, data))      > from user_data      > group by id;  1001    ["area1:5%","area2:20%","area3:25%","area4:50%"]  2001    ["area1:20%","area2:50%","area3:30%"]    #下面使用concat能得到同樣效果  hive> select id, collect_set(concat(area, ':', data))      > from user_data      > group by id;  1001    ["area1:5%","area2:20%","area3:25%","area4:50%"]  2001    ["area1:20%","area2:50%","area3:30%"]    #我們可以看到結果中，collect_set函數為我們加上了中括號和雙引號，能不能去掉它們，我們來看下面的效果：  hive> select id, concat_ws(',', collect_set(concat( area, ':', data)))      > from user_data      > group by id;  1001    area1:5%,area2:20%,area3:25%,area4:50%  2001    area1:20%,area2:50%,area3:30%    #如果想變成map的格式，在此基礎上可以再調用一下str_to_map即可  hive> select id, str_to_map(concat_ws(',', collect_set(concat( area, ':', data))), ",", ":")      > from user_data      > group by id;  1001    {"area1":"5%","area2":"20%","area3":"25%","area4":"50%"}  2001    {"area1":"20%","area2":"50%","area3":"30%"}

4.datediff，from_unixtime，unix_timestamp，to_date

語法: datediff(string enddate, string startdate) 返回值: int 說明: 返回結束日期減去開始日期的天數。日期的格式需要是yyyy-MM-dd，或者yyyy-MM-dd HH:mm:ss 語法: from_unixtime(bigint unixtime[, string format]) 返回值: string 說明: 轉化 UNIX 時間戳（從 1970-01-01 00:00:00 UTC 到指定時間的秒數）到當前時區的時間格式，默認的format是yyyy-MM-dd HH:mm:ss，可以指定別的語法: unix_timestamp(string date[, string format]) 返回值: bigint 說明: 轉換 pattern 格式的日期到 UNIX 時間戳。如果轉化失敗，則返回 0。默認的format是yyyy-MM-dd HH:mm:ss，可以指定別的。語法: to_date(string timestamp) 返回值: string 說明: 返回日期時間字段中的日期部分。

下面看實例：

#虛構的數據datedata.txt，一共有10列，後9列是各種日期。  1    2019-02-03  20190203    2019-02-05  20190205    2019-03-03 10:38:24 2019-03-23 10:36:54 20190323 10:36:54   1551763940  1551763940267  2    2019-02-08  20190208    2019-02-18  20190218    2019-03-19 10:32:04 2019-03-31 10:39:15 20190331 10:39:15   1552632321  1551763940654

hive> create table date_data(id int, d1 string, d2 string, d3 string, d4 string, d5 string, d6 string, d7 string, d8 string, d9 string) row format delimited fields terminated by 't';  hive> load data local inpath 'datedata.txt' into table date_data;

先看datediff的用法：

#yyyy-MM-dd的日期差  hive> select id, datediff(d3, d1)from date_data;  1    2  2    10  hive> select id, datediff(to_date(d3), to_date(d1)) from date_data;  1    2  2    10    #yyyyMMdd的日期差  hive> select id, datediff(d4, d2) from date_data;  1    NULL  2    NULL  #上面的寫法不行，我們需要將日期轉換為yyyy-MM-dd格式，使用截取拼接的套路進行  hive> select datediff(concat_ws('-', substr(d4, 1, 4), substr(d4, 5, 2), substr(d4, 7, 2)), concat_ws('-', substr(d2, 1, 4), substr(d2, 5, 2), substr(d2, 7, 2)))      > from date_data;  2  10    #yyyy-MM-dd HH:mm:ss與yyyy-MM-dd的日期差  hive> select datediff(d5, d1) from date_data;  28  39  hive> select datediff(to_date(d5), d1) from date_data;  28  39    #yyyy-MM-dd HH:mm:ss與yyyy-MM-dd HH:mm:ss的日期差  hive> select datediff(d6, d5) from date_data;  20  12  hive> select datediff(to_date(d6), d5) from date_data;  20  12

再來看unix_timestamp的用法：

#yyyy-MM-dd HH:mm:ss轉換為時間戳  hive> select unix_timestamp(d5) from date_data;  1551580704  1552962724  hive> select unix_timestamp(d5, 'yyyy-MM-dd HH:mm:ss') from date_data;  1551580704  1552962724    #yyyyMMdd HH:mm:ss轉換為時間戳  hive> select unix_timestamp(d7, 'yyyyMMdd HH:mm:ss') from date_data;  1553308614  1553999955

最後看from_unixtime的用法：

#由於我們的表是string格式的，在轉換之前需要轉為bigint型  hive> select from_unixtime(cast(d8 as bigint)) from date_data;  2019-03-05 13:32:20  2019-03-15 14:45:21    hive> select from_unixtime(cast(d8 as bigint), 'yyyy-MM-dd HH:mm:ss') from date_data;  2019-03-05 13:32:20  2019-03-15 14:45:21    hive> select from_unixtime(cast(d8 as bigint), 'yyyyMMdd HH:mm:ss') from date_data;  20190305 13:32:20  20190315 14:45:21    hive> select from_unixtime(cast(d8 as bigint), 'yyyy-MM-dd') from date_data;  2019-03-05  2019-03-15    hive> select from_unixtime(cast(d8 as bigint), 'yyyyMMdd') from date_data;  20190305  20190315

我們經常會在業務中遇到13位的時間戳，10位的時間戳是精確到秒的，13位則是精確到毫秒的。這時只需除以1000並轉化為整數即可。

hive> select from_unixtime(cast(d9/1000 as bigint)) from date_data;  2019-03-05 13:32:20  2019-03-05 13:32:20    hive> select from_unixtime(cast(d9/1000 as bigint), 'yyyy-MM-dd HH:mm:ss') from date_data;  2019-03-05 13:32:20  2019-03-05 13:32:20    hive> select from_unixtime(cast(d9/1000 as bigint), 'yyyyMMdd HH:mm:ss') from date_data;  20190305 13:32:20  20190305 13:32:20    hive> select from_unixtime(cast(d9/1000 as bigint), 'yyyy-MM-dd') from date_data;  2019-03-05  2019-03-05    hive> select from_unixtime(cast(d9/1000 as bigint), 'yyyyMMdd') from date_data;  20190305  20190305

5.coalesce

語法: COALESCE(T v1, T v2, …) 返回值: T 說明: 返回參數中的第一個非空值；如果所有值都為 NULL，那麼返回 NULL

1    https://ty.facebook.com/dwd/social?type=1&query=abc&id=1234&task_id=1111  2    https://ty.facebook.com/dwd/social?type=1&query=abc&id=1234&taskid=2222

hive> create table exp_data(id int, data string) row format delimited fields terminated by 't';  hive> load data local inpath 'expdata.txt' into table exp_data;

如果我們想提取出1111和2222這兩個值，但一個是task_id，一個是taskid。如果直接用str_to_map，直接寫的話，結果總會有一個空值：

hive> select str_to_map(data, '&', '=')['taskid'], str_to_map(data, '&', '=')['task_id']      > from exp_data;  NULL    1111  2222    NULL

這個時候就可以用到coalesce

hive> select coalesce(str_to_map(data, '&', '=')['taskid'], str_to_map(data, '&', '=')['task_id'], "")      > from exp_data;  1111  2222

總結

以上我們總結了一些比較常用的hive函數，重點在於學習這些函數組合使用的情況，經常出現在工作中需要進行數據清洗或者格式轉化的時候。例子都不是很複雜，有條件大家可以嘗試實踐一下，希望對你有用，如果感覺公眾號代碼閱讀起來費勁，可以在後台回復「hive函數」獲取本文的pdf版本。當然也有很多函數都沒有涉及到，比如工作中也經常會用到窗口函數，後續如果有機會我們可以專門進行學習~

常用Hive函數的學習和總結

總結

VirMach 便宜 VPS

QNews

常用Hive函數的學習和總結

總結

分享此文：

Related Posts

巧用 MyBatis 構建樹形結構

結合自己經歷聊聊注重實效的程序員應該掌握的幾個原則

你的FPGA設計有這些缺陷嗎

手把手教你畫雙基因生存曲線

VirMach 便宜 VPS

QNews

熱門搜尋