你需要懂一點ClickHouse的基礎知識

2019 年 11 月 14 日
筆記

應用場景：

1.絕大多數請求都是用於讀訪問的 2.數據需要以大批次（大於1000行）進行更新，而不是單行更新；或者根本沒有更新操作

3.數據只是添加到資料庫，沒有必要修改 4.讀取數據時，會從資料庫中提取出大量的行，但只用到一小部分列

5.表很「寬」，即表中包含大量的列 6.查詢頻率相對較低（通常每台伺服器每秒查詢數百次或更少）

7.對於簡單查詢，允許大約50毫秒的延遲 8.列的值是比較小的數值和短字元串（例如，每個URL只有60個位元組） 9.在處理單個查詢時需要高吞吐量（每台伺服器每秒高達數十億行） 10.不需要事務

11.數據一致性要求較低 12.每次查詢中只會查詢一個大表。除了一個大表，其餘都是小表 13.查詢結果顯著小於數據源。即數據有過濾或聚合。返回結果不超過單個伺服器記憶體大小

相應地，使用ClickHouse也有其本身的限制：

1.不支援真正的刪除/更新支援不支援事務（期待後續版本支援） 2.不支援二級索引 3.有限的SQL支援，join實現與眾不同 4.不支援窗口功能 5.元數據管理需要人工干預維護

常用SQL語法

-- 列出資料庫列表  show databases;    -- 列出資料庫中表列表  show tables;    -- 創建資料庫  create database test;    -- 刪除一個表  drop table if exists test.t1;    -- 創建第一個表  create /*temporary*/ table /*if not exists*/ test.m1 (   id UInt16  ,name String  ) ENGINE = Memory  ;  -- 插入測試數據  insert into test.m1 (id, name) values (1, 'abc'), (2, 'bbbb');    -- 查詢  select * from test.m1;

默認值

默認值的處理方面， ClickHouse 中，默認值總是有的，如果沒有顯示式指定的話，會按欄位類型處理：

數字類型， 0 字元串，空字元串數組，空數組日期， 0000-00-00 時間， 0000-00-00 00:00:00 註：NULLs 是不支援的

數據類型

1.整型：UInt8,UInt16,UInt32,UInt64,Int8,Int16,Int32,Int64 範圍U開頭-2N/2~2N-1;非U開頭0～2^N-1 2.枚舉類型：Enum8,Enum16 Enum('hello'=1,'test'=-1),Enum是有符號的整型映射的，因此負數也是可以的 3.字元串型：FixedString(N),String N是最大位元組數，不是字元長度，如果是UTF8字元串，那麼就會佔3個位元組，GBK會佔2位元組;String可以用來替換VARCHAR,BLOB,CLOB等數據類型 4.時間類型：Date 5.數組類型：Array(T) T是一個基本類型，包括arry在內，官方不建議使用多維數組 6.元組：Tuple 7.結構：Nested(name1 Type1,name2 Type2,…) 類似一種map的結

物化列

指定 MATERIALIZED 表達式，即將一個列作為物化列處理了，這意味著這個列的值不能從insert 語句獲取，只能是自己計算出來的。同時，物化列也不會出現在 select * 的結果中：

drop table if exists test.m2;  create table test.m2 (   a MATERIALIZED (b+1)  ,b UInt16  ) ENGINE = Memory;  insert into test.m2 (b) values (1);  select * from test.m2;  select a, b from test.m2;

表達式列

ALIAS 表達式列某方面跟物化列相同，就是它的值不能從 insert 語句獲取。不同的是，物化列是會真正保存數據（這樣查詢時不需要再計算），而表達式列不會保存數據（這樣查詢時總是需要計算），只是在查詢時返回表達式的結果。

create table test.m3 (a ALIAS (b+1), b UInt16) ENGINE = Memory;  insert into test.m3(b) values (1);  select * from test.m3;  select a, b from test.m3;

引擎/engine

引擎是clickhouse設計的精華部分

TinyLog

最簡單的一種引擎，每一列保存為一個文件，裡面的內容是壓縮過的，不支援索引這種引擎沒有並發控制，所以，當你需要在讀，又在寫時，讀會出錯。並發寫，內容都會壞掉。

應用場景: a. 基本上就是那種只寫一次 b. 然後就是只讀的場景。 c. 不適用於處理量大的數據，官方推薦，使用這種引擎的表最多 100 萬行的數據

drop table if exists test.tinylog;  create table test.tinylog (a UInt16, b UInt16) ENGINE = TinyLog;  insert into test.tinylog(a,b) values (7,13);

此時/var/lib/clickhouse/data/test/tinylog保存數據的目錄結構：

├── a.bin  ├── b.bin  └── sizes.json

a.bin 和 b.bin 是壓縮過的對應的列的數據， sizes.json 中記錄了每個 *.bin 文件的大小

Log

這種引擎跟 TinyLog 基本一致它的改進點，是加了一個 __marks.mrk 文件，裡面記錄了每個數據塊的偏移這樣做的一個用處，就是可以準確地切分讀的範圍，從而使用並發讀取成為可能但是，它是不能支援並發寫的，一個寫操作會阻塞其它讀寫操作 Log 不支援索引，同時因為有一個 __marks.mrk 的冗餘數據，所以在寫入數據時，一旦出現問題，這個表就廢了

應用場景: 同 TinyLog 差不多，它適用的場景也是那種寫一次之後，後面就是只讀的場景，臨時數據用它保存也可以。

drop table if exists test.log;  create table test.log (a UInt16, b UInt16) ENGINE = Log;  insert into test.log(a,b) values (7,13);

此時/var/lib/clickhouse/data/test/log保存數據的目錄結構：

├── __marks.mrk  ├── a.bin  ├── b.bin  └── sizes.json

Memory

記憶體引擎，數據以未壓縮的原始形式直接保存在記憶體當中，伺服器重啟數據就會消失可以並行讀，讀寫互斥鎖的時間也非常短不支援索引，簡單查詢下有非常非常高的性能表現

應用場景: a. 進行測試 b. 在需要非常高的性能，同時數據量又不太大（上限大概 1 億行）的場景

Merge

一個工具引擎，本身不保存數據，只用於把指定庫中的指定多個錶鏈在一起。這樣，讀取操作可以並發執行，同時也可以利用原表的索引，但是，此引擎不支援寫操作指定引擎的同時，需要指定要鏈接的庫及表，庫名可以使用一個表達式，表名可以使用正則表達式指定

create table test.tinylog1 (id UInt16, name String) ENGINE=TinyLog;  create table test.tinylog2 (id UInt16, name String) ENGINE=TinyLog;  create table test.tinylog3 (id UInt16, name String) ENGINE=TinyLog;    insert into test.tinylog1(id, name) values (1, 'tinylog1');  insert into test.tinylog2(id, name) values (2, 'tinylog2');  insert into test.tinylog3(id, name) values (3, 'tinylog3');    use test;  create table test.merge (id UInt16, name String) ENGINE=Merge(currentDatabase(), '^tinylog[0-9]+');  select _table,* from test.merge order by id desc

┌─_table───┬─id─┬─name─────┐ │ tinylog3 │ 3 │ tinylog3 │ │ tinylog2 │ 2 │ tinylog2 │ │ tinylog1 │ 1 │ tinylog1 │ └──────────┴────┴──────────┘

註：_table 這個列，是因為使用了 Merge 多出來的一個的一個虛擬列

a. 它表示原始數據的來源表，它不會出現在 show table 的結果當中 b. select * 不會包含它

Distributed

與 Merge 類似， Distributed 也是通過一個邏輯表，去訪問各個物理表，設置引擎時的樣子是：

Distributed(remote_group, database, table [, sharding_key])

其中：

remote_group /etc/clickhouse-server/config.xml中remote_servers參數 database 是各伺服器中的庫名 table 是表名 sharding_key 是一個定址表達式，可以是一個列名，也可以是像 rand() 之類的函數調用，它與 remote_servers 中的 weight 共同作用，決定在寫時往哪個 shard 寫

配置文件中的 remote_servers

<remote_servers>     <log>         <shard>             <weight>1</weight>             <internal_replication>false</internal_replication>             <replica>                 <host>172.17.0.3</host>                 <port>9000</port>             </replica>         </shard>         <shard>             <weight>2</weight>             <internal_replication>false</internal_replication>             <replica>                 <host>172.17.0.4</host>                 <port>9000</port>             </replica>         </shard>     </log>  </remote_servers>

log 是某個 shard 組的名字，就是上面的 remote_group 的值 shard 是固定標籤 weight 是權重，前面說的 sharding_key 與這個有關。簡單來說，上面的配置，理論上來看: 第一個 shard 「被選中」的概率是 1 / (1 + 2) ，第二個是 2 / (1 + 2) ，這很容易理解。但是， sharding_key 的工作情況，是按實際數字的「命中區間」算的，即第一個的區間是 [0, 1) 的周期，第二個區間是 [1, 1+2) 的周期。比如把 sharding_key 設置成 id ，當 id=0 或 id=3 時，一定是寫入到第一個 shard 中，如果把 sharding_key 設置成 rand() ，那系統會對應地自己作一般化轉換吧，這種時候就是一種概率場景了。 internal_replication 是定義針對多個 replica 時的寫入行為的。如果為 false ，則會往所有的 replica 中寫入數據，但是並不保證數據寫入的一致性，所以這種情況時間一長，各 replica 的數據很可能出現差異。如果為 true ，則只會往第一個可寫的 replica 中寫入數據（剩下的事「物理表」自己處理）。 replica 就是定義各個冗餘副本的，選項有 host ， port ， user ， password 等

看一個實際的例子，我們先在兩台機器上創建好物理表並插入一些測試數據：

create table test.tinylog_d1(id UInt16, name String) ENGINE=TinyLog;  insert into test.tinylog_d1(id, name) values (1, 'Distributed record 1');  insert into test.tinylog_d1(id, name) values (2, 'Distributed record 2');

在其中一台創建邏輯表：

create table test.tinylog_d (id UInt16, name String) ENGINE=Distributed(log, test,tinylog_d1 , id);    -- 插入數據到邏輯表，觀察數據分發情況  insert into test.tinylog_d(id, name) values (0, 'main');  insert into test.tinylog_d(id, name) values (1, 'main');  insert into test.tinylog_d(id, name) values (2, 'main');    select name,sum(id),count(id) from test.tinylog_d group by name;

註：邏輯表中的寫入操作是非同步的，會先快取在本機的文件系統上，並且，對於物理表的不可訪問狀態，並沒有嚴格控制，所以寫入失敗丟數據的情況是可能發生的

Null

空引擎，寫入的任何數據都會被忽略，讀取的結果一定是空。

但是注意，雖然數據本身不會被存儲，但是結構上的和數據格式上的約束還是跟普通表一樣是存在的，同時，你也可以在這個引擎上創建視圖

Buffer

Buffer 引擎，像是Memory 存儲的一個上層應用似的（磁碟上也是沒有相應目錄的）
它的行為是一個緩衝區，寫入的數據先被放在緩衝區，達到一個閾值後，這些數據會自動被寫到指定的另一個表中
和Memory 一樣，有很多的限制，比如沒有索引
Buffer 是接在其它表前面的一層，對它的讀操作，也會自動應用到後面表，但是因為前面說到的限制的原因，一般我們讀數據，就直接從源表讀就好了，緩衝區的這點數據延遲，只要配置得當，影響不大的
Buffer 後面也可以不接任何錶，這樣的話，當數據達到閾值，就會被丟棄掉

一些特點：

如果一次寫入的數據太大或太多，超過了 max 條件，則會直接寫入源表。
刪源表或改源表的時候，建議 Buffer 表刪了重建。
「友好重啟」時， Buffer 數據會先落到源表，「暴力重啟」， Buffer 表中的數據會丟失。
即使使用了 Buffer ，多次的小數據寫入，對比一次大數據寫入，也慢得多（幾千行與百萬行的差距）

-- 創建源表  create table test.mergetree (sdt  Date, id UInt16, name String, point UInt16) ENGINE=MergeTree(sdt, (id, name), 10);  -- 創建 Buffer表  -- Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)  create table test.mergetree_buffer as test.mergetree ENGINE=Buffer(test, mergetree, 16, 3, 20, 2, 10, 1, 10000);    insert into test.mergetree (sdt, id, name, point) values ('2017-07-10', 1, 'a', 20);  insert into test.mergetree_buffer (sdt, id, name, point) values ('2017-07-10', 1, 'b', 10);  select * from test.mergetree;  select '------';  select * from test.mergetree_buffer;

database資料庫 table 源表，這裡除了字元串常量，也可以使用變數的。 num_layers 是類似「分區」的概念，每個分區的後面的 min / max 是獨立計算的，官方推薦的值是 16 。 min / max 這組配置薦，就是設置閾值的，分別是時間（秒），行數，空間（位元組）。

閾值的規則: 是「所有的 min 條件都滿足，或至少一個 max 條件滿足」。

如果按上面我們的建表來說，所有的 min 條件就是：過了 3秒，2條數據，1 Byte。一個 max 條件是：20秒，或 10 條數據，或有 10K

Set

Set 這個引擎有點特殊，因為它只用在 IN 操作符右側，你不能對它 select

create table test.set(id UInt16, name String) ENGINE=Set;  insert into test.set(id, name) values (1, 'hello');  -- select 1 where (1, 'hello') in test.set; -- 默認UInt8 需要手動進行類型轉換  select 1 where (toUInt16(1), 'hello') in test.set;

注: Set 引擎表，是全記憶體運行的，但是相關數據會落到磁碟上保存，啟動時會載入到記憶體中。所以，意外中斷或暴力重啟，是可能產生數據丟失問題的。

MergeTree

這個引擎是 ClickHouse 的重頭戲，它支援一個日期和一組主鍵的兩層式索引，還可以實時更新數據。同時，索引的粒度可以自定義，外加直接支援取樣功能

MergeTree(EventDate, (CounterID, EventDate), 8192)  MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)

EventDate 一個日期的列名 intHash32(UserID) 取樣表達式 (CounterID, EventDate) 主鍵組（裡面除了列名，也支援表達式），也可以是一個表達式 8192 主鍵索引的粒度

drop table if exists test.mergetree1;  create table test.mergetree1 (sdt  Date, id UInt16, name String, cnt UInt16) ENGINE=MergeTree(sdt, (id, name), 10);    -- 日期的格式，好像必須是 yyyy-mm-dd  insert into test.mergetree1(sdt, id, name, cnt) values ('2018-06-01', 1, 'aaa', 10);  insert into test.mergetree1(sdt, id, name, cnt) values ('2018-06-02', 4, 'bbb', 10);  insert into test.mergetree1(sdt, id, name, cnt) values ('2018-06-03', 5, 'ccc', 11);

此時/var/lib/clickhouse/data/test/mergetree1的目錄結構：

├── 20180601_20180601_1_1_0  │   ├── checksums.txt  │   ├── columns.txt  │   ├── id.bin  │   ├── id.mrk  │   ├── name.bin  │   ├── name.mrk  │   ├── cnt.bin  │   ├── cnt.mrk  │   ├── cnt.idx  │   ├── primary.idx  │   ├── sdt.bin  │   └── sdt.mrk -- 保存一下塊偏移量  ├── 20180602_20180602_2_2_0  │   └── ...  ├── 20180603_20180603_3_3_0  │   └── ...  ├── format_version.txt  └── detached

ReplacingMergeTree

1.在 MergeTree 的基礎上，添加了「處理重複數據」的功能=>實時數據場景 2.相比 MergeTree ,ReplacingMergeTree 在最後加一個"版本列",它跟時間列配合一起，用以區分哪條數據是"新的"，並把舊的丟掉(這個過程是在 merge 時處理，不是數據寫入時就處理了的，平時重複的數據還是保存著的，並且查也是跟平常一樣會查出來的) 3.主鍵列組用於區分重複的行

-- 版本列 允許的類型是， UInt 一族的整數，或 Date 或 DateTime  create table test.replacingmergetree (sdt  Date, id UInt16, name String, cnt UInt16) ENGINE=ReplacingMergeTree(sdt, (name), 10, cnt);    insert into test.replacingmergetree (sdt, id, name, cnt) values ('2018-06-10', 1, 'a', 20);  insert into test.replacingmergetree (sdt, id, name, cnt) values ('2018-06-10', 1, 'a', 30);  insert into test.replacingmergetree (sdt, id, name, cnt) values ('2018-06-11', 1, 'a', 20);  insert into test.replacingmergetree (sdt, id, name, cnt) values ('2018-06-11', 1, 'a', 30);  insert into test.replacingmergetree (sdt, id, name, cnt) values ('2018-06-11', 1, 'a', 10);    select * from test.replacingmergetree;    -- 如果記錄未執行merge，可以手動觸發一下 merge 行為  optimize table test.replacingmergetree;

┌────────sdt─┬─id─┬─name─┬─cnt─┐ │ 2018-06-11 │ 1 │ a │ 30 │ └────────────┴────┴──────┴─────┘

SummingMergeTree

1.SummingMergeTree 就是在 merge 階段把數據sum求和 2.sum求和的列可以指定，不可加的未指定列，會取一個最先出現的值

create table test.summingmergetree (sdt Date, name String, a UInt16, b UInt16) ENGINE=SummingMergeTree(sdt, (sdt, name), 8192, (a));    insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-10', 'a', 1, 20);  insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-10', 'b', 2, 11);  insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-11', 'b', 3, 18);  insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-11', 'b', 3, 82);  insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-11', 'a', 3, 11);  insert into test.summingmergetree (sdt, name, a, b) values ('2018-06-12', 'c', 1, 35);    -- 手動觸發一下 merge 行為  optimize table test.summingmergetree;    select * from test.summingmergetree;

┌────────sdt─┬─name─┬─a─┬──b─┐ │ 2018-06-10 │ a │ 1 │ 20 │ │ 2018-06-10 │ b │ 2 │ 11 │ │ 2018-06-11 │ a │ 3 │ 11 │ │ 2018-06-11 │ b │ 6 │ 18 │ │ 2018-06-12 │ c │ 1 │ 35 │ └────────────┴──────┴───┴────┘ 注: 可加列不能是主鍵中的列，並且如果某行數據可加列都是 null ，則這行會被刪除。

AggregatingMergeTree

AggregatingMergeTree 是在 MergeTree 基礎之上，針對聚合函數結果，作增量計算優化的一個設計，它會在 merge 時，針對主鍵預處理聚合的數據應用於AggregatingMergeTree 上的聚合函數除了普通的 sum, uniq等，還有 sumState , uniqState ，及 sumMerge ， uniqMerge 這兩組

1.聚合數據的預計算是一種「空間換時間」的權衡，並且是以減少維度為代價的

dim1	dim2	dim3	measure1
aaaa	a	1	1
aaaa	b	2	1
bbbb	b	3	1
cccc	b	2	1
cccc	c	1	1
dddd	c	2	1
dddd	a	1	1

假設原始有三個維度，一個需要 count 的指標

dim1	dim2	dim3	measure1
aaaa	a	1	1
aaaa	b	2	1
bbbb	b	3	1
cccc	b	2	1
cccc	c	1	1
dddd	c	2	1
dddd	a	1	1

通過減少一個維度的方式，來以 count 函數聚合一次 M

dim2	dim3	count(measure1)
a	1	3
b	2	2
b	3	1
c	1	1
c	2	1

2.聚合數據的增量計算

對於 AggregatingMergeTree 引擎的表，不能使用普通的 INSERT 去添加數據，可以用： a. INSERT SELECT 來插入數據 b. 更常用的，是可以創建一個物化視圖

drop table if exists test.aggregatingmergetree;  create table test.aggregatingmergetree(  sdt Date  , dim1 String  , dim2 String  , dim3 String  , measure1 UInt64  ) ENGINE=MergeTree(sdt, (sdt, dim1, dim2, dim3), 8192);    -- 創建一個物化視圖，使用 AggregatingMergeTree  drop table if exists test.aggregatingmergetree_view;  create materialized view test.aggregatingmergetree_view  ENGINE = AggregatingMergeTree(sdt,(dim2, dim3), 8192)  as  select sdt,dim2, dim3, uniqState(dim1) as uv  from test.aggregatingmergetree  group by sdt,dim2, dim3;    insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'aaaa', 'a', '10', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'aaaa', 'a', '10', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'aaaa', 'b', '20', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'bbbb', 'b', '30', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'cccc', 'b', '20', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'cccc', 'c', '10', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'dddd', 'c', '20', 1);  insert into test.aggregatingmergetree (sdt, dim1, dim2, dim3, measure1) values ('2018-06-10', 'dddd', 'a', '10', 1);    -- 按 dim2 和 dim3 聚合 count(measure1)  select dim2, dim3, count(measure1) from test.aggregatingmergetree group by dim2, dim3;    -- 按 dim2 聚合 UV  select dim2, uniq(dim1) from test.aggregatingmergetree group by dim2;    -- 手動觸發merge  OPTIMIZE TABLE test.aggregatingmergetree_view;  select * from test.aggregatingmergetree_view;    -- 查 dim2 的 uv  select dim2, uniqMerge(uv) from test.aggregatingmergetree_view group by dim2 order by dim2;

CollapsingMergeTree

是專門為 OLAP 場景下，一種「變通」存數做法而設計的，在數據是不能改，更不能刪的前提下，通過「運算」的方式，去抹掉舊數據的影響，把舊數據「減」去即可，從而解決"最終狀態"類的問題，比如 當前有多少人在線？

「以加代刪」的增量存儲方式，帶來了聚合計算方便的好處，代價卻是存儲空間的翻倍，並且，對於只關心最新狀態的場景，中間數據都是無用的

CollapsingMergeTree 在創建時與 MergeTree 基本一樣，除了最後多了一個參數，需要指定 Sign 位（必須是 Int8 類型）

create table test.collapsingmergetree(sign Int8, sdt Date, name String, cnt UInt16) ENGINE=CollapsingMergeTree(sdt, (sdt, name), 8192, sign);

作者：darebeat

來源：https://www.jianshu.com/p/a5bf490247ea