Clickhouse基準測試實踐

2022 年 8 月 28 日
筆記
ClickHouse, JAVA

1.概述

本篇部落格將對MySQL、InfluxDB、Clickhouse在寫入時間、聚合查詢時間、磁碟使用等方面的性能指標來進行比較。

2.內容

比較的數據集，是使用的Clickhouse官網提供的6600萬的數據集來進行測試比較的，當MySQL、InfluxDB、Clickhouse也分配4CPU和16GB記憶體的資源時，Clickhouse完全是在導入速度、磁碟使用和查詢性能等方面體現非常好的效果。結論如下所示：

	MySQL	InfluxDB	Clickhouse
導入時間	70分鐘	35分鐘	70秒
磁碟佔用空間	12.35GB	5.9GB	2.66GB
全表count	24366ms	11674ms	100ms
全表max/min	27023ms	26829ms	186ms
全表求平均值	24841ms	12043ms	123ms
全表求方差	24600ms	OOM	113ms
複雜查詢1	30260ms	OOM	385ms
複雜查詢2	470ms	200ms	8ms

為了保證測試結果比較準確，上面的每條SQL至少執行10次，然後取中間值。其中InfluxDB的性能比預期差，甚至比MySQL差，這可能是由於數據樣本和測試用例不合適InfluxDB場景造成的。

2.1 MySQL

MySQL是Oracle的產品，是目前最流行的關係型資料庫管理系統之一。它使用的SQL語言是用於訪問資料庫的最常見的標準化語言。採用雙授權策略，分為社區版和商業版。由於體積小、速度快、總體擁有成本低，尤其是其開源特性，一般選擇MySQL作為中小型網站開發的網站資料庫。

MySQL並不完美，但是它足夠靈活，是架構中的最佳選擇之一，並且在複雜的非單一項目中總能擁有一席之地。

2.2 InfluxDB

InfluxDB是InfluxData開發的開源時序資料庫，專註于海量時序數據的高性能讀、高性能寫、高效存儲和實時分析。在資料庫引擎排名時序資料庫中，它位居第一，廣泛應用與開發運維監控、物聯網監控、實時分析等場景。

傳統資料庫通常記錄數據的當前值，而時序資料庫記錄所有歷史數據。在處理當前時序數據時，必須不斷接收新的時序數據。同時，時序數據的查詢始終是基於時間的。它重點解決以下海量數據場景：

時序數據的寫入：如何支援每條千萬條數據的寫入；
時序數據的讀取：如何支援每條千萬條數據的聚合查詢；
成本問題：海量數據存儲帶來的成本問題，如何以更低的成本存儲這些數據。

2.3 Clickhouse

Clickhouse是由Yandex開源的基於列存儲的資料庫，用於實時數據分析，其處理數據的速度比傳統方法快100~1000倍。Clickhouse優於當前市場上類似的面向列的DBMS，每台伺服器每秒處理數億到超過10億行和超過10GB的數據。

它是一個用於在線分析（OLAP）的列式資料庫管理系統（DBMS），對OLTP和OLAP的做如下區別介紹：

OLTP：它是一個傳統的關係型資料庫，主要操作增刪改查，強調交易一致性，比如銀行系統、電子商務系統等；
OLAP：是一個倉庫型的資料庫，主要用於讀取數據，做複雜的數據分析，專註於技術決策支援，提供直觀簡單的結果。

Clickhouse用於OLAP的適用場景如下：

讀取多於寫入；
一個大的寬表讀取大量的行和很少的列，同時導入較小的結果集；
數據時分批寫入的，數據不更新或者更新頻率很低；
無需事務，數據一致性要求較低；
靈活多變，不適合模型預構建（類似Kylin的Cube）

3.測試數據準備

直接使用Clickhouse提供的測試數據地址：//clickhouse.com/docs/en/getting-started/example-datasets/opensky/，這個數據集中的數據是從完整的OpenSky數據集中導出和清洗過的。

該數據集涵蓋了自2019年1月1日以來，該網路2500多名成員看到的所有航班資訊。

3.1 下載數據

執行如下命令：

wget -O- https://zenodo.org/record/5092942 | grep -oP '//zenodo.org/record/5092942/files/flightlist_\d+_\d+\.csv\.gz' | xargs wget

在互聯網連接良好的情況下，下載大約需要 2 分鐘。共有 30 個文件，總大小為 4.3 GB。

3.2 創建表Clickhouse

執行如下SQL命令：

CREATE TABLE opensky
(
    callsign String,
    number String,
    icao24 String,
    registration String,
    typecode String,
    origin String,
    destination String,
    firstseen DateTime,
    lastseen DateTime,
    day DateTime,
    latitude_1 Float64,
    longitude_1 Float64,
    altitude_1 Float64,
    latitude_2 Float64,
    longitude_2 Float64,
    altitude_2 Float64
) ENGINE = MergeTree ORDER BY (origin, destination, callsign);

3.2.1 導入數據

將數據導入到Clickhouse中，執行如下所示命令：

ls -1 flightlist_*.csv.gz | xargs -P100 -I{} bash -c 'gzip -c -d "{}" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"'

在這裡，我們將文件列表 ( ls -1 flightlist_*.csv.gz)傳遞xargs給以進行並行處理。 xargs -P100指定最多使用 100 個並行工作器，但由於我們只有 30 個文件，工作器的數量將只有 30 個；
對於每個文件，xargs將運行一個帶有bash -c. 該腳本以 of 的形式進行替換，{}並且該xargs命令將文件名替換為它（我們已經要求它xargs使用-I{}）；
該腳本會將文件 ( gzip -c -d “{}”) 解壓縮到標準輸出 (-c參數) 並將輸出重定向到clickhouse-client
我們還要求使用擴展解析器 ( –date_time_input_format best_effort ) 解析DateTime欄位，以識別具有時區偏移的 ISO-8601 格式

最後，clickhouse-client會做插入。它將以CSVWithNames格式讀取輸入數據。並行上傳需要 24 秒。如果不想使用並行上傳，還可以使用順序上傳，可能需要的時間長一點，大概 75 秒，具體執行命令如下：

for file in flightlist_*.csv.gz; do gzip -c -d "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"; done

完整程式碼如下：

$ clickhouse-client
$ create database test;
$ use test;
$ CREATE TABLE opensky(callsign String,number String,icao24 String,registration String,typecode String,origin String,destination String,firstseen DateTime,lastseen DateTime,day DateTime,latitude_1 Float64,longitude_1 Float64,altitude_1 Float64,latitude_2 Float64,longitude_2 Float64,altitude_2 Float64) ENGINE = MergeTree ORDER BY (origin, destination, callsign);
$ exit

# Import data (about 75 seconds)
$ cd /tmp/flightlist
$ for file in flightlist_*.csv; do cat "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO test.opensky FORMAT CSVWithNames"; done

# Check if the data was imported successfully
$ clickhouse-client
$ SELECT count() FROM test.opensky;

3.3 創建MySQL

完整程式碼如下：

# Link MySQL to build database and table
$ mysql -uroot -p123456
$ use test;
$ CREATE TABLE `opensky` (`callsign` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`number` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`icao24` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`registration` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`typecode` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`origin` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`destination` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`firstseen` datetime DEFAULT NULL,`lastseen` datetime DEFAULT NULL,`day` datetime DEFAULT NULL,`latitude_1` double DEFAULT NULL,`longitude_1` double DEFAULT NULL,`altitude_1` double DEFAULT NULL,`latitude_2` double DEFAULT NULL,`longitude_2` double DEFAULT NULL,`altitude_2` double DEFAULT NULL,KEY `idx_callsign` (`callsign`),KEY `idx_origin` (`origin`),KEY `idx_destination` (`destination`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

# Import data (about 70 minutes)
$ load data local infile 'flightlist_20190101_20190131.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;
# Omit the other 29 import commands：load data local infile 'flightlist_*_*.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;

# Check if the data was imported successfully
$ select count(*) from test.opensky;

3.4 創建InfluxDB

完整程式碼如下：

# Import data (about 30 minutes)
$ influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_20190101_20190131.txt -precision=ns;
# Omit the other 29 import commands：influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_*_*.txt -precision=ns;

# Check if the data was imported successfully
$ influx -username 'admin' -password 'admin123456'
$ select count(latitude_1) from test.autogen.opensky;

4.測試場景

4.1 MySQL測試維度

從不同的維度，對MySQL來進行測試，具體實現程式碼如下所示：

$ mysql -uroot -p123456
$ use test;
-- Enable performance analysis
set profiling = 1;
-- query disk space
select table_rows as `total_lines`, (data_length + index_length)/1024/1024/1024 as `disk_usage(G)` from information_schema.`TABLES` where table_name = 'opensky';
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select avg(latitude_2) from opensky;
-- full table variance
select var_pop(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT origin, count(1) AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;
-- output analysis results
show profiles;

4.2 InfluxDB測試維度

從不同的維度，對InfluxDB來進行測試，具體實現程式碼如下所示：

$ influx -username 'admin' -password 'admin123456'
$ use test;
-- Time-consuming statistics，queryReqDurationNs is the cumulative query time, and the subtraction of the time of the two tasks is the time-consuming
select queryReq,queryReqDurationNs/1000/1000,queryRespBytes from _internal."monitor".httpd order by time desc limit 10;
-- query disk space
select sum(diskBytes) / 1024 / 1024 /1024 from _internal."monitor"."shard" where time > now() - 10s group by "database";
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select mean(latitude_2) from opensky;
-- full table variance
select stddev(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),mean(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT count(latitude_1) AS c FROM opensky WHERE origin =~/^UUEE|UUDD|UUWW$/ GROUP BY origin;

4.3 Clickhouse測試維度

從不同的維度，對Clickhouse來進行測試，具體實現程式碼如下所示：

$ clickhouse-client
$ use test;
-- Time-consuming statistics
select event_time_microseconds,query_duration_ms,read_rows,result_rows,memory_usage,query from system.query_log where query like '%opensky%' and query_duration_ms <> 0 and query not like '%event_time_microseconds%' order by event_time_microseconds desc limit 5;
-- query disk space
SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'opensky';
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select avg(latitude_2) from opensky;
-- full table variance
select var_pop(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT origin, count() AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;

5.為什麼Clickhouse這麼快

5.1 列式存儲

數據存儲在列中，數據就是索引；
查詢只訪問涉及的列，減少了系統 I/O；
每列由一個執行緒處理，有效利用 CPU 資源；
它還為矢量化執行奠定了基礎。

5.2 數據壓縮

數據壓縮的本質是按照一定的步長對數據進行匹配掃描，發現重複數據時進行編碼轉換。

因為是列式存儲，數據特性非常相似，所以數據中存在很多重複，壓縮率越高，數據量越小，磁碟I/O壓力越低，網路傳輸越快。

5.3 矢量化執行引擎

單指令多數據是指一條指令操作多條數據。是通過數據並行來提高性能的一種方式，可以簡單理解為程式中數據在暫存器級別的並行處理。

Clickhouse 廣泛使用 SIMD 來提高計算效率。通過使用SIMD，基本上可以帶來數倍的性能提升。

5.4 多執行緒和分散式

分散式領域有個規律，計算移動比數據移動更划算，這就是它的核心。

數據的計算直接發送到數據所在的伺服器，進行多機並行處理，然後將最終結果匯總在一起。

此外，ClickHouse 還通過執行緒級並行進一步提高效率，充分利用伺服器資源。

5.5 各種表引擎

MergeTree 存儲結構對寫入的數據進行排序，然後有序存儲。有序存儲有兩個主要優點：

對列存文件進行分塊壓縮時，排序鍵中的列值是連續的或重複的，這樣列存分塊中的數據才能得到最終的壓縮比。
存儲順序本身可以加快查詢的索引結構。根據排序鍵中列的等價條件或範圍條件，我們可以快速找到目標的大致位置範圍，而且這種索引結構不會產生額外的存儲開銷。

MergeTree 是 ClickHouse 表引擎中的核心引擎。其他引擎基於 MergeTree 引擎，在數據合併過程中實現不同的特性，從而形成 MergeTree 表引擎家族。

6.總結

Clickhouse的優缺點如下：

優勢：極致的查詢分析性能、低存儲成本、高吞吐數據寫入、多樣化的表引擎、完備的DBMS功能。
缺點：不支援事務，不支援真正的刪除/更新，分發能力弱；不支援高並發，官方推薦100 QPS。

對於非標準的SQL，join的實現比較特殊，性能不好；頻繁的小批量數據操作會影響查詢性能。目前還沒有可以滿足各種場景需求的OLAP引擎。本質原因是沒有一個系統可以同時在查詢效率、及時性和可維護性方面做到完美。只能說ClickHouse是為了極致的查詢性能。做了一些取捨。ClickHouse 的優缺點是顯而易見的。是否採用取決於與實際業務場景的契合度。適合你的架構是最好的架構。

7.結束語

這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行討論或發送郵件給我，我會盡我所能為您解答，與君共勉！

另外，部落客出書了《Kafka並不難學》和《Hadoop大數據挖掘從入門到進階實戰》，喜歡的朋友或同學，可以在公告欄那裡點擊購買鏈接購買部落客的書進行學習，在此感謝大家的支援。關注下面公眾號，根據提示，可免費獲取書籍的教學影片。

Tags: ClickHouse JAVA

Clickhouse基準測試實踐

1.概述

2.內容

2.1 MySQL

2.2 InfluxDB

2.3 Clickhouse

3.測試數據準備

3.1 下載數據

3.2 創建表Clickhouse

3.2.1 導入數據

3.3 創建MySQL

3.4 創建InfluxDB

4.測試場景

4.1 MySQL測試維度

4.2 InfluxDB測試維度

4.3 Clickhouse測試維度

5.為什麼Clickhouse這麼快

5.1 列式存儲

5.2 數據壓縮

5.3 矢量化執行引擎

5.4 多執行緒和分散式

5.5 各種表引擎

6.總結

7.結束語

VirMach 便宜 VPS

QNews

Clickhouse基準測試實踐

1.概述

2.內容

2.1 MySQL

2.2 InfluxDB

2.3 Clickhouse

3.測試數據準備

3.1 下載數據

3.2 創建表Clickhouse

3.2.1 導入數據

3.3 創建MySQL

3.4 創建InfluxDB

4.測試場景

4.1 MySQL測試維度

4.2 InfluxDB測試維度

4.3 Clickhouse測試維度

5.為什麼Clickhouse這麼快

5.1 列式存儲

5.2 數據壓縮

5.3 矢量化執行引擎

5.4 多執行緒和分散式

5.5 各種表引擎

6.總結

7.結束語

分享此文：

Related Posts

如何定時備份Mysql資料庫

面試題隨記一

Postfix別名郵件與SASL驗證

JSP中的EL 表達式

VirMach 便宜 VPS

QNews

熱門搜尋