基於案例分析 MySQL Group Replication 的故障檢測流程

2022 年 11 月 7 日
筆記
MySQL, MySQL 高可用

故障檢測（Failure Detection）是 Group Replication 的一個核心功能模組，通過它可以及時識別集群中的故障節點，並將故障節點從集群中剔除掉。如果不將故障節點及時剔除的話，一方面會影響集群的性能，另一方面還會阻止集群拓撲的變更。

下面結合一個具體的案例，分析 Group Replication 的故障檢測流程。

除此之外，本文還會分析以下問題。

當出現網路分區時，對於少數派節點，會有什麼影響？
什麼是 XCom Cache？如何預估 XCom Cache 的大小？
在線上，為什麼 group_replication_member_expel_timeout 不宜設置過大？

案例

以下是測試集群的拓撲，多主模式。

主機名	IP	角色
node1	192.168.244.10	PRIMARY
node2	192.168.244.20	PRIMARY
node3	192.168.244.30	PRIMARY

本次測試主要包括兩步：

模擬網路分區，看它對集群各節點的影響。
恢復網路連接，看看各節點又是如何反應的。

模擬網路分區

首先模擬網路分區故障，在 node3 上執行。

# iptables -A INPUT  -p tcp -s 192.168.244.10 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.10 -j DROP

# iptables -A INPUT  -p tcp -s 192.168.244.20 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.20 -j DROP

# date "+%Y-%m-%d %H:%M:%S"
2022-07-31 13:03:01

其中，iptables 命令會斷開 node3 與 node1、node2 之間的網路連接。date 記錄了命令執行的時間。

命令執行完 5s（這個時間是固定的，在源碼中通過 DETECTOR_LIVE_TIMEOUT 指定），各個節點開始響應（從各節點的日誌中可以觀察到這一點）

首先看看 node1 的日誌及集群狀態。

2022-07-31T13:03:07.582519-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.30:3306 has become unreachable.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | ONLINE       | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | ONLINE       | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | UNREACHABLE  | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

從 node1，node2 的角度來看，此時 node3 處於 UNREACHABLE 狀態。

接下來看看 node3 的。

2022-07-31T13:03:07.690416-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 has become unreachable.'
2022-07-31T13:03:07.690492-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 has become unreachable.'
2022-07-31T13:03:07.690504-00:00 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | UNREACHABLE  | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | UNREACHABLE  | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

從 node3 的角度來看，此時 node1，node2 處於 UNREACHABLE 狀態。

三個節點，只有一個節點處於 ONLINE 狀態，不滿足組複製的多數派原則。此時，node3 只能查詢，寫操作會被阻塞。

mysql> select * from slowtech.t1 where id=1;
+----+------+
| id | c1   |
+----+------+
|  1 | a    |
+----+------+
1 row in set (0.00 sec)

mysql> delete from slowtech.t1 where id=1;
阻塞中。。。

又過了 16s（這裡的 16s，實際上與 group_replication_member_expel_timeout 參數有關），node1、node2 會將 node3 驅逐出（expel）集群。此時，集群只有兩個節點組成。

看看 node1 的日誌及集群狀態。

2022-07-31T13:03:23.576960-00:00 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.244.30:3306'
2022-07-31T13:03:23.577091-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306 on view 16592724636525403:3.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | ONLINE       | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
2 rows in set (0.00 sec)

再來看看 node3 的，日誌沒有新的輸出，節點狀態也沒變化。

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | UNREACHABLE  | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | UNREACHABLE  | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

恢復網路連接

接下來我們恢復 node3 與 node1、node2 之間的網路連接。

# iptables -F

# date "+%Y-%m-%d %H:%M:%S"
2022-07-31 13:07:30

首先看看 node3 的日誌

2022-07-31T13:07:30.464179-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 is reachable again.'
2022-07-31T13:07:30.464226-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 is reachable again.'
2022-07-31T13:07:30.464239-00:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2022-07-31T13:07:37.458761-00:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2022-07-31T13:07:37.459011-00:00 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.'
2022-07-31T13:07:37.459037-00:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2022-07-31T13:07:37.459431-00:00 31 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 31'
2022-07-31T13:07:37.459478-00:00 31 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2022-07-31T13:07:37.459811-00:00 33 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'

2022-07-31T13:07:37.465738-00:00 34 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
2022-07-31T13:07:37.496466-00:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2022-07-31T13:07:37.498813-00:00 36 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 351, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:39.653028-00:00 34 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'
2022-07-31T13:07:40.653484-00:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2022-07-31T13:07:40.653822-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'
2022-07-31T13:07:40.670530-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:40.682990-00:00 47 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2022-07-31T13:07:40.687566-00:00 47 [System] [MY-010562] [Repl] Slave I/O thread for channel 'group_replication_recovery': connected to master '[email protected]:3306',replication started in log 'FIRST' at position 4
2022-07-31T13:07:40.717851-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:40.732297-00:00 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'
2022-07-31T13:07:40.732511-00:00 53 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'

日誌的輸出包括兩部分，以空格為分界線。

1. 當網路連接恢復後，node3 與 node1、node2 重新建立起了連接，發現自己已經被集群驅逐，於是節點進入到 ERROR 狀態。

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ERROR        |             |
+--------------------------------------+----------------+-------------+--------------+-------------+
1 row in set (0.00 sec)

節點進入到 ERROR 狀態，會自動設置為只讀，即日誌中看到的 super_read_only=ON。注意，ERROR 狀態的節點設置為只讀是默認行為，與後面提到的 group_replication_exit_state_action 參數無關。

2. 如果group_replication_autorejoin_tries不為 0，對於 ERROR 狀態的節點，會自動重試，重新加入集群（auto-rejoin）。重試的次數由 group_replication_autorejoin_tries 決定，從 MySQL 8.0.21 開始，默認為 3。重試的時間間隔是 5min。重試成功後，會進入到分散式恢復階段。

接下來看看 node1 的日誌。

2022-07-31T13:07:39.555613-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'
2022-07-31T13:07:40.732568-00:00 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.244.30:3306 was declared online within the replication group.'

node3 又重新加入到集群中。

故障檢測流程

結合上面的案例，我們來看看 Group Repliction 的故障檢測流程。

集群中每個節點都會定期（每秒 1 次）向其它節點發送心跳資訊。如果在 5s 內（固定值，無參數調整）沒有收到其它節點的心跳資訊，則會將該節點標記為可疑節點，同時會將該節點的狀態設置為 UNREACHABLE 。如果集群中有等於或超過 1/2 的節點顯示為 UNREACHABLE ，則該集群不能對外提供寫服務。
如果在group_replication_member_expel_timeout（從 MySQL 8.0.21 開始，該參數的默認值為 5，單位 s，最大可設置值為3600，即 1 小時）時間內，可疑節點恢復正常，則會直接應用 XCom Cache 中的消息。XCom Cache 的大小由group_replication_message_cache_size 決定，默認是 1G。
如果在group_replication_member_expel_timeout時間內，可疑節點沒有恢復正常，則會被驅逐出集群。
而少數派節點呢，不會自動離開集群，它會一直維持當前的狀態，直到：

網路恢復正常。
達到 group_replication_unreachable_majority_timeout 的限制。注意，該參數的起始計算時間是連接斷開 5s 之後，不是可疑節點被驅逐出集群的時間。該參數默認為 0。

無論哪種情況，都會觸發：

節點狀態從 ONLINE 切換到 ERROR 。

回滾當前被阻塞的寫操作。

mysql> delete from slowtech.t1 where id=1;
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

ERROR 狀態的節點會自動設置為只讀。
如果group_replication_autorejoin_tries不為 0，對於 ERROR 狀態的節點，會自動重試，重新加入集群（auto-rejoin）。
如果group_replication_autorejoin_tries為 0 或重試失敗，則會執行 group_replication_exit_state_action 指定的操作。可選的操作有：

READ_ONLY：只讀模式。在這種模式下，會將 super_read_only 設置為 ON。默認值。
OFFLINE_MODE：離線模式。在這種模式下，會將 offline_mode 和 super_read_only 設置為 ON，此時，只有CONNECTION_ADMIN（SUPER）許可權的用戶才能登陸，普通用戶不能登錄。
```
# mysql -h 192.168.244.3. -P 3306 -ut1 -p123456
ERROR 3032 (HY000): The server is currently in offline mode
```
ABORT_SERVER：關閉實例。

XCom Cache

XCom Cache 是 XCom 使用的消息快取，用來快取集群節點之間交換的消息。快取的消息是共識協議的一部分。如果網路不穩定，可能會出現節點失聯的情況。

如果節點在一定時間（由 group_replication_member_expel_timeout 決定）內恢復正常，它會首先應用 XCom Cache 中的消息。如果 XCom Cache 沒有它需要的所有消息，這個節點會被驅逐出集群。驅逐出集群後，如果 group_replication_autorejoin_tries 不為 0，它會重新加入集群（auto-rejoin）。

重新加入集群會使用 Distributed Recovery 補齊差異數據。相比較直接使用 XCom Cache 中的消息，通過 Distributed Recovery 加入集群需要的時間相對較長，過程也較複雜，並且集群的性能也會受到影響。

所以，我們在設置 XCom Cache 的大小時，需預估 group_replication_member_expel_timeout + 5s 這段時間內的記憶體使用量。如何預估，後面會介紹相關的系統表。

下面我們模擬下 XCom Cache 不足的場景。

1. 將group_replication_message_cache_size調整為最小值（128 MB），重啟組複製使其生效。

mysql> set global group_replication_message_cache_size=134217728;
Query OK, 0 rows affected (0.00 sec)

mysql> stop group_replication;
Query OK, 0 rows affected (4.15 sec)

mysql> start group_replication;
Query OK, 0 rows affected (3.71 sec)

2. 將group_replication_member_expel_timeout調整為 3600。這樣，我們才有充足的時間進行測試。

mysql> set global group_replication_member_expel_timeout=3600;
Query OK, 0 rows affected (0.01 sec)

3. 斷開 node3 與node1、node2 之間的網路連接。

# iptables -A INPUT  -p tcp -s 192.168.244.10 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.10 -j DROP

# iptables -A INPUT  -p tcp -s 192.168.244.20 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.20 -j DROP

4. 反覆執行大事務。

mysql> insert into slowtech.t1(c1) select c1 from slowtech.t1 limit 1000000;
Query OK, 1000000 rows affected (10.03 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

5. 觀察錯誤日誌。

如果 node1 或 node2 的錯誤日誌中提示以下資訊，則意味著 node3 需要的消息已經從 XCom Cache 中逐出了。

[Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Messages that are needed to recover node 192.168.244.30:33061 have been evicted from the message  cache. Consider resizing the maximum size of the cache by  setting group_replication_message_cache_size.'

6. 查看系統表。

除了錯誤日誌，我們還可以通過系統表來判斷 XCom Cache 的使用情況。

mysql> select * from performance_schema.memory_summary_global_by_event_name where event_name like "%GCS_XCom::xcom_cache%"\G
*************************** 1. row ***************************
                  EVENT_NAME: memory/group_rpl/GCS_XCom::xcom_cache
                 COUNT_ALLOC: 23678
                  COUNT_FREE: 22754
   SUM_NUMBER_OF_BYTES_ALLOC: 154713397
    SUM_NUMBER_OF_BYTES_FREE: 28441492
              LOW_COUNT_USED: 0
          CURRENT_COUNT_USED: 924
             HIGH_COUNT_USED: 20992
    LOW_NUMBER_OF_BYTES_USED: 0
CURRENT_NUMBER_OF_BYTES_USED: 126271905
   HIGH_NUMBER_OF_BYTES_USED: 146137294
1 row in set (0.00 sec)

其中，

COUNT_ALLOC：快取過的消息數量。
COUNT_FREE：從快取中刪除的消息數量。
CURRENT_COUNT_USED：當前正在快取的消息數量，等於 COUNT_ALLOC – COUNT_FREE。
SUM_NUMBER_OF_BYTES_ALLOC：分配的記憶體大小。
SUM_NUMBER_OF_BYTES_FREE：被釋放的記憶體大小。
CURRENT_NUMBER_OF_BYTES_USED：當前正在使用的記憶體大小，等於 SUM_NUMBER_OF_BYTES_ALLOC – SUM_NUMBER_OF_BYTES_FREE。
LOW_COUNT_USED，HIGH_COUNT_USED：CURRENT_COUNT_USED 的歷史最小值和最大值。
LOW_NUMBER_OF_BYTES_USED，HIGH_NUMBER_OF_BYTES_USED：CURRENT_NUMBER_OF_BYTES_USED 的歷史最小值和最大值。

如果斷開連接之後，在反覆執行大事務的過程中，發現 COUNT_FREE 發生了變化，同樣意味著 node3 需要的消息已經從 XCom Cache 中驅逐了。

7. 恢復 node3 與 node1、node2 之間的網路連接。

在 group_replication_member_expel_timeout 期間，網路恢復了，而 node3 需要的消息在 XCom Cache 中不存在了，則 node3 同樣會被驅逐出集群。以下是這種場景下 node3 的錯誤日誌。

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Node 0 is unable to get message {4aec99ca 7562 0}, since the group is too far ahead. Node will now exit.'
[ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
[ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
[System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
[System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'

注意事項

如果集群中存在 UNREACHABLE 的節點，會有以下限制和不足：

不能調整集群的拓撲，包括添加和刪除節點。
在單主模式下，如果 Primary 節點出現故障了，無法選擇新主。
如果 Group Replication 的一致性級別等於 AFTER 或 BEFORE_AND_AFTER，則寫操作會一直等待，直到 UNREACHABLE 節點 ONLINE 並應用該操作。
集群吞吐量會下降。如果是單主模式，可將 group_replication_paxos_single_leader （MySQL 8.0.27 引入的）設置為 ON 解決這個問題。

所以，在線上 group_replication_member_expel_timeout 不宜設置過大。

參考資料

[1] Extending replication instrumentation: account for memory used in XCom

[2] MySQL Group Replication – Default response to network partitions has changed

[3] No Ping Will Tear Us Apart – Enabling member auto-rejoin in Group Replication

Tags: MySQL MySQL 高可用