keepalived-1.3.5+MHA部署mysql集群

2021 年 2 月 20 日
筆記
MySQL

MHA：

MHA工作原理總結為以下幾條：

從宕機崩潰的master保存二進位日誌事件（binlog events）；
識別含有最新更新的slave；
應用差異的中繼日誌（relay log）到其他slave；
應用從master保存的二進位日誌事件（binlog events）；
提升一個slave為新master；
使用其他的slave連接新的master進行複製。

MHA特點:

故障切換時間 10-30秒
自動監控Master以及故障轉移
在線切換Master到異機

1.環境介紹

伺服器角色	IP	節點	埠	職責
manager、master	192.168.163.3	mha1	3310	監控、主master
slave1	192.168.163.16	mha2	3310	從slave
slave2	192.168.163.20	mha3	3310	從slave

OS：CentOS7.8

MySQL：Mysql-5.7.23

MHA：mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

mha4mysql-node-0.58-0.el7.centos.noarch.rpm

Keepalived：keepalived-1.3.5-19.el7.x86_64

關閉防火牆

systemctl stop firewalld

systemctl disable firewalld

1.1配置yum源

mha需要很多依賴包，很過中國的網路yum源沒有一些依賴，可直接使用下面的yum源，會省去大部分時間

cat > /etc/yum.repos.d/epel.repo << EOF
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
failovermethod=priority
enabled=1
gpgcheck=0
mirrorlist=//mirrors.fedoraproject.org/mirrorlist?repo=epel-7&arch=$basearch
EOF

1.2安裝依賴包

三台主機都需要安裝依賴包

yum install perl-DBI perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes perl-Params-Validate perl-DateTime -y

yum install cpan perl-ExtUtils-Embed perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker -y

1.3安裝軟體

mha1:

三台主機都選擇安裝manager，node節點都安裝

rpm -ivh mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm

mha2:

rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm

mha3:

rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm

1.4修改主機名和hosts文件

三台主機都修改

[root@mha1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.163.3 mha1
192.168.163.16 mha2
192.168.163.20 mha3

1.5 ssh 互通環境

mha1：
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]

三台主機依次登錄測試，直到可以直接登錄其他主機

ssh 192.168.163.3
ssh 192.168.16316
ssh 192.168.163.20

2. mysql安裝部署

2.1mysql配置文件

注意更改server_id，在同一個集群中，更改為不同的server_id

[root@mha1 ~]# vim /data/3310/my.cnf 
[mysqld]
basedir=/application/mysql
datadir=/data/3310/data
socket=/data/3310/mysql.sock
log_error=/data/3310/mysql.log
port=3310
server_id=331003
log_bin=/data/3310/mysql-bin
gtid-mode=on
enforce-gtid-consistency=true
slow_query_log=ON
skip-name-resolve              #  建議加上 非必須項
relay_log_purge = 0            #  關閉自動清理中繼日誌
log_slave_updates = 1          #  從庫通過binlog更新的數據寫進從庫二進位日誌中，必加，否則切換後可能丟失數據
relay-log = relay-log

2.2加入啟動服務

[root@mha1 ~]# vim /etc/systemd/system/mysqld3310.service
[Unit]
Description=MySQL Server
Documentation=man:mysqld(8)
Documentation=//dev.mysql.com/doc/refman/en/using-systemd.html
After=network.target
After=syslog.target

[Install]
WantedBy=multi-user.target
[Service]
User=mysql
Group=mysql
ExecStart=/application/mysql/bin/mysqld --defaults-file=/data/3310/my.cnf
LimitNOFILE = 5000

2.3初始化mysql

/application/mysql/bin/mysqld --initialize-insecure --user=mysql --basedir=/application/mysql --datadir=/data/3310/data

2.4啟動mysql

systemctl start mysqld3310
systemctl enable mysqld3310

3.一主兩從搭建

3.1 mha1

mysql> create user 'slave'@'192.168.163.%' identified by '123';
mysql> create user  mha@'192.168.163.%' identified by '123';
mysql> grant replication slave on *.* to slave@'192.168.163.%' identified by '123';
mysql> grant all on *.* to   mha@'192.168.163.%';
mysql> flush privileges;

也可以選擇用mysqldump的方式來恢復從庫，此時是新建的庫，並沒有數據。

mysqldump -uroot -p123 -S /data/3310/mysql.sock -A –master-data=2 –single-transaction -R -E –triggers >/tmp/full.sql

從庫source /tmp/full.sql

查看主庫master狀態

3.2 mha2

mysql> change master to master_host='192.168.163.3',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000003',master_log_pos=234;

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’

主庫添加log-bin-index 參數後，從庫報這個錯誤：Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G;

3.3 mha3

mysql> change master to master_host='192.168.163.3',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000003',master_log_pos=234;

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

4.mha搭建

Manager工具

masterha_check_ssh 檢查MHA的SSH配置狀況

masterha_check_repl 檢查MySQL複製狀況

masterha_manger 啟動MHA

masterha_check_status 檢測當前MHA運行狀態

masterha_master_monitor 檢測master是否宕機

masterha_master_switch 控制故障轉移（自動或者手動）

masterha_conf_host 添加或刪除配置的server資訊

Node工具

save_binary_logs 保存和複製master的二進位日誌

apply_diff_relay_logs 識別差異的中繼日誌事件並應用於其它slave

filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用這個工具)

purge_relay_logs 清除中繼日誌(不會阻塞SQL執行緒)

啟動mha

nohup masterha_manager –conf=/etc/mha_master/mha.cnf –remove_dead_master –ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &

關閉mha

masterha_stop –conf=/etc/mha_master/mha.cnf

集群狀態檢查

masterha_check_status –conf=/etc/mha_master/mha.cnf

集群連接檢查

masterha_check_ssh–conf=/etc/mha_master/mha.cnf

集群複製檢查

masterha_check_repl –conf=/etc/mha_master/mha.cnf

集群切換

masterha_master_switch –conf=/etc/mha_master/mha.cnf –master_state=alive –new_master_host=192.168.163.16 –new_master_port=3310 –orig_master_is_new_slave –running_updates_limit=10000

4.1安裝軟體

見1.3 已安裝

4.2配置文件

mha1:
mkdir -p /etc/mha_master/app1
touch /etc/mha_master/manager.log

cat > /etc/mha_master/mha.cnf <<EOF
[server default]
manager_log=/etc/mha_master/manager.log
manager_workdir=/etc/mha_master/app1
password=123
ping_interval=1
remote_workdir=/tmp
repl_password=123
repl_user=slave
ssh_user=root
user=mha

[server1]
candidate_master=1
hostname=192.168.163.3
port=3310
ssh_port=22

[server2]
candidate_master=1
hostname=192.168.163.16
port=3310
ssh_port=22

[server3]
candidate_master=1
hostname=192.168.163.20
port=3310
ssh_port=22
EOF

參數詳情見網路

4.3集群狀態檢查

MHA連接檢查

[root@mha1 ~]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf

MHA複製檢查

[root@mha1 ~]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf

4.4開啟集群

[root@mha1 ~]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf  --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &

集群狀態

[root@mha1 ~]# masterha_check_status --conf=/etc/mha_master/mha.cnf

5.手動故障切換

5.1停止mha監控

停止集群

[root@mha1 ~]# masterha_stop --conf=/etc/mha_master/mha.cnf

5.2切換主庫為mha2

[root@mha1 ~]# masterha_master_switch --conf=/etc/mha_master/mha.cnf --master_state=alive --new_master_host=192.168.163.16 --new_master_port=3310 --orig_master_is_new_slave --running_updates_limit=10000

Sat Feb 20 09:44:43 2021 - [info] MHA::MasterRotate version 0.58.
Sat Feb 20 09:44:43 2021 - [info] Starting online master switch..
Sat Feb 20 09:44:43 2021 - [info] 
Sat Feb 20 09:44:43 2021 - [info] * Phase 1: Configuration Check Phase..
Sat Feb 20 09:44:43 2021 - [info] 
Sat Feb 20 09:44:43 2021 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Feb 20 09:44:43 2021 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Sat Feb 20 09:44:43 2021 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Sat Feb 20 09:44:44 2021 - [info] GTID failover mode = 1
Sat Feb 20 09:44:44 2021 - [info] Current Alive Master: 192.168.163.3(192.168.163.3:3310)
Sat Feb 20 09:44:44 2021 - [info] Alive Slaves:
Sat Feb 20 09:44:44 2021 - [info]   192.168.163.16(192.168.163.16:3310)  Version=5.7.32-log (oldest major version between slaves) log-bin:enabled
Sat Feb 20 09:44:44 2021 - [info]     GTID ON
Sat Feb 20 09:44:44 2021 - [info]     Replicating from 192.168.163.3(192.168.163.3:3310)
Sat Feb 20 09:44:44 2021 - [info]     Primary candidate for the new Master (candidate_master is set)
Sat Feb 20 09:44:44 2021 - [info]   192.168.163.20(192.168.163.20:3310)  Version=5.7.32-log (oldest major version between slaves) log-bin:enabled
Sat Feb 20 09:44:44 2021 - [info]     GTID ON
Sat Feb 20 09:44:44 2021 - [info]     Replicating from 192.168.163.3(192.168.163.3:3310)
Sat Feb 20 09:44:44 2021 - [info]     Primary candidate for the new Master (candidate_master is set)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.163.3(192.168.163.3:3310)? (YES/no): yes
Sat Feb 20 09:44:47 2021 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sat Feb 20 09:44:47 2021 - [info]  ok.
Sat Feb 20 09:44:47 2021 - [info] Checking MHA is not monitoring or doing failover..
Sat Feb 20 09:44:47 2021 - [info] Checking replication health on 192.168.163.16..
Sat Feb 20 09:44:47 2021 - [info]  ok.
Sat Feb 20 09:44:47 2021 - [info] Checking replication health on 192.168.163.20..
Sat Feb 20 09:44:47 2021 - [info]  ok.
Sat Feb 20 09:44:47 2021 - [info] 192.168.163.16 can be new master.
Sat Feb 20 09:44:47 2021 - [info] 
From:
192.168.163.3(192.168.163.3:3310) (current master)
 +--192.168.163.16(192.168.163.16:3310)
 +--192.168.163.20(192.168.163.20:3310)

To:
192.168.163.16(192.168.163.16:3310) (new master)
 +--192.168.163.20(192.168.163.20:3310)
 +--192.168.163.3(192.168.163.3:3310)

Starting master switch from 192.168.163.3(192.168.163.3:3310) to 192.168.163.16(192.168.163.16:3310)? (yes/NO): yes
Sat Feb 20 09:44:49 2021 - [info] Checking whether 192.168.163.16(192.168.163.16:3310) is ok for the new master..
Sat Feb 20 09:44:49 2021 - [info]  ok.
Sat Feb 20 09:44:49 2021 - [info] 192.168.163.3(192.168.163.3:3310): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Sat Feb 20 09:44:49 2021 - [info] 192.168.163.3(192.168.163.3:3310): Resetting slave pointing to the dummy host.
Sat Feb 20 09:44:49 2021 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Feb 20 09:44:49 2021 - [info] 
Sat Feb 20 09:44:49 2021 - [info] * Phase 2: Rejecting updates Phase..
Sat Feb 20 09:44:49 2021 - [info] 
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
Sat Feb 20 09:44:50 2021 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Sat Feb 20 09:44:50 2021 - [info] Executing FLUSH TABLES WITH READ LOCK..
Sat Feb 20 09:44:50 2021 - [info]  ok.
Sat Feb 20 09:44:50 2021 - [info] Orig master binlog:pos is mysql-bin.000002:194.
Sat Feb 20 09:44:50 2021 - [info]  Waiting to execute all relay logs on 192.168.163.16(192.168.163.16:3310)..
Sat Feb 20 09:44:50 2021 - [info]  master_pos_wait(mysql-bin.000002:194) completed on 192.168.163.16(192.168.163.16:3310). Executed 0 events.
Sat Feb 20 09:44:50 2021 - [info]   done.
Sat Feb 20 09:44:50 2021 - [info] Getting new master's binlog name and position..
Sat Feb 20 09:44:50 2021 - [info]  mysql-bin.000001:2509
Sat Feb 20 09:44:50 2021 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.163.16', MASTER_PORT=3310, MASTER_AUTO_POSITION=1, MASTER_USER='slave', MASTER_PASSWORD='xxx';
Sat Feb 20 09:44:50 2021 - [info] 
Sat Feb 20 09:44:50 2021 - [info] * Switching slaves in parallel..
Sat Feb 20 09:44:50 2021 - [info] 
Sat Feb 20 09:44:50 2021 - [info] -- Slave switch on host 192.168.163.20(192.168.163.20:3310) started, pid: 1716
Sat Feb 20 09:44:50 2021 - [info] 
Sat Feb 20 09:44:52 2021 - [info] Log messages from 192.168.163.20 ...
Sat Feb 20 09:44:52 2021 - [info] 
Sat Feb 20 09:44:50 2021 - [info]  Waiting to execute all relay logs on 192.168.163.20(192.168.163.20:3310)..
Sat Feb 20 09:44:50 2021 - [info]  master_pos_wait(mysql-bin.000002:194) completed on 192.168.163.20(192.168.163.20:3310). Executed 0 events.
Sat Feb 20 09:44:50 2021 - [info]   done.
Sat Feb 20 09:44:50 2021 - [info]  Resetting slave 192.168.163.20(192.168.163.20:3310) and starting replication from the new master 192.168.163.16(192.168.163.16:3310)..
Sat Feb 20 09:44:50 2021 - [info]  Executed CHANGE MASTER.
Sat Feb 20 09:44:51 2021 - [info]  Slave started.
Sat Feb 20 09:44:52 2021 - [info] End of log messages from 192.168.163.20 ...
Sat Feb 20 09:44:52 2021 - [info] 
Sat Feb 20 09:44:52 2021 - [info] -- Slave switch on host 192.168.163.20(192.168.163.20:3310) succeeded.
Sat Feb 20 09:44:52 2021 - [info] Unlocking all tables on the orig master:
Sat Feb 20 09:44:52 2021 - [info] Executing UNLOCK TABLES..
Sat Feb 20 09:44:52 2021 - [info]  ok.
Sat Feb 20 09:44:52 2021 - [info] Starting orig master as a new slave..
Sat Feb 20 09:44:52 2021 - [info]  Resetting slave 192.168.163.3(192.168.163.3:3310) and starting replication from the new master 192.168.163.16(192.168.163.16:3310)..
Sat Feb 20 09:44:52 2021 - [info]  Executed CHANGE MASTER.
Sat Feb 20 09:44:53 2021 - [info]  Slave started.
Sat Feb 20 09:44:53 2021 - [info] All new slave servers switched successfully.
Sat Feb 20 09:44:53 2021 - [info] 
Sat Feb 20 09:44:53 2021 - [info] * Phase 5: New master cleanup phase..
Sat Feb 20 09:44:53 2021 - [info] 
Sat Feb 20 09:44:53 2021 - [info]  192.168.163.16: Resetting slave info succeeded.
Sat Feb 20 09:44:53 2021 - [info] Switching master to 192.168.163.16(192.168.163.16:3310) completed successfully.

如果失敗：

5.3啟動mha

[root@mha1 ~]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf  --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &
[1]1733

5.4查看目前集群狀態

狀態檢查已經顯示master節點切換到mha2上

[root@mha1 ~]# masterha_check_status --conf=/etc/mha_master/mha.cnf 
mha (pid:1733) is running(0:PING_OK), master:192.168.163.16

mha1:原master

mha2:現在的master主庫 192.168.163.16

可以看到現在集群軟體仍然部署在192.168.163.3上，只不過主節點現在切換到了mha2上

mha3: 192.168.163.20

6.自動故障切換

根據手動切換的方法可以將master從mha2切換為mha1

6.1安裝keepalived

[root@mha3 ~ 10:21:01]# yum install keepalived -y

6.2配置文件

[root@mha1 ~]# vim /etc/keepalived/keepalived.conf

! Configuration File for keepalived

global_defs {
   notification_email {
     [email protected]
   }
   notification_email_from [email protected]
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id MYSQL_HA
}
vrrp_script chk_mysql {
    script "killall -0 mysql"         # 監控mysql健康性腳本
    insterval 1
    weight -10
} 

vrrp_instance VI_1 {
    state BACKUP
    interface ens33             #根據實際網路介面進行更改
    virtual_router_id 51
    priority 100               #優先順序，mha1設置100，mha2設置99，mha3設置98
    advert_int 1
    nopreempt                  #不主動搶佔資源，只在master上設置
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.163.23
    }
    track_script {
    chk_mysql
    }
}

6.3傳輸配置文件

[root@mha1 ~]# scp /etc/keepalived/keepalived.conf [email protected]:/etc/keepalived/keepalived.conf
keepalived.conf                                                                                                                                                                 100% 1290   836.2KB/s   00:00    
[root@mha1 ~]# scp /etc/keepalived/keepalived.conf [email protected]:/etc/keepalived/keepalived.conf
keepalived.conf

6.4修改keepalived.conf

mha2：

priority 99 #優先順序，mha1設置100，mha2設置99，mha3設置98

mha3：

priority 98 #優先順序，mha1設置100，mha2設置99，mha3設置98

6.5啟動keepalived

[root@mha1 ~]# systemctl start keepalived
[root@mha1 ~]# systemctl status keepalived
[root@mha1 ~]# systemctl enable keepalived

6.6模擬故障

關閉mha1的主節點

mha1：

[root@mha1 ~]# systemctl stop mysqld3310
[1]+  Done                    nohup masterha_manager --conf=/etc/mha_master/mha.cnf --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1

查看集群狀態

[root@mha1 /etc/keepalived]# masterha_check_repl -conf=/etc/mha_master/mha.cnf

mha2：

mysql> show slave status\G;
Empty set (0.00 sec)

ERROR: 
No query specified

mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                  |
+------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+
| mysql-bin.000002 |      234 |              |                  | b883ad36-731c-11eb-8f2b-000c297b5e9b:1-3,
d85d6dce-7287-11eb-9701-000c2902c557:1-8 |
+------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show processlist;
+----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+
| Id | User  | Host                 | db   | Command          | Time | State                                                         | Info             |
+----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+
| 11 | slave | 192.168.163.20:58982 | NULL | Binlog Dump GTID |  598 | Master has sent all binlog to slave; waiting for more updates | NULL             |
| 15 | root  | localhost            | NULL | Query            |    0 | starting                                                      | show processlist |
+----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+
2 rows in set (0.00 sec)

mha3：

mysql> show slave status\G;

6.7恢復故障節點

要先恢復mha1和mha2的主從關係，此時mha2是主庫，mha1作為一個備庫來恢復

mha2：

mha1：

mysql> change master to master_host='192.168.163.16',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000002',master_log_pos=234;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> show slave status\G;

檢查集群狀態，發現活著的節點並沒有mha1

[root@mha1 /etc/keepalived]# masterha_check_repl -conf=/etc/mha_master/mha.cnf

此時需要重新添加mha1進配置文件，mha1已經被集群剔除，需要重新添加。mha1打開的一個過程是需要讀取配置文件的，現在mha1已經不在控制文件里，需要手動添加一下

[server1]
candidate_master=1
hostname=192.168.163.3
port=3310
ssh_port=22

重新啟動

[root@mha1 /etc/keepalived]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf  --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &
[1] 5023

6.8重新檢查集群狀態

[root@mha1 /etc/keepalived]# masterha_check_status --conf=/etc/mha_master/mha.cnf

Tags: MySQL