keepalived-1.3.5+MHA部署mysql集群
MHA:
MHA工作原理總結為以下幾條:
- 從宕機崩潰的master保存二進位日誌事件(binlog events);
- 識別含有最新更新的slave;
- 應用差異的中繼日誌(relay log)到其他slave;
- 應用從master保存的二進位日誌事件(binlog events);
- 提升一個slave為新master;
- 使用其他的slave連接新的master進行複製。
MHA特點:
- 故障切換時間 10-30秒
- 自動監控Master以及故障轉移
- 在線切換Master到異機
1.環境介紹
伺服器角色 |
IP |
節點 |
埠 |
職責 |
manager、master |
192.168.163.3 |
mha1 |
3310 |
監控、主master |
slave1 |
192.168.163.16 |
mha2 |
3310 |
從slave |
slave2 |
192.168.163.20 |
mha3 |
3310 |
從slave |
OS:CentOS7.8
MySQL:Mysql-5.7.23
MHA:mha4mysql-manager-0.58-0.el7.centos.noarch.rpm
mha4mysql-node-0.58-0.el7.centos.noarch.rpm
Keepalived:keepalived-1.3.5-19.el7.x86_64
關閉防火牆
systemctl stop firewalld systemctl disable firewalld
1.1配置yum源
mha需要很多依賴包,很過中國的網路yum源沒有一些依賴,可直接使用下面的yum源,會省去大部分時間
cat > /etc/yum.repos.d/epel.repo << EOF [epel] name=Extra Packages for Enterprise Linux 7 - $basearch failovermethod=priority enabled=1 gpgcheck=0 mirrorlist=//mirrors.fedoraproject.org/mirrorlist?repo=epel-7&arch=$basearch EOF
1.2安裝依賴包
三台主機都需要安裝依賴包
yum install perl-DBI perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes perl-Params-Validate perl-DateTime -y yum install cpan perl-ExtUtils-Embed perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker -y
1.3安裝軟體
mha1:
三台主機都選擇安裝manager,node節點都安裝
rpm -ivh mha4mysql-manager-0.58-0.el7.centos.noarch.rpm rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm
mha2:
rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm
mha3:
rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm
1.4修改主機名和hosts文件
三台主機都修改
[root@mha1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.163.3 mha1 192.168.163.16 mha2 192.168.163.20 mha3
1.5 ssh 互通環境
mha1: ssh-keygen ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected] ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected] ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
三台主機依次登錄測試,直到可以直接登錄其他主機
ssh 192.168.163.3 ssh 192.168.16316 ssh 192.168.163.20
2. mysql安裝部署
2.1mysql配置文件
注意更改server_id,在同一個集群中,更改為不同的server_id
[root@mha1 ~]# vim /data/3310/my.cnf [mysqld] basedir=/application/mysql datadir=/data/3310/data socket=/data/3310/mysql.sock log_error=/data/3310/mysql.log port=3310 server_id=331003 log_bin=/data/3310/mysql-bin gtid-mode=on enforce-gtid-consistency=true slow_query_log=ON skip-name-resolve # 建議加上 非必須項 relay_log_purge = 0 # 關閉自動清理中繼日誌 log_slave_updates = 1 # 從庫通過binlog更新的數據寫進從庫二進位日誌中,必加,否則切換後可能丟失數據 relay-log = relay-log
2.2加入啟動服務
[root@mha1 ~]# vim /etc/systemd/system/mysqld3310.service [Unit] Description=MySQL Server Documentation=man:mysqld(8) Documentation=//dev.mysql.com/doc/refman/en/using-systemd.html After=network.target After=syslog.target [Install] WantedBy=multi-user.target [Service] User=mysql Group=mysql ExecStart=/application/mysql/bin/mysqld --defaults-file=/data/3310/my.cnf LimitNOFILE = 5000
2.3初始化mysql
/application/mysql/bin/mysqld --initialize-insecure --user=mysql --basedir=/application/mysql --datadir=/data/3310/data
2.4啟動mysql
systemctl start mysqld3310 systemctl enable mysqld3310
3.一主兩從搭建
3.1 mha1
mysql> create user 'slave'@'192.168.163.%' identified by '123'; mysql> create user mha@'192.168.163.%' identified by '123'; mysql> grant replication slave on *.* to slave@'192.168.163.%' identified by '123'; mysql> grant all on *.* to mha@'192.168.163.%'; mysql> flush privileges;
也可以選擇用mysqldump的方式來恢復從庫,此時是新建的庫,並沒有數據。
mysqldump -uroot -p123 -S /data/3310/mysql.sock -A –master-data=2 –single-transaction -R -E –triggers >/tmp/full.sql
從庫source /tmp/full.sql
查看主庫master狀態
3.2 mha2
mysql> change master to master_host='192.168.163.3',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000003',master_log_pos=234;
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’
主庫添加log-bin-index 參數後,從庫報這個錯誤:Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’
mysql> stop slave; Query OK, 0 rows affected (0.00 sec) mysql> reset slave; Query OK, 0 rows affected (0.01 sec) mysql> start slave; Query OK, 0 rows affected (0.00 sec) mysql> show slave status\G;
3.3 mha3
mysql> change master to master_host='192.168.163.3',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000003',master_log_pos=234; mysql> stop slave; Query OK, 0 rows affected (0.00 sec) mysql> reset slave; Query OK, 0 rows affected (0.01 sec) mysql> start slave; Query OK, 0 rows affected (0.00 sec)
4.mha搭建
Manager工具
masterha_check_ssh 檢查MHA的SSH配置狀況
masterha_check_repl 檢查MySQL複製狀況
masterha_manger 啟動MHA
masterha_check_status 檢測當前MHA運行狀態
masterha_master_monitor 檢測master是否宕機
masterha_master_switch 控制故障轉移(自動或者手動)
masterha_conf_host 添加或刪除配置的server資訊
Node工具
save_binary_logs 保存和複製master的二進位日誌
apply_diff_relay_logs 識別差異的中繼日誌事件並應用於其它slave
filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用這個工具)
purge_relay_logs 清除中繼日誌(不會阻塞SQL執行緒)
啟動mha
nohup masterha_manager –conf=/etc/mha_master/mha.cnf –remove_dead_master –ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &
關閉mha
masterha_stop –conf=/etc/mha_master/mha.cnf
集群狀態檢查
masterha_check_status –conf=/etc/mha_master/mha.cnf
集群連接檢查
masterha_check_ssh–conf=/etc/mha_master/mha.cnf
集群複製檢查
masterha_check_repl –conf=/etc/mha_master/mha.cnf
集群切換
masterha_master_switch –conf=/etc/mha_master/mha.cnf –master_state=alive –new_master_host=192.168.163.16 –new_master_port=3310 –orig_master_is_new_slave –running_updates_limit=10000
4.1安裝軟體
見1.3 已安裝
4.2配置文件
mha1: mkdir -p /etc/mha_master/app1 touch /etc/mha_master/manager.log cat > /etc/mha_master/mha.cnf <<EOF [server default] manager_log=/etc/mha_master/manager.log manager_workdir=/etc/mha_master/app1 password=123 ping_interval=1 remote_workdir=/tmp repl_password=123 repl_user=slave ssh_user=root user=mha [server1] candidate_master=1 hostname=192.168.163.3 port=3310 ssh_port=22 [server2] candidate_master=1 hostname=192.168.163.16 port=3310 ssh_port=22 [server3] candidate_master=1 hostname=192.168.163.20 port=3310 ssh_port=22 EOF
參數詳情見網路
4.3集群狀態檢查
MHA連接檢查
[root@mha1 ~]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf
MHA複製檢查
[root@mha1 ~]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf
4.4開啟集群
[root@mha1 ~]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 &
集群狀態
[root@mha1 ~]# masterha_check_status --conf=/etc/mha_master/mha.cnf
5.手動故障切換
5.1停止mha監控
停止集群
[root@mha1 ~]# masterha_stop --conf=/etc/mha_master/mha.cnf
5.2切換主庫為mha2
[root@mha1 ~]# masterha_master_switch --conf=/etc/mha_master/mha.cnf --master_state=alive --new_master_host=192.168.163.16 --new_master_port=3310 --orig_master_is_new_slave --running_updates_limit=10000 Sat Feb 20 09:44:43 2021 - [info] MHA::MasterRotate version 0.58. Sat Feb 20 09:44:43 2021 - [info] Starting online master switch.. Sat Feb 20 09:44:43 2021 - [info] Sat Feb 20 09:44:43 2021 - [info] * Phase 1: Configuration Check Phase.. Sat Feb 20 09:44:43 2021 - [info] Sat Feb 20 09:44:43 2021 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat Feb 20 09:44:43 2021 - [info] Reading application default configuration from /etc/mha_master/mha.cnf.. Sat Feb 20 09:44:43 2021 - [info] Reading server configuration from /etc/mha_master/mha.cnf.. Sat Feb 20 09:44:44 2021 - [info] GTID failover mode = 1 Sat Feb 20 09:44:44 2021 - [info] Current Alive Master: 192.168.163.3(192.168.163.3:3310) Sat Feb 20 09:44:44 2021 - [info] Alive Slaves: Sat Feb 20 09:44:44 2021 - [info] 192.168.163.16(192.168.163.16:3310) Version=5.7.32-log (oldest major version between slaves) log-bin:enabled Sat Feb 20 09:44:44 2021 - [info] GTID ON Sat Feb 20 09:44:44 2021 - [info] Replicating from 192.168.163.3(192.168.163.3:3310) Sat Feb 20 09:44:44 2021 - [info] Primary candidate for the new Master (candidate_master is set) Sat Feb 20 09:44:44 2021 - [info] 192.168.163.20(192.168.163.20:3310) Version=5.7.32-log (oldest major version between slaves) log-bin:enabled Sat Feb 20 09:44:44 2021 - [info] GTID ON Sat Feb 20 09:44:44 2021 - [info] Replicating from 192.168.163.3(192.168.163.3:3310) Sat Feb 20 09:44:44 2021 - [info] Primary candidate for the new Master (candidate_master is set) It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.163.3(192.168.163.3:3310)? (YES/no): yes Sat Feb 20 09:44:47 2021 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Sat Feb 20 09:44:47 2021 - [info] ok. Sat Feb 20 09:44:47 2021 - [info] Checking MHA is not monitoring or doing failover.. Sat Feb 20 09:44:47 2021 - [info] Checking replication health on 192.168.163.16.. Sat Feb 20 09:44:47 2021 - [info] ok. Sat Feb 20 09:44:47 2021 - [info] Checking replication health on 192.168.163.20.. Sat Feb 20 09:44:47 2021 - [info] ok. Sat Feb 20 09:44:47 2021 - [info] 192.168.163.16 can be new master. Sat Feb 20 09:44:47 2021 - [info] From: 192.168.163.3(192.168.163.3:3310) (current master) +--192.168.163.16(192.168.163.16:3310) +--192.168.163.20(192.168.163.20:3310) To: 192.168.163.16(192.168.163.16:3310) (new master) +--192.168.163.20(192.168.163.20:3310) +--192.168.163.3(192.168.163.3:3310) Starting master switch from 192.168.163.3(192.168.163.3:3310) to 192.168.163.16(192.168.163.16:3310)? (yes/NO): yes Sat Feb 20 09:44:49 2021 - [info] Checking whether 192.168.163.16(192.168.163.16:3310) is ok for the new master.. Sat Feb 20 09:44:49 2021 - [info] ok. Sat Feb 20 09:44:49 2021 - [info] 192.168.163.3(192.168.163.3:3310): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Sat Feb 20 09:44:49 2021 - [info] 192.168.163.3(192.168.163.3:3310): Resetting slave pointing to the dummy host. Sat Feb 20 09:44:49 2021 - [info] ** Phase 1: Configuration Check Phase completed. Sat Feb 20 09:44:49 2021 - [info] Sat Feb 20 09:44:49 2021 - [info] * Phase 2: Rejecting updates Phase.. Sat Feb 20 09:44:49 2021 - [info] master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes Sat Feb 20 09:44:50 2021 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Sat Feb 20 09:44:50 2021 - [info] Executing FLUSH TABLES WITH READ LOCK.. Sat Feb 20 09:44:50 2021 - [info] ok. Sat Feb 20 09:44:50 2021 - [info] Orig master binlog:pos is mysql-bin.000002:194. Sat Feb 20 09:44:50 2021 - [info] Waiting to execute all relay logs on 192.168.163.16(192.168.163.16:3310).. Sat Feb 20 09:44:50 2021 - [info] master_pos_wait(mysql-bin.000002:194) completed on 192.168.163.16(192.168.163.16:3310). Executed 0 events. Sat Feb 20 09:44:50 2021 - [info] done. Sat Feb 20 09:44:50 2021 - [info] Getting new master's binlog name and position.. Sat Feb 20 09:44:50 2021 - [info] mysql-bin.000001:2509 Sat Feb 20 09:44:50 2021 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.163.16', MASTER_PORT=3310, MASTER_AUTO_POSITION=1, MASTER_USER='slave', MASTER_PASSWORD='xxx'; Sat Feb 20 09:44:50 2021 - [info] Sat Feb 20 09:44:50 2021 - [info] * Switching slaves in parallel.. Sat Feb 20 09:44:50 2021 - [info] Sat Feb 20 09:44:50 2021 - [info] -- Slave switch on host 192.168.163.20(192.168.163.20:3310) started, pid: 1716 Sat Feb 20 09:44:50 2021 - [info] Sat Feb 20 09:44:52 2021 - [info] Log messages from 192.168.163.20 ... Sat Feb 20 09:44:52 2021 - [info] Sat Feb 20 09:44:50 2021 - [info] Waiting to execute all relay logs on 192.168.163.20(192.168.163.20:3310).. Sat Feb 20 09:44:50 2021 - [info] master_pos_wait(mysql-bin.000002:194) completed on 192.168.163.20(192.168.163.20:3310). Executed 0 events. Sat Feb 20 09:44:50 2021 - [info] done. Sat Feb 20 09:44:50 2021 - [info] Resetting slave 192.168.163.20(192.168.163.20:3310) and starting replication from the new master 192.168.163.16(192.168.163.16:3310).. Sat Feb 20 09:44:50 2021 - [info] Executed CHANGE MASTER. Sat Feb 20 09:44:51 2021 - [info] Slave started. Sat Feb 20 09:44:52 2021 - [info] End of log messages from 192.168.163.20 ... Sat Feb 20 09:44:52 2021 - [info] Sat Feb 20 09:44:52 2021 - [info] -- Slave switch on host 192.168.163.20(192.168.163.20:3310) succeeded. Sat Feb 20 09:44:52 2021 - [info] Unlocking all tables on the orig master: Sat Feb 20 09:44:52 2021 - [info] Executing UNLOCK TABLES.. Sat Feb 20 09:44:52 2021 - [info] ok. Sat Feb 20 09:44:52 2021 - [info] Starting orig master as a new slave.. Sat Feb 20 09:44:52 2021 - [info] Resetting slave 192.168.163.3(192.168.163.3:3310) and starting replication from the new master 192.168.163.16(192.168.163.16:3310).. Sat Feb 20 09:44:52 2021 - [info] Executed CHANGE MASTER. Sat Feb 20 09:44:53 2021 - [info] Slave started. Sat Feb 20 09:44:53 2021 - [info] All new slave servers switched successfully. Sat Feb 20 09:44:53 2021 - [info] Sat Feb 20 09:44:53 2021 - [info] * Phase 5: New master cleanup phase.. Sat Feb 20 09:44:53 2021 - [info] Sat Feb 20 09:44:53 2021 - [info] 192.168.163.16: Resetting slave info succeeded. Sat Feb 20 09:44:53 2021 - [info] Switching master to 192.168.163.16(192.168.163.16:3310) completed successfully.
如果失敗:
5.3啟動mha
[root@mha1 ~]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 & [1]1733
5.4查看目前集群狀態
狀態檢查已經顯示master節點切換到mha2上
[root@mha1 ~]# masterha_check_status --conf=/etc/mha_master/mha.cnf mha (pid:1733) is running(0:PING_OK), master:192.168.163.16
mha1:原master
mha2:現在的master主庫 192.168.163.16
可以看到現在集群軟體仍然部署在192.168.163.3上,只不過主節點現在切換到了mha2上
mha3: 192.168.163.20
6.自動故障切換
根據手動切換的方法可以將master從mha2切換為mha1
6.1安裝keepalived
[root@mha3 ~ 10:21:01]# yum install keepalived -y
6.2配置文件
[root@mha1 ~]# vim /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { notification_email { [email protected] } notification_email_from [email protected] smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id MYSQL_HA } vrrp_script chk_mysql { script "killall -0 mysql" # 監控mysql健康性腳本 insterval 1 weight -10 } vrrp_instance VI_1 { state BACKUP interface ens33 #根據實際網路介面進行更改 virtual_router_id 51 priority 100 #優先順序,mha1設置100,mha2設置99,mha3設置98 advert_int 1 nopreempt #不主動搶佔資源,只在master上設置 authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.163.23 } track_script { chk_mysql } }
6.3傳輸配置文件
[root@mha1 ~]# scp /etc/keepalived/keepalived.conf [email protected]:/etc/keepalived/keepalived.conf keepalived.conf 100% 1290 836.2KB/s 00:00 [root@mha1 ~]# scp /etc/keepalived/keepalived.conf [email protected]:/etc/keepalived/keepalived.conf keepalived.conf
6.4修改keepalived.conf
mha2:
priority 99 #優先順序,mha1設置100,mha2設置99,mha3設置98
mha3:
priority 98 #優先順序,mha1設置100,mha2設置99,mha3設置98
6.5啟動keepalived
[root@mha1 ~]# systemctl start keepalived [root@mha1 ~]# systemctl status keepalived [root@mha1 ~]# systemctl enable keepalived
6.6模擬故障
關閉mha1的主節點
mha1:
[root@mha1 ~]# systemctl stop mysqld3310 [1]+ Done nohup masterha_manager --conf=/etc/mha_master/mha.cnf --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1
查看集群狀態
[root@mha1 /etc/keepalived]# masterha_check_repl -conf=/etc/mha_master/mha.cnf
mha2:
mysql> show slave status\G; Empty set (0.00 sec) ERROR: No query specified mysql> show master status; +------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+ | mysql-bin.000002 | 234 | | | b883ad36-731c-11eb-8f2b-000c297b5e9b:1-3, d85d6dce-7287-11eb-9701-000c2902c557:1-8 | +------------------+----------+--------------+------------------+------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> show processlist; +----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | 11 | slave | 192.168.163.20:58982 | NULL | Binlog Dump GTID | 598 | Master has sent all binlog to slave; waiting for more updates | NULL | | 15 | root | localhost | NULL | Query | 0 | starting | show processlist | +----+-------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ 2 rows in set (0.00 sec)
mha3:
mysql> show slave status\G;
6.7恢復故障節點
要先恢復mha1和mha2的主從關係,此時mha2是主庫,mha1作為一個備庫來恢復
mha2:
mha1:
mysql> change master to master_host='192.168.163.16',master_user='slave',master_port=3310,master_password='123',master_log_file='master-log.000002',master_log_pos=234; Query OK, 0 rows affected, 2 warnings (0.01 sec) mysql> stop slave; Query OK, 0 rows affected (0.00 sec) mysql> reset slave; Query OK, 0 rows affected (0.01 sec) mysql> start slave; Query OK, 0 rows affected (0.01 sec) mysql> show slave status\G;
檢查集群狀態,發現活著的節點並沒有mha1
[root@mha1 /etc/keepalived]# masterha_check_repl -conf=/etc/mha_master/mha.cnf
此時需要重新添加mha1進配置文件,mha1已經被集群剔除,需要重新添加。mha1打開的一個過程是需要讀取配置文件的,現在mha1已經不在控制文件里,需要手動添加一下
[server1] candidate_master=1 hostname=192.168.163.3 port=3310 ssh_port=22
重新啟動
[root@mha1 /etc/keepalived]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf --remove_dead_master --ignore_last_failover < /dev/null > /etc/mha_master/manager.log 2>&1 & [1] 5023
6.8重新檢查集群狀態
[root@mha1 /etc/keepalived]# masterha_check_status --conf=/etc/mha_master/mha.cnf