使用repmgrd实现postgresql failover和auto failover

  • 2019 年 12 月 18 日
  • 笔记

前面的文章介绍了postgresql基于repmgr的高可用及切换方案,这篇文章主要聊聊通过repmgrd实现failover及auto failover。

前提是部署好postgresql主从,同时部署好repmgr。

[postgres@node1 ~]$ repmgr cluster show   ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | primary | * running |          | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

failover

停止主库,模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/  waiting for server to shut down..... done  server stopped

备库查看是unreachable状态

[postgres@node2 .ssh]$ repmgr cluster show   ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+---------------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | primary | ? unreachable |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | standby |   running     | ? node1  | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

备库提升为主库

[postgres@node2 ~]$ repmgr standby promote  NOTICE: promoting standby to primary  DETAIL: promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"  waiting for server to promote.... done  server promoted  NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete  NOTICE: STANDBY PROMOTE successful  DETAIL: server "node2" (ID: 2) was successfully promoted to primary

新主库查看集群状态

[postgres@node2 ~]$ repmgr cluster show   ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2    WARNING: following issues were detected    - unable to connect to node "node1" (ID: 1)

原主库执行rejoin操作重新加入集群

[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run  [postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose  INFO: looking for configuration file in /etc  INFO: configuration file found at: "/etc/repmgr.conf"  INFO: prerequisites for using pg_rewind are met  INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"  NOTICE: executing pg_rewind  DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"  NOTICE: 2 files copied to /pgdata  INFO: directory "/tmp/repmgr-config-archive-node1" deleted  INFO: deleting "recovery.done"  NOTICE: setting node 1's upstream to node 2  WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"  DETAIL: PQping() returned "PQPING_NO_RESPONSE"  NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"  INFO: demoted primary is pingable  INFO: node 1 has attached to its upstream node  NOTICE: NODE REJOIN successful  DETAIL: node 1 is now attached to node 2

查看集群状态

[postgres@node1 pgdata]$ repmgr cluster show   ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | standby |   running | node2    | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

auto failover

可以利用repmgrd进程实现自动的failover,首先要在repmgr.conf文件中将location参数设置为一致,不设置的话默认也是一致的。同时启动repmgrd必须在postgres.conf配置文件中设置shared_preload_libraries='repmgr'

修改主备库repmgr.conf文件

failover=automatic  promote_command='/pgsql/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'  follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'  log_file=/home/postgres/repmgrd.log  monitoring_history=true (启用监控参数)  monitor_interval_secs=5(定义监视数据间隔写入时间参数)  reconnect_attempts=10(故障转移之前,尝试重新连接主库次数(默认为6)参数)  reconnect_interval=5(每间隔5s尝试重新连接一次参数)

重启主备库使修改生效

[postgres@node1 ~]$ repmgr node service --action=restart  DETAIL: executing server command "pg_ctl  -w -D '/pgdata' restart"

主备库启动repmgrd

[postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid  [2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"

模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/  waiting for server to shut down..... done  server stopped

查看备库日志,发现已经升为主库

[2019-09-20 12:02:52] [NOTICE] promoting standby to primary  [2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"  [2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete  [2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful  [2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary  [2019-09-20 12:02:52] [INFO] 0 followers to notify  [2019-09-20 12:02:52] [INFO] switching to primary monitoring mode  [2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)

查看cluster状态,备库已经升主

[postgres@node2 ~]$ repmgr cluster show   ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | primary | * running |          | default  | 100      | 5        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2    WARNING: following issues were detected    - unable to connect to node "node1" (ID: 1)

原主库执行rejoin加入集群

[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run  [postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose  INFO: looking for configuration file in /etc  INFO: configuration file found at: "/etc/repmgr.conf"  INFO: prerequisites for using pg_rewind are met  INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"  NOTICE: executing pg_rewind  DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"  NOTICE: 2 files copied to /pgdata  INFO: directory "/tmp/repmgr-config-archive-node1" deleted  INFO: deleting "recovery.done"  NOTICE: setting node 1's upstream to node 2  WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"  DETAIL: PQping() returned "PQPING_NO_RESPONSE"  NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"  INFO: demoted primary is pingable  INFO: node 1 has attached to its upstream node  NOTICE: NODE REJOIN successful  DETAIL: node 1 is now attached to node 2

查看集群状态

[postgres@node1 ~]$ repmgr cluster show   ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string  ----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------   1  | node1 | standby |   running | node2    | default  | 100      | 5        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2   2  | node2 | primary | * running |          | default  | 100      | 6        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2