mongodb 複製集 運維 遇到的問題

  • 2019 年 10 月 4 日
  • 筆記

root couse: 對MongoDB複製集的認識不足

機器環境:

192.168.12.6  master狀態

192.168.12.4 secondary狀態

192.168.12.5  secondary狀態

192.168.2.1    dump節點 ,之前因為磁盤不足,mongodb進程已宕機,這個實例也配置有vote投票權!

過程:

1、DBA在 192.168.12.5 這個 secondary節點上,執行了關閉實例命令

2、集群剩餘的2台主機:192.168.12.4(secondary) 、192.168.12.6(master)  ,都變成了secondary狀態

3、業務反饋大量報錯

4、DBA恢復 192.168.12.5 上面的mongodb進程,集群狀態恢復

復盤:

下面的日誌,是在 192.168.12.6 主節點上面看到的:

2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.12.5:27017 - HostUnreachable: Connection reset by peer  2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 192.168.12.5:27017 due to failed operation on a connection  2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected  2019-04-16T15:47:14.196+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.12.5:27017; HostUnreachable: Connection reset by peer  2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 192.168.12.5:27017  2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 192.168.12.5:27017 - HostUnreachable: Connection refused   2019-04-16T15:47:14.196+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 192.168.12.5:27017 due to failed operation on a connection  2019-04-16T15:47:14.196+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.12.5:27017; HostUnreachable: Connection refused  2019-04-16T15:47:14.197+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary  2019-04-16T15:47:14.197+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat  2019-04-16T15:47:14.198+0800 I REPL     [replExecDBWorker-0] transition to SECONDARY  2019-04-16T15:47:14.274+0800 I NETWORK  [conn476944080] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [192.168.3.11:38712]

集群的配置如下:

set01:SECONDARY> rs.conf()  {   "_id" : "set01",   "version" : 130099,   "members" : [   {   "_id" : 6,   "host" : "192.168.2.1:27017",   "arbiterOnly" : false,   "buildIndexes" : true,   "hidden" : true,   "priority" : 0,   "tags" : {   "dc" : "IDC1",   "role" : "dump"   },   "slaveDelay" : NumberLong(0),   "votes" : 1   },   {   "_id" : 7,   "host" : "192.168.12.4:27017",   "arbiterOnly" : false,   "buildIndexes" : true,   "hidden" : false,   "priority" : 1,   "tags" : {   "dc" : "IDC1"   },   "slaveDelay" : NumberLong(0),   "votes" : 1   },   {   "_id" : 8,   "host" : "192.168.12.5:27017",   "arbiterOnly" : false,   "buildIndexes" : true,   "hidden" : false,   "priority" : 1,   "tags" : {   "dc" : "IDC1"   },   "slaveDelay" : NumberLong(0),   "votes" : 1   },   {   "_id" : 9,   "host" : "192.168.12.6:27017",   "arbiterOnly" : false,   "buildIndexes" : true,   "hidden" : false,   "priority" : 1,   "tags" : {   "dc" : "IDC1"   },   "slaveDelay" : NumberLong(0),   "votes" : 1   }   ],   "settings" : {   "chainingAllowed" : true,   "heartbeatIntervalMillis" : 2000,   "heartbeatTimeoutSecs" : 10,   "electionTimeoutMillis" : 10000,   "getLastErrorModes" : {   },   "getLastErrorDefaults" : {   "w" : 1,   "wtimeout" : 0   }   }  }

根據上面的內容,可以判斷出 192.168.2.1宕機後,我們再次關閉一台192.168.12.5後,集群就剩2個投票,少於一半節點,整個集群無法選舉出Primary,集群退化為只讀狀態【表現為rs.status()看到的都是secondary角色】 因此,通常建議將複製集成員數量設置為奇數。

解決措施:

    將dump節點的投票屬性去掉。

參考: 

http://www.ttlsa.com/mongodb/mongodb-replicaset-internal/

https://blog.csdn.net/qq_24598601/article/details/81150614