Kafka服務不可用(宕機)問題踩坑記

  • 2019 年 10 月 3 日
  • 筆記

背景

某線上日誌收集服務報警,打開域名報502錯誤碼。 收集服務由2台netty HA服務器組成,netty服務器將客戶端投遞來的protobuf日誌解析並發送到kafka,打開其中一個應用的日誌,發現如下報錯:

org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s)  

在排除了netty服務的錯誤之後,去查看kafka的日誌。 發現報錯,排查過程如下;

配置信息

系統 kafka版本 broker數量
CentOS7.4 2.1.0 3

線上有三台Kafka Broker,id分別為0、1、2,服務器只部署了Kafka服務。

問題

線程是否存活

首先jps查看Kafka線程是否存活,三台機器都沒問題,kafka依然在運行。

GC問題

查看kafkaServer-gc.log.1.current的日誌,gc日誌沒發現異常。

Broker 0/server.log
[2019-08-02 15:17:03,699] WARN Attempting to send response via channel for which there is no open connection, connection id 172.21.3.14:9092-172.21.3.11:54311-107706 (kafka.network.Processor)  [2019-08-02 15:19:12,490] INFO [GroupMetadataManager brokerId=0] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)  [2019-08-02 15:26:54,405] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1112819217, epoch=1897450) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)  [2019-08-02 15:26:54,411] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={galaxy_client-7=(offset=15680912, logStartOffset=14755985, maxBytes=1048576, currentLeaderEpoch=Optional[9])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1112819217, epoch=1897450)) (kafka.server.ReplicaFetcherThread)  java.io.IOException: Connection to 1 was disconnected before the response was read          at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)          at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)          at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)          at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)          at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)          at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)          at scala.Option.foreach(Option.scala:257)          at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)          at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)          at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)  [2019-08-02 15:27:26,433] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1112819217, epoch=INITIAL) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)  
Broker 1/server.log
[2019-08-02 15:26:53,751] WARN [GroupCoordinator 1]: Failed to write empty metadata for group StoreToHiveV2: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)  [2019-08-02 15:26:55,515] WARN [GroupCoordinator 1]: Failed to write empty metadata for group LPmkt: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)  [2019-08-02 15:26:56,124] WARN [GroupCoordinator 1]: Failed to write empty metadata for group Store2CarbonClientServer: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)  [2019-08-02 15:26:56,575] WARN [GroupCoordinator 1]: Failed to write empty metadata for group StoreToHive: The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)  
Broker 2/server.log
[2019-08-02 15:26:54,514] WARN [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=2, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={galaxy_client-7=(offset=15680912, logStartOffset=14755985, maxBytes=1048576, currentLeaderEpoch=Optional[9])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1550207796, epoch=1890003)) (kafka.server.ReplicaFetcherThread)  java.io.IOException: Connection to 1 was disconnected before the response was read          at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)          at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)          at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)          at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)          at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)          at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)          at scala.Option.foreach(Option.scala:257)          at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)          at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)          at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)  [2019-08-02 15:27:26,569] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=1550207796, epoch=INITIAL) to node 1: java.io.IOException: Connection to 1 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)  

問題排查

由於是線上應用,基本沒時間排查問題,所以立即重啟了三台Kafka,重啟後得到緩解。

在服務正常之後,在網上搜了很多答案,有的人說是網絡問題,close wait過多導致broker網絡不通,從而Kafka產生腦裂問題。 在各個服務器上運行命令:

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'  

結果

ESTABLISHED 172  TIME_WAIT 32  

也沒發現異常,等下次再出現問題時,要運行這個命令看看服務器的狀態。

社區bug?

https://issues.apache.org/jira/browse/KAFKA-6582

下方有人評論:We are running 2.1.1 in production (and four other environments) since March without this issue showing again.

所以找時間升級Kafka版本吧~