故障分析:網路閃斷引發的ServiceStack.Redis死鎖問題

  • 2022 年 2 月 22 日
  • 筆記

背景說明

某天生產環境發生進程的活躍執行緒數過高的預警問題,且一天兩個節點分別出現相同預警。此程式近一年沒出現過此類預警,事出必有因,本文就記錄下此次根因分析的過程。

監控看到的執行緒數變化情況:

初步的分析和發現

異常的進程在重啟時分別保留了dump,這是進行下一步windbg分析的前提。

查看執行緒明細和異常情況:!threads

0:000> !threads
PDB symbol for clr.dll not loaded
ThreadCount:      297
UnstartedThread:  0
BackgroundThread: 228
PendingThread:    0
DeadThread:       60
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 11744 000002bb7332f750  2026020 Preemptive  000002BC775885C0:000002BC7758A518 000002bb73302f40 0     STA 
  10    2 e988 000002bb733dd610    2b220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Finalizer) 
.....省略.....
  97  428 2cf40 000002bb05df9240  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
  98  410 1c3d8 000002bb05ee44d0  602b221 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA ServiceStack.Redis.RedisException 000002bbf5127940
  99  420 29718 000002bb05ee0e20  202b220 Preemptive  000002BE75FE6398:000002BE75FE76F8 000002bb73302f40 0     MTA 
 100  442 2e76c 000002bb05edd770    2b020 Preemptive  000002BBF5967E88:000002BBF5969E20 000002bb73302f40 1     MTA 
 101  421 108ec 000002bb05edeee0    2b020 Preemptive  000002BB75A3A208:000002BB75A3C1A0 000002bb73302f40 1     MTA 
 102  403 2d260 000002bb05eeb230    2b020 Preemptive  000002BDF5580228:000002BDF55821C0 000002bb73302f40 1     MTA 
 103  472 f008 000002bb06c28950    2b220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA 
 104  395 4f64 000002bb06b45b80  202b220 Preemptive  000002BBF6213018:000002BBF62144D0 000002bb73302f40 0     MTA 
 105  398 21be8 000002bb06b44be0    2b020 Preemptive  000002BE758DB828:000002BE758DD7C0 000002bb73302f40 1     MTA 
 117  355 113d4 000002bb06302a70  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 115  392 2852c 000002bb06a167d0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 111  399 8b4c 000002bb0633e6c0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 106  318 3de0 000002bb06b43c40  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 110  333 20530 000002bb06300b30  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 116  329 2458 000002bb06a16fa0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 109  343 1b5f8 000002bb05cd5910  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker)  
 112  344 26bdc 000002bb06a17770  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 3     MTA (Threadpool Worker) ServiceStack.Redis.RedisException 000002bc751983b8
 107  320 1a508 000002bb0633ee90  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 .....省略.....

最後一列可以看到98號、112號執行緒又異常,分別查看:

0:000> ~98s

0:098> !pe
Exception object: 000002bbf5127940
Exception type:   ServiceStack.Redis.RedisException
Message:          Host:10.201.107.18,Port:26379 Exceeded timeout of 00:00:03
InnerException:   System.Net.Sockets.SocketException, Use !PrintException 000002bef5278f08 to see more.
StackTrace (generated):
    SP               IP               Function
    000000D8D77FC260 00007FFB92F54063 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)+0x2b3
    000000D8D77FE510 00007FFB92F596C4 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])+0xd4
    000000D8D77FE580 00007FFB92F595B5 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)+0x85
    000000D8D77FE5C0 00007FFB92F594F3 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)+0x13
    000000D8D77FE5F0 00007FFB92F5940D UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetSentinelInfo()+0x1d
    000000D8D77FE640 00007FFB92F5937B UNKNOWN!ServiceStack.Redis.RedisSentinel.GetSentinelInfo()+0x4b
    000000D8D77FE6A0 00007FFB9394441A UNKNOWN!ServiceStack.Redis.RedisSentinel.ResetClients()+0x2a
    000000D8D77FE740 00007FFB93944395 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)+0x1d5
    000000D8D77FE7A0 00007FFB93944127 UNKNOWN!ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.<RunLoop>b__0(System.String, System.String)+0x267
    000000D8D77FE810 00007FFB92F5E455 UNKNOWN!ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])+0x2c5
    000000D8D77FE8A0 00007FFB92F5D681 UNKNOWN!ServiceStack.Redis.RedisSubscription.SubscribeToChannelsMatching(System.String[])+0x51
    000000D8D77FE8D0 00007FFB92F5CFC5 UNKNOWN!ServiceStack.Redis.RedisPubSubServer.RunLoop()+0x1d5
0:098> !PrintException 000002bef5278f08
Exception object: 000002bef5278f08
Exception type:   System.Net.Sockets.SocketException
Message:          由於連接方在一段時間後沒有正確答覆或連接的主機沒有反應,連接嘗試失敗。
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    000000D8D77FE3C0 00007FFBEE35C842 System_ni!System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)+0xe2

StackTraceString: <none>
HResult: 80004005

 

0:000> ~112s
0:112> !pe
Exception object: 000002bc751983b8
Exception type:   ServiceStack.Redis.RedisException
Message:          Host:10.201.107.18,Port:26379 Exceeded timeout of 00:00:03
InnerException:   ServiceStack.Redis.RedisRetryableException, Use !PrintException 000002bcf51e92e0 to see more.
StackTrace (generated):
    SP               IP               Function
    000000F42BCFBBD0 00007FFB92F54063 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)+0x2b3
    000000F42BCFDE80 00007FFB92F596C4 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])+0xd4
    000000F42BCFDEF0 00007FFB92F595B5 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)+0x85
    000000F42BCFDF30 00007FFB92F594F3 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)+0x13
    000000F42BCFDF60 00007FFB92F5F17B UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)+0x1b

StackTraceString: <none>
HResult: 80131500
0:112> !pe 000002bcf51e92e0 
Exception object: 000002bcf51e92e0
Exception type:   ServiceStack.Redis.RedisRetryableException
Message:          Socket is not connected
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    000000F42BCFDE20 00007FFB92F53F13 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)+0x163

StackTraceString: <none>
HResult: 80131500

異常顯示Socket連接失敗,於是查看故障期間的網路情況,得到了第一個有價值的發現:網路有閃斷!

 

 

進一步的分析和發現

按說網路閃斷後,故障應該自動恢復,不至於造成執行緒數持續高位。既然有那麼多活躍執行緒,那就看看執行緒都在幹什麼?

0:000> ~*e !clrstack

因涉及公司源碼資訊,不能展示全部堆棧。不過通過堆棧得到的結論是幾乎全部執行緒都卡在獲取Redis Client步驟處:ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient(),且在等待加鎖的操作。

既然是執行緒等待加鎖,那麼就查看dump中鎖的情況:

0:036> !syncblk
Index SyncBlock       MonitorHeld Recursion       Owning Thread Info                  SyncBlock Owner
  510 000002bb0513d4a8           79   1           000002bb073568c0 f9e8 177          000002be7520adb0 System.Object
  530 000002bb0513a1f8          219   1           000002bb06a17770 26bdc 112         000002bd74f07d30 System.Object
  567 000002bb0513e148            5   1           000002bb06ce6f40 24ce0 113         000002be74f320b8 System.Object
 1343 000002bb0513c2b8            7   1           000002bb06a17770 26bdc 112         000002bd7515b158 ServiceStack.Redis.RedisSentinelWorker
-----------------------------
Total           1369
CCW             3
RCW             2
ComClassFactory 0
Free            507

根據輸出內容,可以得到如下結論:

  • 絕大多數執行緒分別在等待000002be7520adb0、000002bd74f07d30對象的釋放;
  • 這兩個鎖對象當前分別被177和112號執行緒佔用。

於是通過!clrstack查看這兩個執行緒的堆棧: 

OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
000000f42bcf9870 00007ffb92f57dcb ServiceStack.DisposableExtensions.Dispose(System.Collections.Generic.IEnumerable`1<System.IDisposable>, ServiceStack.Logging.ILog)
000000f42bcf98d0 00007ffb92f59223 ServiceStack.Redis.RedisSentinel.GetNextSentinel()
000000f42bcf9950 00007ffb93945d77 ServiceStack.Redis.RedisSentinel.OnSentinelError(System.Exception)
000000f42bcf9990 00007ffb92f5f1b8 ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfbae8 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfbae8] 
000000f42bcfbbd0 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000f42bcfdd38 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfdd38] 
000000f42bcfde20 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000f42bcfde80 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000f42bcfdef0 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000f42bcfdf30 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000f42bcfdf60 00007ffb92f5f17b ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfdfa0 00007ffb92f5f067 ServiceStack.Redis.RedisSentinel.GetMaster()
000000f42bcfe020 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
000000f42bcfe060 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
000000f42bcfe090 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
000000f42bcfe100 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
    ....省略....
    
    
OS Thread Id: 0xf9e8 (177)
        Child SP               IP Call Site
0000014977dfccf8 00007ffc025967c4 [GCFrame: 0000014977dfccf8] 
0000014977dfcef0 00007ffc025967c4 [GCFrame: 0000014977dfcef0] 
0000014977dfcf48 00007ffc025967c4 [HelperMethodFrame: 0000014977dfcf48] System.Threading.Monitor.Enter(System.Object)
0000014977dfd040 00007ffb92f5f059 ServiceStack.Redis.RedisSentinel.GetMaster()
0000014977dfd0c0 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
0000014977dfd100 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
0000014977dfd130 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
0000014977dfd1a0 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
        ....省略....

 又能得到以下結論:

  • 112號執行緒在執行執行緒中止操作。
  • 177執行緒跟之前看到的絕大多數執行緒操作一樣,是在等待鎖釋放。

看到這裡就有了點奇怪的味道:別忘了177號本身就是持有了鎖對象000002be7520adb0,且有39個執行緒等著用呢。極有可能會發生死鎖了。於是自然就要查看177號等待的鎖對象是什麼。方法如下:

0:098> ~177s
ntdll!NtWaitForMultipleObjects+0x14:
00007ffc`025967c4 c3              ret
0:177> !clrstack -l
OS Thread Id: 0xf9e8 (177)
        Child SP               IP Call Site
0000014977dfccf8 00007ffc025967c4 [GCFrame: 0000014977dfccf8] 
0000014977dfcef0 00007ffc025967c4 [GCFrame: 0000014977dfcef0] 
0000014977dfcf48 00007ffc025967c4 [HelperMethodFrame: 0000014977dfcf48] System.Threading.Monitor.Enter(System.Object)
0000014977dfd040 00007ffb92f5f059 ServiceStack.Redis.RedisSentinel.GetMaster()
    LOCALS:
        <CLR reg> = 0x000002bd7515b158
        0x0000014977dfd080 = 0x000002bd7515b158
        <no data>
        <no data>
        <no data>
        0x0000014977dfd078 = 0x0000000000000000
            
....省略....

上面輸出的是堆棧以及局部變數資訊。既然要給目標對象加鎖,那麼目標對象自然是要存到局部變數中的。 於是可以判斷000002bd7515b158就是鎖對象的地址,且根據前面syncblk的輸出,可以確認此鎖對象當前被112執行緒持有。而112執行緒的堆棧表明它並沒有在等待加鎖,所以從現有資訊判斷沒有構成死鎖。

確定是發生了死鎖

至此,根據已知的所有資訊,可以歸納出:因為112執行緒持有的2個鎖對象,絕大多數執行緒都被阻塞。此時就要重點分析112號執行緒的情況了。前面的分析過程表明,112號執行緒在執行執行緒中止操作,即Abort()。於是查看此被中止的執行緒有什麼線索沒有:

0:112> !clrstack -a
OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
    PARAMETERS:
        this = <no data>

000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
    PARAMETERS:
        this (<CLR reg>) = 0x000002be7520a900

000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
    PARAMETERS:
        this (0x000000f42bcf9870) = 0x000002be7520a900
    LOCALS:
        <no data>
        <no data>
....省略....
            
            
            
0:177> !do2 000002be7520a900
0x000002be7520a900 ServiceStack.Redis.RedisPubSubServer
[statics]
  0000  startedAt                               : 000002be7511e4e8 (System.Diagnostics.Stopwatch)
  0008  heartbeatTimer                          : NULL
  0010  <OnInit>k__BackingField                 : NULL
  0018  <OnStart>k__BackingField                : NULL
  0020  <OnHeartbeatSent>k__BackingField        : NULL
  0028  <OnHeartbeatReceived>k__BackingField    : NULL
  0030  <OnStop>k__BackingField                 : NULL
  0038  <OnDispose>k__BackingField              : NULL
  0040  <OnMessage>k__BackingField              : 000002be7520abe0 (System.Action<System.String,System.String>)
  0048  <OnControlCommand>k__BackingField       : NULL
  0050  <OnUnSubscribe>k__BackingField          : NULL
  0058  <OnError>k__BackingField                : NULL
  0060  <OnFailover>k__BackingField             : NULL
  0068  rand                                    : 000002be7520aa30 (System.Random)
  0070  lastExMsg                               : NULL
  0078  bgThread                                : 000002be7511e550 (System.Threading.Thread)
  0080  <ClientsManager>k__BackingField         : 000002be7520a7d0 (ServiceStack.Redis.BasicRedisClientManager)
  0088  <Channels>k__BackingField               : 000002be7520aa18 (System.String[]) [Length: 0]
  0090  <ChannelsMatching>k__BackingField       : 000002be7520abc0 (System.String[]) [Length: 1]
  0098  masterClient                            : 000002bef51d8288 (ServiceStack.Redis.RedisClient)
  00a0  lastHeartbeatTicks                      : 637793659924009545 (System.Int64)
  00a8  timesStarted                            : 1 (System.Int64)
  00b0  noOfErrors                              : 0 (System.Int64)
  00b8  bgThreadCount                           : 1 (System.Int64)
  00c0  doOperation                             : 1 (System.Int32)
  00c4  noOfContinuousErrors                    : 0 (System.Int32)
  00c8  status                                  : -1 (System.Int32)
  00cc  autoRestart                             : 0 (System.Int32)
  00d0  <IsSentinelSubscription>k__BackingField : True (System.Boolean)
  00d8  serverTimeAtStart                       : 000002be7520a9e0 2022/2/2 2:33:12 (System.DateTime)
  00e0  HeartbeatInterval                       : 000002be7520a9e8 (System.Nullable<System.TimeSpan>)
  00f0  HeartbeatTimeout                        : 000002be7520a9f8 00:00:30 (System.TimeSpan)
  00f8  <WaitBeforeNextRestart>k__BackingField  : 000002be7520aa00 (System.Nullable<System.TimeSpan>)
      
      
0:177> !do2 000002be7511e550
0x000002be7511e550 System.Threading.Thread
[statics]
  0000  m_Context                             : 000002bdf4f025d8 (System.Runtime.Remoting.Contexts.Context)
  0008  m_ExecutionContext                    : 000002be7511e7b0 (System.Threading.ExecutionContext)
  0010  m_Name                                : 000002be7511e618  "RedisPubSubServer 1" [19] (System.String)
  0018  m_Delegate                            : NULL
  0020  m_CurrentCulture                      : NULL
  0028  m_CurrentUICulture                    : NULL
  0030  m_ThreadStartArg                      : NULL
  0038  DONT_USE_InternalThread               : 000002bb05ee44d0 (System.IntPtr)
  0040  m_Priority                            : 2 (System.Int32)
  0044  m_ManagedThreadId                     : 410 (System.Int32)
  0048  m_ExecutionContextBelongsToOuterScope : False (System.Boolean)

結合上述dump資訊和反編譯的程式源碼,他們之間的聯繫是:112執行緒執行了RedisPubSubServer.KillBgThreadIfExists()方法,方法內執行了this.bgThread.Abort()。

查看dump資訊得到RedisPubSubServer對象為000002be7520a900,bgThread對象為000002be7511e550,bgThread的託管執行緒號是410,根據!threads內容查找對應關係即98號執行緒。

於是得到的結論:112要中止的是98號執行緒。看到這裡,又出現了一點怪味道:因為前面的分析結論是絕大多數執行緒都被112執行緒的鎖對象阻塞了。那麼98號呢?趕緊看看98號的堆棧:

0:098> !clrstack -l
OS Thread Id: 0x1c3d8 (98)
        Child SP               IP Call Site
000000d8d77f87a8 00007ffc025967c4 [GCFrame: 000000d8d77f87a8] 
000000d8d77f89a0 00007ffc025967c4 [GCFrame: 000000d8d77f89a0] 
000000d8d77f89f8 00007ffc025967c4 [HelperMethodFrame: 000000d8d77f89f8] System.Threading.Monitor.Enter(System.Object)
000000d8d77f8af0 00007ffb92f59371 ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
    LOCALS:
        <CLR reg> = 0x000002bd7515b158
        0x000000d8d77f8b28 = 0x000002bd7515b158
        <no data>
        0x000000d8d77f8b20 = 0x0000000000000000

000000d8d77f8b50 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
    LOCALS:
        <no data>
        <no data>
        <no data>
....省略....

果然,98號也是在等鎖對象000002bd7515b158的釋放,而這個對象是被112佔用的!

重述一下前面分析得到的現狀:

  • 112正在中止98號執行緒
  • 98號執行緒在等112的鎖對象釋放

這個味道一看就像是死鎖,但是之前確實沒有驗證過這種情況。那麼自然Google一下,於是….真相大白:

同時還查詢了死鎖的112執行緒的執行時間,方法是通過!clrstack -a查看堆棧的入參和局部變數,結合業務系統的日誌,驗證了死鎖發生的時間正是網路閃斷的時間點。

死鎖的產生原因

既然確定是發生了死鎖,下面就要分析死鎖產生的邏輯關係了。

結合堆棧與ServiceStack.Redis源碼,98號執行緒的執行邏輯是:

  • 進程運行,啟動了RedisPubSubServer.RunLoop();
  • 網路閃斷,需要複位客戶端連接:RedisSentinel.ResetClients();
  • RedisSentinel.GetSentinelInfo()內部又需要加鎖等待;
OS Thread Id: 0x1c3d8 (98)
        Child SP               IP Call Site
000000d8d77f87a8 00007ffc025967c4 [GCFrame: 000000d8d77f87a8] 
000000d8d77f89a0 00007ffc025967c4 [GCFrame: 000000d8d77f89a0] 
000000d8d77f89f8 00007ffc025967c4 [HelperMethodFrame: 000000d8d77f89f8] System.Threading.Monitor.Enter(System.Object)
000000d8d77f8af0 00007ffb92f59371 ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
000000d8d77f8b50 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
000000d8d77f8bf0 00007ffb93944395 ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)
000000d8d77f8c50 00007ffb93944127 ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.<RunLoop>b__0(System.String, System.String)
000000d8d77f8cc0 00007ffb92f5e455 ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])
000000d8d77f8d50 00007ffb938dcd72 ServiceStack.Redis.RedisSubscription.UnSubscribeFromAllChannelsMatchingAnyPatterns()
000000d8d77f8d90 00007ffb92f5d0f9 ServiceStack.Redis.RedisPubSubServer.RunLoop()
000000d8d77fc178 00007ffbf225a626 [HelperMethodFrame: 000000d8d77fc178] 
000000d8d77fc260 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000d8d77fe3c8 00007ffbf225a626 [HelperMethodFrame: 000000d8d77fe3c8] 
000000d8d77fe4b0 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000d8d77fe510 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000d8d77fe580 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000d8d77fe5c0 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000d8d77fe5f0 00007ffb92f5940d ServiceStack.Redis.RedisSentinelWorker.GetSentinelInfo()
000000d8d77fe640 00007ffb92f5937b ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
000000d8d77fe6a0 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
000000d8d77fe740 00007ffb93944395 ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)
000000d8d77fe7a0 00007ffb93944127 ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.<RunLoop>b__0(System.String, System.String)
000000d8d77fe810 00007ffb92f5e455 ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])
000000d8d77fe8a0 00007ffb92f5d681 ServiceStack.Redis.RedisSubscription.SubscribeToChannelsMatching(System.String[])
000000d8d77fe8d0 00007ffb92f5cfc5 ServiceStack.Redis.RedisPubSubServer.RunLoop()
000000d8d77fe940 00007ffbeff6df12 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d8d77fea10 00007ffbeff6dd95 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d8d77fea40 00007ffbeff6dd65 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000d8d77fea90 00007ffbf0013e85 System.Threading.ThreadHelper.ThreadStart()
000000d8d77fece0 00007ffbf2216913 [GCFrame: 000000d8d77fece0] 
000000d8d77ff040 00007ffbf2216913 [DebuggerU2MCatchHandlerFrame: 000000d8d77ff040]

112號執行緒的執行邏輯是:

  • 業務調用獲取redis 客戶端對象實例,調用到PooledRedisClientManager.GetClient()
  • 網路閃斷,引發異常,觸發OnSentinelError事件
  • OnSentinelError事件處理程式中執行Dispose()操作
  • Dispose()操作包含了對現有執行緒bgThread的中止操作
OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
000000f42bcf9870 00007ffb92f57dcb ServiceStack.DisposableExtensions.Dispose(System.Collections.Generic.IEnumerable`1<System.IDisposable>, ServiceStack.Logging.ILog)
000000f42bcf98d0 00007ffb92f59223 ServiceStack.Redis.RedisSentinel.GetNextSentinel()
000000f42bcf9950 00007ffb93945d77 ServiceStack.Redis.RedisSentinel.OnSentinelError(System.Exception)
000000f42bcf9990 00007ffb92f5f1b8 ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfbae8 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfbae8] 
000000f42bcfbbd0 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000f42bcfdd38 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfdd38] 
000000f42bcfde20 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1<System.__Canon>, System.Action`1<System.Func`1<System.__Canon>>, Boolean)
000000f42bcfde80 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000f42bcfdef0 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000f42bcfdf30 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000f42bcfdf60 00007ffb92f5f17b ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfdfa0 00007ffb92f5f067 ServiceStack.Redis.RedisSentinel.GetMaster()
000000f42bcfe020 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
000000f42bcfe060 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
000000f42bcfe090 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
000000f42bcfe100 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
....省略....

上述過程簡化後的邏輯示意圖:

確定死鎖原因後,簡單搜索可找到官方論壇也有一個相同的回饋://forums.servicestack.net/t/application-appears-to-be-deadlocked-after-sentinel-failover/5782,並且官方做了相應的修復。

總結

根據文章的分析思路和步驟看起來比較順暢,這是因為我已經按照關聯關係重新梳理並整合了,實際上整個過程確實花費了一些時間。

但是得益於對windbg的理解和熟悉,以及對ServiceStack.Redis的閱讀把握,最終將此故障的根因確定下來。