bug誕生記——無調用關係的程式碼導致死鎖

2019 年 10 月 4 日
筆記

本文鏈接：https://blog.csdn.net/breaksoftware/article/details/100567271

這個bug源於項目中一個詭異的現象：程式碼層面沒有明顯的鎖的問題，但是執行時發生了死鎖一樣的表現。我把業務邏輯簡化為：父進程一直維持一個子進程。（轉載請指明出於breaksoftware的csdn部落格）

首先我們定義一個結構體ProcessGuard，它持有子進程的ID以及保護它的的鎖。這樣我們在多執行緒中，可以安全的操作這個結構體。

#include <stdio.h>  #include <unistd.h>  #include <string.h>  #include <unistd.h>  #include <stdlib.h>  #include <signal.h>  #include <pthread.h>    struct ProcessGuard {      pthread_mutex_t pids_mutex;      pid_t pid;  };

主進程的主執行緒啟動一個執行緒，用於不停監視ProcessGuard的pid是否為0（即子進程不存在）。如果不存在就創建子進程，並把進程ID記錄到pid中；

void chile_process() {      while (1) {          printf("This is the child process. My PID is %d.My thread_id is %lu.n", getpid(), pthread_self());          sleep(1);      }  }    void create_process_routine() {      printf("This is the child thread of parent process. My PID is %d.My thread_id is %lu.n", getpid(), pthread_self());      while (1) {          int child = 0;          if (child == 0) {              pthread_mutex_lock(&g_guard->pids_mutex);          }            if (g_guard->pid != 0) {              continue;          }            pid_t pid = fork();          sleep(1);          printf("Create child process %d.n", pid);            if (pid < 0) {              perror("fork failed");          }          else if (pid == 0) {              chile_process();              child = 1;              break;          }          else {              // parent process              g_guard->pid = pid;              printf("dispatch task to process. pid is %d.n", pid);          }            if (child == 0) {              pthread_mutex_unlock(&g_guard->pids_mutex);          }          else {              break;          }      }  }

我們在父進程的主執行緒中註冊一個signal監聽。如果子進程被殺掉，則將ProcessGuard中pid設置為0，這樣父進程的監控執行緒將重新啟動一個進程。

void sighandler(int signum) {      printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.n", signum, getpid(), pthread_self());      pthread_mutex_lock(&g_guard->pids_mutex);      g_guard->pid = 0;      pthread_mutex_unlock(&g_guard->pids_mutex);  }

最後看下父進程，它初始化一些結構後，註冊了signal處理事件並啟動了創建子進程的執行緒。

int main(void) {      pthread_t creat_process_tid;        g_guard = malloc(sizeof(struct ProcessGuard));      pthread_mutex_t pids_mutex;      if (pthread_mutex_init(&g_guard->pids_mutex, NULL) != 0) {          perror("init pids_mutex error.");          exit(1);      }      g_guard->pid = 0;        printf("This is the Main thread of parent process.PID is %d.My thread_id is %lu.n", getpid(), pthread_self());        signal(SIGCHLD, sighandler);        pthread_create(&creat_process_tid, NULL, (void*)create_process_routine, NULL);        while(1)  {          printf("Get task from network.n");          sleep(1);      }        pthread_mutex_destroy(&g_guard->pids_mutex);        return 0;  }

上述程式碼，我們看到鎖只在執行緒函數create_process_routine和signal處理函數sighandler中被使用了。它們兩個在程式碼層面沒有任何調用關係，所以不應該出現死鎖！但是實際並非如此。

我們運行程式，並且殺死子進程，會發現主進程並沒有重新啟動一個新的子進程。

$ ./test  This is the Main thread of parent process.PID is 17641.My thread_id is 140014057678656.  Get task from network.  This is the child thread of parent process. My PID is 17641.My thread_id is 140014049122048.  Create child process 17643.  dispatch task to process. pid is 17643.  Create child process 0.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  This is the child process. My PID is 17643.My thread_id is 140014049122048.  Get task from network.  This is the parent process.Catch signal 17.My PID is 17641.My thread_id is 140014049122048.  Get task from network.  Get task from network.  Get task from network.  Get task from network.  Get task from network.

這個和我們程式碼設計不符合，而且不太符合邏輯。於是我們使用gdb attach主進程。

Attaching to process 17641  [New LWP 17642]  [Thread debugging using libthread_db enabled]  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".  0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190) at ../sysdeps/unix/sysv/linux/nanosleep.c:28  28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.  (gdb) info threads    Id   Target Id         Frame  * 1    Thread 0x7f57902be740 (LWP 17641) "test" 0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190)      at ../sysdeps/unix/sysv/linux/nanosleep.c:28    2    Thread 0x7f578fa95700 (LWP 17642) "test" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135  (gdb) t 2  [Switching to thread 2 (Thread 0x7f578fa95700 (LWP 17642))]  #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135  135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.  (gdb) bt  #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135  #1  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78  #2  0x000055c512c29a9d in sighandler ()  #3  <signal handler called>  #4  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133  #5  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78  #6  0x000055c512c29b42 in create_process_routine ()  #7  0x00007f578fe8e6db in start_thread (arg=0x7f578fa95700) at pthread_create.c:463  #8  0x00007f578fbb788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

我們查看執行緒2的調用棧，發現棧幀5和棧幀1鎖住了相同的mutex(0x55c51383e260)。而我們執行緒程式碼中鎖是加/解成對，那麼第二個鎖是哪兒來的呢？

我們看到棧幀1的鎖是源於棧幀2對應的函數sighandler，即下面程式碼

void sighandler(int signum) {      printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.n", signum, getpid(), pthread_self());      pthread_mutex_lock(&g_guard->pids_mutex);      g_guard->pid = 0;      pthread_mutex_unlock(&g_guard->pids_mutex);  }

於是，問題來了。我們在執行緒函數create_process_routine中從來沒有調用sighandler，那這個調用是哪兒來的？

在linux文檔http://man7.org/linux/man-pages/man7/signal.7.html中，我們發現了有關signal的這段話

A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.

這句話是說process-directed signal會被投遞到當前沒有被標記不接受該signal的任意一個執行緒中。具體是哪個，是由系統內核決定的。這就意味著我們的sighandler可能在主執行緒中執行，也可能在子執行緒中執行。於是發生了我們上面的死鎖現象。

那麼如何解決？官方的方法是使用sigprocmask讓一些存在潛在死鎖關係的執行緒不接收這些訊號。但是這個方案在複雜的系統中是存在缺陷的。因為我們的工程往往使用各種開源庫或者第三方庫，我們無法控制它們啟動執行緒的問題。所以，我的建議是：在signal處理函數中，盡量使用無鎖結構。通過中間數據的設計，將複雜的業務程式碼和signal處理函數隔離。

bug誕生記——無調用關係的程式碼導致死鎖

VirMach 便宜 VPS

QNews

bug誕生記——無調用關係的程式碼導致死鎖

分享此文：

Related Posts

【小白向】基於Docker使用Gogs,Drone以及drone-runner-docker的自動化部署

CentOS安裝JDK

JanusGraph — 索引參數與全文索引查詢（janusgraph Index parameters and full text search）

MongoDB4.0.0 遠程連接及用戶名密碼認證登陸配置——windows

VirMach 便宜 VPS

QNews

熱門搜尋