Grafana 監控系統是否重啟

2020 年 2 月 24 日
筆記

一、概述

Linux 內核（以下簡稱內核）是一個不與特定進程相關的功能集合，內核的代碼很難輕易的在調試器中執行和跟蹤。開發者認為，內核如果發生了錯誤，就不應該繼續運行。因此內核發生錯誤時，它的行為通常被設定為系統崩潰，機器重啟。基於動態存儲器的電氣特性，機器重啟後，上次錯誤發生時的現場會遭到破壞，這使得查找內核的錯誤變得異常困難。

線上的k8s集群，有時候回出現重啟的現象，但是什麼原因導致重啟，無法得知。

Kdump

Kdump 是一種基於 kexec 的內存轉儲工具，目前它已經被內核主線接收，成為了內核的一部分，它也由此獲得了絕大多數 Linux 發行版的支持。與傳統的內存轉儲機制不同不同，基於 Kdump 的系統工作的時候需要兩個內核，一個稱為系統內核，即系統正常工作時運行的內核；另外一個稱為捕獲內核，即正常內核崩潰時，用來進行內存轉儲的內核。

關於如何設置 kump，請參考鏈接：

https://blog.csdn.net/bytxl/article/details/45025183

因此，線上已經部署了Kdump，用來捕捉崩潰

二、監控腳本

系統什麼時間發生了重啟？不知道。所以需要有一個腳本來監測一下，一旦發生重啟，就可以使用 crash分析內存轉儲文件

怎麼知道系統重啟

在ubuntu系統中，有一個 last reboot 命令，它會顯示系統重啟的歷史列表

執行命令，效果如下：

root@localhost:~# last reboot  reboot   system boot  4.4.0-119-generi Mon Jan  7 13:50   still running  reboot   system boot  4.4.0-119-generi Sat Jan  5 11:48 - 13:49 (2+02:01)  reboot   system boot  4.4.0-62-generic Sat Jan  5 10:37 - 11:47  (01:10)    wtmp begins Sat Jan  5 10:37:40 2019

在第一行，也就是最近一次的重啟記錄。

判斷條件

怎麼知道昨天，系統有沒有重啟呢？

很簡單，先用 last reboot 獲取最近一次的重啟時間。再獲取昨天的時間，將2個時間做對比，如果一致，就說明昨晚重啟了，否則沒有。

獲取最新一次重啟時間

# 最近一次重啟時間  lately=`last reboot | head -1 | awk '{print $5,$6,$7}'`

昨日時間

# 昨天時間  yesterday=`date -d  "-1 days" | awk '{print $1,$2,$3}'`

Prometheus數據

我們需要構造Prometheus數據，將數據發送給Pushgateway，最後由Grafana 展示圖表以及做報警

這裡我們使用shell腳本來構造數據，格式如下：

監控名{destinationName="描述信息",instance="實例，默認值為空"} 值

這些數據，我是放在一個臨時文件 /tmp/check_system_restart 裏面

echo "system_restart{destinationName="system_restart",instance="$HOSTNAME"} 1" > /tmp/check_system_restart

注意：使用由於echo外部使用了雙引號，所以內部再次使用雙引號時，需要使用反斜杠進行轉義才行。

我們知道，在shell裏面，單引號是無法引用變量的，必須使用雙引號！

$HOSTNAME 是linux 系統的一個全局變量，表示主機名

發送數據

cat /tmp/check_system_restart|curl --data-binary @- http://$localIP:9091/metrics/job/system_restart_`echo $localIP | awk -F '.' '{print $NF}'`

解釋：

–data-binary 參數表示 HTTP POST請求中的數據為純二進制數據

$localIP 表示 Pushgateway的ip地址

echo $localIP | awk -F '.' '{print $NF}' 表示獲取ip地址的最後一位

注意：這裡的job後面跟了一段字符串，是為了保證每一台服務器發送的url不一致。這樣監控數據就不會被其他主機覆蓋！

關於Pushgateway 的搭建，請參考鏈接：

https://www.cnblogs.com/xiao987334176/p/9933963.html

添加任務計劃

常規情況下，我們一般使用 crontab -e 命令來添加任務計劃

但是在shell腳本，卻不能這麼操作。

其實，直接修改 /etc/crontab 文件，也可以添加任務計劃

下面一段代碼，用來判斷任務計劃是否已經添加，不存在時，就添加！

if [ `cat /etc/crontab|grep 'check_reboot.sh'|wc -l` -eq 0 ];then          cp -f /opt/check_reboot.sh /etc/ && chmod 755 /etc/check_reboot.sh           echo "0 * * * * root bash /etc/check_reboot.sh" >>/etc/crontabfi

完整代碼

請將代碼務必放到/opt目錄下，因為代碼路徑寫死了！！！

check_reboot.sh

#!/bin/bash    # 最近一次重啟時間  lately=`last reboot | head -1 | awk '{print $5,$6,$7}'`    # 昨天時間  yesterday=`date -d  "-1 days" | awk '{print $1,$2,$3}'`    # 判斷時間是否一致  if [ "$string" == "$yesterday" ];then      # 寫入日誌      #echo "$HOSTNAME restarted at $lately" >> /opt/restart.log      echo "system_restart{destinationName="system_restart",instance="$HOSTNAME"} 1" > /tmp/check_system_restart  else      echo "system_restart{destinationName="system_restart",instance="$HOSTNAME"} 0" > /tmp/check_system_restart  fi    # 獲取geteway服務器ip  localIP=`ip addr | grep '192.168' | awk '{print $2}' | cut -d '/' -f 1`    # 發送數據給Pushgateway   if [ `cat /tmp/check_system_restart|wc -l` -ge 1 ];then          cat /tmp/check_system_restart|curl --data-binary @- http://$localIP:9091/metrics/job/system_restart_`echo $localIP | awk -F '.' '{print $NF}'`  else          curl -X DELETE http://$localIP:9091/metrics/job/system_restart_`echo $localIP | awk -F '.' '{print $NF}'`  fi    # 添加任務計劃  if [ `cat /etc/crontab|grep 'check_reboot.sh'|wc -l` -eq 0 ];then          cp -f /opt/check_reboot.sh /etc/ && chmod 755 /etc/check_reboot.sh           echo "0 * * * * root bash /etc/check_reboot.sh" >>/etc/crontab  fi

執行腳本，就會自動產生 /tmp/check_system_restart 文件。

查看文件內容

root@localhost:~# cat /tmp/check_system_restart  system_restart{destinationName="system_restart",instance="xx-node01"} 0

自動將腳本複製到 /etc/check_reboot.sh，這樣是為了路徑統一，方便添加任務計劃！

最後，會自動添加任務計劃！

任務計劃定義的是每個小時執行一次，為了不等那麼長時間，可以先手動執行一次 /etc/check_reboot.sh 腳本

查看 Pushgateway 數據

就會看到一條job

三、Grafana添加監控

添加一個圖形，標題叫做昨日系統重啟

設置顯示的值

設置報警策略

當最後一個值等於1時，觸發報警

效果如下：

Grafana 監控系統是否重啟

一、概述

Kdump

二、監控腳本

怎麼知道系統重啟

判斷條件

Prometheus數據

發送數據

添加任務計劃

完整代碼

check_reboot.sh

三、Grafana添加監控

VirMach 便宜 VPS

QNews

Grafana 監控系統是否重啟

一、概述

Kdump

二、監控腳本

怎麼知道系統重啟

判斷條件

Prometheus數據

發送數據

添加任務計劃

完整代碼

check_reboot.sh

三、Grafana添加監控

分享此文：

Related Posts

APK修改神器：插樁工具 DexInjector

【藍橋杯】BASIC-15 字符串對比

Python 檢測系統時間,k8s版本,redis集群,etcd,mysql,ceph,kafka

目標檢測和感受野的總結和想法

VirMach 便宜 VPS

QNews

熱門搜尋