Kubernetes實戰總結 – 自定義Prometheus
- 2020 年 8 月 17 日
- 筆記
- Kubernetes
一、概述
首先Prometheus整體監控結構略微複雜,一個個部署並不簡單。另外監控Kubernetes就需要訪問內部數據,必定需要進行認證、鑒權、准入控制,
那麼這一整套下來將變得難上加難,而且還需要花費一定的時間,如果你沒有特別高的要求,我還是建議選用開源比較好的一些方案。
關於Prometheus具體介紹不再多說,可以參考另外一篇博文:Kubernetes實戰總結 – Prometheus部署(v0.3.0)
本篇主要針對Kubernetes部署Prometheus相關配置介紹,本人採用的是github開源的部署方案:
關於這個kube-prometheus目前應該是開源最好的方案了,該存儲庫收集Kubernetes清單,Grafana儀錶板和Prometheus規則,以及文檔和腳本,
以使用Prometheus Operator 通過Prometheus提供易於操作的端到端Kubernetes集群監視。以容器的方式部署到k8s集群,而且還可以自定義配置,非常的方便。
注意:本人使用的kubernetes-1.17.5 + release-0.3,由於網路問題本人已修改全部鏡像地址。
二、結構分析
kube-prometheus相關部署文件在manifests目錄中,共65個yaml,其中setup文件夾中包含所有自定義資源配置CustomResourceDefinition(一般不用修改,也不要輕易修改),所以部署時必須先執行這個文件夾。
其中包括告警(Alertmanager)、監控(Prometheus)、監控項(PrometheusRule)這三類資源定義,所以如果你想直接在k8s中修改對應控制器配置是沒有用的(比如kubectl edit sts prometheus-k8s -n monitoring) 。
這裡yaml文件看著很多,只要我們梳理一下就會很容易理解了,首先分為7個組件prometheus-operator、prometheus-adapter、prometheus、alertmanager、grafana、kube-state-metrics、node-exporter,
然後每個組件都會定義控制器、配置文件、集群許可權、訪問配置等, 但是我們一般只需要進行自定義告警配置和監控項,這樣一篩選發現只需要修改幾個文件即可(其中紅色後面重點說明,紫色可根據項目情況調整資源配置)。
[root@ymt108 manifests]# tree . ├── alertmanager-alertmanager.yaml ├── alertmanager-secret.yaml # 告警配置 ├── alertmanager-serviceAccount.yaml ├── alertmanager-serviceMonitor.yaml ├── alertmanager-service.yaml ├── grafana-dashboardDatasources.yaml ├── grafana-dashboardDefinitions.yaml ├── grafana-dashboardSources.yaml ├── grafana-deployment.yaml ├── grafana-serviceAccount.yaml ├── grafana-serviceMonitor.yaml ├── grafana-service.yaml ├── kube-state-metrics-clusterRoleBinding.yaml ├── kube-state-metrics-clusterRole.yaml ├── kube-state-metrics-deployment.yaml ├── kube-state-metrics-roleBinding.yaml ├── kube-state-metrics-role.yaml ├── kube-state-metrics-serviceAccount.yaml ├── kube-state-metrics-serviceMonitor.yaml ├── kube-state-metrics-service.yaml ├── node-exporter-clusterRoleBinding.yaml ├── node-exporter-clusterRole.yaml ├── node-exporter-daemonset.yaml ├── node-exporter-serviceAccount.yaml ├── node-exporter-serviceMonitor.yaml ├── node-exporter-service.yaml ├── prometheus-adapter-apiService.yaml ├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml ├── prometheus-adapter-clusterRoleBindingDelegator.yaml ├── prometheus-adapter-clusterRoleBinding.yaml ├── prometheus-adapter-clusterRoleServerResources.yaml ├── prometheus-adapter-clusterRole.yaml ├── prometheus-adapter-configMap.yaml ├── prometheus-adapter-deployment.yaml ├── prometheus-adapter-roleBindingAuthReader.yaml ├── prometheus-adapter-serviceAccount.yaml ├── prometheus-adapter-service.yaml ├── prometheus-clusterRoleBinding.yaml ├── prometheus-clusterRole.yaml ├── prometheus-operator-serviceMonitor.yaml ├── prometheus-prometheus.yaml # 監控配置 ├── prometheus-roleBindingConfig.yaml ├── prometheus-roleBindingSpecificNamespaces.yaml ├── prometheus-roleConfig.yaml ├── prometheus-roleSpecificNamespaces.yaml ├── prometheus-rules.yaml # 默認監控項 ├── prometheus-serviceAccount.yaml ├── prometheus-serviceMonitorApiserver.yaml ├── prometheus-serviceMonitorCoreDNS.yaml ├── prometheus-serviceMonitorKubeControllerManager.yaml ├── prometheus-serviceMonitorKubelet.yaml ├── prometheus-serviceMonitorKubeScheduler.yaml ├── prometheus-serviceMonitor.yaml ├── prometheus-service.yaml └── setup ├── 0namespace-namespace.yaml ├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml ├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml ├── prometheus-operator-0prometheusCustomResourceDefinition.yaml ├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml ├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml ├── prometheus-operator-clusterRoleBinding.yaml ├── prometheus-operator-clusterRole.yaml ├── prometheus-operator-deployment.yaml ├── prometheus-operator-serviceAccount.yaml └── prometheus-operator-service.yaml 1 directories, 65 files
三、修改Prometheus配置
為了保留原始文件,我們複製一份prometheus-prometheus.yaml進行如下修改:
1)replicas:根據項目情況調整副本數
2)retention:修改Prometheus數據保留期限,默認值為「24h」,並且必須與正則表達式「 [0-9] +(ms | s | m | h | d | w | y)」匹配。
3)additionalScrapeConfigs:增加額外監控項配置,具體配置查看第五部分「添加k8s外部監控」。
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web # baseImage: quay.io/prometheus/prometheus baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus additionalScrapeConfigs: name: additional-scrape-configs key: prometheus-additional.yaml retention: 15d nodeSelector: kubernetes.io/os: linux podMonitorSelector: {} replicas: 2 resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.11.0
四、修改PrometheusRule配置
首先查看默認監控項配置prometheus-rules.yaml,其中包括76個告警項,基本覆蓋了k8s常用監控點,同樣為了保留源文件,我們複製一份prometheus-rules.yaml進行一些修改。
由於我使用的版本在裸機k8s集群上存在無法獲取到scheduler和controller資源問題,被我注釋了;另外由於general-rules規則與我自定義規則衝突也被我注釋了;
最後增加了platform參數區分環境,以及進行部分提示語中譯。


apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: node-exporter.rules rules: - expr: | count without (cpu) ( count without (mode) ( node_cpu_seconds_total{job="node-exporter"} ) ) record: instance:node_num_cpu:sum - expr: | 1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m]) ) record: instance:node_cpu_utilisation:rate1m - expr: | ( node_load1{job="node-exporter"} / instance:node_num_cpu:sum{job="node-exporter"} ) record: instance:node_load1_per_cpu:ratio - expr: | 1 - ( node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} ) record: instance:node_memory_utilisation:ratio - expr: | rate(node_vmstat_pgmajfault{job="node-exporter"}[1m]) record: instance:node_vmstat_pgmajfault:rate1m - expr: | rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]) record: instance_device:node_disk_io_time_seconds:rate1m - expr: | rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]) record: instance_device:node_disk_io_time_weighted_seconds:rate1m - expr: | sum without (device) ( rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_receive_bytes_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_transmit_bytes_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_receive_drop_excluding_lo:rate1m - expr: | sum without (device) ( rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[1m]) ) record: instance:node_network_transmit_drop_excluding_lo:rate1m - name: kube-apiserver.rules rules: - expr: | histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m]) ) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate - expr: | container_memory_working_set_bytes{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_working_set_bytes - expr: | container_memory_rss{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_rss - expr: | container_memory_cache{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_cache - expr: | container_memory_swap{job="kubelet", image!=""} * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info) record: node_namespace_pod_container:container_memory_swap - expr: | sum(container_memory_usage_bytes{job="kubelet", image!="", container!="POD"}) by (namespace) record: namespace:container_memory_usage_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod) * on (namespace, pod) group_left(label_name) kube_pod_labels{job="kube-state-metrics"} ) record: namespace:kube_pod_container_resource_requests_memory_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod) * on (namespace, pod) group_left(label_name) kube_pod_labels{job="kube-state-metrics"} ) record: namespace:kube_pod_container_resource_requests_cpu_cores:sum - expr: | sum( label_replace( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)" ) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner{job="kube-state-metrics"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: deployment record: mixin_pod_workload - expr: | sum( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: daemonset record: mixin_pod_workload - expr: | sum( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)" ) ) by (namespace, workload, pod) labels: workload_type: statefulset record: mixin_pod_workload - name: kube-scheduler.rules rules: - expr: | histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile - name: node.rules rules: - expr: sum(min(kube_pod_info) by (node)) record: ':kube_pod_info_node_count:' - expr: | max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod) record: 'node_namespace_pod:kube_pod_info:' - expr: | count by (node) (sum by (node, cpu) ( node_cpu_seconds_total{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: )) record: node:node_num_cpu:sum - expr: | sum( node_memory_MemAvailable_bytes{job="node-exporter"} or ( node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"} ) ) record: :node_memory_MemAvailable_bytes:sum - name: kube-prometheus-node-recording.rules rules: - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance) record: instance:node_cpu:rate:sum - expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"})) BY (instance) record: instance:node_filesystem_usage:sum - expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance) record: instance:node_network_receive_bytes:rate:sum - expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance) record: instance:node_network_transmit_bytes:rate:sum - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance) record: instance:node_cpu:ratio - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) record: cluster:node_cpu:sum_rate5m - expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu)) record: cluster:node_cpu:ratio - name: node-exporter rules: - alert: NodeFilesystemSpaceFillingUp annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. summary: "預計文件系統將在接下來的24小時內用完空間。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemSpaceFillingUp annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. summary: "預計文件系統將在接下來的4個小時內用完空間。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemAlmostOutOfSpace annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. summary: "文件系統剩餘空間不到5%。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemAlmostOutOfSpace annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. summary: "文件系統剩餘空間不到3%。" expr: | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemFilesFillingUp annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up. summary: "預計文件系統將在接下來的24小時內用盡inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemFilesFillingUp annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast. summary: "預計文件系統將在接下來的4小時內用盡inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeFilesystemAlmostOutOfFiles annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. summary: "文件系統僅剩不到5%的inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: warning - alert: NodeFilesystemAlmostOutOfFiles annotations: platform: "測試平台" description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. summary: "文件系統僅剩不到3%的inodes。" expr: | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) for: 1h labels: severity: critical - alert: NodeNetworkReceiveErrs annotations: platform: "測試平台" description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.' summary: "網路介面報告許多接收錯誤。" expr: | increase(node_network_receive_errs_total[2m]) > 10 for: 1h labels: severity: warning - alert: NodeNetworkTransmitErrs annotations: platform: "測試平台" description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.' summary: "網路介面報告許多傳輸錯誤。" expr: | increase(node_network_transmit_errs_total[2m]) > 10 for: 1h labels: severity: warning - name: kubernetes-apps rules: - alert: KubePodCrashLooping annotations: platform: "測試平台" message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. expr: | rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 for: 15m labels: severity: critical - alert: KubePodNotReady annotations: platform: "測試平台" message: "Pod {{$labels.namespace}}/{{$labels.pod}}處於未就緒狀態的時間超過15分鐘。" expr: | sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Failed|Pending|Unknown"} * on(namespace, pod) group_left(owner_kind) kube_pod_owner{owner_kind!="Job"}) > 0 for: 15m labels: severity: critical - alert: KubeDeploymentGenerationMismatch annotations: platform: "測試平台" message: "Deployment {{$labels.namespace}}/{{$labels.deployment}}生成不匹配,這表明Deployment已失敗但尚未回滾。" expr: | kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeDeploymentReplicasMismatch annotations: platform: "測試平台" message: "Deployment{{$labels.namespace}}/{{$labels.deployment}}超過15分鐘未匹配預期的副本數。" expr: | kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetReplicasMismatch annotations: platform: "測試平台" message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}超過15分鐘未匹配預期的副本數。" expr: | kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetGenerationMismatch annotations: platform: "測試平台" message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}生成不匹配,這表明StatefulSet已失敗但尚未回滾。" expr: | kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical - alert: KubeStatefulSetUpdateNotRolledOut annotations: platform: "測試平台" message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. expr: | max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) for: 15m labels: severity: critical - alert: KubeDaemonSetRolloutStuck annotations: platform: "測試平台" message: Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready. expr: | kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00 for: 15m labels: severity: critical - alert: KubeContainerWaiting annotations: platform: "測試平台" message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. expr: | sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 for: 1h labels: severity: warning - alert: KubeDaemonSetNotScheduled annotations: platform: "測試平台" message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' expr: | kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning - alert: KubeDaemonSetMisScheduled annotations: platform: "測試平台" message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' expr: | kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning - alert: KubeCronJobRunning annotations: platform: "測試平台" message: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. expr: | time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 for: 1h labels: severity: warning - alert: KubeJobCompletion annotations: platform: "測試平台" message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete. expr: | kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 for: 1h labels: severity: warning - alert: KubeJobFailed annotations: platform: "測試平台" message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. expr: | kube_job_failed{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning - alert: KubeHpaReplicasMismatch annotations: platform: "測試平台" message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. expr: | (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning - alert: KubeHpaMaxedOut annotations: platform: "測試平台" message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. expr: | kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} for: 15m labels: severity: warning - name: kubernetes-resources rules: - alert: KubeCPUOvercommit annotations: platform: "測試平台" message: "群集已超額使用Pod的CPU資源請求,因此無法容忍節點故障。" expr: | sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: warning - alert: KubeMemOvercommit annotations: platform: "測試平台" message: "群集已過量使用Pod的記憶體資源請求,因此無法容忍節點故障。" expr: | sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning - alert: KubeCPUOvercommit annotations: platform: "測試平台" message: "群集已超額使用了對命名空間的CPU資源請求。" expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning - alert: KubeMemOvercommit annotations: platform: "測試平台" message: "群集已過量使用了對命名空間的記憶體資源請求。" expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 for: 5m labels: severity: warning - alert: KubeQuotaExceeded annotations: platform: "測試平台" message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. expr: | kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90 for: 15m labels: severity: warning # - alert: CPUThrottlingHigh # annotations: # message: '{{ $value | humanizePercentage }} throttling of CPU in namespace # {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ # $labels.pod }}.' # expr: | # sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) # / # sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) # > ( 25 / 100 ) # for: 15m # labels: # severity: warning - name: kubernetes-storage rules: - alert: KubePersistentVolumeUsageCritical annotations: platform: "測試平台" message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. expr: | kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} < 0.03 for: 1m labels: severity: critical - alert: KubePersistentVolumeFullInFourDays annotations: platform: "測試平台" message: "根據最近的抽樣,{{$labels.persistentvolumeclaim}}在命名空間{{$labels.namespace}}中聲明的PersistentVolume預計將在四天內填滿,目前{{$value | humanizePercentage}}可用。" expr: | ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: critical - alert: KubePersistentVolumeErrors annotations: platform: "測試平台" message: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. expr: | kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 for: 5m labels: severity: critical - name: kubernetes-system rules: - alert: KubeVersionMismatch annotations: platform: "測試平台" message: There are {{ $value }} different semantic versions of Kubernetes components running. expr: | count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning - alert: KubeClientErrors annotations: platform: "測試平台" message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' expr: | (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.01 for: 15m labels: severity: warning - name: kubernetes-system-apiserver rules: - alert: KubeAPILatencyHigh annotations: platform: "測試平台" message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. expr: | cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 1 for: 10m labels: severity: warning - alert: KubeAPILatencyHigh annotations: platform: "測試平台" message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. expr: | cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 4 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "測試平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.03 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "測試平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.01 for: 10m labels: severity: warning - alert: KubeAPIErrorsHigh annotations: platform: "測試平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.10 for: 10m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: platform: "測試平台" message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. expr: | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05 for: 10m labels: severity: warning - alert: KubeClientCertificateExpiration annotations: platform: "測試平台" message: "用於驗證apiserver的客戶端證書的有效期限少於7.0天。" expr: | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning - alert: KubeClientCertificateExpiration annotations: platform: "測試平台" message: "用於驗證apiserver的客戶端證書的有效期限少於24.0小時。" expr: | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical - alert: KubeAPIDown annotations: platform: "測試平台" message: "KubeAPI已從Prometheus目標發現中消失。" expr: | absent(up{job="apiserver"} == 1) for: 15m labels: severity: critical - name: kubernetes-system-kubelet rules: - alert: KubeNodeNotReady annotations: platform: "測試平台" message: "{{$labels.node}}尚未準備就緒超過15分鐘。" expr: | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 for: 15m labels: severity: warning - alert: KubeNodeUnreachable annotations: platform: "測試平台" message: "{{$labels.node}}無法訪問,某些工作負荷可能會重新安排。" expr: | kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 labels: severity: warning - alert: KubeletTooManyPods annotations: platform: "測試平台" message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. expr: | max(max(kubelet_running_pod_count{job="kubelet"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"}) by(node) > 0.95 for: 15m labels: severity: warning - alert: KubeletDown annotations: platform: "測試平台" message: "Kubelet已從Prometheus目標發現中消失。" expr: | absent(up{job="kubelet"} == 1) for: 15m labels: severity: critical # - name: kubernetes-system-scheduler # rules: # - alert: KubeSchedulerDown # annotations: # message: KubeScheduler has disappeared from Prometheus target discovery. # runbook_url: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown # expr: | # absent(up{job="kube-scheduler"} == 1) # for: 15m # labels: # severity: critical # - name: kubernetes-system-controller-manager # rules: # - alert: KubeControllerManagerDown # annotations: # message: KubeControllerManager has disappeared from Prometheus target discovery. # runbook_url: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown # expr: | # absent(up{job="kube-controller-manager"} == 1) # for: 15m # labels: # severity: critical - name: prometheus rules: - alert: PrometheusBadConfig annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration. summary: "Prometheus配置重新載入失敗。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # //www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0 for: 10m labels: severity: critical - alert: PrometheusNotificationQueueRunningFull annotations: platform: "測試平台" description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full. summary: "Prometheus警報通知隊列預計將在30m以內用完。" expr: | # Without min_over_time, failed scrapes could create false negatives, see # //www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m]) ) for: 15m labels: severity: warning - alert: PrometheusErrorSendingAlertsToSomeAlertmanagers annotations: platform: "測試平台" description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.' summary: "Prometheus在將警報發送到特定的Alertmanager時遇到了超過1%的錯誤。" expr: | ( rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) * 100 > 1 for: 15m labels: severity: warning - alert: PrometheusErrorSendingAlertsToAnyAlertmanager annotations: platform: "測試平台" description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' summary: "Prometheus在將警報發送到任何Alertmanager時遇到3%以上的錯誤。" expr: | min without(alertmanager) ( rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) * 100 > 3 for: 15m labels: severity: critical - alert: PrometheusNotConnectedToAlertmanagers annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers. summary: "Prometheus未與任何Alertmanager連接。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # //www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1 for: 10m labels: severity: warning - alert: PrometheusTSDBReloadsFailing annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h. summary: "Prometheus從磁碟重新載入塊時遇到問題。" expr: | increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning - alert: PrometheusTSDBCompactionsFailing annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h. summary: "Prometheus在壓縮塊時遇到問題。" expr: | increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0 for: 4h labels: severity: warning - alert: PrometheusNotIngestingSamples annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples. summary: "Prometheus沒有獲取到樣本" expr: | rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0 for: 10m labels: severity: warning - alert: PrometheusDuplicateTimestamps annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with different values but duplicated timestamp. summary: "Prometheus正在刪除帶有重複時間戳的樣本。" expr: | rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning - alert: PrometheusOutOfOrderTimestamps annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with timestamps arriving out of order. summary: "Prometheus丟棄帶有亂序時間戳的樣本。" expr: | rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 10m labels: severity: warning - alert: PrometheusRemoteStorageFailures annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to queue {{$labels.queue}}. summary: "Prometheus無法將樣本發送到遠程存儲。" expr: | ( rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) / ( rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m]) ) ) * 100 > 1 for: 15m labels: severity: critical - alert: PrometheusRemoteWriteBehind annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for queue {{$labels.queue}}. summary: "Prometheus遠程寫入落後了。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # //www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m]) - on(job, instance) group_right max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m]) ) > 120 for: 15m labels: severity: critical - alert: PrometheusRemoteWriteDesiredShards annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}` $labels.instance | query | first | value }}. summary: "Prometheus遠程寫入所需的分片計算要比配置的最大分片運行更多。" expr: | # Without max_over_time, failed scrapes could create false negatives, see # //www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. ( max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m]) ) for: 15m labels: severity: warning - alert: PrometheusRuleFailures annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m. summary: "Prometheus無法通過規則評估。" expr: | increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: critical - alert: PrometheusMissingRuleEvaluations annotations: platform: "測試平台" description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m. summary: "Prometheus由於規則組評估速度慢而缺少規則評估。" expr: | increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0 for: 15m labels: severity: warning - name: alertmanager.rules rules: - alert: AlertmanagerConfigInconsistent annotations: platform: "測試平台" message: "Alertmanager {{$labels.service}}實例的配置不同步。" expr: | count_values("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="prometheus-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "alertmanager-$1", "name", "(.*)") != 1 for: 5m labels: severity: critical - alert: AlertmanagerFailedReload annotations: platform: "測試平台" message: "Alertmanager {{$labels.namespace}}/{{$labels.pod}}重新載入配置失敗。" expr: | alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0 for: 10m labels: severity: warning - alert: AlertmanagerMembersInconsistent annotations: platform: "測試平台" message: "Alertmanager尚未找到集群的所有其他成員。" expr: | alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"} != on (service) GROUP_LEFT() count by (service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}) for: 5m labels: severity: critical # - name: general.rules # rules: # - alert: TargetDown # annotations: # platform: "測試平台" # message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }} targets in # {{ $labels.namespace }} namespace are down.' # expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, # namespace, service)) > 10 # for: 10m # labels: # severity: warning # - alert: Watchdog # annotations: # platform: "測試平台" # message: "此警報始終處於觸髮狀態,旨在確保整個警報管道均正常運行。" # runbook_url: //github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md # expr: vector(1) # labels: # severity: none - name: node-time rules: - alert: ClockSkewDetected annotations: platform: "測試平台" message: Clock skew detected on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}. Ensure NTP is configured correctly on this host. expr: | abs(node_timex_offset_seconds{job="node-exporter"}) > 0.05 for: 2m labels: severity: warning - name: node-network rules: - alert: NodeNetworkInterfaceFlapping annotations: platform: "測試平台" message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}" expr: | changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2 for: 2m labels: severity: warning - name: prometheus-operator rules: - alert: PrometheusOperatorReconcileErrors annotations: platform: "測試平台" message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning - alert: PrometheusOperatorNodeLookupErrors annotations: platform: "測試平台" message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1 for: 10m labels: severity: warning
prometheus-rules.yaml
接下來參考prometheus-rules.yaml,新建自定義的告警項prometheus-additional-rules.yaml
kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-additional-rules namespace: monitoring spec: groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 30s labels: severity: critical annotations: platform: "育苗通測試平台" summary: "監控採集器 {{ $labels.instance }} 停止工作!!!" value: "{{ $value }}" - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{device="rootfs"} / node_filesystem_size_bytes{device="rootfs"} * 100) > 85 for: 5m labels: severity: warning annotations: platform: "育苗通測試平台" summary: "主機 {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率超過80%!!!" value: "{{ $value }}" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 5m labels: severity: warning annotations: platform: "育苗通測試平台" summary: "主機 {{ $labels.instance }} 記憶體使用率超過80%!!!" value: "{{ $value }}" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80 for: 5m labels: severity: warning annotations: platform: "育苗通測試平台" summary: "主機 {{ $labels.instance }} CPU使用率超過80%!!!" value: "{{ $value }}"
五、添加k8s外部監控
一個項目開始可能很難實現全部容器化,比如資料庫、CDH集群。但是我們依然需要監控他們,如果分成兩套prometheus不利於管理,所以我們統一添加這些監控到kube-prometheus中。
那麼接下來我們新建prometheus-additional.yaml文件,添加額外監控組件配置scrape_configs。


- job_name: 'node_export_others' static_configs: - targets: - *.*.*.118:9591 - *.*.*.137:9591 - *.*.*.138:9591 - *.*.*.139:9591 - *.*.*.153:9591 - *.*.*.154:9591 - *.*.*.156:9591 - *.*.*.159:9591 - *.*.*.166:9591 - *.*.*.167:9591 - *.*.*.168:9591 - *.*.*.169:9591 - *.*.*.175:9591 - *.*.*.177:9591 - *.*.*.179:9591 - *.*.*.180:9591 - *.*.*.181:9591 - job_name: 'nginx' static_configs: - targets: - *.*.*.159:9593 - job_name: 'mysql' static_configs: - targets: - *.*.*.156:9592 - job_name: 'pushgateway-flink' static_configs: - targets: - *.*.*.137:9091 - job_name: 'kafka_jmx_export' static_configs: - targets: - *.*.*.118:9991 - *.*.*.138:9991 - *.*.*.139:9991 - job_name: 'zookeeper_cdh_export' static_configs: - targets: - *.*.*.118:9595 - *.*.*.138:9595 - *.*.*.139:9595 - job_name: 'hdfs_namenode_cdh' static_configs: - targets: - *.*.*.137:9600 - *.*.*.139:9600 - job_name: 'hdfs_datanode_cdh' static_configs: - targets: - *.*.*.118:9601 - *.*.*.138:9601 - *.*.*.139:9601 - job_name: 'yarn_resourcemanager_cdh' static_configs: - targets: - *.*.*.137:9602 - *.*.*.139:9602 - job_name: 'yarn_nodemanager_cdh' static_configs: - targets: - *.*.*.118:9603 - *.*.*.138:9603 - *.*.*.139:9603 - job_name: 'redis_exporter' static_configs: - targets: - *.*.*.166:9594 - job_name: 'elasticsearch' metrics_path: "/_prometheus/metrics" static_configs: - targets: - *.*.*.169:8970 - *.*.*.175:8970 - *.*.*.177:8970 - job_name: 'nacos' metrics_path: '/nacos/actuator/prometheus' static_configs: - targets: - *.*.*.166:8848 - *.*.*.167:8848 - *.*.*.168:8848 - job_name: 'redis_exporter_targets' static_configs: - targets: - redis://*.*.*.179:6179 - redis://*.*.*.180:6179 - redis://*.*.*.181:6179 - redis://*.*.*.179:6178 - redis://*.*.*.180:6178 - redis://*.*.*.181:6178 - redis://*.*.*.179:6177 - redis://*.*.*.179:6176 metrics_path: /scrape relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: *.*.*.166:9594
prometheus-additional.yaml
然後我們需要將這些監控配置以secret資源類型存儲到k8s集群中。
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring
六、配置Alertmanager
監控和告警項已經配置好了,那麼接下來我們將進行alertmanager告警配置了。
常用的接收方式就是郵件了,但這裡我們將使用企業微訊號進行接收,所以開發一個連接微信的應用appalertservice,進行消息轉發和處理。
當然,你也可以直接配置微訊號和消息模板,可參考:第3章 Prometheus告警處理。
global: resolve_timeout: 5m # smtp_smarthost: 'smtp.sina.com:25' # smtp_from: '******@sina.com' # smtp_auth_username: '******@sina.com' # smtp_auth_password: '******' route: group_by: ['job'] group_wait: 20s group_interval: 30m repeat_interval: 12h receiver: webhook receivers: - name: webhook webhook_configs: - url: '//appalertservice:20119/' # - name: 'email' # email_configs: # - to: '******@163.com'
然後我們需要將alertmanager配置以secret資源類型存儲到k8s集群中。
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
七、配置Grafana
前面我們把監控和告警已經配置好了,那接下來就剩展示了。打開grafana -> 點擊添加按鈕 ->Import ->Upload .json file,導入監控儀錶板。


{ "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "This dashboard provides cluster admins with the ability to monitor nodes and identify workload bottlenecks. It can be deployed with PSPs enabled using the following helm chart - //github.com/pivotal-cf/charts-grafana", "editable": true, "gnetId": 10000, "graphTooltip": 0, "id": 102, "iteration": 1597137794957, "links": [], "panels": [ { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 34, "panels": [], "repeat": null, "title": "Summary", "type": "row" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 0, "y": 1 }, "height": "180px", "id": 4, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}) / sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster memory usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 8, "y": 1 }, "height": "180px", "id": 6, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) / sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster CPU usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": true, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 5, "w": 8, "x": 16, "y": 1 }, "height": "180px", "id": 7, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_usage_bytes{id=\"/\"}) / sum (container_fs_limit_bytes{id=\"/\"}) * 100", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "metric": "", "refId": "A", "step": 10 } ], "thresholds": "65, 90", "title": "Cluster filesystem usage", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 0, "y": 6 }, "height": "1px", "id": 9, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "20%", "prefix": "", "prefixFontSize": "20%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 4, "y": 6 }, "height": "1px", "id": 10, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "none", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 8, "y": 6 }, "height": "1px", "id": 11, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": " cores", "postfixFontSize": "30%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "none", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 12, "y": 6 }, "height": "1px", "id": 12, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": " cores", "postfixFontSize": "30%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 16, "y": 6 }, "height": "1px", "id": 13, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_usage_bytes{id=\"/\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Used", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": false, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 3, "w": 4, "x": 20, "y": 6 }, "height": "1px", "id": 14, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum (container_fs_limit_bytes{id=\"/\"})", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 10 } ], "thresholds": "", "title": "Total", "type": "singlestat", "valueFontSize": "50%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 9 }, "id": 35, "panels": [], "repeat": null, "title": "Memory", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 0, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 10 }, "id": 25, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": false, "max": false, "min": false, "rightSide": true, "show": true, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": true, "steppedLine": false, "targets": [ { "expr": "sum (container_memory_working_set_bytes{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "{{ pod }}", "metric": "container_memory_usage:sort_desc", "refId": "A", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods memory usage", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 17 }, "id": 37, "panels": [], "title": "CPU", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 3, "editable": true, "error": false, "fill": 0, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 18 }, "height": "", "id": 17, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": false, "min": false, "rightSide": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": true, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_cpu_usage_seconds_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "{{ pod }}", "metric": "container_cpu", "refId": "A", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods CPU usage", "tooltip": { "msResolution": true, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "none", "label": "cores", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 25 }, "id": 33, "panels": [], "repeat": null, "title": "Network I/O", "type": "row" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 7, "w": 24, "x": 0, "y": 26 }, "id": 16, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": false, "min": false, "rightSide": true, "show": true, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_network_receive_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "-> {{ pod }}", "metric": "network", "refId": "A", "step": 10 }, { "expr": "- sum (rate (container_network_transmit_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "<- {{ pod }}", "metric": "network", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pods network I/O", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 5, "w": 24, "x": 0, "y": 33 }, "height": "200px", "id": 32, "legend": { "alignAsTable": false, "avg": true, "current": true, "max": false, "min": false, "rightSide": false, "show": false, "sideWidth": 200, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "Received", "metric": "network", "refId": "A", "step": 10 }, { "expr": "- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))", "format": "time_series", "interval": "10s", "intervalFactor": 1, "legendFormat": "Sent", "metric": "network", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Network I/O pressure", "tooltip": { "msResolution": false, "shared": true, "sort": 0, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "Bps", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } } ], "refresh": "10s", "schemaVersion": 20, "style": "dark", "tags": [ "Prometheus", "Kubernetes" ], "templating": { "list": [ { "auto": true, "auto_count": 20, "auto_min": "2m", "current": { "text": "auto", "value": "$__auto_interval_interval" }, "hide": 2, "label": null, "name": "interval", "options": [ { "selected": true, "text": "auto", "value": "$__auto_interval_interval" }, { "selected": false, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "refresh": 2, "skipUrlSync": false, "type": "interval" }, { "current": { "text": "prometheus", "value": "prometheus" }, "hide": 0, "includeAll": false, "label": null, "multi": false, "name": "datasource", "options": [], "query": "prometheus", "refresh": 1, "regex": "", "skipUrlSync": false, "type": "datasource" }, { "allValue": ".*", "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "", "hide": 0, "includeAll": true, "label": null, "multi": false, "name": "Node", "options": [], "query": "label_values(kubernetes_io_hostname)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 0, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-5m", "to": "now" }, "timepicker": { "refresh_intervals": [ "5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "timezone": "browser", "title": "育苗通K8S集群監控", "uid": "6KoW2MIGk", "version": 15 }
k8s-model.json


{ "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "【中文版本】2020.06.28更新,增加整體資源展示!支援 Grafana6&7,Node Exporter v0.16及以上的版本,優化重要指標展示。包含整體資源展示與資源明細圖表:CPU 記憶體 磁碟 IO 網路等監控指標。//github.com/starsliao/Prometheus", "editable": true, "gnetId": 8919, "graphTooltip": 0, "id": 72, "iteration": 1597137684806, "links": [ { "icon": "external link", "tags": [], "targetBlank": true, "title": "更新node_exporter", "tooltip": "", "type": "link", "url": "//github.com/prometheus/node_exporter/releases" }, { "icon": "external link", "tags": [], "targetBlank": true, "title": "更新當前儀錶板", "tooltip": "", "type": "link", "url": "//grafana.com/dashboards/8919" }, { "icon": "external link", "tags": [], "targetBlank": true, "title": "StarsL.cn", "tooltip": "", "type": "link", "url": "//starsl.cn" }, { "asDropdown": true, "icon": "external link", "tags": [], "targetBlank": true, "title": "", "type": "dashboards" } ], "panels": [ { "collapsed": false, "datasource": "prometheus", "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 187, "panels": [], "title": "資源總覽(關聯JOB項)當前選中主機:【$show_hostname】實例:$node", "type": "row" }, { "columns": [], "datasource": "prometheus", "description": "分區使用率、磁碟讀取、磁碟寫入、下載頻寬、上傳頻寬,如果有多個網卡或者多個分區,是採集的使用率最高的網卡或者分區的數值。", "fontSize": "100%", "gridPos": { "h": 12, "w": 24, "x": 0, "y": 1 }, "id": 185, "options": {}, "pageSize": 10, "showHeader": true, "sort": { "col": 5, "desc": false }, "styles": [ { "alias": "主機名", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "link": false, "linkTooltip": "", "linkUrl": "", "mappingType": 1, "pattern": "nodename", "thresholds": [], "type": "string", "unit": "bytes" }, { "alias": "IP(鏈接到明細)", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": true, "linkTargetBlank": false, "linkTooltip": "瀏覽主機明細", "linkUrl": "/d/9CWBz0bik/node-exporter?orgId=1&var-job=${job}&var-hostname=All&var-node=${__cell}&var-device=All", "mappingType": 1, "pattern": "instance", "thresholds": [], "type": "number", "unit": "short" }, { "alias": "記憶體", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "Value #B", "thresholds": [], "type": "number", "unit": "bytes" }, { "alias": "CPU核", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": null, "mappingType": 1, "pattern": "Value #C", "thresholds": [], "type": "number", "unit": "short" }, { "alias": " 運行時間", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #D", "thresholds": [], "type": "number", "unit": "s" }, { "alias": "分區使用率*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #E", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "CPU使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #F", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "記憶體使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #G", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "磁碟讀取*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #H", "thresholds": [ "10485760", "20485760" ], "type": "number", "unit": "Bps" }, { "alias": "磁碟寫入*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #I", "thresholds": [ "10485760", "20485760" ], "type": "number", "unit": "Bps" }, { "alias": "下載頻寬*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #J", "thresholds": [ "30485760", "104857600" ], "type": "number", "unit": "bps" }, { "alias": "上傳頻寬*", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #K", "thresholds": [ "30485760", "104857600" ], "type": "number", "unit": "bps" }, { "alias": "5m負載", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "Value #L", "thresholds": [], "type": "number", "unit": "short" }, { "alias": "", "align": "right", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 2, "pattern": "/.*/", "thresholds": [], "type": "hidden", "unit": "short" } ], "targets": [ { "expr": "node_uname_info{job=~\"$job\"} - 0", "format": "table", "instant": true, "interval": "", "legendFormat": "主機名", "refId": "A" }, { "expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "運行時間", "refId": "D" }, { "expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "總記憶體", "refId": "B" }, { "expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "總核數", "refId": "C" }, { "expr": "node_load5{job=~\"$job\"}", "format": "table", "instant": true, "interval": "", "legendFormat": "5分鐘負載", "refId": "L" }, { "expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "CPU使用率", "refId": "F" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "記憶體使用率", "refId": "G" }, { "expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "分區使用率", "refId": "E" }, { "expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[5m])) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "最大讀取", "refId": "H" }, { "expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[5m])) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "最大寫入", "refId": "I" }, { "expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[5m])*8) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "下載頻寬", "refId": "J" }, { "expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[5m])*8) by (instance)", "format": "table", "hide": false, "instant": true, "interval": "", "legendFormat": "上傳頻寬", "refId": "K" } ], "timeFrom": null, "timeShift": null, "title": "伺服器資源總覽表(每頁10行)", "transform": "table", "type": "table" }, { "aliasColors": { "192.168.200.241:9100_Total": "dark-red", "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁碟花費在I/O操作佔比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 13 }, "hiddenSeries": false, "id": 191, "legend": { "alignAsTable": false, "avg": false, "current": true, "hideEmpty": true, "hideZero": true, "max": false, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "總平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "總核數", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "count(node_cpu_seconds_total{job=~\"$job\", mode='system'})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "總核數", "refId": "B", "step": 240 }, { "expr": "sum(node_load5{job=~\"$job\"})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "總5分鐘負載", "refId": "A", "step": 240 }, { "expr": "avg(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "總平均使用率", "refId": "F", "step": 240 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整體總負載與整體平均CPU使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "short", "label": "總負載", "logBase": 1, "max": null, "min": null, "show": true }, { "decimals": 0, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_總記憶體": "dark-red", "記憶體_Avaliable": "#6ED0E0", "記憶體_Cached": "#EF843C", "記憶體_Free": "#629E51", "記憶體_Total": "#6d1f62", "記憶體_Used": "#eab839", "可用": "#9ac48a", "總記憶體": "#bf1b00" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 13 }, "height": "300", "hiddenSeries": false, "id": 195, "legend": { "alignAsTable": false, "avg": false, "current": true, "max": false, "min": false, "rightSide": false, "show": true, "sort": "current", "sortDesc": false, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "總記憶體", "color": "#C4162A", "fill": 0 }, { "alias": "總平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"})", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "總記憶體", "refId": "A", "step": 4 }, { "expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"})", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "總已用", "refId": "B", "step": 4 }, { "expr": "(sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"}) / sum(node_memory_MemTotal_bytes{job=~\"$job\"}))*100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "總平均使用率", "refId": "H" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整體總記憶體與整體平均記憶體使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "bytes", "label": "總記憶體量", "logBase": 1, "max": null, "min": "0", "show": true }, { "decimals": null, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 13 }, "hiddenSeries": false, "id": 197, "legend": { "alignAsTable": false, "avg": false, "current": true, "hideEmpty": false, "hideZero": false, "max": false, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "總平均使用率", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "總磁碟量", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "總磁碟量", "refId": "E" }, { "expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "總使用量", "refId": "C" }, { "expr": "(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))))", "format": "time_series", "instant": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "總平均使用率", "refId": "A" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "$job:整體總磁碟與整體平均磁碟使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": 1, "format": "bytes", "label": "總磁碟量", "logBase": 1, "max": null, "min": "0", "show": true }, { "decimals": null, "format": "percent", "label": "平均使用率", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "collapsed": false, "datasource": "prometheus", "gridPos": { "h": 1, "w": 24, "x": 0, "y": 21 }, "id": 189, "panels": [], "title": "資源明細:【$show_hostname】", "type": "row" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorPrefix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": 0, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "s", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "threshcisLabels": false, "threshcisMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 22 }, "hideTimeOverride": true, "id": 15, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "null", "nullText": null, "options": {}, "pluginVersion": "6.4.2", "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(time() - node_boot_time_seconds{instance=~\"$node\"})", "format": "time_series", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 40 } ], "threshciss": "1,2", "thresholds": "1,3", "title": "運行時間", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "datasource": "prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "custom": {}, "decimals": 2, "displayName": "", "mappings": [ { "from": "", "id": 1, "operator": "", "text": "N/A", "to": "", "type": 1, "value": "0" } ], "max": 100, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 70 }, { "color": "#EAB839", "value": 90 } ] }, "unit": "percent" }, "overrides": [] }, "gridPos": { "h": 6, "w": 3, "x": 2, "y": 22 }, "id": 177, "options": { "displayMode": "lcd", "fieldOptions": { "calcs": [ "last" ], "defaults": { "decimals": 1, "mappings": [ { "from": "", "id": 1, "operator": "", "text": "N/A", "to": "", "type": 1, "value": "0" } ], "max": 100, "min": 0.1, "thresholds": { "0": { "color": "green", "value": null }, "1": { "color": "red", "value": 80 }, "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 70 }, { "color": "red", "value": 90 } ] }, "unit": "percent" }, "override": {}, "overrides": [], "values": false }, "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "values": false }, "showUnfilled": true }, "pluginVersion": "6.4.3", "targets": [ { "expr": "100 - (avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) * 100)", "instant": true, "interval": "", "legendFormat": "總CPU使用率", "refId": "A" }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100", "hide": true, "instant": true, "interval": "", "legendFormat": "IOwait使用率", "refId": "C" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100", "instant": true, "interval": "", "legendFormat": "記憶體使用率", "refId": "B" }, { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"})*100 /(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}))", "hide": false, "instant": true, "interval": "", "legendFormat": "最大分區({{mountpoint}})使用率", "refId": "D" }, { "expr": "(1 - ((node_memory_SwapFree_bytes{instance=~\"$node\"} + 1)/ (node_memory_SwapTotal_bytes{instance=~\"$node\"} + 1))) * 100", "instant": true, "legendFormat": "交換分區使用率", "refId": "F" } ], "timeFrom": null, "timeShift": null, "title": "", "type": "bargauge" }, { "columns": [], "datasource": "prometheus", "description": "本看板中的:磁碟總量、使用量、可用量、使用率保持和df命令的Size、Used、Avail、Use% 列的值一致,並且Use%的值會四捨五入保留一位小數,會更加準確。\n\n註:df中Use%演算法為:(size - free) * 100 / (avail + (size - free)),結果是整除則為該值,非整除則為該值+1,結果的單位是%。\n參考df命令源碼:", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fontSize": "100%", "gridPos": { "h": 6, "w": 10, "x": 5, "y": 22 }, "id": 181, "links": [ { "targetBlank": true, "title": "//github.com/coreutils/coreutils/blob/master/src/df.c", "url": "//github.com/coreutils/coreutils/blob/master/src/df.c" } ], "options": {}, "pageSize": null, "scroll": true, "showHeader": true, "sort": { "col": 6, "desc": false }, "styles": [ { "alias": "分區", "align": "auto", "colorMode": null, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "mappingType": 1, "pattern": "mountpoint", "thresholds": [ "" ], "type": "string", "unit": "bytes" }, { "alias": "可用空間", "align": "auto", "colorMode": "value", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "mappingType": 1, "pattern": "Value #A", "thresholds": [ "10000000000", "20000000000" ], "type": "number", "unit": "bytes" }, { "alias": "使用率", "align": "auto", "colorMode": "cell", "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "rgba(245, 54, 54, 0.9)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 1, "mappingType": 1, "pattern": "Value #B", "thresholds": [ "70", "85" ], "type": "number", "unit": "percent" }, { "alias": "總空間", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 0, "link": false, "mappingType": 1, "pattern": "Value #C", "thresholds": [], "type": "number", "unit": "bytes" }, { "alias": "文件系統", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "fstype", "thresholds": [], "type": "string", "unit": "short" }, { "alias": "設備名", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 2, "link": false, "mappingType": 1, "pattern": "device", "preserveFormat": false, "sanitize": false, "thresholds": [], "type": "string", "unit": "short" }, { "alias": "", "align": "auto", "colorMode": null, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 2, "pattern": "/.*/", "preserveFormat": true, "sanitize": false, "thresholds": [], "type": "hidden", "unit": "short" } ], "targets": [ { "expr": "node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0", "format": "table", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "總量", "refId": "C" }, { "expr": "node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0", "format": "table", "hide": false, "instant": true, "interval": "10s", "intervalFactor": 1, "legendFormat": "", "refId": "A" }, { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))", "format": "table", "hide": false, "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "B" } ], "title": "【$show_hostname】:各分區可用空間(EXT.*/XFS)", "transform": "table", "type": "table" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(50, 172, 45, 0.97)", "rgba(237, 129, 40, 0.89)", "#d44a3a" ], "datasource": "prometheus", "decimals": 2, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "percent", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 22 }, "id": 20, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "nullPointMode": "connected", "nullText": null, "options": {}, "pluginVersion": "6.4.2", "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": true, "lineColor": "#3274D9", "show": true, "ymax": null, "ymin": null }, "tableColumn": "", "targets": [ { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "20,50", "timeFrom": null, "timeShift": null, "title": "CPU iowait", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "avg" }, { "aliasColors": { "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in": "light-red", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in下載": "green", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_out上傳": "yellow", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_in下載": "purple", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out": "purple", "cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out上傳": "blue" }, "bars": true, "dashLength": 10, "dashes": false, "datasource": "prometheus", "editable": true, "error": false, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "grid": {}, "gridPos": { "h": 6, "w": 7, "x": 17, "y": 22 }, "hiddenSeries": false, "id": 183, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "show": false, "sort": "current", "sortDesc": true, "total": true, "values": true }, "lines": false, "linewidth": 2, "links": [], "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 1, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*_out上傳$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "increase(node_network_receive_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])", "interval": "60m", "intervalFactor": 1, "legendFormat": "{{device}}_in下載", "metric": "", "refId": "A", "step": 600, "target": "" }, { "expr": "increase(node_network_transmit_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])", "hide": false, "interval": "60m", "intervalFactor": 1, "legendFormat": "{{device}}_out上傳", "refId": "B", "step": 600 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每小時流量$device", "tooltip": { "msResolution": false, "shared": true, "sort": 0, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": "上傳(-)/下載(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "short", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 24 }, "id": 14, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "count(node_cpu_seconds_total{instance=~\"$node\", mode='system'})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "1,2", "title": "CPU 核數", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "short", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 24 }, "id": 179, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(node_filesystem_files_free{instance=~\"$node\",mountpoint=\"$maxmount\",fstype=~\"ext.?|xfs\"})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "100000,1000000", "title": "剩餘節點數:$maxmount ", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": 0, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "bytes", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 0, "y": 26 }, "id": 75, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "70%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "sum(node_memory_MemTotal_bytes{instance=~\"$node\"})", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{instance}}", "refId": "A", "step": 20 } ], "thresholds": "2,3", "title": "總記憶體", "type": "singlestat", "valueFontSize": "80%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "cacheTimeout": null, "colorBackground": false, "colorPostfix": false, "colorValue": true, "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "datasource": "prometheus", "decimals": null, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "format": "locale", "gauge": { "maxValue": 100, "minValue": 0, "show": false, "thresholdLabels": false, "thresholdMarkers": true }, "gridPos": { "h": 2, "w": 2, "x": 15, "y": 26 }, "id": 178, "interval": null, "links": [], "mappingType": 1, "mappingTypes": [ { "name": "value to text", "value": 1 }, { "name": "range to text", "value": 2 } ], "maxDataPoints": 100, "maxPerRow": 6, "nullPointMode": "null", "nullText": null, "options": {}, "postfix": "", "postfixFontSize": "50%", "prefix": "", "prefixFontSize": "50%", "rangeMaps": [ { "from": "null", "text": "N/A", "to": "null" } ], "sparkline": { "fillColor": "rgba(31, 118, 189, 0.18)", "full": false, "lineColor": "rgb(31, 120, 193)", "show": false }, "tableColumn": "", "targets": [ { "expr": "avg(node_filefd_maximum{instance=~\"$node\"})", "format": "time_series", "instant": true, "intervalFactor": 1, "legendFormat": "", "refId": "A", "step": 20 } ], "thresholds": "1024,10000", "title": "總文件描述符", "type": "singlestat", "valueFontSize": "70%", "valueMaps": [ { "op": "=", "text": "N/A", "value": "null" } ], "valueName": "current" }, { "aliasColors": { "192.168.200.241:9100_Total": "dark-red", "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁碟花費在I/O操作佔比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 28 }, "hiddenSeries": false, "id": 7, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*總使用率/", "color": "#C4162A", "fill": 0 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "系統使用率", "refId": "A", "step": 20 }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"user\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "用戶使用率", "refId": "B", "step": 240 }, { "expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) by (instance) *100", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "磁碟IO使用率", "refId": "D", "step": 240 }, { "expr": "(1 - avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) by (instance))*100", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "總使用率", "refId": "F", "step": 240 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "CPU使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": 0, "format": "percent", "label": "", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_總記憶體": "dark-red", "使用率": "yellow", "記憶體_Avaliable": "#6ED0E0", "記憶體_Cached": "#EF843C", "記憶體_Free": "#629E51", "記憶體_Total": "#6d1f62", "記憶體_Used": "#eab839", "可用": "#9ac48a", "總記憶體": "#bf1b00" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 28 }, "height": "300", "hiddenSeries": false, "id": 156, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "總記憶體", "color": "#C4162A", "fill": 0 }, { "alias": "使用率", "color": "rgb(0, 209, 255)", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "總記憶體", "refId": "A", "step": 4 }, { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemAvailable_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "已用", "refId": "B", "step": 4 }, { "expr": "node_memory_MemAvailable_bytes{instance=~\"$node\"}", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "可用", "refId": "F", "step": 4 }, { "expr": "node_memory_Buffers_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "記憶體_Buffers", "refId": "D", "step": 4 }, { "expr": "node_memory_MemFree_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "記憶體_Free", "refId": "C", "step": 4 }, { "expr": "node_memory_Cached_bytes{instance=~\"$node\"}", "format": "time_series", "hide": true, "intervalFactor": 1, "legendFormat": "記憶體_Cached", "refId": "E", "step": 4 }, { "expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - (node_memory_Cached_bytes{instance=~\"$node\"} + node_memory_Buffers_bytes{instance=~\"$node\"} + node_memory_MemFree_bytes{instance=~\"$node\"})", "format": "time_series", "hide": true, "intervalFactor": 1, "refId": "G" }, { "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100", "format": "time_series", "hide": false, "interval": "30m", "intervalFactor": 10, "legendFormat": "使用率", "refId": "H" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "記憶體資訊", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bytes", "label": null, "logBase": 1, "max": null, "min": "0", "show": true }, { "format": "percent", "label": "記憶體使用率", "logBase": 1, "max": "100", "min": "0", "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.10.227:9100_em1_in下載": "super-light-green", "192.168.10.227:9100_em1_out上傳": "dark-blue" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 28 }, "height": "300", "hiddenSeries": false, "id": 157, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_out上傳$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_network_receive_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_in下載", "refId": "A", "step": 4 }, { "expr": "irate(node_network_transmit_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_out上傳", "refId": "B", "step": 4 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每秒網路頻寬使用$device", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "bps", "label": "上傳(-)/下載(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "15分鐘": "#6ED0E0", "1分鐘": "#BF1B00", "5分鐘": "#CCA300" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "editable": true, "error": false, "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "grid": {}, "gridPos": { "h": 8, "w": 8, "x": 0, "y": 36 }, "height": "300", "hiddenSeries": false, "id": 13, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "maxPerRow": 6, "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "repeat": null, "seriesOverrides": [ { "alias": "/.*總核數/", "color": "#C4162A" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_load1{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "1分鐘負載", "metric": "", "refId": "A", "step": 20, "target": "" }, { "expr": "node_load5{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "5分鐘負載", "refId": "B", "step": 20 }, { "expr": "node_load15{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "15分鐘負載", "refId": "C", "step": 20 }, { "expr": " sum(count(node_cpu_seconds_total{instance=~\"$node\", mode='system'}) by (cpu,instance)) by(instance)", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "CPU總核數", "refId": "D", "step": 20 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "系統平均負載", "tooltip": { "msResolution": false, "shared": true, "sort": 2, "value_type": "cumulative" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda_write": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Read bytes 每個磁碟分區每秒讀取的比特數\nWritten bytes 每個磁碟分區每秒寫入的比特數", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "gridPos": { "h": 8, "w": 8, "x": 8, "y": 36 }, "height": "300", "hiddenSeries": false, "id": 168, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_讀取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_read_bytes_total{instance=~\"$node\"}[5m])", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_讀取", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_written_bytes_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_寫入", "refId": "B", "step": 10 } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每秒磁碟讀寫容量", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "Bps", "label": "讀取(-)/寫入(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 1, "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 36 }, "hiddenSeries": false, "id": 174, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "rightSide": false, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/Inodes.*/", "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{mountpoint}}", "refId": "A" }, { "expr": "node_filesystem_files_free{instance=~'$node',fstype=~\"ext.?|xfs\"} / node_filesystem_files{instance=~'$node',fstype=~\"ext.?|xfs\"}", "hide": true, "interval": "", "legendFormat": "Inodes:{{instance}}:{{mountpoint}}", "refId": "B" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "磁碟使用率", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "percent", "label": "", "logBase": 1, "max": "100", "min": "0", "show": true }, { "decimals": 2, "format": "percentunit", "label": null, "logBase": 1, "max": "1", "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda_write": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Reads completed: 每個磁碟分區每秒讀完成次數\n\nWrites completed: 每個磁碟分區每秒寫完成次數\n\nIO now 每個磁碟分區每秒正在處理的輸入/輸出請求數", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 9, "w": 8, "x": 0, "y": 44 }, "height": "300", "hiddenSeries": false, "id": 161, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*_讀取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_讀取", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_寫入", "refId": "B", "step": 10 }, { "expr": "node_disk_io_now{instance=~\"$node\"}", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "磁碟讀寫速率(IOPS)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "iops", "label": "讀取(-)/寫入(+)I/O ops/sec", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "Idle - Waiting for something to happen": "#052B51", "guest": "#9AC48A", "idle": "#052B51", "iowait": "#EAB839", "irq": "#BF1B00", "nice": "#C15C17", "sdb_每秒I/O操作%": "#d683ce", "softirq": "#E24D42", "steal": "#FCE2DE", "system": "#508642", "user": "#5195CE", "磁碟花費在I/O操作佔比": "#ba43a9" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": null, "description": "每一秒鐘的自然時間內,花費在I/O上的耗時。(wall-clock time)\n\nnode_disk_io_time_seconds_total:\n磁碟花費在輸入/輸出操作上的秒數。該值為累加值。(Milliseconds Spent Doing I/Os)\n\nirate(node_disk_io_time_seconds_total[1m]):\n計算每秒的速率:(last值-last前一個值)/時間戳差值,即:1秒鐘內磁碟花費在I/O操作的時間佔比。", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 9, "w": 8, "x": 8, "y": 44 }, "hiddenSeries": false, "id": 175, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "rightSide": false, "show": true, "sideWidth": null, "sort": null, "sortDesc": null, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "maxPerRow": 6, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_每秒I/O操作%", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每1秒內I/O操作耗時佔比", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "decimals": null, "format": "percentunit", "label": "", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "vda": "#6ED0E0" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Read time seconds 每個磁碟分區讀操作花費的秒數\n\nWrite time seconds 每個磁碟分區寫操作花費的秒數\n\nIO time seconds 每個磁碟分區輸入/輸出操作花費的秒數\n\nIO time weighted seconds每個磁碟分區輸入/輸出操作花費的加權秒數", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 1, "fillGradient": 1, "gridPos": { "h": 9, "w": 8, "x": 16, "y": 44 }, "height": "300", "hiddenSeries": false, "id": 160, "legend": { "alignAsTable": true, "avg": true, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": true, "show": true, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null as zero", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/,*_讀取$/", "transform": "negative-Y" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "irate(node_disk_read_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_讀取", "refId": "B" }, { "expr": "irate(node_disk_write_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_寫入", "refId": "C" }, { "expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}", "refId": "A", "step": 10 }, { "expr": "irate(node_disk_io_time_weighted_seconds_total{instance=~\"$node\"}[5m])", "format": "time_series", "hide": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{device}}_加權", "refId": "D" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "每次IO讀寫的耗時(參考:小於100ms)(beta)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "s", "label": "讀取(-)/寫入(+)", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": false } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "192.168.200.241:9100_TCP_alloc": "semi-dark-blue", "TCP": "#6ED0E0", "TCP_alloc": "blue" }, "bars": false, "dashLength": 10, "dashes": false, "datasource": "prometheus", "decimals": 2, "description": "Sockets_used - 已使用的所有協議套接字總量\n\nCurrEstab - 當前狀態為 ESTABLISHED 或 CLOSE-WAIT 的 TCP 連接數\n\nTCP_alloc - 已分配(已建立、已申請到sk_buff)的TCP套接字數量\n\nTCP_tw - 等待關閉的TCP連接數\n\nUDP_inuse - 正在使用的 UDP 套接字數量\n\nRetransSegs - TCP 重傳報文數\n\nOutSegs - TCP 發送的報文數\n\nInSegs - TCP 接收的報文數", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 0, "gridPos": { "h": 8, "w": 16, "x": 0, "y": 53 }, "height": "300", "hiddenSeries": false, "id": 158, "interval": "", "legend": { "alignAsTable": true, "avg": false, "current": true, "hideEmpty": true, "hideZero": true, "max": true, "min": false, "rightSide": true, "show": true, "sideWidth": null, "sort": "current", "sortDesc": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/.*Sockets_used/", "color": "#E02F44", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_netstat_Tcp_CurrEstab{instance=~'$node'}", "format": "time_series", "hide": false, "instant": false, "interval": "", "intervalFactor": 1, "legendFormat": "CurrEstab", "refId": "A", "step": 20 }, { "expr": "node_sockstat_TCP_tw{instance=~'$node'}", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "TCP_tw", "refId": "D" }, { "expr": "node_sockstat_sockets_used{instance=~'$node'}", "hide": false, "interval": "30m", "intervalFactor": 1, "legendFormat": "Sockets_used", "refId": "B" }, { "expr": "node_sockstat_UDP_inuse{instance=~'$node'}", "interval": "", "legendFormat": "UDP_inuse", "refId": "C" }, { "expr": "node_sockstat_TCP_alloc{instance=~'$node'}", "interval": "", "legendFormat": "TCP_alloc", "refId": "E" }, { "expr": "irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "{{instance}}_Tcp_PassiveOpens", "refId": "G" }, { "expr": "irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "{{instance}}_Tcp_ActiveOpens", "refId": "F" }, { "expr": "irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])", "interval": "", "legendFormat": "Tcp_InSegs", "refId": "H" }, { "expr": "irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])", "interval": "", "legendFormat": "Tcp_OutSegs", "refId": "I" }, { "expr": "irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])", "hide": false, "interval": "", "legendFormat": "Tcp_RetransSegs", "refId": "J" }, { "expr": "irate(node_netstat_TcpExt_ListenDrops{instance=~'$node'}[5m])", "hide": true, "interval": "", "legendFormat": "", "refId": "K" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "網路Socket連接資訊", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "transformations": [], "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": "已使用的所有協議套接字總量", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": { "filefd_192.168.200.241:9100": "super-light-green", "switches_192.168.200.241:9100": "semi-dark-red", "使用的文件描述符_10.118.72.128:9100": "red", "每秒上下文切換次數_10.118.71.245:9100": "yellow", "每秒上下文切換次數_10.118.72.128:9100": "yellow" }, "bars": false, "cacheTimeout": null, "dashLength": 10, "dashes": false, "datasource": "prometheus", "description": "", "fieldConfig": { "defaults": { "custom": {} }, "overrides": [] }, "fill": 0, "fillGradient": 1, "gridPos": { "h": 8, "w": 8, "x": 16, "y": 53 }, "hiddenSeries": false, "hideTimeOverride": false, "id": 16, "legend": { "alignAsTable": false, "avg": false, "current": true, "max": false, "min": false, "rightSide": false, "show": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pluginVersion": "6.4.2", "pointradius": 1, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "/每秒上下文切換次數.*/", "color": "#FADE2A", "lines": false, "pointradius": 1, "points": true, "yaxis": 2 }, { "alias": "/使用的文件描述符.*/", "color": "#F2495C" } ], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "node_filefd_allocated{instance=~\"$node\"}", "format": "time_series", "instant": false, "interval": "", "intervalFactor": 5, "legendFormat": "使用的文件描述符", "refId": "B" }, { "expr": "irate(node_context_switches_total{instance=~\"$node\"}[5m])", "interval": "", "intervalFactor": 5, "legendFormat": "每秒上下文切換次數", "refId": "A" }, { "expr": " (node_filefd_allocated{instance=~\"$node\"}/node_filefd_maximum{instance=~\"$node\"}) *100", "format": "time_series", "hide": true, "instant": false, "interval": "", "intervalFactor": 5, "legendFormat": "使用的文件描述符佔比_{{instance}}", "refId": "C" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "打開的文件描述符(左 )/每秒上下文切換次數(右)", "tooltip": { "shared": true, "sort": 2, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "label": "使用的文件描述符", "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": "context_switches", "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } } ], "refresh": "", "schemaVersion": 20, "style": "dark", "tags": [ "Prometheus", "node_exporter" ], "templating": { "list": [ { "allValue": null, "current": { "tags": [], "text": "node-exporter", "value": "node-exporter" }, "datasource": "prometheus", "definition": "label_values(node_uname_info, job)", "hide": 0, "includeAll": false, "index": -1, "label": "JOB", "multi": false, "name": "job", "options": [], "query": "label_values(node_uname_info, job)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\"}, nodename)", "hide": 0, "includeAll": true, "index": -1, "label": "主機名", "multi": false, "name": "hostname", "options": [], "query": "label_values(node_uname_info{job=~\"$job\"}, nodename)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allFormat": "glob", "allValue": null, "current": { "text": "ymt108", "value": "ymt108" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)", "hide": 0, "includeAll": false, "index": -1, "label": "Instance", "multi": true, "multiFormat": "regex values", "name": "node", "options": [], "query": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allFormat": "glob", "allValue": null, "current": { "text": "All", "value": "$__all" }, "datasource": "prometheus", "definition": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)", "hide": 0, "includeAll": true, "index": -1, "label": "網卡", "multi": true, "multiFormat": "regex values", "name": "device", "options": [], "query": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "/", "value": "/" }, "datasource": "prometheus", "definition": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))", "hide": 2, "includeAll": false, "index": -1, "label": "最大掛載目錄", "multi": false, "name": "maxmount", "options": [], "query": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))", "refresh": 2, "regex": "/.*\\\"(.*)\\\".*/", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false }, { "allValue": null, "current": { "text": "ymt108", "value": "ymt108" }, "datasource": "prometheus", "definition": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)", "hide": 2, "includeAll": false, "index": -1, "label": "展示使用的主機名", "multi": false, "name": "show_hostname", "options": [], "query": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 5, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-12h", "to": "now" }, "timepicker": { "hidden": false, "now": true, "refresh_intervals": [ "15s", "30s", "1m", "5m", "15m", "30m" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "timezone": "browser", "title": "育苗通Node資源監控", "uid": "hb7fSE0Zz", "version": 11 }
node-model.json
當然,默認還內置了很多k8s相關的資源監控模板。
八、匯總
當我們完成了所有配置, 那接下來還需要整理一下,編寫升級腳本upgrade.sh,方便之後部署,以及修改更新。
#!/bin/sh # upgrade alertmanager configuration kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring# upgrade scrape configs kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring# upgrade prometheus rules kubectl apply -f prometheus-additional-rules.yaml kubectl apply -f prometheus-rules.yaml # upgrade prometheus configuration kubectl apply -f prometheus-prometheus.yaml
作者:Leozhanggg
出處://www.cnblogs.com/leozhanggg/p/13502983.html
本文版權歸作者和部落格園共有,歡迎轉載,但未經作者同意必須保留此段聲明,且在文章頁面明顯位置給出原文連接,否則保留追究法律責任的權利。