k8s之pod調度

Pod調度

在默認情況下,一個pod在哪個node節點上運行,是由scheduler組件採用相應的演算法計算出來的,這個過程是不受人工控制的。

但是在實際過程中,這並不滿足需求,因為很多情況下,我們想控制某些pod到達某些節點上,那麼應該怎麼做呢?

這就要求了解k8s對pod的調度規則,k8s提供了四大類調度方式:

  • 自動調度:運行在哪個節點上完全由scheduler經過一系列的演算法得出
  • 定向調度:nodename、nodeselector
  • 親和性調度:nodeaffinity、podaffinity、podantiaffinity
  • 污點(容忍)調度:Taints、toleration

定向調度

定向調度,指的是利用在pod上聲明nodename或者nodeselector,以此將pod調度到期望的node節點上。注意,這裡的調度是強制的,

這就意味著即使要調度目標node不存在,也會向上面進行調度,只不過pod運行失敗而已。

nodename

nodename用於強制約束將pod調度到指定的name的pod節點上。這種方式,其實是直接跳過scheduler的調度邏輯,直接寫入podlist表

接下來,實驗一下:創建一個pod-nodename.yaml文件

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodename
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1         #指定調度到node1節點上

 使用配置文件

[root@master ~]# kubectl create -f pod-nodename.yaml 
pod/pod-nodename created
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodename   1/1     Running   0          49s   10.244.2.35   node1   <none>           <none>

可以發現pod運行在node1

接下來刪除pod,更改配置文件為node3

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodename
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node3         #指定調度到node1節點上

使用配置文件

[root@master ~]# kubectl delete -f pod-nodename.yaml 
pod "pod-nodename" deleted
[root@master ~]# vim pod-nodename.yaml 
[root@master ~]# kubectl create -f pod-nodename.yaml 
pod/pod-nodename created
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP       NODE    NOMINATED NODE   READINESS GATES
pod-nodename   0/1     Pending   0          21s   <none>   node3   <none>           <none>

可以看見雖然被指定在了node3,但是由於node3不存在,pod無法啟動

 

 nodeselector

nodeselector用於將pod調度到添加了指定標籤的node節點上,它是通過k8s的label-selector機制實現的,也即是說,在pod創建之前,會由

scheduler使用matchnodeselector調度策略進行label匹配,找出目標node,然後將pod調度到目標節點,該匹配規則是強制約束

接下來,實驗一下:

1.首先分別為node節點添加標籤

[root@master ~]# kubectl label nodes node1 nodeenv=pro
node/node1 labeled
[root@master ~]# kubectl label nodes node2 nodeenv=test
node/node2 labeled

創建一個pod-nodeselector.yaml文件,並使用它創建pod

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeselector
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeSelector: 
    nodeenv: pro  #指定調度到具有nodeenv=pro標籤的節點上

使用配置文件(這裡省略過程)

[root@master ~]# kubectl get pod pod-nodeselector -n dev -o wide
NAME               READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodeselector   1/1     Running   0          22m   10.244.2.36   node1   <none>           <none>

可以看見pod已經被調度到了node1

 

親和性調度

上面兩種定向調度的方式使用起來非常方便,但是也有一定的問題,那就是如果沒有滿足條件的node,那麼pod將不會被運行,即使在集群中還有可用的node列表也不行,這就限制了它的使用場景。

基於上面的問題,k8s還提供了一種親和性調度,它在nodeselector的基礎之上進行了擴展,可以通過配置的方式,實現優先選擇滿足條件的node進行調度,如果沒有,也可以調度到不滿足條件的節點上,使調度更加靈活。

affinity主要分為三類:

  • nodeaffinity(node親和性):以node為目標,解決node可以調度到哪些node的問題
  • podaffinity(pod親和性):以pod為目標,解決pod可以和哪些已存在的pod部署到同一個拓撲域中的問題
  • podantiaffinity(pod反親和性):以pod為目標,解決pod不能和哪些已存在的pod部署到同一個拓撲域中的問題

關於親和性(反親和性)使用場景的說明:

  • 親和性:如果兩個應用頻繁交互,那就有必要利用親和性讓兩個應用儘可能地靠近,這樣可以減少網路通訊而帶來的性能損耗。
  • 反親和性:當應用採用副本部署時,有必要採用反親和性讓各個應用實例打散分布在各個node上,這樣可以提高服務的高可用性。

nodeaffinity

關係符的使用說明:

- matchExpressions:
  - key: nodeenv         #匹配存在標籤的key為nodeenv的節點
    operator: Exists
  - key: nodeenv         #匹配存在標籤的key為nodeenv,且value是"xxx""yyy"的節點
    operator: In
    values: ["xxx","yyy"]
  - key: nodeenv         #匹配存在標籤的key為nodeenv,且value大於"xxx"的節點
    operator: Gt
    values: "xxx"

接下來首先演示一下requireDuringSchedullingIgnoreDuringExecution

創建pod-nodeaffinity-required.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    nodeAffinity:   #設置node親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
        nodeSelectorTerms:
        - matchExpressions:
          - key: nodeenv
            operator: In
            values: ["xxx","yyy"]

 創建並使用配置文件

[root@master ~]# vim pod-nodeaffinity-required.yaml
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml 
pod/pod-nodeaffinity-required created
[root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev
NAME                        READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-required   0/1     Pending   0          14s

發現啟動失敗,查看詳細描述

[root@master ~]# kubectl describe pod pod-nodeaffinity-required -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.

刪除pod,重新編輯配置文件的values

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    nodeAffinity:   #設置node親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
        nodeSelectorTerms:
        - matchExpressions:
          - key: nodeenv
            operator: In
            values: ["pro","yyy"]
[root@master ~]# kubectl delete -f pod-nodeaffinity-required.yaml 
pod "pod-nodeaffinity-required" deleted
[root@master ~]# vim pod-nodeaffinity-required.yaml 
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml 
pod/pod-nodeaffinity-required created
[root@master ~]# kubectl get pod pod-nodeaffinity-required -n dev
NAME                        READY   STATUS    RESTARTS   AGE
pod-nodeaffinity-required   1/1     Running   0          35s

發現創建成功

 

接下來再演示一下preferredDuringSchedulingIgnoredDuringExecution,

創建pod-nodeaffinity-preferred.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-nodeaffinity-preferred
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    nodeAffinity:   #設置node親和性
      preferredDuringSchedulingIgnoredDuringExecution:  #軟限制
      - weight: 1
        preference:
          matchExpressions:
          - key: nodeenv
            operator: In
            values: ["xxx","yyy"]

創建並使用配置文件

[root@master ~]# vim pod-nodeaffinity-preferred.yaml
[root@master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml 
pod/pod-nodeaffinity-preferred created
[root@master ~]# kubectl get pod pod-nodeaffinity-preferred -n dev -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-nodeaffinity-preferred   1/1     Running   0          23s   10.244.2.38   node1   <none>           <none>

發現pod被調度到了node1

 

nodeaffinity規則設置的注意事項:

  • 如果同時定義了nodeselector和nodeaffinity,那麼必須兩個條件都得到滿足,pod才能運行在指定的node上
  • 如果nodeaffinity指定了多個nodeSelectorTerms,那麼只需要其中一個能匹配成功即可
  • 如果一個nodeSelectorTerms中有多個matchExpressions,則一個節點必須滿足所有的才能匹配成功
  • 如果一個pod所在的node再pod運行期間其標籤發生了改變,不再符合該pod節點親和性需求,則系統將忽略此變化

 

podaffinity

podaffinity主要實現以運行的pod為參照,實現讓新創建的pod跟參照pod在一個區域的功能

topologyKey用於指定調度時作用域,例如:

  • 如果指定為kubernetes.io/hostname,那就是以Node節點為區分範圍
  • 如果指定為beta.kubernetes.io/os,則以Node節點的作業系統類型來區分

接下來,演示下requireDuringSchedullingIgnoreDuringExecution

創建一個參照pod,pod-podaffinity-target.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-target
  namespace: dev
  labels:
    podenv: pro   #設置標籤
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1  #將目標pod明確指定到node1上
[root@master ~]# vim pod-podaffinity-target.yaml
[root@master ~]# kubectl create -f pod-podaffinity-target.yaml 
pod/pod-podaffinity-target created
[root@master ~]# kubectl get pod pod-podaffinity-target -n dev -o wide --show-labels
NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-target   1/1     Running   0          2m47s   10.244.2.39   node1   <none>           <none>            podenv=pro

創建pod-podaffinity-required.yaml

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    podAffinity:   #設置pod親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤
           - key: podenv
             operator: In
             values: ["xxx","yyy"]
         topologyKey: kubernetes.io/hostname
[root@master ~]# vim pod-podaffinity-required.yaml
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml 
pod/pod-podaffinity-required created
[root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   0/1     Pending   0          24s   <none>   <none>   <none>           <none>            <none>

發現調度失敗,查看調度資訊

[root@master ~]# kubectl describe pod pod-podaffinity-required -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.

刪除pod,重新編輯配置文件

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    podAffinity:   #設置pod親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["xxx","yyy"]中的標籤
           - key: podenv
             operator: In
             values: ["pro","yyy"]
         topologyKey: kubernetes.io/hostname
[root@master ~]# kubectl delete -f pod-podaffinity-required.yaml 
pod "pod-podaffinity-required" deleted
[root@master ~]# vim pod-podaffinity-required.yaml 
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml 
pod/pod-podaffinity-required created
[root@master ~]# kubectl get pod pod-podaffinity-required -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   1/1     Running   0          11s   10.244.2.40   node1   <none>           <none>            <none>

 

podantiaffinity

podaniaffinity主要實現以運行的pod為參照,讓新創建的pod跟參照pod不在一個區域中的功能

它的配置方式和選項跟podaffinity是一樣的,這裡不再做詳細解釋,直接做一個測試案例

繼續使用上個案例中目標pod

[root@master ~]# kubectl get pod -n dev -o wide --show-labels
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES   LABELS
pod-podaffinity-required   1/1     Running   2          24h   10.244.2.57   node1   <none>           <none>            <none>
pod-podaffinity-target     1/1     Running   2          24h   10.244.2.56   node1   <none>           <none>            podenv=pro

創建pod-podantiaffinity-required.yaml,內容如下

apiVersion: v1
kind: Pod
metadata: 
  name: pod-podantiaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:   #親和性設置
    podAntiAffinity:   #設置pod反親和性
      requiredDuringSchedulingIgnoredDuringExecution:  #硬限制
      -  labelSelector:
           matchExpressions: #匹配env的值在["pro"]中的標籤
           - key: podenv
             operator: In
             values: ["pro"]
         topologyKey: kubernetes.io/hostname

使用配置文件

[root@master ~]# vim pod-podantiaffinity-required.yaml
[root@master ~]# kubectl create -f pod-podantiaffinity-required.yaml 
pod/pod-podantiaffinity-required created
[root@master ~]# kubectl get pod -n dev -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
pod-podaffinity-required       1/1     Running   2          24h   10.244.2.57   node1   <none>           <none>
pod-podaffinity-target         1/1     Running   2          24h   10.244.2.56   node1   <none>           <none>
pod-podantiaffinity-required   1/1     Running   0          10s   10.244.1.57   node2   <none>           <none>

可以發現pod被調度到了pod2中

 

污點和容忍

污點(Taints)

前面的調度方式都是站在pod的角度上,通過在pod上添加屬性,來確定pod是否要調度到指定的node上,其實我們也可以站在node的角度上,通過在node上添加

污點屬性,來決定是否允許pod調度過來

node被設置上污點之後就和pod之間存在了一種相斥的關係,進而拒絕pod調度進來,甚至可以將已經存在的pod驅逐出去

污點的格式為:key=value:effect,key和value是污點的標籤,effect描述污點的作用,支援如下三個選項:

  • PreferNoSchedule:k8s將盡量避免把pod調度到具有該污點的node上,除非沒有其他節點可以調度
  • NoSchedule:k8s將不會把pod調度到具有該污點的node上,但不會影響當前node上已經存在的pod
  • NoExecute:k8s將不會把Pod調度到具有該污點的node上,同時也會將Node上已存在的Pod驅離

使用kubectl設置和去除污點的命令如下:

#設置污點
kubectl taint nodes nodeName key=value:effect

#去除污點
kubectl taint nodes nodeName key:effect-

#去除所有污點
kubectl taint nodes nodeName key-

 接下來,演示下污點的效果:

  1. 準備節點node1(為了演示效果更加明顯,暫時停止node2節點)
  2. 為node1節點設置一個污點:tag=ayanami:PreferNoSchedule;然後創建pod1(pod1可以)
  3. 修改為node1節點設置一個污點:tag=ayanami:NoSchedule;然後創建pod2(pod1正常 pod2失敗)
  4. 修改為node1節點設置一個污點:tag=ayanami:NoExecute;然後創建pod3(3個pod都失敗)

 為node1設置污點(PreferNoSchedule):

[root@master ~]# kubectl taint nodes node1 tag=ayanami:PreferNoSchedule
node/node1 tainted

 創建pod1

[root@master ~]# kubectl run taint1 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.

[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          6m16s

為node1設置污點(取消PreferNoSchedule設置為NoSchedule)

[root@master ~]# kubectl taint nodes node1 tag:PreferNoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint nodes node1 tag=ayanami:NoSchedule
node/node1 tainted

再次查看pod,發現沒有變化

[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          10m

創建新的taint2並查看

[root@master ~]# kubectl run taint2 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/taint2 created
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-lhmcj   1/1     Running   0          11m
taint2-84946958cf-h9765   0/1     Pending   0          15s

發現新的pod無法running,查看taint2

[root@master ~]# kubectl describe pod taint2 -n dev
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.

為node1設置污點(取消NoSchedule,設置為NoExecute)

[root@master ~]# kubectl taint node node1 tag:NoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint node node1 tag=ayanami:NoExecute
node/node1 tainted
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-fdtqw   0/1     Pending   0          30s
taint2-84946958cf-26rfx   0/1     Pending   0          30s

發現兩個pod都停止了,再創建一個taint3

[root@master ~]# kubectl run taint3 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
taint1-766c47bf55-fdtqw   0/1     Pending   0          97s
taint2-84946958cf-26rfx   0/1     Pending   0          97s
taint3-57d45f9d4c-68pwr   0/1     Pending   0          9s

發現新的也創建不了了

拓展:

使用kubeadm搭建的集群,默認就會給master節點添加一個污點標記,所以pod就不會調度到master節點上

[root@master ~]# kubectl describe node master
Taints:             node-role.kubernetes.io/master:NoSchedule

容忍

上面介紹了污點的作用,我們可以在node上添加污點用於拒絕pod調度上來,但是如果就是想將一個pod調度到一個有污點的node上去,這時候應該怎麼做呢?這就要用到容忍

污點就是拒絕,容忍就是忽略,node通過污點拒絕pod調度上去,pod通過容忍忽略拒絕

下面先通過案例看下效果:

  1. 上一小節,已經在node1節點上打了NoExecute的污點,此時pod是調度不上去的
  2. 本小節,可以通過給pod添加容忍,然後將其調度上去

創建pod-toleration.yaml,內容如下

apiVersion: v1
kind: Pod
metadata: 
  name: pod-toleration
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  tolerations:    #添加容忍
  - key: "tag"   #要容忍的污點的key
    operator: "Equal"   #操作符
    value: "ayanami"   #容忍的污點的value
    effect: "NoExecute"   #添加容忍的規則,這裡必須和標記的污點規則相同

使用配置文件

[root@master ~]# vim pod-toleration.yaml
[root@master ~]# kubectl create -f pod-toleration.yaml 
pod/pod-toleration created
[root@master ~]# kubectl get pod -n dev
NAME                      READY   STATUS    RESTARTS   AGE
pod-toleration            1/1     Running   0          9s
taint1-766c47bf55-fdtqw   0/1     Pending   0          34m
taint2-84946958cf-26rfx   0/1     Pending   0          34m
taint3-57d45f9d4c-68pwr   0/1     Pending   0          33m

容忍的詳細配置

key:對應著要容忍的污點的鍵,空值意味著所有的鍵
value:意味著要容忍的污點的值
operator:key-value的運算符,支援Equal和Exists(默認)
effect:對應污點的effect,空值意味著匹配所有的影響
tolerationSeconds:容忍時間,當effect為NoExecute時生效,表示pod再Node上的停留時間