淺析Kubernetes架構之workqueue

2022 年 6 月 17 日
筆記
Kubernetes

通用隊列

在kubernetes中，使用go的channel無法滿足kubernetes的應用場景，如延遲、限速等；在kubernetes中存在三種隊列通用隊列 common queue ，延遲隊列 delaying queue，和限速隊列 rate limiters queue

Inferface

Interface作為所有隊列的一個抽象定義

type Interface interface {
	Add(item interface{})
	Len() int
	Get() (item interface{}, shutdown bool)
	Done(item interface{})
	ShutDown()
	ShuttingDown() bool
}

Implementation

type Type struct { // 一個work queue
	queue []t // queue用slice做存儲
	dirty set // 臟位，定義了需要處理的元素，類似於作業系統，表示已修改但為寫入
	processing set // 當前正在處理的元素集合
	cond *sync.Cond
	shuttingDown bool
	metrics queueMetrics
	unfinishedWorkUpdatePeriod time.Duration
	clock                      clock.Clock
}
type empty struct{}
type t interface{} // t queue中的元素
type set map[t]empty // dirty 和 processing中的元素

可以看到其中核心屬性就是 queue , dirty , processing

延遲隊列

在研究優先順序隊列前，需要對 Heap 有一定的了解，因為delay queue使用了 heap 做延遲隊列

Heap

Heap 是基於樹屬性的特殊數據結構；heap是一種完全二叉樹類型，具有兩種類型：

如：B 是 A 的子節點，則 \(key(A) \geq key(B)\) 。這就意味著具有最大Key的元素始終位於根節點，這類Heap稱為最大堆 MaxHeap。
父節點的值小於或等於其左右子節點的值叫做 MinHeap

二叉堆的存儲規則：

每個節點包含的元素大於或等於該節點子節點的元素。
樹是完全二叉樹。

那麼下列圖片中，那個是堆

heap的實現

實例：向左邊添加一個值為42的元素的過程

步驟一：將新元素放入堆中的第一個可用位置。這將使結構保持為完整的二叉樹，但它可能不再是堆，因為新元素可能具有比其父元素更大的值。

步驟二：如果新元素的值大於父元素，將新元素與父元素交換，直到達到新元素到根，或者新元素大於等於其父元素的值時將停止

這種過程被稱為 向上調整 （reheapification upward）

實例：移除根

步驟一：將根元素複製到用於返回值的變數中，將最深層的最後一個元素複製到根，然後將最後一個節點從樹中取出。該元素稱為 out-of-place 。

步驟二：而將異位元素與其最大值的子元素交換，並返回在步驟1中保存的值。

這個過程被稱為向下調整 （reheapification downward）

優先順序隊列

優先順序隊列的行為：

元素被放置在隊列中，然後被取出。
優先順序隊列中的每個元素都有一個關聯的數字，稱為優先順序。
當元素離開優先順序隊列時，最高優先順序的元素最先離開。

如何實現的：

在優先順序隊列中，heap的每個節點都包含一個元素以及元素的優先順序，並且維護樹以便它遵循使用元素的優先順序來比較節點的堆存儲規則：
- 每個節點包含的元素的優先順序大於或等於該節點子元素的優先順序。
- 樹是完全二叉樹。
實現的程式碼：golang priorityQueue

Reference

heap

Client-go 的延遲隊列

在Kubernetes中對 delaying queue 的設計非常精美，通過使用 heap 實現的延遲隊列，加上kubernetes中的通過隊列，完成了延遲隊列的功能。

// 注釋中給了一個hot-loop熱循環，通過這個loop實現了delaying
type DelayingInterface interface {
	Interface // 繼承了workqueue的功能
	AddAfter(item interface{}, duration time.Duration) // 在time後將內容添加到工作隊列中
}

具體實現了 DelayingInterface 的實例

type delayingType struct {
	Interface // 通用的queue 
	clock clock.Clock // 對比的時間 ，包含一些定時器的功能
    	type Clock interface {
            PassiveClock
            		type PassiveClock interface {
                        Now() time.Time
                        Since(time.Time) time.Duration
                    }
            After(time.Duration) <-chan time.Time
            NewTimer(time.Duration) Timer
            Sleep(time.Duration)
            NewTicker(time.Duration) Ticker
        }
	stopCh chan struct{} // 停止loop
	stopOnce sync.Once // 保證退出只會觸發一次
	heartbeat clock.Ticker // 一個定時器，保證了loop的最大空事件等待時間
	waitingForAddCh chan *waitFor // 普通的chan，用來接收數據插入到延遲隊列中
	metrics retryMetrics // 重試的指數
}

那麼延遲隊列的整個數據結構如下圖所示

而上面部分也說到了，這個延遲隊列的核心就是一個優先順序隊列，而優先順序隊列又需要滿足：

優先順序隊列中的每個元素都有一個關聯的數字，稱為優先順序。
當元素離開優先順序隊列時，最高優先順序的元素最先離開。

而 waitFor 就是這個優先順序隊列的數據結構

type waitFor struct {
	data    t // 數據
	readyAt time.Time // 加入工作隊列的時間
	index int // 優先順序隊列中的索引
}

而 waitForPriorityQueue 是對 container/heap/heap.go.Inferface 的實現，其數據結構就是使最小 readyAt 位於Root 的一個 MinHeap

type Interface interface {
	sort.Interface
	Push(x interface{}) // add x as element Len()
	Pop() interface{}   // remove and return element Len() - 1.
}

而這個的實現是 waitForPriorityQueue

type waitForPriorityQueue []*waitFor

func (pq waitForPriorityQueue) Len() int {
	return len(pq)
}
// 這個也是最重要的一個，就是哪個屬性是排序的關鍵，也是heap.down和heap.up中使用的
func (pq waitForPriorityQueue) Less(i, j int) bool {
	return pq[i].readyAt.Before(pq[j].readyAt)
}
func (pq waitForPriorityQueue) Swap(i, j int) {
	pq[i], pq[j] = pq[j], pq[i]
	pq[i].index = i
	pq[j].index = j
}
// push 和pop 必須使用heap.push 和heap.pop
func (pq *waitForPriorityQueue) Push(x interface{}) {
	n := len(*pq)
	item := x.(*waitFor)
	item.index = n
	*pq = append(*pq, item)
}


func (pq *waitForPriorityQueue) Pop() interface{} {
	n := len(*pq)
	item := (*pq)[n-1]
	item.index = -1
	*pq = (*pq)[0:(n - 1)]
	return item
}

// Peek returns the item at the beginning of the queue, without removing the
// item or otherwise mutating the queue. It is safe to call directly.
func (pq waitForPriorityQueue) Peek() interface{} {
	return pq[0]
}

而整個延遲隊列的核心就是 waitingLoop，作為了延遲隊列的主要邏輯，檢查 waitingForAddCh 有沒有要延遲的內容，取出延遲的內容放置到 Heap 中；以及保證最大的阻塞周期

func (q *delayingType) waitingLoop() {
	defer utilruntime.HandleCrash()
	never := make(<-chan time.Time) // 作為佔位符
	var nextReadyAtTimer clock.Timer // 最近一個任務要執行的定時器
	waitingForQueue := &waitForPriorityQueue{} // 優先順序隊列，heap
	heap.Init(waitingForQueue)
	waitingEntryByData := map[t]*waitFor{} // 檢查是否反覆添加

	for {
		if q.Interface.ShuttingDown() {
			return
		}

		now := q.clock.Now()
		for waitingForQueue.Len() > 0 {
			entry := waitingForQueue.Peek().(*waitFor)
			if entry.readyAt.After(now) {
				break // 時間沒到則不處理
			}

			entry = heap.Pop(waitingForQueue).(*waitFor) // 從優先順序隊列中取出一個
			q.Add(entry.data) // 添加到延遲隊列中
			delete(waitingEntryByData, entry.data) // 刪除map表中的數據
		}

		// 如果存在數據則設置最近一個內容要執行的定時器
		nextReadyAt := never
		if waitingForQueue.Len() > 0 {
			if nextReadyAtTimer != nil {
				nextReadyAtTimer.Stop()
			}
			entry := waitingForQueue.Peek().(*waitFor) // 窺視[0]和值
			nextReadyAtTimer = q.clock.NewTimer(entry.readyAt.Sub(now)) // 創建一個定時器
			nextReadyAt = nextReadyAtTimer.C()
		}

		select {
		case <-q.stopCh: // 退出
			return
		case <-q.heartbeat.C(): // 多久沒有任何動作時重新一次循環
		case <-nextReadyAt: // 如果有元素時間到了，則繼續執行循環，處理上面添加的操作
		case waitEntry := <-q.waitingForAddCh:
			if waitEntry.readyAt.After(q.clock.Now()) { // 時間沒到，是用readyAt和now對比time.Now
				// 添加到延遲隊列中，有兩個 waitingEntryByData waitingForQueue
				insert(waitingForQueue, waitingEntryByData, waitEntry)
			} else {
				q.Add(waitEntry.data)
			}

			drained := false // 保證可以取完q.waitingForAddCh // addafter
			for !drained {
				select {
                // 這裡是一個有buffer的隊列，需要保障這個隊列讀完
				case waitEntry := <-q.waitingForAddCh: 
					if waitEntry.readyAt.After(q.clock.Now()) {
						insert(waitingForQueue, waitingEntryByData, waitEntry)
					} else {
						q.Add(waitEntry.data)
					}
				default: // 保證可以退出，但限制於上一個分支的0~n的讀取
				// 如果上一個分支阻塞，則為沒有數據就是取盡了，走到這個分支
				// 如果上個分支不阻塞則讀取到上個分支阻塞為止，代表阻塞，則走default退出
					drained = true
				}
			}
		}
	}
}

限速隊列

限速隊列 RateLimiting 是在優先順序隊列是在延遲隊列的基礎上進行擴展的一個隊列

type RateLimitingInterface interface {
	DelayingInterface // 繼承延遲隊列
	// 在限速器準備完成後（即合規後）添加條目到隊列中
	AddRateLimited(item interface{})
	// drop掉條目，無論成功或失敗
	Forget(item interface{})
	// 被重新放入隊列中的次數
	NumRequeues(item interface{}) int
}

可以看到一個限速隊列的抽象對應只要滿足了 AddRateLimited() , Forget() , NumRequeues() 的延遲隊列都是限速隊列。看了解規則之後，需要對具體的實現進行分析。

type rateLimitingType struct {
	DelayingInterface
	rateLimiter RateLimiter
}

func (q *rateLimitingType) AddRateLimited(item interface{}) {
	q.DelayingInterface.AddAfter(item, q.rateLimiter.When(item))
}

func (q *rateLimitingType) NumRequeues(item interface{}) int {
	return q.rateLimiter.NumRequeues(item)
}

func (q *rateLimitingType) Forget(item interface{}) {
	q.rateLimiter.Forget(item)
}

rateLimitingType 則是對抽象規範 RateLimitingInterface 的實現，可以看出是在延遲隊列的基礎上增加了一個限速器 RateLimiter

type RateLimiter interface {
	// when決定等待多長時間
	When(item interface{}) time.Duration
	// drop掉item
	// or for success, we'll stop tracking it
	Forget(item interface{})
	// 重新加入隊列中的次數
	NumRequeues(item interface{}) int
}

抽象限速器的實現，有 BucketRateLimiter , ItemBucketRateLimiter , ItemExponentialFailureRateLimiter , ItemFastSlowRateLimiter , MaxOfRateLimiter ，下面對這些限速器進行分析

BucketRateLimiter

BucketRateLimiter 是實現 rate.Limiter 與抽象 RateLimiter 的一個令牌桶，初始化時通過 workqueue.DefaultControllerRateLimiter() 進行初始化。

func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

更多關於令牌桶演算法可以參考這裡

ItemBucketRateLimiter

ItemBucketRateLimiter 是作為列表存儲每個令牌桶的實現，每個key都是單獨的限速器

type ItemBucketRateLimiter struct {
	r     rate.Limit
	burst int

	limitersLock sync.Mutex
	limiters     map[interface{}]*rate.Limiter
}

func NewItemBucketRateLimiter(r rate.Limit, burst int) *ItemBucketRateLimiter {
	return &ItemBucketRateLimiter{
		r:        r,
		burst:    burst,
		limiters: make(map[interface{}]*rate.Limiter),
	}
}

ItemExponentialFailureRateLimiter

如名所知 ItemExponentialFailureRateLimiter 限速器是一個錯誤指數限速器，根據錯誤的次數，將指數用於delay的時長，指數的計算公式為：\(baseDelay\times2^{<num-failures>}\)。可以看出When絕定了流量整形的delay時間，根據錯誤次數為指數進行延長重試時間

type ItemExponentialFailureRateLimiter struct {
	failuresLock sync.Mutex
	failures     map[interface{}]int // 失敗的次數

	baseDelay time.Duration // 延遲基數
	maxDelay  time.Duration // 最大延遲
}

func (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	exp := r.failures[item]
	r.failures[item] = r.failures[item] + 1

	// The backoff is capped such that 'calculated' value never overflows.
	backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp))
	if backoff > math.MaxInt64 {
		return r.maxDelay
	}

	calculated := time.Duration(backoff)
	if calculated > r.maxDelay {
		return r.maxDelay
	}

	return calculated
}

func (r *ItemExponentialFailureRateLimiter) NumRequeues(item interface{}) int {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	return r.failures[item]
}

func (r *ItemExponentialFailureRateLimiter) Forget(item interface{}) {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	delete(r.failures, item)
}

ItemFastSlowRateLimiter

ItemFastSlowRateLimiter ，限速器先快速重試一定次數，然後慢速重試

type ItemFastSlowRateLimiter struct {
	failuresLock sync.Mutex
	failures     map[interface{}]int

	maxFastAttempts int // 最大嘗試次數
	fastDelay       time.Duration // 快的速度
	slowDelay       time.Duration // 慢的速度
}


func NewItemFastSlowRateLimiter(fastDelay, slowDelay time.Duration, maxFastAttempts int) RateLimiter {
	return &ItemFastSlowRateLimiter{
		failures:        map[interface{}]int{},
		fastDelay:       fastDelay,
		slowDelay:       slowDelay,
		maxFastAttempts: maxFastAttempts,
	}
}

func (r *ItemFastSlowRateLimiter) When(item interface{}) time.Duration {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	r.failures[item] = r.failures[item] + 1
	// 當錯誤次數沒超過快速的閾值使用快速，否則使用慢速
	if r.failures[item] <= r.maxFastAttempts {
		return r.fastDelay
	}

	return r.slowDelay
}

func (r *ItemFastSlowRateLimiter) NumRequeues(item interface{}) int {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	return r.failures[item]
}

func (r *ItemFastSlowRateLimiter) Forget(item interface{}) {
	r.failuresLock.Lock()
	defer r.failuresLock.Unlock()

	delete(r.failures, item)
}

MaxOfRateLimiter

MaxOfRateLimiter 是返回限速器列表中，延遲最大的那個限速器

type MaxOfRateLimiter struct {
	limiters []RateLimiter
}

func (r *MaxOfRateLimiter) When(item interface{}) time.Duration {
	ret := time.Duration(0)
	for _, limiter := range r.limiters {
		curr := limiter.When(item)
		if curr > ret {
			ret = curr
		}
	}

	return ret
}

func NewMaxOfRateLimiter(limiters ...RateLimiter) RateLimiter {
	return &MaxOfRateLimiter{limiters: limiters}
}

func (r *MaxOfRateLimiter) NumRequeues(item interface{}) int {
	ret := 0
    // 找到列表內所有的NumRequeues（失敗的次數），以最多次的為主。 
	for _, limiter := range r.limiters {
		curr := limiter.NumRequeues(item)
		if curr > ret {
			ret = curr
		}
	}

	return ret
}

func (r *MaxOfRateLimiter) Forget(item interface{}) {
	for _, limiter := range r.limiters {
		limiter.Forget(item)
	}
}

如何使用Kubernetes的限速器

基於流量管制的限速隊列實例，可以大量突發，但是需要進行整形，添加操作會根據 When() 中設計的需要等待的時間進行添加。根據不同的隊列實現不同方式的延遲

package main

import (
	"fmt"
	"log"
	"strconv"
	"time"

	"k8s.io/client-go/util/workqueue"
)

func main() {
	stopCh := make(chan string)
	timeLayout := "2006-01-02:15:04:05.0000"
	limiter := workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())
	length := 20 // 一共請求20次
	chs := make([]chan string, length)
	for i := 0; i < length; i++ {
		chs[i] = make(chan string, 1)
		go func(taskId string, ch chan string) {
			item := "Task-" + taskId + time.Now().Format(timeLayout)
			log.Println(item + " Added.")
            limiter.AddRateLimited(item) // 添加會根據When() 延遲添加到工作隊列中

		}(strconv.FormatInt(int64(i), 10), chs[i])

		go func() {
			for {
				key, quit := limiter.Get()
				if quit {
					return
				}
				log.Println(fmt.Sprintf("%s process done", key))
				defer limiter.Done(key)

			}
		}()
	}
	<-stopCh
}

因為默認的限速器不支援初始化 QPS，修改源碼內的為 \(BT(1, 5)\) ，執行結果可以看出，大突發流量時，超過桶內token數時，會根據token生成的速度進行放行。

圖中，任務的添加是突發性的，日誌列印的是同時添加，但是在添加前輸出的日誌，消費端可以看到實際是被延遲了。配置的是每秒一個token，實際上放行流量也是每秒一個token。

Tags: Kubernetes