分散式系統：時間、時鐘和事件序列

2019 年 10 月 3 日
筆記

在程式中，我們經常需要知道事件序列，在單體應用中，事件序列是較為簡單的，最簡單的辦法就是用時間戳，但在分散式系統中，事件序列是很困難的，Leslie Lamport大神在論文Time, Clocks, and the Ordering of Events in a Distributed System討論了在分散式系統中時間、時鐘和事件序列的問題。

【1】分散式系統中物理時鐘存在的問題

邏輯時鐘是相對物理時鐘這個概念的，為什麼要提出邏輯時鐘，因為物理時鐘在分散式系統中存在一系列問題。在一台機器上的多個進程可以從一個物理時鐘中獲取時間戳，不管這個物理時鐘是否準確，只要是從一個物理時鐘獲取時間戳，我們都能獲得多個事件的相對時間順序。但是在分散式系統中，我們無法從一個物理時鐘獲取時間戳，只能從各自機器上物理時鐘獲取時間戳，而各台機器的物理時鐘是很難完全同步的，即使有NTP，精度也是有限的。所以在分散式系統中，是不能通過物理時鐘決定事件序列的。

物理時鐘在分散式系統中也不是毫無用處，至少它一定程度上可以判斷在一台機器上的事件順序，同時分散式系統中還是有必要讓不同機器上的物理時鐘在一定精度內同步時間的，只是不作為決定事件序列的方法。

【2】偏序（Partial Ordering）

事件序列有兩種：偏序事件序列和全序事件序列。所謂的偏序指的是只能為系統中的部分事件定義先後順序。這裡的部分其實是有因果關係的事件。在論文Time, Clocks, and the Ordering of Events in a Distributed System中，偏序是由「happened before」引出的，我們先看一下"happened before"（表示為「->」）的定義：

Definition. The relation "->"on the set of events of a system is the smallest relation satisfying the following three conditions:

(1) If a and b are events in the same process, and a comes before b, then a->b.

(2) If a is the sending of a message by one process and b is the receipt of the same message by another process, then a->b.

(3) If a->b and b->c then a->c.

在分散式系統中，只有兩個發生關聯的事件（有因果關係），我們才會去關心兩者的先來後到關係。對於並發事件，他們兩個誰先發生，誰後發生，其實我們並不關心。偏序就是用來定義兩個因果事件的發生次序，即『happens before』。而對於並發事件（沒有因果關係），並不能決定其先後，所以說這種『happens before』的關係，是一種偏序關係。

If two entities do not exchange any messages, then they probably do not need to share a common clock; events occurring on those entities are termed as concurrent events.」

【3】邏輯時鐘

論文原文中有這樣一句：We begin with an abstract point of view in which a clock is just a way of assigning a number to an event, where the number is thought of as the time at which the event occurred. 這句話的意思是，可以把時間進行抽象，把時間值看成是事件發生順序的一個序列號，這個值可以<20190515,20190516,20190517>，也可以是<1,2,3>。後面就有了邏輯時鐘的概念。定義如下：
we define a clock Ci for each process Pi to be a function which assigns a number Ci(a) to any event a in that process.

Clock Condition. For any events a,b: if a->b then C(a) < C(b).
C1. If a and b are events in process Pi, and a comes before b, then Ci(a) < Ci(b).
C2. If a is the sending of a message by process Pi and b is the receipt of that message by process Pi, then Ci(a) < Ci(b).

具體的，根據上面的定義條件，我們做如下實現規則：

每個事件對應一個Lamport時間戳，初始值為0
如果事件在節點內發生，本地進程中的時間戳加1
如果事件屬於發送事件，本地進程中的時間戳加1並在消息中帶上該時間戳
如果事件屬於接收事件，本地進程中的時間戳 = Max(本地時間戳，消息中的時間戳) + 1

根據上面的定義，我們知道a->b，C(a)<C(b)，但如果C(a)=C(b)，那麼a,b是什麼順序呢？它們肯定不是因果關係，所以它們之間的先後其實並不會影響結果，我們這裡只需要給出一種確定的方式來定義它們之間的先後就能得到全序關係。

一種可行的方式是利用給進程編號，利用進程編號的大小來排序。假設a、b分別在節點P、Q上發生，Pi、Qj分別表示我們給P、Q的編號，如果 C(a)=C(b) 並且 Pi<Qj，同樣定義為a發生在b之前，記作 a⇒b（全序關係）。假如我們上圖的A、B、C分別編號Ai=1、Bj=2、Ck=3，因 C(B4)=C(C3) 並且 Bj<Ck，則 B4⇒C3。

通過以上定義，我們可以對所有事件排序，獲得事件的全序關係(total order)。上圖例子，我們可以進行排序：C1⇒B1⇒B2⇒A1⇒B3⇒A2⇒C2⇒B4⇒C3⇒A3⇒B5⇒C4⇒C5⇒A4。觀察上面的全序關係你可以發現，從時間軸來看B5是早於A3發生的，但是在全序關係裡面我們根據上面的定義給出的卻是A3早於B5，這是因為Lamport邏輯時鐘只保證因果關係（偏序）的正確性，不保證絕對時序的正確性。

【4】嘗試用邏輯時鐘解決分散式鎖的問題

單機多進程程式可由鎖進行同步，那是因為這些進程都運行在作業系統上，有center為它們的請求排序，這個center知道所有需要進行同步的進程的所有資訊。但是在分散式系統中，各個進程運行在各自的主機上，沒有一個center的概念，那分散式系統中多進程該怎麼進行同步呢？或者說分散式鎖該怎麼實現呢？論文中提出了解決這一問題的演算法要滿足下面三個條件：

(I) A process which has been granted the resource must release it before it can be granted to another process.

(II) Different requests for the resource must be granted in the order in which they are made.

(III) If every process which is granted the resource eventually releases it, then every request is eventually granted.
為了簡化問題，我們做如下假設：

任何兩個進程Pi，Pj它們之間接收到的消息的順序與發送消息的順序一致，並且每個消息一定能夠被接收到。
每個進程都維護一個不被其他進程所知的請求隊列。並且請求隊列初始化為包含一個T0:P0請求，P0用於該共享資源，T0是初始值小於任何時鐘值

演算法如下：

To request the resource, process Pi sends the message Tm:Pi requests resource to every other process, and puts that message on its request queue, where Tm is the timestamp of the message.（請求資源，發送請求給其他進程，在自己的請求隊列中添加該請求）
When process Pj receives the message Tm:Pi requests resource, it places it on its request queue and sends a (timestamped) acknowledgment message to Pi.（收到其他進程的請求，放到請求隊列中，回應發起請求的進程）
To release the resource, process Pi removes any Tm:Pi requests resource message from its request queue and sends a (timestamped) Pi releases resource message to every other process.（釋放資源，從請求隊列中移除該資源請求，發送給其他進程，告訴它們我釋放了該資源）
When process Pj receives a Pi releases resource message, it removes any Tm:Pi requests resource message from its request queue.（收到其他進程釋放資源的消息，從請求隊列中移除該資源請求）
Process Pi granted the resource when the following two conditions are satisfied: (i) There is a Tm:Pi requests resource message in its request queue which is ordered before any other request in its queue by the relation ⇒ . (ii) Pi has received a message from every other process timestamped later than Tm.
（判斷自己是否可以獲得該資源，有兩個條件：其一，按全序排序後，Tm:Pi請求在請求隊列的最前面；其二，自己Pi已經收到了所有其他進程的時戳大於Tm的消息）

下面我們舉個例子說明上面的演算法過程：
初始狀態為P0擁有資源，請求隊列中為0:0(T0:P0的簡寫)，而後P1請求資源，將1:1添加到請求隊列中，此時P0讓佔有資源，P1還無法獲取資源，等到P0釋放資源後，0:0從請求隊列中移除（下圖中沒有畫出），此時請求隊列中1:1的請求在最前面，同時P1收到了其他兩個進程的大於1的回應消息，滿足了佔有資源的條件，此時P1佔有資源。

其實關鍵思想很簡單，既然分散式系統中沒有「center」的概念，那我請求共享資源時我就讓其他所有進程都知道我要請求該資源，擁有資源的進程釋放資源時也告訴所有進程，我要釋放該資源，想請求該資源的你們可以按序（邏輯時鐘的作用，這裡再次說明一下，並不能保證在絕對物理時間上請求的排序）請求了。這樣每個進程都知道其他進程的狀態，就相當於有個「center」。

對於分散式鎖問題，多個請求不一定是一定按照絕對物理時鐘排序才可以，只要我們有這樣一個演算法，這個演算法可以保證多個進程的請求按照這個演算法總能得到同一個排序，就可以了，按照絕對物理時鐘排序只是其中一個可行的演算法。

到這裡是否就萬事大吉了呢，其實並沒有，這個實現是很脆弱的，它要求所有進程都非常可靠，一旦一個進程掛了或出現網路分區的情況，是無法工作的，同時我們提出的網路要求也非常嚴格，要求發出的消息一定被接收到，這個在實用的系統中是很難做到的。所以這是一個理想狀況下的演算法實現，並不是一個可以工業級應用的演算法實現。但它仍然是非常有意義的，給了我們關於分散式系統中解決一致性、共識演算法等思想啟迪。

參考文檔：
大神論文：Time, Clocks, and the Ordering of Events in a Distributed System
Lamport timestamps
分散式系統：Lamport 邏輯時鐘

關注微信公眾號，每天進步一點點！