TiKV 源碼解析系列文章（十三）MVCC 數據讀取

2019 年 12 月 5 日
筆記

在《TiKV 源碼解析系列文章（十二）分佈式事務》中，我們介紹了如何在滿足事務特性的要求下進行數據寫入。本文將介紹數據讀取的流程。由於順序掃（Forward Scan）比較具有代表性，因此本文只介紹順序掃的流程，而不會介紹點查或逆序掃。點查是順序掃的簡化，相信讀者理解了順序掃的流程後能自己想出點查的實現，而逆序掃與順序掃也比較類似，主要區別在於從後向前掃，稍複雜一些，相信大家在閱讀本文後，也能自己對照着代碼讀懂逆序掃的實現。

數據格式

首先回憶一下事務寫入完成後，在 RocksDB 層面存儲的具體是什麼樣的數據：

其中：

為了消除歧義，約定 User Key (user_key) 指 TiKV Client（如 TiDB）所寫入的或所要讀取的 Key，User Value (user_value) 指 User Key 對應的 Value。
lock_info 包含 lock type、primary key、timestamp、ttl 等信息，見 src/storage/mvcc/lock.rs。
write_info 包含 write type、start_ts 等信息，見 src/storage/mvcc/write.rs。

事務樣例

為了便於大家理解代碼，我們假設 TiKV Client 之前進行了下面這些事務：

注意，TiDB 向 TiKV 寫入的 Key（及上面的 user_key）並不會長成 foo、abc、box 這樣，而大部分會是 tXXXXXXXX_rXXXXXXXX 或 tXXXXXXXX_iXXXXXXXX 的格式。但 Key 的格式並不影響 TiKV 的邏輯處理，所以我們這裡僅採用簡化的 Key 作為樣例。Value 同理。

每個事務 Prewrite 並 Commit 完畢後，落到 RocksDB 上的數據類似於這樣：

事務 #1：

事務 #2：

事務 #3：

事務 #4：

實際在 RocksDB 中存儲的數據與上面表格里寫的略微不一樣，主要區別有：

1. TiKV Raft 層會修改實際寫入 RocksDB 的 Key（例如增加前綴 z）以便進行數據區分。對於 MVCC 和事務來說這個操作是透明的，因此我們先忽略這個。

2. User Key 會被按照 Memory Comparable Encoding 方式進行編碼，編碼算法是以 8 位元組為單位進行 Padding。這個操作確保了我們在 User Key 後面追加 start_ts 或 commit_ts 之後實際寫入的 Key 能保持與 User Key 具有相同的順序。

例如，假設我們依次寫入 abc、abcx00..x00 兩個 User Key，在不進行 Padding 的情況下：

可見，User Key 順序是 abc < abcx00..x00，但寫入的 Key 順序卻是 abcx00x00..x05 > abcx00x00..x00x00x00..x10。顯然，在這之後，我們若想要有序地掃數據就會面臨巨大的挑戰。因此需要對 User Key 進行編碼：

Example 1:

User Key:      abc  Encoded:       abcx00x00x00x00x00xFA                 ^^^                    ^^^^                 Key                    Pad=5                    ^^^^^^^^^^^^^^^^^^^^                    Padding

Example 2:

User Key:      abcx00x00x00x00x00x00x00x00  Encoded[0..9]: abcx00x00x00x00x00xFF                 ^^^^^^^^^^^^^^^^^^^^^^^                 Key[0..8]                                        ^^^^                                        Pad=0  Encoded[9..]:  x00x00x00x00x00x00x00x00xFA                 ^^^^^^^^^^^^                    ^^^^                 Key[8..11]                      Pad=5                             ^^^^^^^^^^^^^^^^^^^^                             Padding

編碼後的 Key 無論後面再追加什麼 8 位元組的 Timestamp，都能保持原來的順序。

3. TiKV 在 Key 中存儲的 Timestamp（無論是 start_ts 還是 commit_ts）都是 Timestamp 取反後的結果，其目的是讓較新的數據（即 Timestamp 比較大的數據）排列在較老的數據（即 Timestamp 比較小的數據）前面。掃數據的流程利用了這個特性優化性能，繼續閱讀本文可以有所感受。後面本文中關於時間戳的部分將寫作 {!ts} 來反映這個取反操作。

4. TiKV 對較小（<= 64 位元組）的 User Value 會進行優化，不存儲在 Default CF 中，而是直接內嵌在 Lock Info 或 Write Info 中，從而加快這類 User Key 的掃的效率及寫入效率。我們這個示例先暫且忽略這個優化，就當成 User Value 都很長沒有進行內嵌。

順序掃

順序掃的代碼位於 src/storage/mvcc/reader/scanner/forward.rs。順序掃的定義是給定 scan_ts、可選的下界 lower_bound 與可選的上界 upper_bound，需要依次知道在 [lower_bound, upper_bound) 範圍內所有滿足 scan_ts（即最新 commit_ts <= scan_ts）的數據。掃的過程中可以隨時中止，不需要掃出範圍內所有數據。

以「事務樣例」為例，假設其所有事務都 Commit 後：

scan_ts = 0x00 順序掃 [-∞, +∞) 可依次掃出：(空)。
scan_ts = 0x05 順序掃 [-∞, +∞) 可依次掃出：bar => bar_value、foo => foo_value。
scan_ts = 0x12 順序掃 [-∞, +∞)，可依次掃出 bar => bar_value、foo => foo_value。
scan_ts = 0x15 順序掃 [-∞, +∞) 可依次掃出：bar => bar_value、box => box_value、foo => foo_value2。
scan_ts = 0x35 順序掃 [-∞, +∞) 可依次掃出：bar => bar_value、foo => foo_value2。
scan_ts = 0x05 順序掃 [c, +∞) 可依次掃出：foo => foo_value。

假設「事務樣例」中事務 #1 已 Commit 而事務 #2 已 Prewrite 未 Commit，此時：

scan_ts = 0x05 順序掃 [-∞, +∞)，可依次掃出：bar => bar_value、foo => foo_value。
scan_ts = 0x12 順序掃 [-∞, +∞)，會先掃出 bar => bar_value，若還要繼續掃應當返回 box 的鎖衝突。TiDB 拿到這個錯誤後會等鎖、清鎖並重試。

順序掃流程

根據上面所說的順序掃定義及例子，在不考慮鎖衝突的情況下，可以想出一個最簡單的實現思路就是不斷將 Write CF 的 Cursor 從 lower_bound 往後移動，對於各個 User Key 跳過它 commit_ts > scan_ts 的版本，採納第一個 commit_ts <= scan_ts 的版本，根據版本 Write Info 從 Default CF 中獲取 Value，即可組成返回給上層的 KV 對。

這個思路很簡單，但無法處理鎖衝突。在有鎖衝突的情況下，順序掃只應當對掃到的數據處理鎖衝突，沒掃到的數據即使有鎖，也不應該影響無衝突數據的正常掃（例如用戶的 SQL 中有 limit）。由於不同 User Key（及同一個 User Key 的不同版本）都可能同時散落在 Write CF 與 Lock CF 中，因此 TiKV 的思路類似於歸併排序：同時移動 Write CF Cursor 與 Lock CF Cursor，在移動過程中這兩個 Cursor 可能對應了不同的 User Key，較小的那個就是要優先處理的 User Key。如果這個 User Key 是 Lock CF 中的，說明可能遇到了鎖衝突，需要返回失敗或忽略。如果這個 User Key 是 Write CF 中的，說明有多版本可以供讀取，需要找到最近的一個滿足 scan_ts 要求的版本信息 Write Info，根據其內部記載的 start_ts 再從 Default CF 中獲取 Value，從而組成 KV 對返回給上層。

圖 1 TiKV 掃數據算法示意

單次迭代的具體流程為：

步驟 1.

首次迭代：將 Lock 及 Write CF Cursor Seek 到 lower_bound 處。此時它們各自指向了第一個 >= lower_bound 的 Key。

if !self.is_started {      if self.cfg.lower_bound.is_some() {          self.write_cursor.seek(              self.cfg.lower_bound.as_ref().unwrap(),              ...,          )?;          self.lock_cursor.seek(              self.cfg.lower_bound.as_ref().unwrap(),              ...,          )?;      } else {          self.write_cursor.seek_to_first(...);          self.lock_cursor.seek_to_first(...);      }      self.is_started = true;  }

步驟 2.

Lock Cursor 和 Write Cursor 分別指向的 Key 可能對應不同的 User Key（也可能指向空，代表該 CF 已沒有更多數據）。比較 Lock Cursor 與 Write Cursor 可得出第一個遇到的 User Key：

let w_key = if self.write_cursor.valid()? {      Some(self.write_cursor.key(...))  } else {      None  };  let l_key = if self.lock_cursor.valid()? {      Some(self.lock_cursor.key(...))  } else {      None  };    match (w_key, l_key) { ... }

分支 2.1.

Write Cursor 指向空，Lock Cursor 指向空：說明兩個 CF 都掃完了，該直接結束了。

圖 2 進入本分支的一種情況，若 Seek 的是 e，則處於 Write Cursor 和 Lock Cursor 都指向空的狀態

(current_user_key_slice, has_write, has_lock) = match (w_key, l_key) {      (None, None) => {          // Both cursors yield `None`: we know that there is nothing remaining.          return Ok(None);      }      ...  }

分支 2.2.

Write Cursor 指向某個值 w_key，Lock Cursor 指向空：說明存在一個 User Key = w_key 的 Write Info，且沒有任何 >= Start Key 的 Lock Info。w_key 即為第一個遇到的 User Key。

(current_user_key_slice, has_write, has_lock) = match (w_key, l_key) {      ...      (Some(k), None) => {          // Write cursor yields something but lock cursor yields `None`:          // We need to further step write cursor to our desired version          (Key::truncate_ts_for(k)?, true, false)      }      ...  }

分支 2.3.

Write Cursor 指向空，Lock Cursor 指向某個值 l_key：說明存在一個 User Key = l_key 的 Lock Info。l_key 即是第一個遇到的 User Key。

(current_user_key_slice, has_write, has_lock) = match (w_key, l_key) {      ...      (None, Some(k)) => {          // Write cursor yields `None` but lock cursor yields something:          // In RC, it means we got nothing.          // In SI, we need to check if the lock will cause conflict.          (k, false, true)      }      ...  }

分支 2.4.

Write Cursor 指向某個值 w_key，Lock Cursor 指向某個值 l_key：說明存在一個 User Key = l_key 的 Lock Info、存在一個 User Key = w_key 的 Write Info。l_key 與 w_key 中小的那個是第一個遇到的 User Key。

圖 3 進入本分支的一種情況，若 Seek 的是 a，則處於 Write Cursor 和 Lock Cursor 都指向某個值的狀態

(current_user_key_slice, has_write, has_lock) = match (w_key, l_key) {      ...      (Some(wk), Some(lk)) => {          let write_user_key = Key::truncate_ts_for(wk)?;          match write_user_key.cmp(lk) {              Ordering::Less => {                  // Write cursor user key < lock cursor, it means the lock of the                  // current key that write cursor is pointing to does not exist.                  (write_user_key, true, false)              }              Ordering::Greater => {                  // Write cursor user key > lock cursor, it means we got a lock of a                  // key that does not have a write. In SI, we need to check if the                  // lock will cause conflict.                  (lk, false, true)              }              Ordering::Equal => {                  // Write cursor user key == lock cursor, it means the lock of the                  // current key that write cursor is pointing to *exists*.                  (lk, true, true)              }          }      }  }

步驟 3.

如果在步驟 2 中，第一個遇到的 User Key 來自於 Lock，則：

步驟 3.1.

檢查 Lock Info 是否有效，例如需要忽略 start_ts > scan_ts 的 lock。

let lock = {      let lock_value = self.lock_cursor.value(...);      Lock::parse(lock_value)?  };  match super::util::check_lock(&current_user_key, self.cfg.ts, &lock)? {      CheckLockResult::NotLocked => {}      CheckLockResult::Locked(e) => result = Err(e),      CheckLockResult::Ignored(ts) => get_ts = ts,  }

我們一般以當前的時間構造 scan_ts，為什麼實際看到的似乎是「未來」的 lock？原因是這個讀請求可能來自於一個早期開始的事務，或這個請求被網絡阻塞了一會兒，或者我們正在讀取歷史數據。

步驟 3.2.

將 Lock Cursor 往後移動一個 Key，以便下次迭代可以直接從新的 Lock 繼續。此時 Lock Cursor 指向下一個 Lock（也可能指向空）。

步驟 3.3.

在 3.1 步驟中檢查下來有效的話報錯返回這個 Lock，TiDB 後續需要進行清鎖操作。

步驟 4.

如果在步驟 2 中，第一個遇到的 User Key 來自於 Write：

註：Lock Cursor 與 Write Cursor 可能一起指向了同一個 User Key 的不同版本。由於我們只想忽略鎖對應的版本而不是想忽略這整個 User Key，因此此時步驟 3 和步驟 4 都會被執行，如下圖所示。

圖 4 一種 User Cursor 和 Lock Cursor 具有相同 User Key 的情況，Seek 的是 c

走到了目前這一步，說明我們需要從 Write Info 中讀取 User Key 滿足 scan_ts 的記錄。需要注意，此時 User Key 可能是存在 Lock 的，但已被判定為應當忽略。

步驟 4.1.

將 Write Cursor Seek 到 {w_key}{!scan_ts} 處（註：參見「事務樣例」中區別 3，時間戳存儲時取了反，因此這裡及本文其餘部分都以 ! 標記取反操作）。如果版本數很少（同時這也符合絕大多數場景），那麼這個要 Seek 的 Key 很可能非常靠近當前位置。在這個情況下為了避免較大的 Seek 開銷，TiKV 採取先 next 若干次再 seek 的策略：

// Try to iterate to `${user_key}_${ts}`. We first `next()` for a few times,  // and if we have not reached where we want, we use `seek()`.    // Whether we have *not* reached where we want by `next()`.  let mut needs_seek = true;    for i in 0..SEEK_BOUND {      if i > 0 {          self.write_cursor.next(...);          if !self.write_cursor.valid()? {              // Key space ended.              return Ok(None);          }      }      {          let current_key = self.write_cursor.key(...);          if !Key::is_user_key_eq(current_key, user_key.as_encoded().as_slice()) {              // Meet another key.              *met_next_user_key = true;              return Ok(None);          }          if Key::decode_ts_from(current_key)? <= ts {              // Founded, don't need to seek again.              needs_seek = false;              break;          }      }  }  // If we have not found `${user_key}_${ts}` in a few `next()`, directly `seek()`.  if needs_seek {      // `user_key` must have reserved space here, so its clone has reserved space too. So no      // reallocation happens in `append_ts`.      self.write_cursor          .seek(&user_key.clone().append_ts(ts), ...)?;      if !self.write_cursor.valid()? {          // Key space ended.          return Ok(None);      }      let current_key = self.write_cursor.key(...);      if !Key::is_user_key_eq(current_key, user_key.as_encoded().as_slice()) {          // Meet another key.          *met_next_user_key = true;          return Ok(None);      }  }

步驟 4.2.

w_key 可能沒有任何 commit_ts <= scan_ts 的記錄，因此 Seek {w_key}{!scan_ts} 時可能直接越過了當前 User Key 進入下一個 w_key，因此需要先判斷一下現在 Write Cursor 對應的 User Key 是否仍然是 w_key。如果是的話，說明這是我們找到的最大符合 scan_ts 的版本（Write Info）了，我們就可以依據該版本直接確定數據內容。若版本中包含的類型是 DELETE，說明在這個版本下 w_key 或者說 User Key 已被刪除，那麼我們就當做它不存在；否則如果類型是 PUT，就可以按照版本中存儲的 start_ts 在 Default CF 中直接取得 User Value：Get {w_key}{!start_ts}。

另一方面，如果這一步 Seek 到了下一個 w_key，我們就不能採信這個新的 w_key，什麼也不做，回到步驟 2，因為這個新的 w_key 可能比 l_key 大了，需要先重新看一下 l_key 的情況。

// Now we must have reached the first key >= `${user_key}_${ts}`. However, we may  // meet `Lock` or `Rollback`. In this case, more versions needs to be looked up.  loop {      let write = Write::parse(self.write_cursor.value(...))?;      self.statistics.write.processed += 1;        match write.write_type {          WriteType::Put => return Ok(Some(self.load_data_by_write(write, user_key)?)),          WriteType::Delete => return Ok(None),          WriteType::Lock | WriteType::Rollback => {              // Continue iterate next `write`.          }      }        ...  }

步驟 4.3.

此時我們已經知道了 w_key（即 User Key）符合 scan_ts 版本要求的 Value。為了能允許後續進一步迭代到下一個 w_key，我們需要移動 Write Cursor 跳過當前 w_key 剩餘所有版本。跳過的方法是 Seek {w_key}{xFF..xFF}，此時 Write Cursor 指向第一個 >= {w_key}{xFF..xFF} 的 Key，也就是下一個 w_key。

fn move_write_cursor_to_next_user_key(&mut self, current_user_key: &Key) -> Result<()> {      for i in 0..SEEK_BOUND {          if i > 0 {              self.write_cursor.next(...);          }          if !self.write_cursor.valid()? {              // Key space ended. We are done here.              return Ok(());          }          {              let current_key = self.write_cursor.key(...);              if !Key::is_user_key_eq(current_key, current_user_key.as_encoded().as_slice()) {                  // Found another user key. We are done here.                  return Ok(());              }          }      }      // We have not found another user key for now, so we directly `seek()`.      // After that, we must pointing to another key, or out of bound.      // `current_user_key` must have reserved space here, so its clone has reserved space too.      // So no reallocation happens in `append_ts`.      self.write_cursor.internal_seek(          &current_user_key.clone().append_ts(0),          ...,      )?;      Ok(())  }

步驟 4.4.

依據之前取得的 User Value 返回 (User Key, User Value)。

步驟 5.

如果沒有掃到值，回到 2。

樣例解釋

上面的步驟可能過於枯燥，接下來結合「事務樣例」看一下流程。假設現在樣例中的事務 #1 已遞交而事務 #2 prewrite 完畢但還沒 commit，則這幾個樣例事務在 RocksDB 存儲的數據類似於如下所示：

圖 5 樣例事務在 RocksDB 的存儲數據

現在嘗試以 scan_ts = 0x05 順序掃 [-∞, +∞)。

執行步驟 1：首次迭代：將 Lock 及 Write CF Cursor Seek 到 lower_bound 處。

圖 6 執行完畢後各個 Cursor 位置示意

執行步驟 2：對比 Lock Cursor 與 Write Cursor，進入分支 2.4。
執行分支 2.4：Write Cursor 指向 bar，Lock Cursor 指向 box，User Key 為 bar。
執行步驟 3：User Key = bar 不來自於 Lock，跳過。
執行步驟 4：User Key = bar 來自於 Write，繼續。
執行步驟 4.1：Seek {w_key}{!scan_ts}，即 Seek bar......xFFxFF..xFA。Write Cursor 仍然是當前位置。

圖 7 執行完畢後各個 Cursor 位置示意

執行步驟 4.2：此時 Write Key 指向 bar 與 User Key 相同，因此依據 PUT (start_ts=1) 從 Default CF 中獲取到 value = bar_value。
執行步驟 4.3：移動 Write Cursor 跳過當前 bar 剩餘所有版本，即 Seek bar......xFFxFF..xFF：

圖 8 執行完畢後各個 Cursor 位置示意

執行步驟 4.4：對外返回 Key Value 對 (bar, bar_value)。
若外部只需要 1 個 KV 對（例如 limit = 1），此時就可以停止了，若外部還要繼續獲取更多 KV 對，則重新開始執行步驟 1。
執行步驟 1：不是首次迭代，跳過。
執行步驟 2：對比 Lock Cursor 與 Write Cursor，進入分支 2.4。
執行分支 2.4：Write Cursor 指向 foo，Lock Cursor 指向 box，User Key 為 box。

圖 9 執行完畢後各個 Cursor 位置示意

執行步驟 3：User Key = box 來自於 Lock，繼續。
執行步驟 3.1：檢查 Lock Info。Lock 的 ts 為 0x11，scan_ts 為 0x05，忽略這個 Lock 不返回鎖衝突錯誤。
執行步驟 3.2：將 Lock Cursor 往後移動一個 Key。

圖 10 執行完畢後各個 Cursor 位置示意

執行步驟 4：User Key = box 不來自於 Write，跳過，回到步驟 2。
執行步驟 2：對比 Lock Cursor 與 Write Cursor，進入分支 2.4。
執行分支 2.4：Write Cursor 指向 foo，Lock Cursor 指向 foo，User Key 為 foo。

圖 11 執行完畢後各個 Cursor 位置示意

執行步驟 3：User Key = foo 來自於 Lock，繼續。與之前類似，鎖被忽略，且 Lock Cursor 往後移動。

圖 12 執行完畢後各個 Cursor 位置示意

執行步驟 4：User Key = foo 同樣來自於 Write，繼續。
執行步驟 4.1：Seek {w_key}{!scan_ts}，即 Seek foo......xFFxFF..xFA。Write Cursor 仍然是當前位置。
執行步驟 4.2：此時 Write Key 指向 foo 與 User Key 相同，因此依據 PUT (start_ts=1) 從 Default CF 中獲取到 value = foo_value。
執行步驟 4.3：移動 Write Cursor 跳過當前 foo 剩餘所有版本，即 Seek foo......xFFxFF..xFF：

圖 13 執行完畢後各個 Cursor 位置示意

執行步驟 4.4：對外返回 Key Value 對 (foo, foo_value)。
若外部選擇繼續掃，則繼續回到步驟 1。
執行步驟 1：不是首次迭代，跳過。
執行步驟 2：對比 Lock Cursor 與 Write Cursor，進入分支 2.1。

圖 14 執行完畢後各個 Cursor 位置示意

執行步驟 2.1：Write Cursor 和 Lock Cursor 都指向空，沒有更多數據了。

總結

以上就是 MVCC 順序掃數據代碼的解析，點查和逆序掃流程與其類似，並且代碼注釋很詳細，大家可以自主閱讀理解。下篇文章我們會詳細介紹悲觀事務的代碼實現。

?文中劃線部分均有跳轉，點擊【閱讀原文】查看原版文章

TiKV 是一個開源的分佈式事務 Key-Value 數據庫，支持跨行 ACID 事務，同時實現了自動水平伸縮、數據強一致性、跨數據中心高可用和雲原生等重要特性。作為一個基礎組件，TiKV 可作為構建其它系統的基石。目前，TiKV 已用於支持分佈式 HTAP 數據庫—— TiDB 中，負責存儲數據，並已被多個行業的領先企業應用在實際生產環境。2019 年 5 月，CNCF 的 TOC（技術監督委員會）投票決定接受 TiKV 晉級為孵化項目。

· 源碼地址：https://github.com/tikv/tikv

· 更多信息：https://tikv.org

文章轉載自PingCAP。點擊這裡閱讀原文了解更多。