探索Redis設計與實現2：Redis內部數據結構詳解——dict

2019 年 12 月 2 日
筆記

本文轉自http://zhangtielei.com/posts/blog-redis-dict.html

本系列文章將整理到我在GitHub上的《Java面試指南》倉庫，更多精彩內容請到我的倉庫里查看

https://github.com/h2pl/Java-Tutorial

喜歡的話麻煩點下Star哈

本系列文章將整理到我的個人博客

www.how2playlife.com

本文是微信公眾號【Java技術江湖】的《探索Redis設計與實現》其中一篇，本文部分內容來源於網絡，為了把本文主題講得清晰透徹，也整合了很多我認為不錯的技術博客內容，引用其中了一些比較好的博客文章，如有侵權，請聯繫作者。

該系列博文會告訴你如何從入門到進階，Redis基本的使用方法，Redis的基本數據結構，以及一些進階的使用方法，同時也需要進一步了解Redis的底層數據結構，再接着，還會帶來Redis主從複製、集群、分佈式鎖等方面的相關內容，以及作為緩存的一些使用方法和注意事項，以便讓你更完整地了解整個Redis相關的技術體系，形成自己的知識框架。

如果對本系列文章有什麼建議，或者是有什麼疑問的話，也可以關注公眾號【Java技術江湖】聯繫作者，歡迎你參與本系列博文的創作和修訂。

如果你使用過Redis，一定會像我一樣對它的內部實現產生興趣。《Redis內部數據結構詳解》是我準備寫的一個系列，也是我個人對於之前研究Redis的一個階段性總結，着重講解Redis在內存中的數據結構實現（暫不涉及持久化的話題）。Redis本質上是一個數據結構服務器（data structures server），以高效的方式實現了多種現成的數據結構，研究它的數據結構和基於其上的算法，對於我們自己提升局部算法的編程水平有很重要的參考意義。

當我們在本文中提到Redis的「數據結構」，可能是在兩個不同的層面來討論它。

第一個層面，是從使用者的角度。比如：

string
list
hash
set
sorted set

這一層面也是Redis暴露給外部的調用接口。

第二個層面，是從內部實現的角度，屬於更底層的實現。比如：

dict
sds
ziplist
quicklist
skiplist

第一個層面的「數據結構」，Redis的官方文檔(http://redis.io/topics/data-types-intro)有詳細的介紹。本文的重點在於討論第二個層面，Redis數據結構的內部實現，以及這兩個層面的數據結構之間的關係：Redis如何通過組合第二個層面的各種基礎數據結構來實現第一個層面的更高層的數據結構。

在討論任何一個系統的內部實現的時候，我們都要先明確它的設計原則，這樣我們才能更深刻地理解它為什麼會進行如此設計的真正意圖。在本文接下來的討論中，我們主要關注以下幾點：

存儲效率（memory efficiency）。Redis是專用於存儲數據的，它對於計算機資源的主要消耗就在於內存，因此節省內存是它非常非常重要的一個方面。這意味着Redis一定是非常精細地考慮了壓縮數據、減少內存碎片等問題。
快速響應時間（fast response time）。與快速響應時間相對的，是高吞吐量（high throughput）。Redis是用於提供在線訪問的，對於單個請求的響應時間要求很高，因此，快速響應時間是比高吞吐量更重要的目標。有時候，這兩個目標是矛盾的。
單線程（single-threaded）。Redis的性能瓶頸不在於CPU資源，而在於內存訪問和網絡IO。而採用單線程的設計帶來的好處是，極大簡化了數據結構和算法的實現。相反，Redis通過異步IO和pipelining等機制來實現高速的並發訪問。顯然，單線程的設計，對於單個請求的快速響應時間也提出了更高的要求。

本文是《Redis內部數據結構詳解》系列的第一篇，講述Redis一個重要的基礎數據結構：dict。

dict是一個用於維護key和value映射關係的數據結構，與很多語言中的Map或dictionary類似。Redis的一個database中所有key到value的映射，就是使用一個dict來維護的。不過，這只是它在Redis中的一個用途而已，它在Redis中被使用的地方還有很多。比如，一個Redis hash結構，當它的field較多時，便會採用dict來存儲。再比如，Redis配合使用dict和skiplist來共同維護一個sorted set。這些細節我們後面再討論，在本文中，我們集中精力討論dict本身的實現。

dict本質上是為了解決算法中的查找問題（Searching），一般查找問題的解法分為兩個大類：一個是基於各種平衡樹，一個是基於哈希表。我們平常使用的各種Map或dictionary，大都是基於哈希表實現的。在不要求數據有序存儲，且能保持較低的哈希值衝突概率的前提下，基於哈希表的查找性能能做到非常高效，接近O(1)，而且實現簡單。

在Redis中，dict也是一個基於哈希表的算法。和傳統的哈希算法類似，它採用某個哈希函數從key計算得到在哈希表中的位置，採用拉鏈法解決衝突，並在裝載因子（load factor）超過預定值時自動擴展內存，引發重哈希（rehashing）。Redis的dict實現最顯著的一個特點，就在於它的重哈希。它採用了一種稱為增量式重哈希（incremental rehashing）的方法，在需要擴展內存時避免一次性對所有key進行重哈希，而是將重哈希操作分散到對於dict的各個增刪改查的操作中去。這種方法能做到每次只對一小部分key進行重哈希，而每次重哈希之間不影響dict的操作。dict之所以這樣設計，是為了避免重哈希期間單個請求的響應時間劇烈增加，這與前面提到的「快速響應時間」的設計原則是相符的。

下面進行詳細介紹。

dict的數據結構定義

為了實現增量式重哈希（incremental rehashing），dict的數據結構里包含兩個哈希表。在重哈希期間，數據從第一個哈希表向第二個哈希表遷移。

dict的C代碼定義如下（出自Redis源碼dict.h）：

typedef struct dictEntry {      void *key;      union {          void *val;          uint64_t u64;          int64_t s64;          double d;      } v;      struct dictEntry *next;  } dictEntry;    typedef struct dictType {      unsigned int (*hashFunction)(const void *key);      void *(*keyDup)(void *privdata, const void *key);      void *(*valDup)(void *privdata, const void *obj);      int (*keyCompare)(void *privdata, const void *key1, const void *key2);      void (*keyDestructor)(void *privdata, void *key);      void (*valDestructor)(void *privdata, void *obj);  } dictType;    /* This is our hash table structure. Every dictionary has two of this as we   * implement incremental rehashing, for the old to the new table. */  typedef struct dictht {      dictEntry **table;      unsigned long size;      unsigned long sizemask;      unsigned long used;  } dictht;    typedef struct dict {      dictType *type;      void *privdata;      dictht ht[2];      long rehashidx; /* rehashing not in progress if rehashidx == -1 */      int iterators; /* number of iterators currently running */  } dict;

結合上面的代碼和結構圖，可以很清楚地看出dict的結構。一個dict由如下若干項組成：

一個指向dictType結構的指針（type）。它通過自定義的方式使得dict的key和value能夠存儲任何類型的數據。
一個私有數據指針（privdata）。由調用者在創建dict的時候傳進來。
兩個哈希表（ht[2]）。只有在重哈希的過程中，ht[0]和ht[1]才都有效。而在平常情況下，只有ht[0]有效，ht[1]裏面沒有任何數據。上圖表示的就是重哈希進行到中間某一步時的情況。
當前重哈希索引（rehashidx）。如果rehashidx = -1，表示當前沒有在重哈希過程中；否則，表示當前正在進行重哈希，且它的值記錄了當前重哈希進行到哪一步了。
當前正在進行遍歷的iterator的個數。這不是我們現在討論的重點，暫時忽略。

dictType結構包含若干函數指針，用於dict的調用者對涉及key和value的各種操作進行自定義。這些操作包含：

hashFunction，對key進行哈希值計算的哈希算法。
keyDup和valDup，分別定義key和value的拷貝函數，用於在需要的時候對key和value進行深拷貝，而不僅僅是傳遞對象指針。
keyCompare，定義兩個key的比較操作，在根據key進行查找時會用到。
keyDestructor和valDestructor，分別定義對key和value的析構函數。

私有數據指針（privdata）就是在dictType的某些操作被調用時會傳回給調用者。

需要詳細察看的是dictht結構。它定義一個哈希表的結構，由如下若干項組成：

一個dictEntry指針數組（table）。key的哈希值最終映射到這個數組的某個位置上（對應一個bucket）。如果多個key映射到同一個位置，就發生了衝突，那麼就拉出一個dictEntry鏈表。
size：標識dictEntry指針數組的長度。它總是2的指數。
sizemask：用於將哈希值映射到table的位置索引。它的值等於(size-1)，比如7, 15, 31, 63，等等，也就是用二進制表示的各個bit全1的數字。每個key先經過hashFunction計算得到一個哈希值，然後計算(哈希值 & sizemask)得到在table上的位置。相當於計算取余(哈希值 % size)。
used：記錄dict中現有的數據個數。它與size的比值就是裝載因子（load factor）。這個比值越大，哈希值衝突概率越高。

dictEntry結構中包含k, v和指向鏈表下一項的next指針。k是void指針，這意味着它可以指向任何類型。v是個union，當它的值是uint64t、int64t或double類型時，就不再需要額外的存儲，這有利於減少內存碎片。當然，v也可以是void指針，以便能存儲任何類型的數據。

dict的創建（dictCreate）

dict *dictCreate(dictType *type,          void *privDataPtr)  {      dict *d = zmalloc(sizeof(*d));        _dictInit(d,type,privDataPtr);      return d;  }    int _dictInit(dict *d, dictType *type,          void *privDataPtr)  {      _dictReset(&d->ht[0]);      _dictReset(&d->ht[1]);      d->type = type;      d->privdata = privDataPtr;      d->rehashidx = -1;      d->iterators = 0;      return DICT_OK;  }    static void _dictReset(dictht *ht)  {      ht->table = NULL;      ht->size = 0;      ht->sizemask = 0;      ht->used = 0;  }

dictCreate為dict的數據結構分配空間並為各個變量賦初值。其中兩個哈希表ht[0]和ht[1]起始都沒有分配空間，table指針都賦為NULL。這意味着要等第一個數據插入時才會真正分配空間。

dict的查找（dictFind）

#define dictIsRehashing(d) ((d)->rehashidx != -1)    dictEntry *dictFind(dict *d, const void *key)  {      dictEntry *he;      unsigned int h, idx, table;        if (d->ht[0].used + d->ht[1].used == 0) return NULL; /* dict is empty */      if (dictIsRehashing(d)) _dictRehashStep(d);      h = dictHashKey(d, key);      for (table = 0; table <= 1; table++) {          idx = h & d->ht[table].sizemask;          he = d->ht[table].table[idx];          while(he) {              if (key==he->key || dictCompareKeys(d, key, he->key))                  return he;              he = he->next;          }          if (!dictIsRehashing(d)) return NULL;      }      return NULL;  }

上述dictFind的源碼，根據dict當前是否正在重哈希，依次做了這麼幾件事：

如果當前正在進行重哈希，那麼將重哈希過程向前推進一步（即調用_dictRehashStep）。實際上，除了查找，插入和刪除也都會觸發這一動作。這就將重哈希過程分散到各個查找、插入和刪除操作中去了，而不是集中在某一個操作中一次性做完。
計算key的哈希值（調用dictHashKey，裏面的實現會調用前面提到的hashFunction）。
先在第一個哈希表ht[0]上進行查找。在table數組上定位到哈希值對應的位置（如前所述，通過哈希值與sizemask進行按位與），然後在對應的dictEntry鏈表上進行查找。查找的時候需要對key進行比較，這時候調用dictCompareKeys，它裏面的實現會調用到前面提到的keyCompare。如果找到就返回該項。否則，進行下一步。
判斷當前是否在重哈希，如果沒有，那麼在ht[0]上的查找結果就是最終結果（沒找到，返回NULL）。否則，在ht[1]上進行查找（過程與上一步相同）。

下面我們有必要看一下增量式重哈希的_dictRehashStep的實現。

static void _dictRehashStep(dict *d) {      if (d->iterators == 0) dictRehash(d,1);  }    int dictRehash(dict *d, int n) {      int empty_visits = n*10; /* Max number of empty buckets to visit. */      if (!dictIsRehashing(d)) return 0;        while(n-- && d->ht[0].used != 0) {          dictEntry *de, *nextde;            /* Note that rehashidx can't overflow as we are sure there are more           * elements because ht[0].used != 0 */          assert(d->ht[0].size > (unsigned long)d->rehashidx);          while(d->ht[0].table[d->rehashidx] == NULL) {              d->rehashidx++;              if (--empty_visits == 0) return 1;          }          de = d->ht[0].table[d->rehashidx];          /* Move all the keys in this bucket from the old to the new hash HT */          while(de) {              unsigned int h;                nextde = de->next;              /* Get the index in the new hash table */              h = dictHashKey(d, de->key) & d->ht[1].sizemask;              de->next = d->ht[1].table[h];              d->ht[1].table[h] = de;              d->ht[0].used--;              d->ht[1].used++;              de = nextde;          }          d->ht[0].table[d->rehashidx] = NULL;          d->rehashidx++;      }        /* Check if we already rehashed the whole table... */      if (d->ht[0].used == 0) {          zfree(d->ht[0].table);          d->ht[0] = d->ht[1];          _dictReset(&d->ht[1]);          d->rehashidx = -1;          return 0;      }        /* More to rehash... */      return 1;  }

dictRehash每次將重哈希至少向前推進n步（除非不到n步整個重哈希就結束了），每一步都將ht[0]上某一個bucket（即一個dictEntry鏈表）上的每一個dictEntry移動到ht[1]上，它在ht[1]上的新位置根據ht[1]的sizemask進行重新計算。rehashidx記錄了當前尚未遷移（有待遷移）的ht[0]的bucket位置。

如果dictRehash被調用的時候，rehashidx指向的bucket里一個dictEntry也沒有，那麼它就沒有可遷移的數據。這時它嘗試在ht[0].table數組中不斷向後遍歷，直到找到下一個存有數據的bucket位置。如果一直找不到，則最多走n*10步，本次重哈希暫告結束。

最後，如果ht[0]上的數據都遷移到ht[1]上了（即d->ht[0].used == 0），那麼整個重哈希結束，ht[0]變成ht[1]的內容，而ht[1]重置為空。

根據以上對於重哈希過程的分析，我們容易看出，本文前面的dict結構圖中所展示的正是rehashidx=2時的情況，前面兩個bucket（ht[0].table[0]和ht[0].table[1]）都已經遷移到ht[1]上去了。

dict的插入（dictAdd和dictReplace）

dictAdd插入新的一對key和value，如果key已經存在，則插入失敗。

dictReplace也是插入一對key和value，不過在key存在的時候，它會更新value。

int dictAdd(dict *d, void *key, void *val)  {      dictEntry *entry = dictAddRaw(d,key);        if (!entry) return DICT_ERR;      dictSetVal(d, entry, val);      return DICT_OK;  }    dictEntry *dictAddRaw(dict *d, void *key)  {      int index;      dictEntry *entry;      dictht *ht;        if (dictIsRehashing(d)) _dictRehashStep(d);        /* Get the index of the new element, or -1 if       * the element already exists. */      if ((index = _dictKeyIndex(d, key)) == -1)          return NULL;        /* Allocate the memory and store the new entry.       * Insert the element in top, with the assumption that in a database       * system it is more likely that recently added entries are accessed       * more frequently. */      ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];      entry = zmalloc(sizeof(*entry));      entry->next = ht->table[index];      ht->table[index] = entry;      ht->used++;        /* Set the hash entry fields. */      dictSetKey(d, entry, key);      return entry;  }    static int _dictKeyIndex(dict *d, const void *key)  {      unsigned int h, idx, table;      dictEntry *he;        /* Expand the hash table if needed */      if (_dictExpandIfNeeded(d) == DICT_ERR)          return -1;      /* Compute the key hash value */      h = dictHashKey(d, key);      for (table = 0; table <= 1; table++) {          idx = h & d->ht[table].sizemask;          /* Search if this slot does not already contain the given key */          he = d->ht[table].table[idx];          while(he) {              if (key==he->key || dictCompareKeys(d, key, he->key))                  return -1;              he = he->next;          }          if (!dictIsRehashing(d)) break;      }      return idx;  }

以上是dictAdd的關鍵實現代碼。我們主要需要注意以下幾點：

它也會觸發推進一步重哈希（_dictRehashStep）。
如果正在重哈希中，它會把數據插入到ht[1]；否則插入到ht[0]。
在對應的bucket中插入數據的時候，總是插入到dictEntry的頭部。因為新數據接下來被訪問的概率可能比較高，這樣再次查找它時就比較次數較少。
_dictKeyIndex在dict中尋找插入位置。如果不在重哈希過程中，它只查找ht[0]；否則查找ht[0]和ht[1]。
dictKeyIndex可能觸發dict內存擴展（dictExpandIfNeeded，它將哈希表長度擴展為原來兩倍，具體請參考dict.c中源碼）。

dictReplace在dictAdd基礎上實現，如下：

int dictReplace(dict *d, void *key, void *val)  {      dictEntry *entry, auxentry;        /* Try to add the element. If the key       * does not exists dictAdd will suceed. */      if (dictAdd(d, key, val) == DICT_OK)          return 1;      /* It already exists, get the entry */      entry = dictFind(d, key);      /* Set the new value and free the old one. Note that it is important       * to do that in this order, as the value may just be exactly the same       * as the previous one. In this context, think to reference counting,       * you want to increment (set), and then decrement (free), and not the       * reverse. */      auxentry = *entry;      dictSetVal(d, entry, val);      dictFreeVal(d, &auxentry);      return 0;  }

在key已經存在的情況下，dictReplace會同時調用dictAdd和dictFind，這其實相當於兩次查找過程。這裡Redis的代碼不夠優化。

dict的刪除（dictDelete）

dictDelete的源碼這裡忽略，具體請參考dict.c。需要稍加註意的是：

dictDelete也會觸發推進一步重哈希（_dictRehashStep）
如果當前不在重哈希過程中，它只在ht[0]中查找要刪除的key；否則ht[0]和ht[1]它都要查找。
刪除成功後會調用key和value的析構函數（keyDestructor和valDestructor）。

dict的實現相對來說比較簡單，本文就介紹到這。在下一篇中我們將會介紹Redis中動態字符串的實現——sds，敬請期待。

本文鏈接：http://zhangtielei.com/posts/blog-redis-dict.html