node2vec和deepwalk中的取樣

2021 年 7 月 15 日
AI

關於權重的問題，先理解一下deepwalk，deepwalk原論文里取樣使用的是均勻取樣，例如

假設start node是node3，則從3開始，節點3隨機取樣，因為節點3的一階鄰節點只有2，所以無論怎麼取樣都只能得到節點2，然後遊走到節點2，節點2可以遊走到節點1，6，3，4，即節點2到1，6，3，4的概率分別為[1/4,1/4,1/4,1/4]，cumsum之後為[0.25,0.5,0.75,1]，然後np.random.uniform()生成一個0~1之間的隨機數，這個隨機數落在0~0.25的區間則選擇節點1，落到0.25~0.5的區間則選擇節點2.。。以此類推。

當邊存在權重的時候，以節點2為例，假設節點2到1，6，3，4的權重分別為 1,1,2,2，則首先要標準化,1+1+2+2=6，然後權重normalize為[1/6,1/6,2/6,2/6]，然後cumsum，剩下的計算和選擇方法和上面是一樣的。

通過np.searchsorted可以比較快速方便的實現。

方法1：簡單的方法

當邊存在權重的時候，我們上面使用的方法已經涉及到第一種最簡單的 非均勻分布的離散型隨機變數的概率分布的取樣了。

簡單的取樣方法中，涉及到線性查找，時間複雜度為O(n),而np.searchsorted內部使用的是二分查找，因此時間複雜度下降為O(logn).值得注意的是，當graph是無權圖的時候，我們不需要線性查找，以節點2為例，我們只需要生成[0,1,2,3]的一個數組，然後均勻分布生成一個0~3之間的隨機整數，然後用數據的index，可以直接訪問到對應的事件，事件複雜度為O(1)。不過一般來說，為了統一無權和有權圖，都是傾向於使用線性查找的方式來做的，畢竟有權圖就不能直接使用這種O(1)的方法，而有權圖在我遇到的業務問題中比無權圖多的多的多。

方法2： alias sampling

alias sampling能夠達到O(1)的時間複雜度，alias sampling的做法很漂亮，用空間換時間，為了方便理解，這裡假設節點2和節點1，6，4，3的edge的權重，normalize之後為[1/2,1/3,1/12,1/12]，因為後續涉及到取樣，權重要轉化為概率值的形式。

則 alisa samping的做法是：

圖片和例子來源於：

Alias取樣演算法 – 吱吱了了 – 部落格園www.cnblogs.com

程式碼來源於：

//www.programmersought.com/article/47703588096/www.programmersought.com

1、我們設Prob_val=[1/2,1/3,1/12,1/12]，L=len(Prob_val)，初始化四個數組：

    L = len(Prob_val)
    # Initialize two arrays
    accept_prob = np.zeros(L)   # Saved probability
    alias_index = np.zeros(L, dtype=np.int)  # Save the subscript / serial number

    # Large queues are used to store node labels with an area larger than 1, and small queues are used to store node labels with an area smaller than 1.
    small_queue = []
    large_queue = []

2、節點2的鄰節點有4個，意味著存在4個隨機事件，用4乘normalize之後的edge的權重向量得到：[2,4/3,1/3,1/3]

對應程式碼

    for index, prob in enumerate(Prob_val):
        accept_prob[index] = L*prob

        if accept_prob[index] < 1.0:
            small_queue.append(index)
        else:
            large_queue.append(index)

這裡，small_queue用來存放計算結果小於1的事件的index，large_queue用來存放計算結果大於等於1的事件的index，我們可以得到small_queue=[0,1],large_queue=[2,3]。

0，1，2，3事件分別對應遊走到節點1，遊走到節點6，遊走到節點3，遊走到節點4 一共四個事件。

3、核心步驟

    # 1. Take one from each of the two queues at a time, and let the big one supplement the small one, and then the small one out of the small queue
    # 2. After looking at the big value and subtracting the small supply, the remaining value, if it is greater than 1, continue to be placed in the large queue; if it is exactly equal to 1, it will be out of the queue; if it is less than 1, join the small queue
    while small_queue and large_queue:
        small_index = small_queue.pop()
        large_index = large_queue.pop()
        # Because there is stored in alias_index: the label of another event, then now with a large probability to supplement the small probability, the label will become the label of the large event
        alias_index[small_index] = large_index
        # The principle of supplementation is: the large probability is to fill up the small probability (to the probability of 1), then the rest
        accept_prob[large_index] = accept_prob[large_index] + accept_prob[small_index] - 1.0
        # After judging the completion, the size of the remaining value
        if accept_prob[large_index] < 1.0:
            small_queue.append(large_index)
        else:
            large_queue.append(large_index)

這裡做的事情，是將上述矩陣轉化為：

這樣的形式，（看這個程式碼的邏輯頭太疼了，偷懶不解釋過程了）

leetcode上居然有alias sample的題，牛逼牛逼

Python3-alias-sample-by-la-bi-xiao-xin-sqw/” data-draft-node=”block” data-draft-type=”link-card” class=”LinkCard old LinkCard–noImage”>力扣leetcode-cn.com

4、最終得到：

return accept_prob, alias_index

accept_prob=[2/3,1,1/3,1/3],alias_index=[2,0,1,1]

alias_index的設計比較nice，我們設index=0，表示事件0，則alias_index[index]=2，表示如果不是事件0，則為事件2，通過這種做法，用一個數組的每一位存儲了兩種可能的事件。

對應的，accept_prob[index]=accept_prob[0]=2/3，表示了事件0發生的概率，對應的事件2發生的概率即為1-2/3=1/3；

到這裡可以看到，alis table的思路不複雜，就是將原來在1維向量上的取樣，變成在2維的平面上的取樣了。

5、最終的取樣：

def alias_smaple(accept_prob, alias_index):
    N = len(accept_prob)

    # Throw the first dice to generate the first random number from 1 to N, and decide which column to fall in
    random_num1 = int(np.floor(np.random.rand()*N))
    # Throw the second dice, generate a random number between 0 ~ 1, judge the size of accept_prob [random_num1]
    random_num2 = np.random.rand()

    # If less than Prab [i], sample i, if greater than Prab [i], sample Alias [i]
    if random_num2 < accept_prob[random_num1]:
        return random_num1
    else:
        alias_index[random_num1]

只需要做兩次取樣，random_num1 = int(np.floor(np.random.rand()*N)),生成一個0~3之間的隨機的整數，random_num2 = np.random.rand()，生成一個0~1之間的隨機的小數。

random_num1用來確定我們使用哪一個區間，比如random_num1=2，則通過數組的index，可以以O(1)的時間複雜度訪問到對應的那一列。

然後就很簡單了，random2和accept_prob[random_num1]存放的概率比較，直接放回對應結果。

整個過程就涉及到兩個隨機數生成和大小比較以及index的索引，常數時間複雜度。不過需要注意的是，構造accept_prob, alias_index 時，時間複雜度還是O(n)。不過總的程式運行時間當然是大大縮短的，畢竟 alias table我們只需要構建一次，存下來就可以了。

我們上面提到的兩種取樣方法，和node2vec的有偏取樣中的p和q沒有任何關係，都是針對於一般的帶權圖，換句話說，deepwalk面對帶權圖的時候使用的取樣方法可以是上述的這兩種，而node2vec中又引入的p，q參數和這裡的取樣方法無關，那又是更煩人的一件事情了。

方法3 reject sampling

上面的例子里，我們是以節點2為例，講述了整個alias sampling的計算過程，graph上有n個節點，則就需要構建n個alias table。

而node2vec中，引入的p、q參數使得問題更加複雜。

上述的圖片，描述的是節點從節點t到節點v對應的取樣的過程，（這裡補充一下，node2vec的有偏取樣的初始化，論文里給的圖片描述的是t到v之後，在v節點的時候我們面臨的遊走的決策，但是沒有提到t是怎麼到v的，實際上在node2vec的程式碼實現里，初始的步，從t到v這一步，使用的是和deepwalk一樣的取樣策略）

重點是這個公式，前面的

這個沒啥問題，對於每個節點t來說，alpha(t,x)都是固定的。主要的問題在於：

這一項，Wvx可以使用前面提到的alisa sampling的方法來計算得到，引入了p，q之後，我們只要將alpha和對應的概率相乘就可以了，舉個例子，假設節點v和4個鄰節點的邊的權重均為1，則t,x1,x2,x3四個鄰節點被遊走到的概率分別為[1/4,1/4,1/4,1/4]，然後假設p=q=0.5，則1/p=1/q=2，和[1/4,1/4,1/4,1/4]相乘可以得到[t:1/2,x1:1/4,x2:1/2,x3:1/2]，然後構建Wvx對應的 alias table或者直接cumsum之後取樣即可。

但是麻煩的地方在於，這裡的Wvx的alias table構建的前提是，v節點的上一個節點是t節點，對於節點v而言，edges是固定不變的，但是alpha對應的這個公式是根據節點v的上一個節點而變的，例如假設我們是從x1遊走到v節點，則alpha這一項對應的數值就改變了，因此alias table也不能使用原來的那個需要重新構建。

這就意味著，我們其實是根據每一個node pairs來構建alias table的，在deepwalk中，一個節點v只要構建一個alias table，但是在node2vec中，因為引入了p q 參數，一個節點v需要構建4個alias table，如果某個節點的鄰節點非常多，則需要構建大量的alias table，雖然alias table的查詢的時間複雜度為O(1)，但是這樣的alias table的構建方式，空間複雜度為原來的O(n)變成O(m)，n是節點的數量，m是邊的數量X2，因為節點v到節點X的邊和節點X到節點V的邊在node2vec中意義是不同的，因為二者對應的alpha函數是不同的。

那麼拒絕取樣是怎麼做的呢？直接看程式碼：

import numpy as np
from numba import njit


@njit(nogil=True)
def random_walk(walk_length, p, q, t):
    """sample a random walk starting from t
    """
    # Normalize the weights to compute rejection probabilities
    max_prob = max(1 / p, 1, 1 / q)
    prob_0 = 1 / p / max_prob
    prob_1 = 1 / max_prob
    prob_2 = 1 / q / max_prob

    # Initialize the walk
    walk = np.empty(walk_length, dtype=indices.dtype)
    walk[0] = t
    walk[1] = random_neighbor(t)

    for j in range(2, walk_length):
        while True:
            new_node = random_neighbor(walk[j - 1])
            r = np.random.rand()
            if new_node == walk[j - 2]:
                # back to the previous node
                if r < prob_0:
                    break
            elif is_neighbor(walk[j - 2], new_node):
                # distance 1
                if r < prob_1:
                    break
            elif r < prob_2:
                # distance 2
                break
        walk[j] = new_node
        
    return walk

可以看到，拒絕取樣的思路非常的簡單，首先是：

    walk[0] = t
    walk[1] = random_neighbor(t)

t是初始的遊走節點，random_neighbor這個函數的功能就是根據節點t和其鄰節點的權重進行取樣，這裡可以使用PartialSum（partialsum就是第一種，直接用numpy.cumsum實現的取樣的方法）的方法，也可以使用alias sampling的方法。

然後就是核心部分：

    for j in range(2, walk_length):
        while True:
            new_node = random_neighbor(walk[j - 1])
            r = np.random.rand()
            if new_node == walk[j - 2]:
                # back to the previous node
                if r < prob_0:
                    break
            elif is_neighbor(walk[j - 2], new_node):
                # distance 1
                if r < prob_1:
                    break
            elif r < prob_2:
                # distance 2
                break
        walk[j] = new_node

思路非常簡單，也比較符合「拒絕取樣」的命名，符合要求，取樣結束，否則，繼續取樣。

可以看到，這種方式非常簡單直觀，並且不需要預先存儲那麼多的alias table，但是時間複雜度上會比使用alias table的方法要大。

程式碼來自於：

//louisabraham.github.io/articles/node2vec-sampling.htmllouisabraham.github.io