茅坑殺手與Alias Method離散取樣

說起Alias,你可能第一個聯想到的是Linux中的Alias命令,就像中世紀那些躲在茅坑下面(是真的,起碼日本有糞坑忍者,沒有馬桶的年代就是社會的噩夢)進行刺殺的殺手一樣,讓人防不勝防,對於那些被這個命令坑過的人來說,電腦必須時刻出現在視野內,因為你不知道你身邊的殺手朋友什麼時候會模仿中世紀茅坑殺手在你的終端執行這樣一條命令。

alias cd=rm -rf

(如果不懂這個梗,給一個小提示,alias命令是給某個命令重命名,這裡把cd命令改成了rm -rf命令,你每次進入目錄其實是刪除了目錄)

說回正題,前段時間看同事寫的程式碼有一個應用到權重的地方,對於系統中用到的兩個數據源,我們希望有一定比例走新數據源,一定比例走老數據源(感覺用的花里胡哨,直接配置中心搞一個開關不就好了),看到他們寫的程式碼中使用了Alias Method來實現權重,所以就專門學習了一下(雖然感覺在新老數據源這個場景中使用有點雞肋,但是掌握一門權重演算法的實現還是很有幫助的)。

先貼出一篇國外寫的很好的部落格:

//www.keithschwarz.com/darts-dice-coins/

飛鏢,骰子,硬幣:離散取樣方法

一. 原理

權重演算法的本質是指定時間發生的幾率,比如用戶的每一次請求我有30%的幾率打到伺服器A,有20%的幾率到伺服器B,還有50%的幾率到伺服器C。那麼給出未來十次的用戶請求序列,要求給出它們會分別打到哪個伺服器。

最簡單的實現當然是產生一個位於0.0~1.0之間的隨機數。

  1. 當隨機數位於0~0.3時,請求伺服器A
  2. 當隨機數位於0.3~0.5時,請求伺服器B
  3. 當隨機數位於0.5~1時,請求伺服器C

這種方式實現簡單,生成一個隨機數,然後在對應時間上if-else判斷即可,但是存在精度問題。

另一種實現方式是離散演算法,通過概率分布構造幾個點,[30, 50, 100],代表三個區間,再生成1~100的整數,看它位於哪個區間,比如45位於[30~50]區間,就說明本次請求打到伺服器B。

現在進入正題Alias Method演算法,Alias Method的最終結果是要構造拼裝出一個每一列合都為1的矩形,若每一列最後都要為1,那麼要將所有元素都乘以概率類型的數量(此處為3)。

此時右邊乘以3之後的模型概率大於一或者小於一,需要用大於一去補足小於一的,而且必須滿足每列必須至多只有兩種組合。

根據填充後的上圖我們可以得到兩個數組,分別是位於下方的[9/10, 6/10, 10/10]組成的Prod數組[0.9, 0.6, 1]以及上方用來填充的顏色的序號[3, 3, NULL],第一個3代表紅色列被第三列藍色填充,第二個3代表綠色列被第三列藍色填充,第三列概率大於1,不需要填充,為NULL。

T 1 2 3
PDF 0.3 0.2 0.5
Prob 0.9 0.6 1
Alias 3 3 NULL

得到這兩個數組之後,隨機取其中的一列(代表事件),比如是第二列,讓Prob[2]的值與一個隨機小數f比較,如果f小於Prob[2],那麼結果就是2,否則就是Alias[2],即3,若為第三列,Prob[3]必定大於小數f,結果就是3。

可以來簡單驗證一下,比如隨機到第三列的概率是1/3,第三列全部為紅色,則第三列事件3發生概率就是1/3乘1,藍色再其他列上部分還有覆蓋,分別是1/3乘2/51/3乘1/10,最終的結果還是為0.5,符合原來的pdf概率。這種演算法初始化較複雜,但生成隨機結果的時間複雜度為O(1),是一種性能非常好的演算法。

二. 程式碼實現

下面是一個較為完整的程式碼實現,該實例在使用中只需要在AliasMethod中初始化好我們之前所說的Alias和Prob數組即可。

import java.util.*;

/**
 * @description 權重演算法
 */
public final class AliasMethod {
    /**
     * The random number generator used to sample from the distribution.
     */
    private final Random random;
    /**
     * The probability and alias tables.
     */
    private final int[] alias;
    private final double[] probability;

    /**
     * Constructs a new AliasMethod to sample from a discrete distribution and
     * hand back outcomes based on the probability distribution.
     * <p>
     * Given as input a list of probabilities corresponding to outcomes 0, 1,
     * ..., n - 1, this constructor creates the probability and alias tables
     * needed to efficiently sample from this distribution.
     *
     * @param probabilities The list of probabilities.
     */
    public AliasMethod(List<Double> probabilities) {
        this(probabilities, new Random());
    }

    /**
     * Constructs a new AliasMethod to sample from a discrete distribution and
     * hand back outcomes based on the probability distribution.
     * <p>
     * Given as input a list of probabilities corresponding to outcomes 0, 1,
     * ..., n - 1, along with the random number generator that should be used
     * as the underlying generator, this constructor creates the probability
     * and alias tables needed to efficiently sample from this distribution.
     *
     * @param probabilities The list of probabilities.
     * @param random        The random number generator
     */
    public AliasMethod(List<Double> probabilities, Random random) {
        /* Begin by doing basic structural checks on the inputs. */
        if (probabilities == null || random == null) {
            throw new NullPointerException();
        }
        if (probabilities.size() == 0) {
            throw new IllegalArgumentException("Probability vector must be nonempty.");
        }
        /* Allocate space for the probability and alias tables. */
        probability = new double[probabilities.size()];
        alias = new int[probabilities.size()];
        /* Store the underlying generator. */
        this.random = random;
        /* Compute the average probability and cache it for later use. */
        final double average = 1.0 / probabilities.size();
        /* Make a copy of the probabilities list, since we will be making
         * changes to it.
         */
        probabilities = new ArrayList<>(probabilities);
        /* Create two stacks to act as worklists as we populate the tables. */
        Deque<Integer> small = new ArrayDeque<>();
        Deque<Integer> large = new ArrayDeque<>();
        /* Populate the stacks with the input probabilities. */
        for (int i = 0; i < probabilities.size(); ++i) {
            /* If the probability is below the average probability, then we add
             * it to the small list; otherwise we add it to the large list.
             */
            if (probabilities.get(i) >= average) {
                large.add(i);
            } else {
                small.add(i);
            }
        }
        /* As a note: in the mathematical specification of the algorithm, we
         * will always exhaust the small list before the big list. However,
         * due to floating point inaccuracies, this is not necessarily true.
         * Consequently, this inner loop (which tries to pair small and large
         * elements) will have to check that both lists aren't empty.
         */
        while (!small.isEmpty() && !large.isEmpty()) {
            /* Get the index of the small and the large probabilities. */
            int less = small.removeLast();
            int more = large.removeLast();
            /* These probabilities have not yet been scaled up to be such that
             * 1/n is given weight 1.0. We do this here instead.
             */
            probability[less] = probabilities.get(less) * probabilities.size();
            alias[less] = more;
            /* Decrease the probability of the larger one by the appropriate
             * amount.
             */
            probabilities.set(more,
                    (probabilities.get(more) + probabilities.get(less)) - average);
            /* If the new probability is less than the average, add it into the
             * small list; otherwise add it to the large list.
             */
            if (probabilities.get(more) >= 1.0 / probabilities.size()) {
                large.add(more);
            } else {
                small.add(more);
            }
        }
        /* At this point, everything is in one list, which means that the
         * remaining probabilities should all be 1/n. Based on this, set them
         * appropriately. Due to numerical issues, we can't be sure which
         * stack will hold the entries, so we empty both.
         */
        while (!small.isEmpty()) {
            probability[small.removeLast()] = 1.0;
        }
        while (!large.isEmpty()) {
            probability[large.removeLast()] = 1.0;
        }
    }

    /**
     * Samples a value from the underlying distribution.
     *
     * @return A random value sampled from the underlying distribution.
     */
    public int next() {
        /* Generate a fair die roll to determine which column to inspect. */
        int column = random.nextInt(probability.length);
        /* Generate a biased coin toss to determine which option to pick. */
        boolean coinToss = random.nextDouble() < probability[column];
        /* Based on the outcome, return either the column or its alias. */
        return coinToss ? column : alias[column];
    }
}

另外,stackoverflow上有一個回答,關於程式中擲骰子這個事件的模型改用什麼數據結構,無疑可以使用Alias Method,感興趣的老鐵可以自己畫畫它的概率模型。

另外回答中還有各種千奇百怪的,比如說下面這個,使用平衡二叉搜索樹,骰子的每一個面設置一個節點。

三. 最後

沒有最後,點個關注點個在看,你懂我意思嗎? 我向你敬禮!salute。(該不會就只看懂個茅坑殺手吧)