關於word2vec的一些問題

2020 年 6 月 23 日
筆記
ML/DL, nlp

CBOW v.s. skip-gram

CBOW

上下文預測中心詞，出現次數少的詞會被平滑，對出現頻繁的詞有更高的準確率
skip-gram

中心詞預測上下文，訓練次數比CBOW多，表示罕見詞更好

例如給定上下文 yesterday was a really [...] day ，CBOW可能會輸出 beautiful 或 nice ,但是 delightful 的概率就很低；而skip-gram 是給定 delightful ，模型必須理解它的含義並輸出可能的上下文是 yesterday was a really [...] day。[1]

In the “skip-gram” mode alternative to “CBOW”, rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict ‘ate’ from average(‘The’, ‘cat’, ‘the’, ‘mouse’)], the network is presented with four skip-gram examples [predict ‘ate’ from ‘The’], [predict ‘ate’ from ‘cat’], [predict ‘ate’ from ‘the’], [predict ‘ate’ from ‘mouse’]. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)

CBOW 在預測中心詞時，用GradientDesent方法，不斷的去調整權重（周圍詞的向量），cbow的對周圍詞的調整是統一的：求出的gradient的值會同樣的作用到每個周圍詞的詞向量當中去。時間複雜度為 $O(V)$

skip-gram中，會利用周圍的詞的預測結果情況，使用GradientDecent來不斷的調整中心詞的詞向量。skip-gram進行預測的次數是要多於cbow的：因為每個詞在作為中心詞時，都要使用周圍詞進行預測一次。時間複雜度為 $O(KV)$[2]

skip-gram

輸入層：中心詞的one-hot編碼，維度等於詞典大小

隱藏層：大小等於詞向量維度

輸出層：與輸入層維度相等，輸出詞的概率

假如有下列句子：

「the dog saw a cat」, 「the dog chased the cat」, 「the cat climbed a tree」

有8個詞，假設隱層有三個神經元，則 $ W_I$ 和 $W_O$ 分別為 8×3 和 3×8 的矩陣，開始訓練前進行初始化

假設輸入為「cat」，one-hot編碼為[0 1 0 0 0 0 0 0]，隱藏層為

\[h_t = X_tW_I = [-0.490796\qquad -0.229903\qquad 0.065460]
\]

對應 $W_I$ 的第二行（『cat’的編碼表示），就是查表。

實際訓練使用了負取樣。

霍夫曼樹

把 N 分類問題變成 log(N)次二分類

Word2vec v.s. GloVe

predictive 目標是不斷提高對其他詞的預測能力，減小預測損失，從而得到詞向量。

count-based 對共現矩陣降維，得到詞向量

[1] //stackoverflow.com/questions/38287772/cbow-v-s-skip-gram-why-invert-context-and-target-words

[2] //zhuanlan.zhihu.com/p/37477611

Tags: ML/DL nlp