Neural machine Translation of Rare Words with Subword Units

2019 年 12 月 10 日
筆記

本次總結的是一篇16年的關於NLP中分詞操作的論文，論文鏈接Subword，參考的實現程式碼subword-nmt，許多論文方法（例如BERT等）都將該方法應用到分詞處理上，相對於word-level和character-level，該方法取得了不錯的效果。

動機和創新點

機器翻譯中，通常使用固定大小的詞表，而在實際翻譯場景中，應當是open-vocabulary。這就使得翻譯數據集中的稀有詞變得困難，不能生成詞表中沒出現的詞。這就是OOV問題。
對於word-level 翻譯模型，通常使用back-off dictionary（比如將source和target兩兩對應起來，使用OOV來表示，這樣在翻譯結果中出現OOV時，就用source所對應的target來代替）來處理OOV辭彙。但是這樣做是建立在source target中的詞總是能一一對應的前提下，因為語言之間的形態合成程度不同，這種假設常常不成立；其次word-level 翻譯模型不能生成模型沒見過的詞(不在詞表中)，對於這種情況，有論文提出直接從copy unknown 詞到target words中，但是這種處理策略也僅僅限於一些實體名稱類的辭彙；同時為了節省計算時間和資源，詞表大小通常被限制在30k-50k之間，所以詞表空間是比較昂貴的，如果一些詞義類似的詞放到詞表中，例如like，liked，liking等形態上類似的詞均放到詞表中，直觀上感覺有些浪費。
在實際翻譯時，並不一定都是以word為基本單位進行翻譯，可通過比單詞更小的單位(subword)進行翻譯，例如名稱類的詞（通過subword複製或音譯）、複合詞（形態上類似的詞，前綴後綴相同，如run、runer、running等，可通過合成(run與er合成等）翻譯）、同源詞和外來詞（通過subword語音和形態轉換），以上稱為透明翻譯。直觀上來看，有時相對於以一個word作為基本單位整體去翻譯，這種通過subword來翻譯則顯得更有效率和意義。論文中提到，從德語數據集中分析，在100個稀有詞（不在出現最頻繁的5000個詞中）中，大多數的詞可以通過更小的subword units進行翻譯。
那麼，如何將word切分成合適的subword呢？論文中提出了採用Byte pair encoding（BPE)壓縮演算法，首先以字元劃分，然後再合併。也就是不斷的以出現頻次最高的2-gram進行合併操做，直到打到詞表大小為止。這種以頻次合併是比較符合常識的，例如像'er','ing,'ed'這樣比較有意義的後綴，'e'和'r'同時出現的頻次應該比較多。「ing」和「ed」類似道理。
將稀有詞劃分成subword，能比較好的處理oov問題。例如訓練集中大量出現「runner」和「thinking」，那按word-level詞頻來說，word-level詞表中應該包含這兩個詞，此時詞「running」只出現一次，則該詞很有可能是oov，但是如果以subword切詞，則subword詞表中應該含有「run」，」er」，」think」，」ing」，那麼對於」running」，其subword切詞後為「run」，」ing」均在詞表中。
通過對稀有詞進行適當的切分，得到subword units，機器翻譯模型能更好的處理透明翻譯，並且能生成一些unseen words。
機器翻譯模型通常都會用到注意力機制，在word-level模型中，模型每次只能計算在word級別上的注意力，我們希望模型可以在每一步學習中將注意力放在不同的subword上，顯然這樣更有意義和效率。

BPE

一個很簡單的壓縮演算法。具體來說，分為以下幾步：

將原始數據集劃分成單詞集合，再將每個單詞劃分成字符集，對於每個單詞最後一個字元後面加上'-'（標記單詞邊界，以後用於字元恢復成單詞），這樣就得到了一個大的字符集合。
統計上面的得到的字符集合，統計每個單詞內 2-gram字元出現頻次，得到頻次最高的2-gram字元，例如（『A』，『B』）連續出現的頻率最高，則以「AB」替換所有單詞內出現的（『A『，『B』），然後將該新詞添加到詞表中。這裡注意：該新詞以後可以作為一個整體與單詞內其他字元合併；替換或合併不跨單詞邊界。
對步驟2循環進行多次，直到達到就得到我們預設的詞表大小。最終的vocab_size = init_size（這裡為0）+ merge_operation_count， merge_operation_count是模型唯一的超參數。

上圖中，實際上將er這樣常見且很有意義的後綴作為一個subword放入到詞表中，對模型理解語言和節省詞表空間、以及OOV問題很有幫助。

實現程式碼

這篇論文所提方法很簡單，但是程式碼具體實現可以了解下，挺有趣的。這裡參考的程式碼nmt-subword。程式碼中主要有以下兩個重要文件程式碼。依次來看看。

learn_bpe.py

在原始數據集上，統計每個word出現的頻率，然後得到每個單詞的字元序列（末尾加'</w>'標記單詞邊界），和其出現頻率，組成字典，然後按照頻率進行倒排序，得到sorted_vocab。

vocab = get_vocabulary(infile, is_dict)  vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])  sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)

統計每個單詞內2-gram出現的頻率

def get_pair_statistics(vocab):      """Count frequency of all symbol pairs, and create index"""        # data structure of pair frequencies      stats = defaultdict(int)        #index from pairs to words      indices = defaultdict(lambda: defaultdict(int))        for i, (word, freq) in enumerate(vocab):          prev_char = word[0]          for char in word[1:]:              stats[prev_char, char] += freq              indices[prev_char, char][i] += 1              prev_char = char        return stats, indices

在預設的合併次數內，不斷的合併出現頻次最高的2-gram詞，同時將得到的新詞加入到詞表中，同時更新與新詞左右兩邊相連的2-gram頻次。直到出現的最高頻次低於我們預設min_frequency。

for i in range(num_symbols):       if stats:            most_frequent = max(stats, key=lambda x: (stats[x], x)) ## 統計出現的最高頻次的2-gram          # we probably missed the best pair because of pruning; go back to full statistics        if not stats or (i and stats[most_frequent] < threshold):            prune_stats(stats, big_stats, threshold)            stats = copy.deepcopy(big_stats)            most_frequent = max(stats, key=lambda x: (stats[x], x))            # threshold is inspired by Zipfian assumption, but should only affect speed            threshold = stats[most_frequent] * i/(i+10000.0)            prune_stats(stats, big_stats, threshold)          if stats[most_frequent] < min_frequency:            sys.stderr.write('no pair has frequency >= {0}. Stoppingn'.format(min_frequency))            break          if verbose:            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))        outfile.write('{0} {1}n'.format(*most_frequent)) ## 將出現頻次最高的2-gram存入到詞表中        changes = replace_pair(most_frequent, sorted_vocab, indices) ##  合併和替換出現頻次最高的2-gram        update_pair_statistics(most_frequent, changes, stats, indices) ## 更新與合併後新詞左右兩邊相連的2-gram頻次        stats[most_frequent] = 0

這樣就得到了2-gram詞頻表 BPE_codes。

apply_bpe.py

在learn_bpe.py中，按照bpe演算法得到了數據集的BPE_codes，那麼將數據餵給模型之前，我們需要將輸入數據按照bpe_codes進行編碼，通俗來說就是按照BPE_codes裡面的分詞規則對輸入數據進行分詞罷了。

def segment_tokens(self, tokens):  	 """segment a sequence of tokens with BPE encoding"""  	   output = []  	   for word in tokens:  	       # eliminate double spaces  	       if not word:  	           continue  	       new_word = [out for segment in self._isolate_glossaries(word)  	                   for out in encode( ## 下面會有encode方法介紹  	                   					segment,   ## 需要進行編碼的序列  	                                     self.bpe_codes, ## learn_bpe得到的編碼，進行了字典處理，第一個為pair 元祖，第一個為對應的索引  	                                     self.bpe_codes_reverse, ## (pair[0] + pair[1], pair)  	                                     self.vocab,  	                                     self.separator,  	                                     self.version,  	                                     self.cache,  	                                     self.glossaries)] ##  已有的一些辭彙表，如城市名稱，國家名稱等，這些詞不能再切分，並且要和其他詞split開    	       for item in new_word[:-1]:  	           output.append(item + self.separator)  	       output.append(new_word[-1])    	   return output

再來看看encode方法，就是將序列內的2-ngram按照bpe_codes不斷的合併替換。如果最終pair長度為1即沒有合併的可能或者最大的pair也不再bpe_codes中，則停止循環。

def encode(orig, bpe_codes, bpe_codes_reverse, vocab, separator, version, cache, glossaries=None):      """Encode word based on list of BPE merge operations, which are applied consecutively      """        if orig in cache:          return cache[orig]        if re.match('^({})$'.format('|'.join(glossaries)), orig):          cache[orig] = (orig,)          return (orig,)        if version == (0, 1):          word = tuple(orig) + ('</w>',)      elif version == (0, 2): # more consistent handling of word-final segments          word = tuple(orig[:-1]) + ( orig[-1] + '</w>',)      else:          raise NotImplementedError        pairs = get_pairs(word)        if not pairs:          return orig        while True:          bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))          if bigram not in bpe_codes:              break          first, second = bigram          new_word = []          i = 0          while i < len(word):              try:                  j = word.index(first, i)                  new_word.extend(word[i:j])                  i = j              except:                  new_word.extend(word[i:])                  break                if word[i] == first and i < len(word)-1 and word[i+1] == second:                  new_word.append(first+second)                  i += 2              else:                  new_word.append(word[i])                  i += 1          new_word = tuple(new_word)          word = new_word          if len(word) == 1:              break          else:              pairs = get_pairs(word)        # don't print end-of-word symbols      if word[-1] == '</w>':          word = word[:-1]      elif word[-1].endswith('</w>'):          word = word[:-1] + (word[-1].replace('</w>',''),)        if vocab: ## 這裡的vocab是以一定閾值，統計得到的詞表，過濾掉了低頻詞，以減少低詞頻影響。         ## 論文中講到低頻詞可能是雜訊      	## 這裡結合過濾低頻詞後的辭彙表。      	##因為過濾掉低詞頻，可能會出現oov問題，如出現oov問題，則將原詞切分為更小的詞。      	## 更小的詞，就有可能在subword詞表中。          word = check_vocab_and_split(word, bpe_codes_reverse, vocab, separator)        cache[orig] = word      return word

這樣就完成了對輸入數據的subword分詞。

個人總結

論文方法比較簡單，在流程上輸入數據預處理階段
相對於word-level，subword更有意義和效率
這種subword方法可能對中文不太奏效，因為中文沒有那種後綴之類的形式。

本文轉載自公眾號：跟我一起讀論文啦啦，作者村頭陶員外