NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

2020 年 9 月 30 日
AI

字幕組雙語原文：NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

英語原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻譯：雷鋒字幕組（關山、wiige）

概要

在本文中，我將使用NLP和Python來解釋3種不同的文本多分類策略：老式的詞袋法（tf-ldf），著名的詞嵌入法（Word2Vec）和最先進的語言模型（BERT）。

NLP（自然語言處理）是人工智慧的一個領域，它研究電腦和人類語言之間的交互作用，特別是如何通過電腦編程來處理和分析大量的自然語言數據。NLP常用於文本數據的分類。文本分類是指根據文本數據內容對其進行分類的問題。

我們有多種技術從原始文本數據中提取資訊，並用它來訓練分類模型。本教程比較了傳統的詞袋法（與簡單的機器學習演算法一起使用）、流行的詞嵌入模型（與深度學習神經網路一起使用）和最先進的語言模型（和基於attention的transformers模型中的遷移學習一起使用），語言模型徹底改變了NLP的格局。

我將介紹一些有用的Python程式碼，這些程式碼可以輕鬆地應用在其他類似的案例中（僅需複製、粘貼、運行），並對程式碼逐行添加註釋，以便你能復現這個例子（下面是全部程式碼的鏈接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我將使用「新聞類別數據集」（News category dataset），這個數據集提供了從HuffPost獲取的2012-2018年間所有的新聞標題，我們的任務是把這些新聞標題正確分類，這是一個多類別分類問題（數據集鏈接如下）。

News Category Dataset

特別地，我要講的是：

設置：導入包，讀取數據，預處理，分區。
詞袋法：用scikit-learn進行特徵工程、特徵選擇以及機器學習，測試和評估，用lime解釋。
詞嵌入法：用gensim擬合Word2Vec，用tensorflow/keras進行特徵工程和深度學習，測試和評估，用Attention機制解釋。
語言模型：用transformers進行特徵工程，用transformers和tensorflow/keras進行預訓練BERT的遷移學習，測試和評估。

設置

首先，我們需要導入下面的庫：

## for data
import json
import pandas as pd
import numpy as np## for plotting
import matplotlib.pyplot as plt
import seaborn as sns## for bag-of-words
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer
from lime import lime_text## for word embedding
import gensim
import gensim.downloader as gensim_api## for deep learning
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K## for bert language model
import transformers

該數據集包含在一個jason文件中，所以我們首先將其讀取到一個帶有json的字典列表中，然後將其轉換為pandas的DataFrame。

lst_dics = []
with open(‘data.json’, mode=’r’, errors=’ignore’) as json_file:
for dic in json_file:
lst_dics.append( json.loads(dic) )## print the first one
lst_dics[0]

原始數據集包含30多個類別，但出於本教程中的目的，我將使用其中的3個類別：娛樂（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf
dtf = pd.DataFrame(lst_dics)## filter categories
dtf = dtf[ dtf[“category”].isin([‘ENTERTAINMENT’,’POLITICS’,’TECH’]) ][[“category”,”headline”]]## rename columns
dtf = dtf.rename(columns={“category”:”y”, “headline”:”text”})## print 5 random rows
dtf.sample(5)

從圖中可以看出，數據集是不均衡的：和其他類別相比，科技新聞的佔比很小，這會使模型很難識別科技新聞。

在解釋和構建模型之前，我將給出一個預處理示例，包括清理文本、刪除停用詞以及應用詞形還原。我們要寫一個函數，並將其用於整個數據集上。

”’
Preprocess a string.
:parameter
:param text: string – name of column containing text
:param lst_stopwords: list – list of stopwords to remove
:param flg_stemm: bool – whether stemming is to be applied
:param flg_lemm: bool – whether lemmitisation is to be applied
:return
cleaned text
”’
def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
## clean (convert to lowercase and remove punctuations and
characters and then strip)
text = re.sub(r'[^\w\s]’, ”, str(text).lower().strip())

## Tokenize (convert from string to list)
lst_text = text.split() ## remove Stopwords
if lst_stopwords is not None:
lst_text = [word for word in lst_text if word not in
lst_stopwords]

## Stemming (remove -ing, -ly, …)
if flg_stemm == True:
ps = nltk.stem.porter.PorterStemmer()
lst_text = [ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)
if flg_lemm == True:
lem = nltk.stem.wordnet.WordNetLemmatizer()
lst_text = [lem.lemmatize(word) for word in lst_text]

## back to string from list
text = ” “.join(lst_text)
return text

該函數從語料庫中刪除了一組單詞（如果有的話）。我們可以用nltk創建一個英語辭彙的通用停用詞列表（我們可以通過添加和刪除單詞來編輯此列表）。

lst_stopwords = nltk.corpus.stopwords.words(“english”)
lst_stopwords

現在，我將在整個數據集中應用編寫的函數，並將結果存儲在名為「text_clean」的新列中，以便你選擇使用原始的語料庫，或經過預處理的文本。

dtf[“text_clean”] = dtf[“text”].apply(lambda x:
utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,
lst_stopwords=lst_stopwords))dtf.head()

如果你對更深入的文本分析和預處理感興趣，你可以查看這篇文章。我將數據集劃分為訓練集（70%）和測試集（30%），以評估模型的性能。

## split dataset
dtf_train, dtf_test = model_selection.train_test_split(dtf, test_size=0.3)## get target
y_train = dtf_train[“y”].values
y_test = dtf_test[“y”].values

讓我們開始吧！

詞袋法

詞袋法的模型很簡單：從文檔語料庫構建一個辭彙表，並計算單詞在每個文檔中出現的次數。換句話說，辭彙表中的每個單詞都成為一個特徵，文檔由具有相同辭彙量長度的矢量（一個「詞袋」）表示。例如，我們有3個句子，並用這種方法表示它們：

特徵矩陣的形狀：文檔數x辭彙表長度

可以想像，這種方法將會導致很嚴重的維度問題：文件越多，辭彙表越大，因此特徵矩陣將是一個巨大的稀疏矩陣。所以，為了減少維度問題，詞袋法模型通常需要先進行重要的預處理（詞清除、刪除停用詞、詞幹提取/詞形還原）。

詞頻不一定是文本的最佳表示方法。實際上我們會發現，有些常用詞在語料庫中出現頻率很高，但是它們對目標變數的預測能力卻很小。為了解決此問題，有一種詞袋法的高級變體，它使用詞頻-逆向文件頻率（Tf-Idf）代替簡單的計數。基本上，一個單詞的值和它的計數成正比地增加，但是和它在語料庫中出現的頻率成反比。

先從特徵工程開始，我們通過這個流程從數據中提取資訊來建立特徵。使用Tf-Idf向量器(vectorizer)，限制為1萬個單詞（所以詞長度將是1萬），捕捉一元文法（即 “new “和 “york”）和二元文法（即 “new york”）。以下是經典的計數向量器的程式碼:

ngram_range=(1,2))vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

現在將在訓練集的預處理語料上使用向量器來提取詞表並創建特徵矩陣。

corpus = dtf_train[“text_clean”]vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

特徵矩陣X_train的尺寸為34265（訓練集中的文檔數）×10000（詞長度），這個矩陣很稀疏:

sns.heatmap(X_train.todense()[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title(‘Sparse Matrix Sample’)

從特徵矩陣中隨機抽樣（黑色為非零值）

為了知道某個單詞的位置，可以這樣在詞表中查詢:

word = “new york”dic_vocabulary[word]

如果詞表中存在這個詞，這行腳本會輸出一個數字N，表示矩陣的第N個特徵就是這個詞。

為了降低矩陣的維度所以需要去掉一些列，我們可以進行一些特徵選擇（Feature Selection），這個流程就是選擇相關變數的子集。操作如下:

將每個類別視為一個二進位位（例如，”科技”類別中的科技新聞將分類為1，否則為0）;
進行卡方檢驗，以便確定某個特徵和其（二進位）結果是否獨立;
只保留卡方檢驗中有特定p值的特徵。

y = dtf_train[“y”]
X_names = vectorizer.get_feature_names()
p_value_limit = 0.95dtf_features = pd.DataFrame()
for cat in np.unique(y):
    chi2, p = feature_selection.chi2(X_train, y==cat)
    dtf_features = dtf_features.append(pd.DataFrame(
                   {“feature”:X_names, “score”:1-p, “y”:cat}))
    dtf_features = dtf_features.sort_values([“y”,”score”],
                    ascending=[True,False])
    dtf_features = dtf_features[dtf_features[“score”]>p_value_limit]X_names = dtf_features[“feature”].unique().tolist()

這將特徵的數量從10000個減少到3152個，保留了最有統計意義的特徵。選一些列印出來是這樣的:

for cat in np.unique(y):
   print(“# {}:”.format(cat))
   print(”  . selected features:”,
         len(dtf_features[dtf_features[“y”]==cat]))
   print(”  . top features:”, “,”.join(
dtf_features[dtf_features[“y”]==cat][“feature”].values[:10]))
   print(” “)

我們將這組新的詞表作為輸入，在語料上重新擬合向量器。這將輸出一個更小的特徵矩陣和更短的詞表。

vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

新的特徵矩陣X_train的尺寸是34265（訓練中的文檔數量）×3152（給定的詞表長度）。你看矩陣是不是沒那麼稀疏了:

從新的特徵矩陣中隨機抽樣（非零值為黑色）

現在我們該訓練一個機器學習模型試試了。我推薦使用樸素貝葉斯演算法：它是一種利用貝葉斯定理的概率分類器，貝葉斯定理根據可能相關條件的先驗知識進行概率預測。這種演算法最適合這種大型數據集了，因為它會獨立考察每個特徵，計算每個類別的概率，然後預測概率最高的類別。

classifier = naive_bayes.MultinomialNB()

我們在特徵矩陣上訓練這個分類器，然後在經過特徵提取後的測試集上測試它。因此我們需要一個scikit-learn流水線：這個流水線包含一系列變換和最後接一個estimator。將Tf-Idf向量器和樸素貝葉斯分類器放入流水線，就能輕鬆完成對測試數據的變換和預測。

## pipelinemodel = pipeline.Pipeline([(“vectorizer”, vectorizer),
(“classifier”, classifier)])## train classifiermodel[“classifier”].fit(X_train, y_train)## testX_test = dtf_test[“text_clean”].values
predicted = model.predict(X_test)
predicted_prob = model.predict_proba(X_test)

至此我們可以使用以下指標評估詞袋模型了:

準確率: 模型預測正確的比例。
混淆矩陣: 是一張記錄每類別預測正確和預測錯誤數量的匯總表。
ROC: 不同閾值下，真正例率與假正例率的對比圖。曲線下的面積(AUC)表示分類器中隨機選擇的正觀察值排序比負觀察值更靠前的概率。
精確率: “所有被正確檢索的樣本數(TP)”占所有”實際被檢索到的(TP+FP)”的比例。
召回率: 所有”被正確檢索的樣本數(TP)”占所有”應該檢索到的結果(TP+FN)”的比例。

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values
    ## Accuracy, Precision, Recallaccuracy = metrics.accuracy_score(y_test, predicted)
auc = metrics.roc_auc_score(y_test, predicted_prob,
                            multi_)
print(“Accuracy:”,  round(accuracy,2))
print(“Auc:”, round(auc,2))
print(“Detail:”)
print(metrics.classification_report(y_test, predicted))
    ## Plot confusion matrixcm = metrics.confusion_matrix(y_test, predicted)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt=’d’, ax=ax, cmap=plt.cm.Blues,
            cbar=False)
ax.set(xlabel=”Pred”, ylabel=”True”, xticklabels=classes,
       yticklabels=classes, title=”Confusion matrix”)
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3,
              label='{0} (area={1:0.2f})’.format(classes[i],
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color=’navy’, lw=3, line)
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],
          xlabel=’False Positive Rate’,
          ylabel=”True Positive Rate (Recall)”,
          title=”Receiver operating characteristic”)
ax[0].legend(loc=”lower right”)
ax[0].grid(True)
    ## Plot precision-recall curvefor i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3,
               label='{0} (area={1:0.2f})’.format(classes[i],
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel=’Recall’,
          ylabel=”Precision”, title=”Precision-Recall curve”)
ax[1].legend(loc=”best”)
ax[1].grid(True)
plt.show()

詞袋模型能夠在測試集上正確分類85%的樣本（準確率為0.85），但在辨別科技新聞方面卻很吃力（只有252條預測正確）。

讓我們探究一下為什麼模型會將新聞分類為其他類別，順便看看預測結果是不是能解釋些什麼。lime包可以幫助我們建立一個解釋器。為讓這更好理解，我們從測試集中隨機取樣一次, 看看能發現些什麼:

## select observationi = 0
txt_instance = dtf_test[“text”].iloc[i]## check true value and predicted valueprint(“True:”, y_test[i], “–> Pred:”, predicted[i], “| Prob:”, round(np.max(predicted_prob[i]),2))## show explanationexplainer = lime_text.LimeTextExplainer(class_names=
np.unique(y_train))
explained = explainer.explain_instance(txt_instance,
model.predict_proba, num_features=3)
explained.show_in_notebook(text=txt_instance, predict_proba=False)

這就一目了然了：雖然”舞台(stage)”這個詞在娛樂新聞中更常見, “柯林頓(Clinton) “和 “GOP “這兩個詞依然為模型提供了引導（政治新聞）。

詞嵌入

詞嵌入（Word Embedding）是將中詞表中的詞映射為實數向量的特徵學習技術的統稱。這些向量是根據每個詞出現在另一個詞之前或之後的概率分布計算出來的。換一種說法，上下文相同的單詞通常會一起出現在語料庫中，所以它們在向量空間中也會很接近。例如，我們以前面例子中的3個句子為例:

二維向量空間中的詞嵌入

在本教程中，我門將使用這類模型的開山怪: Google的Word2Vec（2013）。其他流行的詞嵌入模型還有斯坦福大學的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一個包含語料庫中的每個獨特單詞的向量空間，通常有幾百維, 這樣在語料庫中擁有共同上下文的單詞在向量空間中的位置就會相互靠近。有兩種不同的方法可以生成詞嵌入：從某一個詞來預測其上下文（Skip-gram）或根據上下文預測某一個詞（Continuous Bag-of-Words）。

在Python中，可以像這樣從genism-data中載入一個預訓練好的詞嵌入模型:

nlp = gensim_api.load(“word2vec-google-news-300”)

我將不使用預先訓練好的模型，而是用gensim在訓練數據上自己訓練一個Word2Vec。在訓練模型之前，需要將語料轉換為n元文法列表。具體來說，就是嘗試捕獲一元文法（”york”）、二元文法（”new york”）和三元文法（”new york city”）。

corpus = dtf_train[“text_clean”]## create list of lists of unigramslst_corpus = []
for string in corpus:
   lst_words = string.split()
   lst_grams = [” “.join(lst_words[i:i+1])
               for i in range(0, len(lst_words), 1)]
   lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
                 delimiter=” “.encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=” “.encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

在訓練Word2Vec時，需要設置一些參數:

詞向量維度設置為300;
窗口大小，即句子中當前詞和預測詞之間的最大距離，這裡使用語料庫中文本的平均長度;
訓練演算法使用 skip-grams (sg=1)，因為一般來說它的效果更好。

## fit w2vnlp = gensim.models.word2vec.Word2Vec(lst_corpus, size=300,
window=8, min_count=1, sg=1, iter=30)

現在我們有了詞嵌入模型，所以現在可以從語料庫中任意選擇一個詞，將其轉化為一個300維的向量。

word = “data”nlp[word].shape

甚至可以通過某些維度縮減演算法（比如TSNE），將一個單詞及其上下文可視化到一個更低的維度空間（2D或3D）。

word = “data”
fig = plt.figure()## word embedding
tot_words = [word] + [tupla[0] for tupla in
                 nlp.most_similar(word, topn=20)]
X = nlp[tot_words]## pca to reduce dimensionality from 300 to 3
pca = manifold.TSNE(perplexity=40, n_components=3, init=’pca’)
X = pca.fit_transform(X)## create dtf
dtf_ = pd.DataFrame(X, index=tot_words, columns=[“x”,”y”,”z”])
dtf_[“input”] = 0
dtf_[“input”].iloc[0:1] = 1## plot 3d
from mpl_toolkits.mplot3d import Axes3D
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(dtf_[dtf_[“input”]==0][‘x’],
           dtf_[dtf_[“input”]==0][‘y’],
           dtf_[dtf_[“input”]==0][‘z’], c=”black”)
ax.scatter(dtf_[dtf_[“input”]==1][‘x’],
           dtf_[dtf_[“input”]==1][‘y’],
           dtf_[dtf_[“input”]==1][‘z’], c=”red”)
ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=[],
       yticklabels=[], zticklabels=[])
for label, row in dtf_[[“x”,”y”,”z”]].iterrows():
    x, y, z = row
    ax.text(x, y, z, s=label)

這非常酷，但詞嵌入在預測新聞類別這樣的任務上有何裨益呢？詞向量可以作為神經網路的權重。具體是這樣的:

首先，將語料轉化為單詞id的填充(padded)序列，得到一個特徵矩陣。
然後，創建一個嵌入矩陣，使id為N的詞向量位於第N行。
最後，建立一個帶有嵌入層的神經網路，對序列中的每一個詞都用相應的向量進行加權。

還是從特徵工程開始，用 tensorflow/keras 將 Word2Vec 的同款預處理語料（n-grams 列表）轉化為文本序列的列表:

## tokenize texttokenizer = kprocessing.text.Tokenizer(lower=True, split=’ ‘,
                     oov_token=”NaN”,
                     filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’)
tokenizer.fit_on_texts(lst_corpus)
dic_vocabulary = tokenizer.word_index## create sequencelst_text2seq= tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train = kprocessing.sequence.pad_sequences(lst_text2seq,
                    maxlen=15, padding=”post”, truncating=”post”)

特徵矩陣X_train的尺寸為34265×15（序列數×序列最大長度）。可視化一下是這樣的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)
plt.show()

特徵矩陣(34 265 x 15)

現在語料庫中的每一個文本都是一個長度為15的id序列。例如，如果一個文本中有10個詞符，那麼這個序列由10個id和5個0組成，這個0這就是填充元素（而詞表中沒有的詞其id為1）。我們來輸出一下看看一段訓練集文本是如何被轉化成一個帶有填充元素的詞序列:

i = 0## list of text: [“I like this”, …]len_txt = len(dtf_train[“text_clean”].iloc[i].split())print(“from: “, dtf_train[“text_clean”].iloc[i], “| len:”, len_txt)## sequence of token ids: [[1, 2, 3], …]len_tokens = len(X_train[i])print(“to: “, X_train[i], “| len:”, len(X_train[i]))## vocabulary: {“I”:1, “like”:2, “this”:3, …}print(“check: “, dtf_train[“text_clean”].iloc[i].split()[0],
” — idx in vocabulary –>”,
dic_vocabulary[dtf_train[“text_clean”].iloc[i].split()[0]])print(“vocabulary: “, dict(list(dic_vocabulary.items())[0:5]), “… (padding element, 0)”)

記得在測試集上也要做這個特徵工程:

corpus = dtf_test[“text_clean”]## create list of n-gramslst_corpus = []
for string in corpus:
    lst_words = string.split()
    lst_grams = [” “.join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
    ## detect common bigrams and trigrams using the fitted detectorslst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq = tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test = kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,
             padding=”post”, truncating=”post”)

X_test (14,697 x 15)

現在我們就有了X_train和X_test，現在需要創建嵌入矩陣，它將作為神經網路分類器的權重矩陣.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings = np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items():
    ## update the row with vector    try:
        embeddings[idx] =  nlp[word]
    ## if word not in model then skip and the row stays all 0s    except:
        pass

這段程式碼生成的矩陣尺寸為22338×300（從語料庫中提取的詞表長度×向量維度）。它可以通過詞表中的詞id。

word = “data”print(“dic[word]:”, dic_vocabulary[word], “|idx”)print(“embeddings[idx]:”, embeddings[dic_vocabulary[word]].shape,
“|vector”)

終於要建立深度學習模型了! 我門在神經網路的第一個Embedding層中使用嵌入矩陣，訓練它之後就能用來進行新聞分類。輸入序列中的每個id將被視為訪問嵌入矩陣的索引。這個嵌入層的輸出是一個包含輸入序列中每個詞id對應詞向量的二維矩陣（序列長度 x 詞向量維度）。以 “我喜歡這篇文章(I like this article) “這個句子為例:

我的神經網路的結構如下:

一個嵌入層，如前文所述, 將文本序列作為輸入, 詞向量作為權重。
一個簡單的Attention層，它不會影響預測，但它可以捕捉每個樣本的權重, 以便將作為一個不錯的解釋器（對於預測來說它不是必需的，只是為了提供可解釋性，所以其實可以不用加它）。這篇論文（2014）提出了序列模型（比如LSTM）的Attention機制，探究了長文本中哪些部分實際相關。
兩層雙向LSTM，用來建模序列中詞的兩個方向。
最後兩層全連接層，可以預測每個新聞類別的概率。

## code attention layerdef attention_layer(inputs, neurons):
    x = layers.Permute((2,1))(inputs)
    x = layers.Dense(neurons, activation=”softmax”)(x)
    x = layers.Permute((2,1), name=”attention”)(x)
    x = layers.multiply([inputs, x])
    return x## inputx_in = layers.Input(shape=(15,))## embeddingx = layers.Embedding(input_dim=embeddings.shape[0],
                     output_dim=embeddings.shape[1],
                     weights=[embeddings],
                     input_length=15, trainable=False)(x_in)## apply attentionx = attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,
                         return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx = layers.Dense(64, activation=’relu’)(x)
y_out = layers.Dense(3, activation=’softmax’)(x)## compilemodel = models.Model(x_in, y_out)
model.compile(loss=’sparse_categorical_crossentropy’,
              optimizer=’adam’, metrics=[‘accuracy’])
model.summary()

現在來訓練模型，不過在實際測試集上測試之前，我們要在訓練集上劃一小塊驗證集來驗證模型性能。

## encode ydic_y_mapping = {n:label for n,label in
                 enumerate(np.unique(y_train))}
inverse_dic = {v:k for k,v in dic_y_mapping.items()}
y_train = np.array([inverse_dic[y] for y in y_train])## traintraining = model.fit(x=X_train, y=y_train, batch_size=256,
                     epochs=10, shuffle=True, verbose=0,
                     validation_split=0.3)## plot loss and accuracymetrics = [k for k in training.history.keys() if (“loss” not in k) and (“val” not in k)]
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title=”Training”)
ax11 = ax[0].twinx()
ax[0].plot(training.history[‘loss’], color=’black’)
ax[0].set_xlabel(‘Epochs’)
ax[0].set_ylabel(‘Loss’, color=’black’)for metric in metrics:
    ax11.plot(training.history[metric], label=metric)
ax11.set_ylabel(“Score”, color=’steelblue’)
ax11.legend()ax[1].set(title=”Validation”)
ax22 = ax[1].twinx()
ax[1].plot(training.history[‘val_loss’], color=’black’)
ax[1].set_xlabel(‘Epochs’)
ax[1].set_ylabel(‘Loss’, color=’black’)for metric in metrics:
     ax22.plot(training.history[‘val_’+metric], label=metric)
ax22.set_ylabel(“Score”, color=”steelblue”)
plt.show()

Nice！在某些epoch中準確率達到了0.89。為了對詞嵌入模型進行評估，在測試集上也要進行預測，並用相同指標進行對比（評價指標的程式碼與之前相同）。

## testpredicted_prob = model.predict(X_test)
predicted = [dic_y_mapping[np.argmax(pred)] for pred in
predicted_prob]

該模式的表現與前一個模型差不多。其實，它的科技新聞分類也不怎麼樣。

但它也具有可解釋性嗎? 是的! 因為在神經網路中放了一個Attention層來提取每個詞的權重，我們可以了解這些權重對一個樣本的分類貢獻有多大。所以這裡我將嘗試使用Attention權重來構建一個解釋器（類似於上一節里的那個）:

## select observationi = 0txt_instance = dtf_test[“text”].iloc[i]## check true value and predicted valueprint(“True:”, y_test[i], “–> Pred:”, predicted[i], “| Prob:”, round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus = []for string in [re.sub(r'[^\w\s]’,”, txt_instance.lower().strip())]:
    lst_words = string.split()
    lst_grams = [” “.join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
lst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])
X_instance = kprocessing.sequence.pad_sequences(
              tokenizer.texts_to_sequences(corpus), maxlen=15,
              padding=”post”, truncating=”post”)### 2. get attention weightslayer = [layer for layer in model.layers if “attention” in
         layer.name][0]
func = K.function([model.input], [layer.output])
weights = func(X_instance)[0]
weights = np.mean(weights, axis=2).flatten()### 3. rescale weights, remove null vector, map word-weightweights = preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)
weights = [weights[n] for n,idx in enumerate(X_instance[0]) if idx
           != 0]
dic_word_weigth = {word:weights[n] for n,word in
                   enumerate(lst_corpus[0]) if word in
                   tokenizer.word_index.keys()}### 4. barplotif len(dic_word_weigth) > 0:
   dtf = pd.DataFrame.from_dict(dic_word_weigth, orient=’index’,
                                columns=[“score”])
   dtf.sort_values(by=”score”,
           ascending=True).tail(top).plot(kind=”barh”,
           legend=False).grid(axis=’x’)
   plt.show()else:
   print(“— No word recognized —“)### 5. produce html visualizationtext = []for word in lst_corpus[0]:
    weight = dic_word_weigth.get(word)
    if weight is not None:
         text.append(‘<b><span >’ + word + ‘</span></b>’)
    else:
         text.append(word)
text = ‘ ‘.join(text)### 6. visualize on notebookprint(“\033[1m”+”Text with highlighted words”)from IPython.core.display import display, HTML
display(HTML(text))

就像之前一樣，”柯林頓 (clinton)”和 “老大黨(gop) “這兩個詞激活了模型的神經元，而且這次發現 “高(high) “和 “班加西(benghazi) “與預測也略有關聯。

語言模型

語言模型, 即上下文/動態詞嵌入（Contextualized/Dynamic Word Embeddings），克服了經典詞嵌入方法的最大局限：多義詞消歧義，一個具有不同含義的詞（如” bank “或” stick”）只需一個向量就能識別。最早流行的是 ELMO（2018），它並沒有採用固定的嵌入，而是利用雙向 LSTM觀察整個句子，然後給每個詞分配一個嵌入。

到Transformers時代, Google的論文Attention is All You Need（2017）提出的一種新的語言建模技術，在該論文中，證明了序列模型（如LSTM）可以完全被Attention機製取代，甚至獲得更好的性能。

而後Google的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和幾個Transformers，而且它是雙向的（這是對Transformers的一大創新改進）。BERT分配給一個詞的向量是整個句子的函數，因此，一個詞可以根據上下文不同而有不同的詞向量。我們輸入岸河(bank river)到Transformer試試:

txt = “bank river”## bert tokenizertokenizer = transformers.BertTokenizer.from_pretrained(‘bert-base-uncased’, do_lower_case=True)## bert modelnlp = transformers.TFBertModel.from_pretrained(‘bert-base-uncased’)## return hidden layer with embeddingsinput_ids = np.array(tokenizer.encode(txt))[None,:]
embedding = nlp(input_ids)
embedding[0][0]

如果將輸入文字改為 “銀行資金(bank money)”，則會得到這樣的結果:

為了完成文本分類任務，可以用3種不同的方式來使用BERT:

從零訓練它，並將其作為分類器使用。
提取詞嵌入，並在嵌入層中使用它們（就像上面用Word2Vec那樣）。
對預訓練模型進行精調(遷移學習)。

我打算用第三種方式，從預訓練的輕量 BERT 中進行遷移學習，人稱 Distil-BERT （用6600 萬個參數替代1.1 億個參數）

## distil-bert tokenizertokenizer = transformers.AutoTokenizer.from_pretrained(‘distilbert-base-uncased’, do_lower_case=True)

在訓練模型之前，還是需要做一些特徵工程，但這次會比較棘手。為了說明我們需要做什麼，還是以我們這句 “我喜歡這篇文章(I like this article) “為例，他得被轉化為3個向量（Ids, Mask, Segment）:

尺寸為 3 x 序列長度

首先，我們需要確定最大序列長度。這次要選擇一個大得多的數字(比如50)，因為BERT會將未知詞分割成子詞符(sub-token)，直到找到一個已知的單字。比如若給定一個像 “zzdata “這樣的虛構詞，BERT會把它分割成[“z”，”##z”，”##data”]。除此之外, 我們還要在輸入文本中插入特殊的詞符，然後生成掩碼(musks)和分段(segments)向量。最後，把它們放進一個張量里得到特徵矩陣，其尺寸為3（id、musk、segment）x 語料庫中的文檔數 x 序列長度。

這裡我使用原始文本作為語料（前面一直用的是clean_text列）。

corpus = dtf_train[“text”]
maxlen = 50## add special tokensmaxqnans = np.int((maxlen-20)/2)
corpus_tokenized = [“[CLS] “+
             ” “.join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n’, ”,
             str(txt).lower().strip()))[:maxqnans])+
             ” [SEP] ” for txt in corpus]## generate masksmasks = [[1]*len(txt.split(” “)) + [0]*(maxlen – len(
           txt.split(” “))) for txt in corpus_tokenized]
    ## paddingtxt2seq = [txt + ” [PAD]”*(maxlen-len(txt.split(” “))) if len(txt.split(” “)) != maxlen else txt for txt in corpus_tokenized]
    ## generate idxidx = [tokenizer.encode(seq.split(” “)) for seq in txt2seq]
    ## generate segmentssegments = [] for seq in txt2seq:
    temp, i = [], 0    for token in seq.split(” “):
        temp.append(i)
        if token == “[SEP]”:
             i += 1    segments.append(temp)## feature matrixX_train = [np.asarray(idx, dtype=’int32′),
           np.asarray(masks, dtype=’int32′),
           np.asarray(segments, dtype=’int32′)]

特徵矩陣X_train的尺寸為3×34265×50。我們可以從特徵矩陣中隨機挑一個出來看看:

i = 0print(“txt: “, dtf_train[“text”].iloc[0])
print(“tokenized:”, [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist()])
print(“idx: “, X_train[0][i])
print(“mask: “, X_train[1][i])
print(“segment: “, X_train[2][i])

這段程式碼在dtf_test[“text”]上跑一下就能得到X_test。

現在要從預練好的 BERT 中用遷移學習一個深度學習模型。具體就是，把 BERT 的輸出用平均池化壓成一個向量，然後在最後添加兩個全連接層來預測每個新聞類別的概率.

下面是使用BERT原始版本的程式碼（記得用正確的tokenizer重做特徵工程):

## inputsidx = layers.Input((50), dtype=”int32″, name=”input_idx”)
masks = layers.Input((50), dtype=”int32″, name=”input_masks”)
segments = layers.Input((50), dtype=”int32″, name=”input_segments”)## pre-trained bertnlp = transformers.TFBertModel.from_pretrained(“bert-base-uncased”)
bert_out, _ = nlp([idx, masks, segments])## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation=”relu”)(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation=’softmax’)(x)## compilemodel = models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:
    layer.trainable = Falsemodel.compile(loss=’sparse_categorical_crossentropy’,
              optimizer=’adam’, metrics=[‘accuracy’])model.summary()

這裡用輕量級的Distil-BERT來代替BERT:

## inputsidx = layers.Input((50), dtype=”int32″, name=”input_idx”)
masks = layers.Input((50), dtype=”int32″, name=”input_masks”)## pre-trained bert with configconfig = transformers.DistilBertConfig(dropout=0.2,
           attention_dropout=0.2)
config.output_hidden_states = Falsenlp = transformers.TFDistilBertModel.from_pretrained(‘distilbert-
                  base-uncased’, config=config)
bert_out = nlp(idx, attention_mask=masks)[0]## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation=”relu”)(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation=’softmax’)(x)## compilemodel = models.Model([idx, masks], y_out)for layer in model.layers[:3]:
    layer.trainable = Falsemodel.compile(loss=’sparse_categorical_crossentropy’,
              optimizer=’adam’, metrics=[‘accuracy’])model.summary()