機器學習-將多項式樸素貝葉斯應用於NLP問題

2019 年 10 月 7 日
筆記

樸素貝葉斯分類器演算法是一系列概率演算法，基於貝葉斯定理和每對特徵之間條件獨立的「樸素」假設而應用。貝葉斯定理計算概率P（c | x），其中c是可能結果的類別，x是必須分類的給定實例，表示某些特定特徵。

P(c|x) = P(x|c) * P(c) / P(x)

樸素貝葉斯主要用於自然語言處理（NLP）問題。樸素貝葉斯預測文本的標籤。他們計算給定文本的每個標籤的概率，然後輸出最高標籤的標籤。

樸素貝葉斯演算法如何工作？

讓我們考慮一個示例，對評論進行正面或負面的分類。

TEXT	REVIEWS
「I liked the movie」	positive
「It』s a good movie. Nice story」	positive
「Nice songs. But sadly boring ending. 」	negative
「Hero』s acting is bad but heroine looks good. Overall nice movie」	positive
「Sad, boring movie」	negative

我們對「總體喜歡這部電影」的文字進行正面評價還是負面評價。我們必須計算 P（正面|總體上喜歡這部電影） —假定句子「總體上喜歡這部電影」，則該句子的標籤為正的概率。 P（負|總體上喜歡這部電影） —假定句子「總體上喜歡這部電影」，則句子的標籤為負的概率。

在此之前，首先，我們在文本中應用「刪除停用詞並阻止」。

刪除停用詞：這些是常用詞，實際上並沒有真正添加任何內容，例如，有能力的，甚至其他的，等等。

詞根提取：詞根提取。

現在，在應用了這兩種技術之後，我們的文本變為

TEXT	REVIEWS
「ilikedthemovi」	positive
「itsagoodmovienicestori」	positive
「nicesongsbutsadlyboringend」	negative
「herosactingisbadbutheroinelooksgoodoverallnicemovi」	positive
「sadboringmovi」	negative

特徵工程： 重要的部分是從數據中找到特徵，以使機器學習演算法起作用。在這種情況下，我們有文字。我們需要將此文本轉換為可以進行計算的數字。我們使用詞頻。那就是將每個文檔視為包含的一組單詞。我們的功能將是每個單詞的計數。

在本例中，通過使用以下定理，我們得到 P(positive | overall liked the movie)：

P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)

由於對於我們的分類器，我們必須找出哪個標籤具有更大的概率，因此我們可以捨棄兩個標籤相同的除數，

P(overall liked the movie | positive)* P(positive) with P(overall liked the movie | negative) * P(negative)

但是存在一個問題：「總體上喜歡這部電影」沒有出現在我們的訓練數據集中，因此概率為零。在這裡，我們假設「樸素」的條件是句子中的每個單詞都獨立於其他單詞。這意味著現在我們來看單個單詞。

我們可以這樣寫：

P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)

下一步就是應用貝葉斯定理：

現在，這些單詞實際上在我們的訓練數據中出現了幾次，我們可以計算出來！

計算概率：

首先，我們計算每個標籤的先驗概率：對於我們訓練數據中的給定句子，其為正P（positive）的概率為3/5。那麼，P（negative）是2/5。

然後，計算P（overall | positive）意味著計算單詞「 overall」在肯定文本（1）中出現的次數除以肯定（11）中的單詞總數。因此，P(overall | positive) = 1/17， P(liked/positive) = 1/17，P(the/positive)= 2/17，P(movie/positive)= 3/17。

如果概率為零，則使用拉普拉斯平滑法：我們向每個計數加1，因此它永遠不會為零。為了平衡這一點，我們將可能單詞的數量添加到除數中，因此除法永遠不會大於1。在我們的情況下，可能單詞的總數為21。

應用平滑，結果為：

WORD	P(WORD \| POSITIVE)	P(WORD \| NEGATIVE)
overall	1 + 1/17 + 21	0 + 1/7 + 21
liked	1 + 1/17 + 21	0 + 1/7 + 21
the	2 + 1/17 + 21	0 + 1/7 + 21
movie	3 + 1/17 + 21	1 + 1/7 + 21

現在我們將所有概率相乘，看看誰更大：

P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive) * P(postive ) = 1.38 * 10^{-5} = 0.0000138

P(overall | negative) * P(liked | negative) * P(the | negative) * P(movie | negative) * P(negative) = 0.13 * 10^{-5} = 0.0000013

我們的分類器為「總體喜歡這部電影」賦予了肯定的標籤。

下面是實現：

#導入包這裡用到了NLTK

import pandas as pd

import re

import nltk

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer

dataset = [["I liked the movie", "positive"],

["It』s a good movie. Nice story", "positive"],

["Hero』s acting is bad but heroine looks good.

Overall nice movie", "positive"],

["Nice songs. But sadly boring ending.", "negative"],

["sad movie, boring movie", "negative"]]

dataset = pd.DataFrame(dataset)

dataset.columns = ["Text", "Reviews"]

nltk.download('stopwords')

corpus = []

for i in range(0, 5):

text = re.sub('[^a-zA-Z]', '', dataset['Text'][i])

text = text.lower()

text = text.split()

ps = PorterStemmer()

text = ''.join(text)

corpus.append(text)

# 創建單詞模型庫

cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()

y = dataset.iloc[:, 1].values

# 分隔數據設置訓練數據和測試數據

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.25, random_state = 0)

# 使用樸素貝葉斯高斯分布訓練數據

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import confusion_matrix

classifier = GaussianNB();

classifier.fit(X_train, y_train)

# 預測測試結果

y_pred = classifier.predict(X_test)

# 製作混亂矩陣

cm = confusion_matrix(y_test, y_pred)

cm

機器學習-將多項式樸素貝葉斯應用於NLP問題

VirMach 便宜 VPS

QNews

機器學習-將多項式樸素貝葉斯應用於NLP問題

分享此文：

Related Posts

Java之戳中痛點 – （8）synchronized深度解析

使用了 Eclipse 10 年之後，我終於投向了 IDEA

機器學習-樸素貝葉斯分類器

httprunner學習4-variables變數聲明與引用

VirMach 便宜 VPS

QNews

熱門搜尋