基於深度學習的文本分類案例:使用LSTM進行情緒分類
Sentiment classification using LSTM
在這個筆記型電腦中,我們將使用LSTM架構在電影評論數據集上訓練一個模型來預測評論的情緒。首先,讓我們看看什麼是LSTM?
LSTM,即長短時記憶,是一種序列神經網路架構,它利用其結構保留了對前一序列的記憶。第一個被引入的序列模型是RNN。但是,很快研究人員發現,RNN並沒有保留很多以前序列的記憶。這導致在長文本序列中失去上下文。
為了維護這一背景,LSTM被引入。在LSTM單元中,有一些特殊的結構被稱為門和單元狀態,它們被改變和維護以保持LSTM中的記憶。要了解這些結構如何工作,請閱讀 this blog.
從程式碼上看,我們正在使用tensorflow和keras來建立模型和訓練它。為了進一步了解本項目的程式碼/概念,我們使用了以下參考資料。
References:
(1) Medium article on keras lstm
(2) Keras embedding layer documentation
(3) Keras example of text classification from scratch
(4) Bi-directional lstm model example
(5) kaggle notebook for text preprocessing
Notebook:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: //github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv
/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip
/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip
train_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = '\t')
test_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = '\t')
train_data.head()
PhraseId | SentenceId | Phrase | Sentiment | |
---|---|---|---|---|
0 | 1 | 1 | A series of escapades demonstrating the adage … | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage … | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
train_data = train_data.drop(['PhraseId','SentenceId'],axis = 1)
test_data = test_data.drop(['PhraseId','SentenceId'],axis = 1)
import keras
from keras.models import Sequential
from keras.layers import Dense #層lyer
from keras.layers import LSTM
from keras.layers import Activation
from keras.layers import Embedding
from keras.layers import Bidirectional
max_features = 20000 # 只考慮前20千字
maxlen = 200
train_data.head()
Phrase | Sentiment | |
---|---|---|
0 | A series of escapades demonstrating the adage … | 1 |
1 | A series of escapades demonstrating the adage … | 2 |
2 | A series | 2 |
3 | A | 2 |
4 | series | 2 |
from nltk.corpus import stopwords
import re
# 定義文本清理函數
def text_cleaning(text):
forbidden_words = set(stopwords.words('english'))#停用詞,對於理解文章沒有太大意義的詞,比如"the"、「an」、「his」、「their」
if text:
text = ' '.join(text.split('.'))
text = re.sub('\/',' ',text)
text = re.sub(r'\\',' ',text)
text = re.sub(r'((http)\S+)','',text)
text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip()
text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
text = [word for word in text.split() if word not in forbidden_words]
return text
return []
# 將句子轉化為詞語列表
train_data['flag'] = 'TRAIN'
test_data['flag'] = 'TEST'
total_docs = pd.concat([train_data,test_data],axis = 0,ignore_index = True)
total_docs['Phrase'] = total_docs['Phrase'].apply(lambda x: ' '.join(text_cleaning(x)))
phrases = total_docs['Phrase'].tolist()
from keras.preprocessing.text import one_hot
vocab_size = 50000
encoded_phrases = [one_hot(d, vocab_size) for d in phrases]
total_docs['Phrase'] = encoded_phrases
train_data = total_docs[total_docs['flag'] == 'TRAIN']
test_data = total_docs[total_docs['flag'] == 'TEST']
x_train = train_data['Phrase']
y_train = train_data['Sentiment']
x_val = test_data['Phrase']
y_val = test_data['Sentiment']
x_train.head()
y_train.unique()
array([1, 2, 3, 4, 0])
tf.keras.preprocessing.sequence.pad_sequences()的用法://blog.csdn.net/qq_45465526/article/details/109400926)
# 將序列轉化為經過填充以後得到的一個長度相同新的序列
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)
model = Sequential()
inputs = keras.Input(shape=(None,), dtype="int32")
# 將每個整數嵌入一個128維的向量中
model.add(inputs)
model.add(Embedding(50000, 128))
# 增加2個雙向的LSTM
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
# 添加一個分類器
model.add(Dense(5, activation="sigmoid"))
#model = keras.Model(inputs, outputs)
model.summary()
result:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 128) 6400000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128) 98816
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128) 98816
_________________________________________________________________
dense (Dense) (None, 5) 645
=================================================================
Total params: 6,598,277
Trainable params: 6,598,277
Non-trainable params: 0
_________________________________________________________________
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=30, validation_data=(x_val, y_val))
result:
Epoch 1/30
4877/4877 [==============================] - 562s 115ms/step - loss: 0.9593 - accuracy: 0.6107 - val_loss: 0.7819 - val_accuracy: 0.6798
Epoch 2/30
4877/4877 [==============================] - 520s 107ms/step - loss: 0.7942 - accuracy: 0.6729 - val_loss: 0.7094 - val_accuracy: 0.7114
.....................................................................
Epoch 29/30
4877/4877 [==============================] - 539s 111ms/step - loss: 0.3510 - accuracy: 0.8117 - val_loss: 0.3220 - val_accuracy: 0.8242
Epoch 30/30
4877/4877 [==============================] - 553s 113ms/step - loss: 0.3485 - accuracy: 0.8124 - val_loss: 0.3187 - val_accuracy: 0.8238
<tensorflow.python.keras.callbacks.History at 0x7fa9b82520d0>
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))
result:
Epoch 1/5
4877/4877 [==============================] - 535s 110ms/step - loss: 0.3477 - accuracy: 0.8128 - val_loss: 0.3193 - val_accuracy: 0.8240
Epoch 2/5
4877/4877 [==============================] - 543s 111ms/step - loss: 0.3457 - accuracy: 0.8134 - val_loss: 0.3173 - val_accuracy: 0.8250
Epoch 3/5
4877/4877 [==============================] - 542s 111ms/step - loss: 0.3428 - accuracy: 0.8140 - val_loss: 0.3158 - val_accuracy: 0.8254
Epoch 4/5
4877/4877 [==============================] - 541s 111ms/step - loss: 0.3429 - accuracy: 0.8144 - val_loss: 0.3165 - val_accuracy: 0.8257
Epoch 5/5
4877/4877 [==============================] - 557s 114ms/step - loss: 0.3395 - accuracy: 0.8150 - val_loss: 0.3136 - val_accuracy: 0.8259
<tensorflow.python.keras.callbacks.History at 0x7fa8e0763150>
總之,我們創建了一個雙向的LSTM模型,並對其進行了檢測情感的訓練。我們達到了80%的訓練和82%的驗證準確率。
Notebook code://www.kaggle.com/code/ranxi169/sentiment-classification-using-lstm/notebook
原創作者:孤飛-部落格園
個人部落格://blog.onefly.top