練習推導一個最簡單的BP神經網路訓練過程【個人作業/數學推導】
寫在前面
各式資料中關於BP神經網路的講解已經足夠全面詳盡,故不在此過多贅述。本文重點在於由一個「最簡單」的神經網路練習推導其訓練過程,和大家一起在練習中一起更好理解神經網路訓練過程。
一、BP神經網路
1.1 簡介
BP網路(Back-Propagation Network) 是1986年被提出的,是一種按誤差逆向傳播演算法訓練的
多層前饋網路,是目前應用最廣泛的神經網路模型之一,用於函數逼近、模型識別分類、數據壓縮和時間序列預測等。
一個典型的BP網路應該包括三層:輸入層、隱藏層和輸出層。各層之間全連接,同層之間無連接。隱藏層可以有很多層。
1.2 訓練(學習)過程
每一次迭代(Interation)意味著使用一批(Batch)數據對模型進行一次更新過程,被稱為「一次訓練」,包含一個正向過程和一個反向過程。
具體過程可以概括為如下過程:
- 準備樣本資訊(數據&標籤)、定義神經網路(結構、初始化參數、選取激活函數等)
- 將樣本輸入,正向計算各節點函數輸出
- 計算損失函數
- 求損失函數對各權重的偏導數,採用適當方法進行反向過程優化
- 重複2~4直至達到停止條件
以下訓練將使用均值平方差(Mean Squared Error, MSE)作為損失函數,sigmoid函數作為激活函數、梯度下降法作為優化權重方法進行推導
二、實例推導練習作業
2.1 準備工作
- 第一層是輸入層,包含兩個神經元: i1, i2 和偏置b1
- 第二層是隱藏層,包含兩個神經元: h1, h2 和偏置項b2
- 第三層是輸出: o1, o2
- 每條線上標的 wi 是層與層之間連接的權重
- 激活函數是 sigmod 函數
- 我們用 z 表示某神經元的加權輸入和,用 a 表示某神經元的輸出
2.2 第一次正向過程【個人推導】
根據上述資訊,我們可以得到另一種表達一次迭代的「環形」過程的圖示如下:
我們做一次正向過程(由於需多次迭代,因此我們將第一次正向過程標記為t=0),得各項數值如下:
由此我們可得損失函數值為MSE=0.298371109,假設這超出了我們對損失值的要求,那麼我們就需要對各個權重(wi,t=0)進行更新, 作為t=1的初始權重。
2.3 推導計算∂/∂wi【個人推導】
2.3.1 均值平方差損失函數的全微分推導
\(dMSE=\frac{\partial MSE}{\partial ao_{1}}dao_{1}+\frac{\partial MSE}{\partial ao_{2}}dao_{2}\)
\({\color{white}{dMSE}}=\frac{\partial MSE}{\partial ao_{1}}\frac{\partial ao_{1}}{\partial zo_{1}} dzo_{1}
+\frac{\partial MSE}{\partial ao_{2}}\frac{\partial ao_{2}}{\partial zo_{2}} dzo_{2}\)
\({\color{white}{dMSE}}=\frac{\partial MSE}{\partial ao_{1}}\frac{\partial ao_{1}}{\partial zo_{1}}\left ( \frac{\partial zo_{1}}{\partial ah_{1}}dah_{1}
+\frac{\partial zo_{1}}{\partial ah_{2}}dah_{2}
+
\frac{\partial zo_{1}}{\partial w\omega _{5}}d\omega _{5}
+\frac{\partial zo_{1}}{\partial w\omega _{6}}d\omega _{6}
\right )\)
\({\color{white}{dMSE=}}+\frac{\partial MSE}{\partial ao_{2}}\frac{\partial ao_{2}}{\partial zo_{2}}\left ( \frac{\partial zo_{2}}{\partial ah_{1}}dah_{1}
+\frac{\partial zo_{2}}{\partial ah_{2}}dah_{2}
+
\frac{\partial zo_{2}}{\partial w\omega _{7}}d\omega _{7}
+\frac{\partial zo_{2}}{\partial w\omega _{8}}d\omega _{8}
\right )\)
\({\color{white}{dMSE}}=\frac{\partial MSE}{\partial ao_{1}}\frac{\partial ao_{1}}{\partial zo_{1}}\left ( \frac{\partial zo_{1}}{\partial ah_{1}}\frac{\partial ah_{1}}{\partial zh_{1}}dzh_{1}
+\frac{\partial zo_{1}}{\partial ah_{2}}\frac{\partial ah_{2}}{\partial zh_{2}}dzh_{2}
+
\frac{\partial zo_{1}}{\partial w\omega _{5}}d\omega _{5}
+\frac{\partial zo_{1}}{\partial w\omega _{6}}d\omega _{6}
\right )\)
\({\color{white}{dMSE=}}+\frac{\partial MSE}{\partial ao_{2}}\frac{\partial ao_{2}}{\partial zo_{2}}\left ( \frac{\partial zo_{2}}{\partial ah_{1}}\frac{\partial ah_{1}}{\partial zh_{1}}dzh_{1}
+\frac{\partial zo_{2}}{\partial ah_{2}}\frac{\partial ah_{2}}{\partial zh_{2}}dzh_{2}
+
\frac{\partial zo_{2}}{\partial w\omega _{7}}d\omega _{7}
+\frac{\partial zo_{2}}{\partial w\omega _{8}}d\omega _{8}
\right )\)
\({\color{white}{dMSE}}=\frac{\partial MSE}{\partial ao_{1}}\frac{\partial ao_{1}}{\partial zo_{1}}\left [ \frac{\partial zo_{1}}{\partial ah_{1}}\frac{\partial ah_{1}}{\partial zh_{1}}\left ( \frac{\partial zh_{1}}{\partial \omega _{1}}d\omega _{1}+\frac{\partial zh_{1}}{\partial \omega _{2}}d\omega _{2} \right )
+\frac{\partial zo_{1}}{\partial ah_{2}}\frac{\partial ah_{2}}{\partial zh_{2}}\left ( \frac{\partial zh_{2}}{\partial \omega _{3}}d\omega _{3}+\frac{\partial zh_{2}}{\partial \omega _{4}}d\omega _{4} \right )
+
\frac{\partial zo_{1}}{\partial w\omega _{5}}d\omega _{5}
+\frac{\partial zo_{1}}{\partial w\omega _{6}}d\omega _{6}
\right ]\)
\({\color{white}{dMSE=}}+\frac{\partial MSE}{\partial ao_{2}}\frac{\partial ao_{2}}{\partial zo_{2}}\left [ \frac{\partial zo_{2}}{\partial ah_{1}}\frac{\partial ah_{1}}{\partial zh_{1}}\left ( \frac{\partial zh_{1}}{\partial \omega _{1}}d\omega _{1}+\frac{\partial zh_{1}}{\partial \omega _{2}}d\omega _{2} \right )
+\frac{\partial zo_{2}}{\partial ah_{2}}\frac{\partial ah_{2}}{\partial zh_{2}}\left ( \frac{\partial zh_{2}}{\partial \omega _{3}}d\omega _{3}+\frac{\partial zh_{2}}{\partial \omega _{4}}d\omega _{4} \right )
+
\frac{\partial zo_{2}}{\partial w\omega _{7}}d\omega _{7}
+\frac{\partial zo_{2}}{\partial w\omega _{8}}d\omega _{8}
\right ]\)
2.3.2 這一次代入訓練實例的數值和各數量名
\(dMSE=\frac{\partial \frac{1}{2}\left ( y_{1}-ao_{1} \right )^2}{\partial ao_{1}}dao_{1}+\frac{\partial \frac{1}{2}\left ( y_{2}-ao_{2} \right )^2}{\partial ao_{2}}dao_{2}\)
\({\color{white}{dMSE}}=-\left( y_{1}-ao_{1} \right )\frac{\partial ao_{1}}{\partial zo_{1}}dzo_{1}-\left( y_{2}-ao_{2} \right )\frac{\partial ao_{2}}{\partial zo_{2}}dzo_{2}\)
\({\color{white}{dMSE}}=-\left( y_{1}-ao_{1} \right )ao_{1}\left( 1-ao_{1} \right )\left ( \frac{\partial zo_{1}}{\partial ah_{1}}dah_{1}+\frac{\partial zo_{1}}{\partial ah_{2}}dah_{2}+\frac{\partial zo_{1}}{\partial \omega _{5}}d\omega _{5}+\frac{\partial zo_{1}}{\partial \omega _{6}}d\omega _{6} \right )\)
\({\color{white}{dMSE=}}-\left( y_{2}-ao_{2} \right )ao_{2}\left( 1-ao_{2} \right )\left ( \frac{\partial zo_{2}}{\partial ah_{1}}dah_{1}+\frac{\partial zo_{2}}{\partial ah_{2}}dah_{2}+\frac{\partial zo_{2}}{\partial \omega _{7}}d\omega _{7}+\frac{\partial zo_{2}}{\partial \omega _{8}}d\omega _{8} \right )\)
\({\color{white}{dMSE}}=-\left( y_{1}-ao_{1} \right )ao_{1}\left( 1-ao_{1} \right )\left ( \omega_{5}\frac{\partial ah_{1}}{\partial zh_{1}}dzh_{1}+\omega_{6}\frac{\partial ah_{2}}{\partial zh_{2}}dzh_{2}+ah_{1}d\omega _{5}+ah_{2}d\omega _{6} \right )\)
\({\color{white}{dMSE=}}-\left( y_{2}-ao_{2} \right )ao_{2}\left( 1-ao_{2} \right )\left ( \omega_{7}\frac{\partial ah_{1}}{\partial zh_{1}}dzh_{1}+\omega_{8}\frac{\partial ah_{2}}{\partial zh_{2}}dzh_{2}+ah_{1}d\omega _{7}+ah_{2}d\omega _{8} \right )\)
\({\color{white}{dMSE}}=-\left ( y_{1}-ao_{1} \right )ao_{1}\left( 1-ao_{1} \right )\left [ \omega_{5}\cdot ah_{1}\left ( 1- ah_{1} \right )\left (\frac{\partial zh_{1}}{\partial \omega_{1}}\omega_{1}+\frac{\partial zh_{1}}{\partial \omega_{2}}\omega_{2} \right )
+\omega_{6}\cdot ah_{2}\left ( 1- ah_{2} \right )\left (\frac{\partial zh_{2}}{\partial \omega_{3}}\omega_{3}+\frac{\partial zh_{2}}{\partial \omega_{4}}\omega_{4} \right )+ah_{1}d\omega _{5}+ah_{2}d\omega _{6} \right ]\)
\({\color{white}{dMSE=}}-\left ( y_{2}-ao_{2} \right )ao_{2}\left( 1-ao_{2} \right )\left [ \omega_{7}\cdot ah_{1}\left ( 1- ah_{1} \right )\left (\frac{\partial zh_{1}}{\partial \omega_{1}}\omega_{1}+\frac{\partial zh_{1}}{\partial \omega_{2}}\omega_{2} \right )
+\omega_{8}\cdot ah_{2}\left ( 1- ah_{2} \right )\left (\frac{\partial zh_{2}}{\partial \omega_{3}}\omega_{3}+\frac{\partial zh_{2}}{\partial \omega_{4}}\omega_{4} \right )+ah_{1}d\omega _{7}+ah_{2}d\omega _{8} \right ]\)
\({\color{white}{dMSE}}=-\left ( y_{1}-ao_{1} \right )ao_{1}\left( 1-ao_{1} \right )\left[ \omega_{5} \cdot ah_{1}\left ( 1-ah_{1} \right )\left ( {\color{green}{i_{1}d\omega_{1}}}+{\color{green}{i_{2}d\omega_{2}}} \right )
+\omega_{6} \cdot ah_{2}\left ( 1-ah_{2} \right )\left ( {\color{green}{i_{1}d\omega_{3}}}+{\color{green}{i_{2}d\omega_{4}}} \right )
+{\color{blue}{ah_{1}d\omega_{5}}}+{\color{blue}{ah_{2}d\omega_{6}}}\right ]\)
\({\color{white}{dMSE=}}-\left ( y_{2}-ao_{2} \right )ao_{2}\left( 1-ao_{2} \right )\left[ \omega_{7} \cdot ah_{1}\left ( 1-ah_{1} \right )\left ( {\color{green}{i_{1}d\omega_{1}}}+{\color{green}{i_{2}d\omega_{2}}} \right )
+\omega_{8} \cdot ah_{2}\left ( 1-ah_{2} \right )\left ( {\color{green}{i_{1}d\omega_{3}}}+{\color{green}{i_{2}d\omega_{4}}} \right )
+{\color{blue}{ah_{1}d\omega_{7}}}+{\color{blue}{ah_{2}d\omega_{8}}}\right ]\)
2.3.3 由此我們得到∂/∂wi的表達式
\(\frac {\partial MSE}{\partial \omega_{1}}=-\left [ \left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot \omega_{5}\cdot ah_{1}\left ( 1-ah_{1} \right )+\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot \omega_{7}\cdot ah_{1}\left ( 1-ah_{1} \right ) \right ]\cdot i_{1}\)
\(\frac {\partial MSE}{\partial \omega_{2}}=-\left [ \left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot \omega_{5}\cdot ah_{1}\left ( 1-ah_{1} \right )+\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot \omega_{7}\cdot ah_{1}\left ( 1-ah_{1} \right ) \right ]\cdot i_{2}\)
\(\frac {\partial MSE}{\partial \omega_{3}}=-\left [ \left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot \omega_{6}\cdot ah_{2}\left ( 1-ah_{2} \right )+\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot \omega_{8}\cdot ah_{2}\left ( 1-ah_{2} \right ) \right ]\cdot i_{1}\)
\(\frac {\partial MSE}{\partial \omega_{4}}=-\left [ \left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot \omega_{6}\cdot ah_{2}\left ( 1-ah_{2} \right )+\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot \omega_{8}\cdot ah_{2}\left ( 1-ah_{2} \right ) \right ]\cdot i_{2}\)
\(\frac {\partial MSE}{\partial \omega_{5}}=-\left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot ah_{1}\)
\(\frac {\partial MSE}{\partial \omega_{6}}=-\left( y_{1}-ao_{1} \right )ao_{1}\left ( 1-ao_{1} \right )\cdot ah_{2}\)
\(\frac {\partial MSE}{\partial \omega_{7}}=-\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot ah_{1}\)
\(\frac {\partial MSE}{\partial \omega_{8}}=-\left( y_{2}-ao_{2} \right )ao_{2}\left ( 1-ao_{2} \right )\cdot ah_{2}\)
當然如果你喜歡用矩陣表示也可以:
(P.S. Markdown編輯器承受不住如此「巨大」的矩陣算式而崩潰,我只好轉成svg圖片貼上了,見諒~)
2.4 根據∂/∂wi梯度下降法優化wi【個人推導】
根據梯度下降法\({\color{purple}{\omega_{i,t+1}=\omega_{i,t}+\alpha_{t}\left [ -\triangledown Loss(\omega_{i,t}) \right ]}}\),設置學習率α=0.5,計算出wi,t+1,然後重新進行下一次正向過程。(可以將該過程在Excel中輕易實現,下表中為迭代數據截取)
可以看到,經過10001次迭代之後MSE(t=10001)=3.51019E-05已經足夠小,可以停止迭代完成1代訓練。
………………………………………………………………………………………………………………………………………………
sigmoid函數求導Tips(for於初級選手)
\(\because {\color{blue}{\frac{1}{1+e^{-x}}}}=\frac{1}{1+\frac{1}{e^{x}}}=\frac{e^{x}}{e^{x}+1}={\color{blue}{1-\frac{1}{1+e^{x}}}}\)
\(\therefore \frac{d\left( {\color{red}{\frac{1}{1+e^{-x}}}} \right )}{dx}=
d\left( 1-\frac{1}{e^{x}+1} \right )\frac{1}{dx}\)
\({\color{white}{\therefore \frac{d\left( \frac{1}{1+e^{-x}} \right )}{dx}}}=
(-1)\times (-1)(e^{x}+1)^{-2}d(e^{x}+1)\frac{1}{dx}\)
\({\color{white}{\therefore \frac{d\left( \frac{1}{1+e^{-x}} \right )}{dx}}}=
\frac{e^{x}}{(e^{x}+1)^2}\frac{1}{dx}=-\left[ \frac{1}{e^x+1}-\frac{1}{(e^x+1)^2} \right ]\frac{1}{dx}\)
\({\color{white}{\therefore \frac{d\left( \frac{1}{1+e^{-x}} \right )}{dx}}}=
\frac{1}{1+e^{x}}\left [ 1-\frac{1}{e^x+1} \right ]\frac{1}{dx}\)
\({\color{white}{\therefore \frac{d\left( \frac{1}{1+e^{-x}} \right )}{dx}}}=
\left ( {\color{red}{1-\frac{1}{1+e^{-x}}}} \right )\left [ {\color{red}{\frac{1}{1+e^x}}} \right ]\frac{1}{dx}\)