DL基礎補全計劃(三)-模型選擇、欠擬合、過擬合

DL基礎補全計劃(三)—模型選擇、欠擬合、過擬合

2021 年 7 月 18 日
筆記
DL

PS: 這個只是基於《我自己》的理解，

如果和你的原則及想法相衝突，請諒解，勿噴。

前置說明

本文作為本人csdn blog的主站的備份。（BlogID=107）

環境說明

Windows 10
VSCode
Python 3.8.10
Pytorch 1.8.1
Cuda 10.2

前言

在前文中，我們已經接觸了兩種回歸模型，也接觸了深度學習中的一些常見的概念。其中有趣的資訊是，我們在《DL基礎補全計劃(二)—Softmax回歸及示例（Pytorch，交叉熵損失）》中已經發現了，在softmax回歸的時候，我們使用一個線性的隱藏層在其數據集上都能夠達到不錯的不錯的準確率，這裡的不錯是指瞎猜和我們的模型推測的準確率，一個是10%，一個是80%左右。這至少說明了我們這個分類模型是有效的。其實後續我們就會更換線性隱藏層為其他層來實現我們的模型，比如：CNN、RNN等等，不同的隱藏層是後續我們要接觸和學習的內容，這裡不先做詳解。

我們假設我們已經設計出了許多的不同隱藏層的模型，這個時候有一個重要的問題就是選擇哪一個模型為我們實際的要應用的模型，本文將會介紹一些方法來實現怎麼選擇模型的問題。

一些基本概念簡介

基本概念簡介：

訓練誤差是指模型在參數更新後，在訓練集上做一次測試，算出的和真實值的誤差。
泛化誤差是指模型在真實數據分布下，算出的和真實值的誤差，但是一般情況下數據是無窮多的，我們只能夠採集一些真實數據，並算出泛化誤差。常見的情況是我們構造一個測試集來計算泛化誤差。
欠擬合模型擬合能力差，訓練誤差和泛化誤差差異小，但是兩個誤差都比較大，一般來說，就是模型基本沒有學習到我們需要學習的規律和特徵。
過擬合訓練誤差小，泛化誤差大。一般來說就是在訓練集上學習的太過分了，類似強行記住了訓練集上的所有規律和特徵，導致泛化能力太弱了。

一般來說欠擬合的話，就是換網路，加深加大網路等解決問題，欠擬合其實很明顯，解決方向比較明確。

其實我們更多是遇到過擬合，因為隨著發展，我們的模型越來越深和寬，但是我們能夠收集到的數據是有限的，導致了我們的模型可能出現『死記硬背』下我們的訓練集，然後泛化能力就令人擔憂，為了緩解這個問題，後續我們將會介紹幾種緩解過擬合的方法。

下面我們將會通過一個實例來體會一下正常擬合、欠擬合、過擬合。

一個正常擬合、過擬合、欠擬合的實例

這裡我們通過pytorch的高級API來設計一個線性規劃的實例。

首先通過如下的程式碼生成\(Y=X^3*W1 + X^2*W2 + X*W3 + b + \epsilon, \epsilon=N(0, 0.1^2)\)的特徵和標籤。

def synthetic_data(w, num_examples): #@save
    """⽣成y = X1^3*W1 + X2^2*W1 + X3*W3 + b + 雜訊。"""

    X = np.random.normal(0, 1, (num_examples, 1))

    y = np.dot(X**3/np.math.factorial(3), w[0]) + np.dot(X**2/np.math.factorial(2), w[1]) + np.dot(X/np.math.factorial(1), w[2]) + w[3]
    
    # 雜訊
    y += np.random.normal(0, 0.1, y.shape)
    
    return X, y.reshape((-1, 1))

然後通過自定義Pytorch層，通過傳入參數N，計算N項多項式的結果。

class TestLayer(nn.Module):
    def __init__(self, n, **kwargs):
        super(TestLayer, self).__init__(**kwargs)
        self.n = n
        self.w_array = nn.Parameter(torch.tensor( np.random.normal(0, 0.1, (1, n))).reshape(-1, 1))
        self.b = nn.Parameter(torch.tensor(np.random.normal(0, 0.1, 1)))

    def cal(self, X, n):
        X = X.reshape(batch_size, 1, 1)
        Y = self.b
        for i in range(n):
            # print(X.shape)
            # print(self.w_array.shape)
            # print(Y.shape)

            Y  = Y + torch.matmul(X**(i + 1)/torch.tensor(np.math.factorial(i + 1)), self.w_array[i])
        return Y

    def forward(self, x):
        return self.cal(x, self.n)


class TestNet(nn.Module):
    def __init__(self, n):
        super(TestNet, self).__init__()
        self.test_net = nn.Sequential(
            TestLayer(n)
        )   

    def forward(self, x):
        return self.test_net(x)

最終完整程式碼如下：

import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils import data
from matplotlib.pyplot import MultipleLocator

fig, ax = plt.subplots()
xdata0, ydata0 = [], []
xdata1, ydata1 = [], []
line0, = ax.plot([], [], 'r-', label='TrainError')
line1, = ax.plot([], [], 'b-', label='TestError')


def init_and_show():
    ax.set_xlabel('epoch')
    ax.set_ylabel('loss')
    ax.set_title('Train/Test Loss')
    ax.set_xlim(0, epochs)
    ax.set_ylim(0.05, 100)
    ax.set_yscale('log')
    # y_locator = MultipleLocator(0.1)
    # ax.yaxis.set_major_locator(y_locator)
    ax.legend([line0, line1], ('TrainError', 'TestError'))
    
    # ax.legend([line1], ('TestError', ))
    line0.set_data(xdata0, ydata0)
    line1.set_data(xdata1, ydata1)

    plt.show()



def synthetic_data(w, num_examples): #@save
    """⽣成y = X1^3*W1 + X2^2*W1 + X3*W3 + b + 雜訊。"""

    X = np.random.normal(0, 1, (num_examples, 1))

    y = np.dot(X**3/np.math.factorial(3), w[0]) + np.dot(X**2/np.math.factorial(2), w[1]) + np.dot(X/np.math.factorial(1), w[2]) + w[3]
    
    # 雜訊
    y += np.random.normal(0, 0.1, y.shape)
    
    return X, y.reshape((-1, 1))



class TestLayer(nn.Module):
    def __init__(self, n, **kwargs):
        super(TestLayer, self).__init__(**kwargs)
        self.n = n
        self.w_array = nn.Parameter(torch.tensor( np.random.normal(0, 0.1, (1, n))).reshape(-1, 1))
        self.b = nn.Parameter(torch.tensor(np.random.normal(0, 0.1, 1)))

    def cal(self, X, n):
        X = X.reshape(batch_size, 1, 1)
        Y = self.b
        for i in range(n):
            # print(X.shape)
            # print(self.w_array.shape)
            # print(Y.shape)

            Y  = Y + torch.matmul(X**(i + 1)/torch.tensor(np.math.factorial(i + 1)), self.w_array[i])
        return Y

    def forward(self, x):
        return self.cal(x, self.n)


class TestNet(nn.Module):
    def __init__(self, n):
        super(TestNet, self).__init__()
        self.test_net = nn.Sequential(
            TestLayer(n)
        )   

    def forward(self, x):
        return self.test_net(x)

# copy from d2l/torch.py
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyTorch data iterator."""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

# def data_loader(batch_size, features, labels):
#     num_examples = len(features)
#     indices = list(range(num_examples))
#     np.random.shuffle(indices) # 樣本的讀取順序是隨機的

#     for i in range(0, num_examples, batch_size):
#         j = np.array(indices[i: min(i + batch_size, num_examples)])
#         yield torch.tensor(features.take(j, 0)), torch.tensor(labels.take(j)) # take函數根據索引返回對應元素

def train(dataloader, model, loss_fn, optimizer):
    size = train_examples
    num_batches = train_examples / batch_size
    train_loss_sum = 0
    for batch, (X, y) in enumerate(dataloader):
        # move X, y to gpu
        if torch.cuda.is_available():
            X = X.to('cuda')
            y = y.to('cuda')
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss_sum += loss.item()
        
        if batch % 5 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
    
    print(f"Train Error: \n Avg loss: {train_loss_sum/num_batches:>8f} \n")
    return train_loss_sum/num_batches


def test(dataloader, model, loss_fn):
    num_batches = test_examples / batch_size
    test_loss = 0
    with torch.no_grad():
        for X, y in dataloader:
            # move X, y to gpu
            if torch.cuda.is_available():
                X = X.to('cuda')
                y = y.to('cuda')
            pred = model(X)
            test_loss += loss_fn(pred, y).item()

    test_loss /= num_batches
    print(f"Test Error: \n Avg loss: {test_loss:>8f} \n")

    return test_loss
    
if __name__ == '__main__':
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print('Using {} device'.format(device))
    
    true_w1 = [1.65]

    true_w2 = [-2.46]

    true_w3 = [3.54]

    true_b = 0.78    

    test_examples = 100
    train_examples = 100
    
    num_examples = test_examples + train_examples

    f1, labels = synthetic_data([true_w1, true_w2, true_w3, true_b], num_examples)
    print(f1.shape)
    print(labels.shape)

    num_weight = 3

    l1_loss_fn = torch.nn.MSELoss()
    
    learning_rate = 0.01

    model = TestNet(num_weight)

    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

    model = model.to(device)
    print(model)

    epochs = 1500

    model.train()


    batch_size = 10

    train_data = (torch.tensor(f1[:train_examples,]), torch.tensor(labels[:train_examples,]))
    test_data = (torch.tensor(f1[train_examples:,]), torch.tensor(labels[train_examples:,]))

    train_dataloader = load_array(train_data ,batch_size, True)
    test_dataloader = load_array(test_data ,batch_size, True)
    # verify dataloader
    # for x,y in train_dataloader:
    #     print(x.shape)
    #     print(y.shape)
    #     print(torch.matmul(x**3, torch.tensor(true_w1, dtype=torch.double)) + torch.matmul(x**2, torch.tensor(true_w2, dtype=torch.double)) + torch.matmul(x, torch.tensor(true_w3, dtype=torch.double)) + true_b)
    #     print(y)
    #     break

    model.train()
    for t in range(epochs):
        print(f"Epoch {t+1}\n-------------------------------")
        train_l = train(train_dataloader, model, l1_loss_fn, optimizer)
        test_l = test(test_dataloader, model, l1_loss_fn)
        ydata0.append(train_l*10)
        ydata1.append(test_l*10)
        xdata0.append(t)
        xdata1.append(t)
    print("Done!")

    init_and_show()

    param_iter = model.parameters()
    print('W = ')
    print(next(param_iter)[: num_weight, :])
    print('b = ')
    print(next(param_iter))

注意，此最終程式碼首先生成了100個訓練集和100個測試集。通過num_weight可以控制參與訓練的多項式個數，話句話說，可以控制參與擬合訓練的參數個數。下面通過三個說明我們來看看，不同num_weight下，TrainErr和TestErr和迭代次數，參與擬合訓練的參數的關係。

正常擬合(num_weight = 3)

當num_weight = 3時，運行我們的訓練腳本，我們可以清楚的看到，我們擬合出來的結果和我們的真實參數是幾乎一樣的。同時我們也可以看到TrainErr和TestErr快速的收斂接近0而且差別不是很大。

欠擬合(num_weight = 1)

當num_weight = 1時，運行我們的訓練腳本，我們可以清楚的看到，損失影像到了一定程度就不下降了，不能夠收斂。

過擬合(num_weight = 20)

當num_weight = 20時，按照我們的猜測，我們的模型應該會出現過擬合。

正常過擬合現象, 注意觀察最終輸出前面3項的w和b和真實w和b存在差異。

從我多次的實驗的結果來看，除了上面的真實出現的過擬合情況，還有一些情況是，不會出現過擬合現象，如下圖。注意觀察最終輸出前面3項的w和b和真實w和b。

我們通過觀察，發現了w的4到20項參數接近於0，前面3項的w和b和真實w和b是比較接近的，因此我們猜測沒有出現過擬合的原因是w的4到20項的權重在整個表達式中佔比非常小，因此不會過擬合。可以直接理解為w的4到20項的權重為0。

注意過擬合這個例子，需要多次運行才會出現過擬合現象，其是波動的，其實就是我們初始化的參數充滿了隨機性，導致了不容易收斂。而欠擬合和正常擬合的例子不管你怎麼運行，都能穩定的得到結果。

後記

這裡我們從模型選擇的角度出發，發現了我們訓練的過程中會出現的3種現象，欠擬合，正常擬合，過擬合。其中正常擬合狀態下的模型是我們需要的。

對於欠擬合來說，就是參與訓練的參數少了，換句話說我們的模型太簡單了，不能夠代表我們要學習的特徵，導致完全不能夠收斂。

對於過擬合來說，遠不止我們看到的這麼簡單和清晰。在這裡我們只是看到了一個主要的導致訓練出現大波動的原因就是參數過多，這種情況下會出現過擬合現象。由於在後面的模型中，參數都是成百上千，我們不可能一個個嘗試，因此在後續，我們還會學習一些手段來抑制過擬合現象。

這裡我們也要引出一個問題，我們知道模型的複雜度（參數個數）在一個特定數據集上可能會導致過擬合，那麼我們除了控制模型複雜度之外，還有其他的方案可以選擇嗎?

參考文獻

//github.com/d2l-ai/d2l-zh/releases (V1.0.0)
//github.com/d2l-ai/d2l-zh/releases (V2.0.0 alpha1)

打賞、訂閱、收藏、丟香蕉、硬幣，請關注公眾號（攻城獅的搬磚之路）

PS: 請尊重原創，不喜勿噴。

PS: 有問題請留言，看到後我會第一時間回復。

Tags: DL