Optimizing the ridge regression parameter最優化嶺回歸參數

  • 2019 年 11 月 13 日
  • 筆記

Once you start using ridge regression to make predictions or learn about relationships in the system you're modeling, you'll start thinking about the choice of alpha.For example, using OLS regression might show some relationship between two variables;however, when regularized by some alpha, the relationship is no longer significant. This can be a matter of whether a decision needs to be taken.

一旦你開始使用嶺回歸來做預測或者學習你的模型系統間的關係,你就得開始思考如何選擇α。比如,使用OLS(最小二乘法)回歸可能顯示兩個變數之間的關係,然而,當用α正則化後,關係將不再顯著,這將是一個是否需要做決定的問題。

Getting ready準備工作

This is the first recipe where we'll tune the parameters for a model. This is typically done by cross-validation. There will be recipes laying out a more general way to do this in later recipes,but here we'll walkthrough to be able to tune ridge regression.

這是我們第一次調整模型參數。通常採用交叉驗證。這將是一個為後面的方法引出更普遍方法的步驟。在這我們先用來調節嶺回歸。

If you remember, in ridge regression, the gamma parameter is typically represented as alpha in scikit-learn when calling RidgeRegression ; so, the question that arises is what the best alpha is. Create a regression dataset, and then let's get started:

如果你還記得,在嶺回歸中當調用RidgeRegression時,γ參數通常被scikit-learn中的α代替。所以問題就成了最好的α是多少,生成一個回歸數據集,讓我們開始吧。

>>> from sklearn.datasets import make_regression

>>> reg_data, reg_target = make_regression(n_samples=100,n_features=2, effective_rank=1, noise=10)

How to do it…怎麼做

In the linear_models module, there is an object called RidgeCV , which stands for ridge cross-validation. This performs a cross-validation similar to leave-one-out cross-validation (LOOCV).Under the hood, it's going to train the model for all samples except one. It'll then evaluate the error in predicting this one test case:

在線性模型模組當中,有一個叫做RidgeCV的對象,代替嶺回歸交叉驗證。它執行一種交叉驗證類似留一交叉驗證。在背後,模型將訓練除了留下的一個以外的所有樣本。它將通過預測這個測試樣例來評估誤差。

>>> from sklearn.linear_model import RidgeCV

>>> rcv = RidgeCV(alphas=np.array([.1, .2, .3, .4]))

>>> rcv.fit(reg_data, reg_target)

RidgeCV(alphas=array([ 0.1, 0.2, 0.3, 0.4]), cv=None,fit_intercept=True, gcv_mode=None, loss_func=None, normalize=False, score_func=None, scoring=None,store_cv_values=False)

After we fit the regression, the alpha attribute will be the best alpha choice:擬合完回歸後,α的屬性將是可選擇的最好的α值。

>>> rcv.alpha_

0.10000000000000001

In the previous example, it was the first choice. We might want to hone in on something around .1 :

再上一個例子中,這是第一個選擇,我們可能想訓練0.1附近的值:

>>> rcv2 = RidgeCV(alphas=np.array([.08, .09, .1, .11, .12]))

>>> rcv2.fit(reg_data, reg_target)

RidgeCV(alphas=array([ 0.08, 0.09, 0.1 , 0.11, 0.12]), cv=None,fit_intercept=True, gcv_mode=None,

loss_func=None, normalize=False,score_func=None, scoring=None,store_cv_values=False)

>>> rcv2.alpha_

0.08

We can continue this hunt, but hopefully, the mechanics are clear.我們可以繼續這樣尋找,還好機器已經清楚了

How it works…它怎麼工作的。

The mechanics might be clear, but we should talk a little more about the why and define what was meant by "best". At each step in the cross-validation process, the model scores an error against the test sample. By default, it's essentially a squared error. Check out the There's more… section for more details.

雖然機器可能很清楚了,但是我們應該多談論一點為什麼和定義最好意味著什麼。在交叉估驗證的每一步驟,模型對測試樣例得出一個分值,通過定義,它其實是一個誤差平方,通過擴展閱讀,了解更多細節。

We can force the RidgeCV object to store the cross-validation values; this will let us visualize what it's doing:

我們能夠讓RidgeCV對象記錄交叉驗證的值,這將讓我們能夠看見它做了什麼:

>>> alphas_to_test = np.linspace(0.01, 1)

>>> rcv3 = RidgeCV(alphas=alphas_to_test, store_cv_values=True)

>>> rcv3.fit(reg_data, reg_target)

As you can see, we test a bunch of points (50 in total) between 0.01 and 1. Since we passed store_cv_values as true , we can access these values:

如你所見,我們通過測試一組在0.01到1之間總量為50個的點,當我們傳入store_cv_values為True,我們能得到這些值。

>>> rcv3.cv_values_.shape

(100, 50)

So, we had 100 values in the initial regression and tested 50 different alpha values. We now have access to the errors of all 50 values. So, we can now find the smallest mean error and choose it as alpha:

所以,我們有100個初始化回歸的值和測試50個不同的α值。我們現在訪問這50個值得到的誤差,以便我們能找到最小均方誤差和它的最優的α。

>>> smallest_idx = rcv3.cv_values_.mean(axis=0).argmin()

>>> alphas_to_test[smallest_idx]

The question that arises is "Does RidgeCV agree with our choice?" Use the following command to find out:

現在問題變成了RidgeCV同意我們對最優α的選擇嗎?使用以下程式碼,來找到答案。

>>> rcv3.alpha_

0.01

Beautiful!漂亮!

It's also worthwhile to visualize what's going on. In order to do that, we'll plot the mean for all 50 test alphas.

這很值得看一看發生了什麼,為了驗證這件事,我們畫出50個測試的α值得到的均值。

There's more…擴展閱讀

If we want to use our own scoring function, we can do that as well. Since we looked up MAD before, let's use it to score the differences. First, we need to define our loss function:

如果我們使用我們自己的打分函數,我們也能做到,當我們查找異常值檢測,我們使用它來給出不同的分數,首先,我們定義我們的損失函數。

>>> def MAD(target, predictions):

absolute_deviation = np.abs(target – predictions)

return absolute_deviation.mean()

After we define the loss function, we can employ the make_scorer function in sklearn .This will take care of standardizing our function so that scikit's objects know how to use it.Also, because this is a loss function and not a score function, the lower the better, and thus the need to let sklearn to flip the sign to turn this from a maximization problem into a minimization problem:

定義完損失函數以後,我們能夠使用sklearn中的make_scorer函數,需要注意標準化我們的函數,以便scikit's的對象能夠知道怎麼使用它。同樣,因為這是一個損失函數(越低越好),而不是打分函數,所以需要讓sklearn轉換符號,來吧最大化問題改為最小會問題。

>>> import sklearn

>>> MAD = sklearn.metrics.make_scorer(MAD, greater_is_better=False)

>>> rcv4 = RidgeCV(alphas=alphas_to_test, store_cv_values=True,scoring=MAD)

>>> rcv4.fit(reg_data, reg_target)

>>> smallest_idx = rcv4.cv_values_.mean(axis=0).argmin()

>>> alphas_to_test[smallest_idx]

0.2322