機器學習-理解Logistic Regression

2019 年 10 月 5 日
筆記

背景介紹

本文討論了Logistic回歸的基礎知識及其在Python中的實現。邏輯回歸基本上是一種監督分類演算法。在分類問題中，目標變數（或輸出）y對於給定的一組特徵（或輸入）X，只能採用離散值。

與普遍看法相反，邏輯回歸是一種回歸模型。該模型建立回歸模型以預測給定數據條目屬於編號為「1」的類別的概率。就像線性回歸假設數據遵循線性函數一樣，Logistic回歸使用sigmoid函數對數據進行建模。

只有當決策閾值進入圖片時，邏輯回歸才成為分類技術。閾值的設置是Logistic回歸的一個非常重要的方面，並且取決於分類問題本身。

閾值的決定主要受精度和召回值的影響。理想情況下，我們希望精度和召回都是1，但很少這種情況。在Precision-Recall權衡的情況下，我們使用以下參數來決定thresold：

1.低精度/高回調：在我們想要減少假陰性數量而不必減少誤報數量的應用中，我們選擇具有低精度值或高回調值的決策值。例如，在癌症診斷應用中，如果患者被錯誤地診斷為患有癌症，我們不希望任何受影響的患者被歸類為不受影響而沒有給予足夠的注意。這是因為，可以通過其他醫學疾病來檢測不存在癌症，但是在已經被拒絕的候選者中不能檢測到疾病的存在。

2.高精度/低回調：在我們希望減少誤報數量而不必減少假陰性數量的應用中，我們選擇具有高精度值或低回調值的決策值。例如，如果我們對客戶進行分類，他們是否會對個性化廣告做出積極或消極的反應，我們希望絕對確定客戶會對廣告做出積極反應，否則，負面反應會導致客戶的潛在銷售損失。

根據類別數量，Logistic回歸可分為：

二項式：目標變數只能有兩種可能的類型：「0」或「1」代表「贏」與「損失」，「通過」與「失敗」，「死」與「活著」等。

多項式：目標變數可以具有3種或更多種未被排序的可能類型（即類型沒有定量意義），例如「疾病A」與「疾病B」對比「疾病C」。

順序：它處理具有有序類別的目標變數。例如，測試分數可以分類為：「非常差」，「差」，「好」，「非常好」。在這裡，每個類別可以給出分數，如0,1,2,3。

首先，我們探索最簡單的Logistic回歸形式，即二項Logistic回歸。

二項Logistic回歸

考慮一個示例數據集，該數據集將學習小時數與考試結果進行映射。結果只能採用兩個值，即通過（1）或失敗（0）：

HOURS(X)0.500.751.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.005.50PASS(Y)00000010101010111111

所以我們有：

即y是分類目標變數，它只能採用兩種可能的類型：「0」或「1」。

為了概括我們的模型，我們假設：

數據集具有'p'特徵變數和'n'觀察值。
特徵矩陣表示為：

被稱為學習率，需要明確設置。

讓我們在樣本數據集上看到上面技術的python實現：

import csvimport numpy as npimport matplotlib.pyplot as plt def loadCSV(filename): ''' function to load dataset ''' with open(filename,"r") as csvfile: lines = csv.reader(csvfile) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return np.array(dataset) def normalize(X): ''' function to normalize feature matrix, X ''' mins = np.min(X, axis = 0) maxs = np.max(X, axis = 0) rng = maxs – mins norm_X = 1 – ((maxs – X)/rng) return norm_X def logistic_func(beta, X): ''' logistic(sigmoid) function ''' return 1.0/(1 + np.exp(-np.dot(X, beta.T))) def log_gradient(beta, X, y): ''' logistic gradient function ''' first_calc = logistic_func(beta, X) – y.reshape(X.shape[0], -1) final_calc = np.dot(first_calc.T, X) return final_calc def cost_func(beta, X, y): ''' cost function, J ''' log_func_v = logistic_func(beta, X) y = np.squeeze(y) step1 = y * np.log(log_func_v) step2 = (1 – y) * np.log(1 – log_func_v) final = -step1 – step2 return np.mean(final) def grad_desc(X, y, beta, lr=.01, converge_change=.001): ''' gradient descent function ''' cost = cost_func(beta, X, y) change_cost = 1 num_iter = 1 while(change_cost > converge_change): old_cost = cost beta = beta – (lr * log_gradient(beta, X, y)) cost = cost_func(beta, X, y) change_cost = old_cost – cost num_iter += 1 return beta, num_iter def pred_values(beta, X): ''' function to predict labels ''' pred_prob = logistic_func(beta, X) pred_value = np.where(pred_prob >= .5, 1, 0) return np.squeeze(pred_value) def plot_reg(X, y, beta): ''' function to plot decision boundary ''' # labelled observations x_0 = X[np.where(y == 0.0)] x_1 = X[np.where(y == 1.0)] # plotting points with diff color for diff label plt.scatter([x_0[:, 1]], [x_0[:, 2]], c='b', label='y = 0') plt.scatter([x_1[:, 1]], [x_1[:, 2]], c='r', label='y = 1') # plotting decision boundary x1 = np.arange(0, 1, 0.1) x2 = -(beta[0,0] + beta[0,1]*x1)/beta[0,2] plt.plot(x1, x2, c='k', label='reg line') plt.xlabel('x1') plt.ylabel('x2') plt.legend() plt.show() if __name__ == "__main__": # load the dataset dataset = loadCSV('dataset1.csv') # normalizing feature matrix X = normalize(dataset[:, :-1]) # stacking columns wth all ones in feature matrix X = np.hstack((np.matrix(np.ones(X.shape[0])).T, X)) # response vector y = dataset[:, -1] # initial beta values beta = np.matrix(np.zeros(X.shape[1])) # beta values after running gradient descent beta, num_iter = grad_desc(X, y, beta) # estimated beta values and number of iterations print("Estimated regression coefficients:", beta) print("No. of iterations:", num_iter) # predicted labels y_pred = pred_values(beta, X) # number of correctly predicted labels print("Correctly predicted labels:", np.sum(y == y_pred)) # plotting regression line plot_reg(X, y, beta)

Estimated regression coefficients: [[  1.70474504  15.04062212 -20.47216021]]  No. of iterations: 2612  Correctly predicted labels: 100

意：梯度下降是估算

的眾多方法之一。

基本上，這些是更高級的演算法，一旦您定義了成本函數和漸變，就可以在Python中輕鬆運行。這些演算法是：

BFGS（Broyden-Fletcher-Goldfarb-Shanno演算法）
L-BFGS（與BFGS一樣，但使用有限的記憶體）
共軛梯度

使用這些演算法中的任何一種優於梯度下降的優點/缺點：

好處

不需要選擇學習率
經常跑得更快（並非總是如此）
可以在數值上近似梯度（並不總是很好）

缺點

更複雜
除非你了解具體細節，否則更多的是黑匣子

多項Logistic回歸

在Multiomial Logistic回歸中，輸出變數可以具有兩個以上可能的離散輸出。考慮一下數字數據集。這裡，輸出變數是數字值，它可以取出（0,12,3,4,5,6,7,8,9）中的值。

下面給出了使用scikit實現Multinomial Logisitc回歸 – 學習對數字數據集進行預測。

from sklearn import datasets, linear_model, metrics # load the digit datasetdigits = datasets.load_digits() # defining feature matrix(X) and response vector(y)X = digits.datay = digits.target # splitting X and y into training and testing setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) # create logistic regression objectreg = linear_model.LogisticRegression() # train the model using the training setsreg.fit(X_train, y_train) # making predictions on the testing sety_pred = reg.predict(X_test) # comparing actual response values (y_test) with predicted response values (y_pred)print("Logistic Regression model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Logistic Regression model accuracy(in %): 95.6884561892

最後，這裡有一些關於Logistic回歸思考的觀點：

不假設因變數和自變數之間存在線性關係，但它假設解釋變數的logit與響應之間存在線性關係。
獨立變數甚至可以是原始自變數的冪項或一些其他非線性變換。
因變數不需要是正態分布的，但它通常假設來自指數族的分布（例如二項式，泊松，多項式，正態，……）; 二元邏輯回歸假設響應的二項分布。
方差的同質性不需要滿足。
錯誤需要是獨立的，但不是正常分布的。
它使用最大似然估計（MLE）而不是普通最小二乘（OLS）來估計參數，因此依賴於大樣本近似。
參考文獻:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
http://machinelearningmastery.com/logistic-regression-for-machine-learning/
https://onlinecourses.science.psu.edu/stat504/node/164

本文由Nikhil Kumar撰寫。