機器學習-決策樹(Decision Tree)案例
- 2019 年 10 月 5 日
- 筆記
背景介紹
這是我最喜歡的演算法之一,我經常使用它。它是一種監督學習演算法,主要用於分類問題。令人驚訝的是,它適用於分類和連續因變數。在該演算法中,我們將總體分成兩個或更多個同類集。這是基於最重要的屬性/獨立變數來完成的,以儘可能地作為不同的組。有關詳細資訊,請參閱簡化決策樹:https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

在上圖中,您可以看到人口根據多個屬性分為四個不同的組,以識別「他們是否會玩」。為了將人口分成不同的異構群體,它使用各種技術,如基尼,資訊增益,卡方,熵。
理解決策樹如何工作的最好方法是玩Jezzball–一款來自微軟的經典遊戲(如下圖所示)。基本上,你有一個移動牆壁的房間,你需要創建牆壁,以便最大限度的區域被球清除。

所以,每次你用牆隔開房間時,你都試圖在同一個房間里創造2個不同的人口。決策樹以非常類似的方式工作,通過將人口分成儘可能不同的群體。
接下來看使用Python Scikit-learn的決策樹案例:
import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score # read the train and test dataset train_data = pd.read_csv('train-data.csv') test_data = pd.read_csv('test-data.csv') # shape of the dataset print('Shape of training data :',train_data.shape) print('Shape of testing data :',test_data.shape) train_x = train_data.drop(columns=['Survived'],axis=1) train_y = train_data['Survived'] test_x = test_data.drop(columns=['Survived'],axis=1) test_y = test_data['Survived'] model = DecisionTreeClassifier() model.fit(train_x,train_y) # depth of the decision tree print('Depth of the Decision Tree :', model.get_depth()) # predict the target on the train dataset predict_train = model.predict(train_x) print('Target on train data',predict_train) # Accuray Score on train dataset accuracy_train = accuracy_score(train_y,predict_train) print('accuracy_score on train dataset : ', accuracy_train) # predict the target on the test dataset predict_test = model.predict(test_x) print('Target on test data',predict_test) # Accuracy Score on test dataset accuracy_test = accuracy_score(test_y,predict_test) print('accuracy_score on test dataset : ', accuracy_test)
上面程式碼運行結果:
Shape of training data : (712, 25) Shape of testing data : (179, 25) Depth of the Decision Tree : 19 Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 01 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 11 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 00 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 00 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 01 0 0 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 10 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 00 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 00 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 01 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 11 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 10 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 00 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 01 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 11 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 00 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 01 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 1 0 1 0 1 0 1 1 1 0 0 1 0] accuracy_score on train dataset : 0.9859550561797753 Target on test data [0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 01 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 00 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 11 0 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 01 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0] accuracy_score on test dataset : 0.770949720670391