機器學習-決策樹(Decision Tree)案例

  • 2019 年 10 月 5 日
  • 筆記

背景介紹

這是我最喜歡的演算法之一,我經常使用它。它是一種監督學習演算法,主要用於分類問題。令人驚訝的是,它適用於分類和連續因變數。在該演算法中,我們將總體分成兩個或更多個同類集。這是基於最重要的屬性/獨立變數來完成的,以儘可能地作為不同的組。有關詳細資訊,請參閱簡化決策樹:https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

在上圖中,您可以看到人口根據多個屬性分為四個不同的組,以識別「他們是否會玩」。為了將人口分成不同的異構群體,它使用各種技術,如基尼,資訊增益,卡方,熵。

理解決策樹如何工作的最好方法是玩Jezzball–一款來自微軟的經典遊戲(如下圖所示)。基本上,你有一個移動牆壁的房間,你需要創建牆壁,以便最大限度的區域被球清除。

所以,每次你用牆隔開房間時,你都試圖在同一個房間里創造2個不同的人口。決策樹以非常類似的方式工作,通過將人口分成儘可能不同的群體。

接下來看使用Python Scikit-learn的決策樹案例:

import pandas as pd  from sklearn.tree import DecisionTreeClassifier  from sklearn.metrics import accuracy_score    # read the train and test dataset  train_data = pd.read_csv('train-data.csv')  test_data = pd.read_csv('test-data.csv')    # shape of the dataset  print('Shape of training data :',train_data.shape)  print('Shape of testing data :',test_data.shape)    train_x = train_data.drop(columns=['Survived'],axis=1)  train_y = train_data['Survived']    test_x = test_data.drop(columns=['Survived'],axis=1)  test_y = test_data['Survived']  model = DecisionTreeClassifier()  model.fit(train_x,train_y)    # depth of the decision tree  print('Depth of the Decision Tree :', model.get_depth())    # predict the target on the train dataset  predict_train = model.predict(train_x)  print('Target on train data',predict_train)    # Accuray Score on train dataset  accuracy_train = accuracy_score(train_y,predict_train)  print('accuracy_score on train dataset : ', accuracy_train)    # predict the target on the test dataset  predict_test = model.predict(test_x)  print('Target on test data',predict_test)    # Accuracy Score on test dataset  accuracy_test = accuracy_score(test_y,predict_test)  print('accuracy_score on test dataset : ', accuracy_test)

上面程式碼運行結果:

Shape of training data : (712, 25)  Shape of testing data : (179, 25)  Depth of the Decision Tree : 19  Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0   1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 01 0 0 0 0 0   0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 11 0 0 0 0 0   0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 00 1 0 1 1 0   0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 00 0 1 0 0 1   0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 00 0 0 0 0 0   0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 01 0 0 0 1 0   0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 10 0 0 0 0 0   0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 00 0 0 0 0 1   1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 00 0 1 0 0 0   0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 01 1 0 1 1 1   0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 11 0 0 1 0 0   0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 10 0 0 0 0 0   0 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 00 0 0 1 0 1   1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 01 0 1 0 0 1   0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 11 0 1 1 1 0   1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 00 0 0 0 0 1   0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 01 0 1 0 1 0   1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 1 0 1 0   1 0 1 1 1 0 0 1 0]  accuracy_score on train dataset :  0.9859550561797753  Target on test data [0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 01 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0   1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 00 1 0 0 0 0   1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 11 0 1 1 0 1   0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 01 0 0 0 1 0   0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0]  accuracy_score on test dataset :  0.770949720670391