Kaggle入門——泰坦尼克號生還者預測

前言

  這個是Kaggle比賽中泰坦尼克號生存率的分析。強烈建議在做這個比賽的時候,再看一遍電源《泰坦尼克號》,可能會給你一些啟發,比如婦女兒童先上船等。所以是否獲救其實並非隨機,而是基於一些背景有先後順序的。

1,背景介紹

  1912年4月15日,載著1316號乘客和891名船員的豪華巨輪泰坦尼克號在首次航行期間撞上冰山後沉沒,2224名乘客和機組人員中有1502人遇難。沉船導致大量傷亡的原因之一是沒有足夠的救生艇給乘客和船員。雖然倖存下來有一些運氣因素,但有一些人比其他人更有可能生存,比如婦女,兒童和上層階級。在本文中將對哪些人可能生存作出分析,特別是運用Python和機器學習的相關模型工具來預測哪些乘客幸免於難,最後提交結果。

  其中訓練和測試數據是一些乘客的個人資訊以及存活狀況,要嘗試根據它生成合適的模型並預測其他人的存活狀況。這是一個二分類的問題。

2,數據文件說明

  從Kaggle泰坦尼克號項目頁面下載數據://www.kaggle.com/c/titanic

  下面是問題的背景頁:

 

   下面是可下載Data的頁面

 

   下面是forum頁面,我們會從中學到各種數據處理/建模想法:

 

 

 

3,數據變數說明

  每個乘客有12個屬性,其中PassengerID在這裡只起到索引作用,而Survived是我們要預測的目標,因此我們要處理的數據總共有10個變數。

   對於上述變數的說明:

  1. PassengerID(ID)
  2. Survived(存活與否)
  3. Pclass(客艙等級,在當時的英國階級分層比較嚴重,較為重要)
  4. Name(姓名,可提取出更多資訊)
  5. Sex(性別,較為重要)
  6. Age(年齡,較為重要)
  7. Parch(直系親友,是指父母,孩子,其中1表示有一個,依次類推)
  8. SibSp(旁系,是指兄弟姐妹)
  9. Ticket(票編號,這個是個玄學問題)
  10. Fare(票價,可能票價貴的獲救幾率大)
  11. Cabin(客艙編號)
  12. Embarked(上船的港口編號,是指從不同的港口上船)

4,評估方法

  將從基礎的方法開始學習,然後預測結果,最後將準確率最高的模型預測的結果提交到Kaggle上,查看自己的結果如何。

5,完整程式碼,請移步小編的GitHub

  傳送門:請點擊我

   如果點不開,複製這個地址://github.com/LeBron-Jian/Kaggle-learn

數據預處理

  數據的品質決定模型能達到的上限、所以對數據的預處理無比重要。機器學習的演算法模型是用來挖掘數據中潛在模式的,但是若數據太過複雜,潛在的模式就很難找到,更糟糕的是,我們所收集的數據的特徵和我們想預測的標籤之間並沒有太大關聯,這時候這個特徵就像噪音一樣只會干擾我們的模型做出準確的預測。所以說,我們要對拿到手的數據集進行分析,並看看各個特徵到底會不會顯著影響到我們要預測的標籤值。

1,總體預覽

  在Data下的我們的 train.csv 和 test.csv 兩個文件分別存著官方給出的訓練和測試數據。

train.info()

   從數據中發現,總數為891個,那還有特徵是有空值的,Cabin甚至只有一點數據(不著急,這裡我們先看看)。而且有些數據是數值型的,一些是文本型的,還有一些是類目性的。這些我們用下面函數是看不到的。

res = all_data.describe()

 

   describe() 函數時查看數據的描述統計資訊,只查看數值型數據,字元串型的數據,如姓名(name),客艙號(Cabin)等不可顯示。

  而上面資訊告訴我們,大概有 0.383838的人最終獲救了,平均乘客年齡大概是29.88歲等等。。

2,數據初步分析

  每個乘客都有這麼多屬性,那我們咋知道哪些屬性更有用,而又應該怎麼用他們呢?說實話這會我也不知道,但是我們要知道,對數據的認識是非常重要的!所以我們要深入的看我們的數據,後面就會寫到如何分析數據。

  為什麼要這樣分析數據,我們可以先看看這個文章//zhuanlan.zhihu.com/p/26663761。一個泰坦尼克號的生還者寫下的回憶錄,我這裡粘貼幾個重要的:

  注意:面對沉船災難,婦女和兒童先上救生艇,當然也有例外,不過很少。

  這次分析分析以下幾個方面:

  • 1,性別與倖存率的關係
  • 2,乘客社會等級與倖存率的關係
  • 3,攜帶配偶及兄弟姐妹與倖存率的關係
  • 4,攜帶父母及子女與倖存率的關係
  • 5,年齡與倖存率的關係
  • 6,登港港口與倖存率的關係
  • 7,稱呼(從姓名中提取乘客的稱呼)與倖存率的關係
  • 8,家庭總人數與倖存率的關係
  • 9,不同船艙的乘客與倖存率的關係

  (這些分析參考部落格://www.jianshu.com/p/e79a8c41cb1a)

  首先,我們看看得到的數據中,倖存的人與死亡人數的比重:

    res = train['Survived'].value_counts()
    print(res)
    '''
    0    549
    1    342
    Name: Survived, dtype: int64  '''

 

2.1 性別與生存率的關係

  我們從上面知道,讓婦女兒童上船,那麼女性的倖存率到底如何呢?我們看圖:

    sns.barplot(x='Sex', y='Survived', data=train)

   圖如下:

   我們可以看到,確實是女性倖存率遠高於男性,那麼性別Sex是一個很重要的特徵。

2.2  乘客等級與生存率的關係

sns.barplot(x='Pclass', y='Survived', data=train)

   圖如下:

   我們發現,乘客社會等級越高,倖存率越高,所以Pclass這個特徵也比較重要。

2.3  攜帶配偶及兄弟姐妹與生存率的關係

sns.barplot(x='SibSp', y='Survived', data=train)

   圖如下:

  只能說:攜帶的配偶與兄弟姐妹適中的乘客倖存率更高,可能帶個配偶倖存率最高。

2.4 攜帶父母與子女與生存率的關係

sns.barplot(x='Parch', y='Survived', data=train)

   圖如下:

   好像也是父母與子女適中的乘客倖存率更高。

2.5  年齡與倖存率的關係

    facet = sns.FacetGrid(train, hue='Survived', aspect=2)
    facet.map(sns.kdeplot, 'Age', shade=True)
    facet.set(xlim=(0, train['Age'].max()))
    facet.add_legend()
    plt.xlabel('Age')
    plt.ylabel('density')

   圖如下:

  

   從不同生還情況的密度圖可以看出,在年齡15歲的左側,生還率有明顯差別,密度圖非交叉區域面積非常大,但在其他年齡段,則差別不是很明顯,認為是隨機所致,因此可以考慮將年齡偏小的區域分離出來。

2.6  登港港口與倖存率的關係

登船港口(Embarked):

  • 出發地點:S = 英國南安普頓Southampton
  • 途徑地點1:C = 法國 瑟堡市Cherbourg
  • 途徑地點2:Q = 愛爾蘭 昆士敦Queenstown

  我們先按照登港港口與倖存率的關係畫圖,程式碼如下:

sns.countplot('Embarked',hue='Survived',data=train)

       圖如下:

   我們發現C地的生存率更高,這個應該保存為模型特徵,最後我們再篩選嘛。

2.7  不同稱呼與倖存率的關係

  每個人都有自己的名字,而且每個名字都是獨一無二的,名字和倖存與否看起來並沒有直接關聯,那怎麼利用這個特徵呢?有用的資訊就隱藏在稱呼當中,比如上面我們提到過女生優先,所以稱呼為Mrs和 Miss 的就比稱呼為Mr 的的更可能倖存,於是我們需要從Name中拿到其稱呼並建立新的特徵列Title。

  我們注意在每一個name 中,有一個非常顯著的特點:乘客頭銜每個名字當中都包含了具體的稱謂或者說是頭銜,將這部分資訊提取出來後可以作為非常有用的一個新變數,可以幫助我們進行預測。

  例如:

Braund, Mr. Owen Harris
Heikkinen, Miss. Laina
Oliva y Ocana, Dona. Fermina
Peter, Master. Michael J

  定義函數,從名字中獲取頭銜。

all_data['Title'] = all_data.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

  如果上面程式碼看不懂,可以看下面這個:

    # get all the titles and print how often each one occurs
    titles = all_data['Name'].apply(get_title)
    print(pd.value_counts(titles))


# A function to get the title from a name
def get_title(name):
    # use a regular expression to search for a title
    title_search = re.search('([A-Za-z]+)\.', name)
    # if the title exists, extract and return it
    if title_search:
        return title_search.group(1)
    return ''

  我們可以看看稱呼的種類和數量:

all_data.Title.value_counts()

out:
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Ms                2
Major             2
Capt              1
Lady              1
Jonkheer          1
Don               1
Dona              1
the Countess      1
Mme               1
Sir               1
Name: Title, dtype: int64

  我們將定義以下幾種頭銜類型

  1. Officer政府官員
  2. Royalty王室(皇室)
  3. Mr已婚男士
  4. Mrs已婚婦女
  5. Miss年輕未婚女子
  6. Master有技能的人/教師

  大類可以分為六個:Mr,Miss,Mrs,Master,Royalty,Officer,姓名中頭銜字元串與定義頭銜類型的分類程式碼如下:

    all_data = pd.concat([train, test], ignore_index=True)
    all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
    Title_Dict = {}
    Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
    Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
    Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
    Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
    Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
    Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))

    all_data['Title'] = all_data['Title'].map(Title_Dict)
    sns.barplot(x="Title", y="Survived", data=all_data)

  上面的pandas不會使用,也可以使用下面程式碼:

# map each title to an integer some titles are very rare
    # and are compressed into the same codes as other titles
    title_mapping = {
        'Mr': 1,
        'Miss': 2,
        'Mrs': 3,
        'Master': 4,
        'Rev': 5,
        'Dr': 6,
        'Col': 7,
        'Mlle': 8,
        'Ms': 9,
        'Major': 10,
        'Don': 11,
        'Countess': 12,
        'Mme': 13,
        'Jonkheer ': 14,
        'Sir': 15,
        'Dona': 16,
        'Capt': 17,
        'Lady': 18,
    }
    for k, v in title_mapping.items():
        titles[titles == k] = v
        # print(k, v)
    all_data['Title'] = titles

   至於如何分類,就看自己的想法了,我這裡只是將所有的稱呼表示出來,你可以自己分。我按照上面的分類:

    title_mapping = {
        'Mr': 2,
        'Miss': 3,
        'Mrs': 4,
        'Master': 1,
        'Rev': 5,
        'Dr': 5,
        'Col': 5,
        'Mlle': 3,
        'Ms': 4,
        'Major': 6,
        'Don': 5,
        'Countess': 5,
        'Mme': 4,
        'Jonkheer ': 1,
        'Sir': 5,
        'Dona': 5,
        'Capt': 6,
        'Lady': 5,
    }

   圖如下:

2.8  家庭總人數與倖存率的關係

  我們新增FamilyLabel特徵,這個特徵等於父母兒童+配偶兄弟姐妹+1,在文中就是 FamilyLabel=Parch+SibSp+1,然後將FamilySize分為三類:

all_data['FamilySize']=all_data['SibSp']+all_data['Parch']+1
sns.barplot(x="FamilySize", y="Survived", data=all_data)

   圖如下:

家庭類別:

  • 小家庭Family_Single:家庭人數=1
  • 中等家庭Family_Small: 2<=家庭人數<=4
  • 大家庭Family_Large: 家庭人數>=5

   我們按照生存率將FamilySize分為三類,構成FamilyLabel特徵:

def Fam_label(s):
    if (s >= 2) & (s <= 4):
        return 2
    elif ((s > 4) & (s <= 7)) | (s == 1):
        return 1
    elif (s > 7):
        return 0

all_data['FamilySize'] = all_data['SibSp'] + all_data['Parch'] + 1
all_data['FamilyLabel']=all_data['FamilySize'].apply(Fam_label)
sns.barplot(x="FamilyLabel", y="Survived", data=all_data)

   結果圖如下:

  也可以提取名字長度的特徵:

    # generating a familysize column  是指所有的家庭成員
    all_data['FamilySize'] = all_data['SibSp'] + all_data['Parch']

    # the .apply method generates a new series
    all_data['NameLength'] = all_data['Name'].apply(lambda x: len(x))

   最後我們選擇是否使用這些特徵。

2.9 不同船艙的乘客與倖存率的關係

  船艙號(Cabin)裡面數據總數是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比較大。這註定是個棘手的問題。

  當然船艙的數據並不是很多,但是乘客位於不同船艙,也就意味著身份不同,所以我們新增Deck特徵,先把Cabin空白的填充為「Unknown」,再提取Cabin中的首字母構成乘客的甲板號。

    all_data['Cabin'] = all_data['Cabin'].fillna('Unknown')
    all_data['Deck'] = all_data['Cabin'].str.get(0)
    sns.barplot(x="Deck", y="Survived", data=all_data)
    

   圖如下:

 

2.10  與二到四人共票號的乘客與倖存率的關係

  新增了一個特徵叫做 TicketGroup特徵,這個特徵是統計每個乘客的共票號數。程式碼如下:

    Ticket_Count = dict(all_data['Ticket'].value_counts())
    all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x: Ticket_Count[x])
    sns.barplot(x='TicketGroup', y='Survived', data=all_data)

   圖如下:

   把生存率按照TicketGroup 分為三類:

    Ticket_Count = dict(all_data['Ticket'].value_counts())
    all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x: Ticket_Count[x])
    all_data['TicketGroup'] = all_data['TicketGroup'].apply(Ticket_Label)
    sns.barplot(x='TicketGroup', y='Survived', data=all_data)

def Ticket_Label(s):
    if (s >= 2) & (s <= 4):
        return 2
    elif ((s > 4) & (s <= 8)) | (s == 1):
        return 1
    elif (s > 8):
        return 0

   圖如下:

3,數據清洗

3.1 缺失值填充

  上面我們從整體分析的時候,也發現了有著不少的缺失值,缺失量也有比較大的,所以我們如何填充缺失值,這是個問題。

  很多機器學習演算法為了訓練模型,要求所傳入的特徵中不能有空值。一般做如下處理:

1,如果是數值類型,用平均值取代  .fillna(.mean())

2,如果是分類數據,用最常見的類別取代 .value_counts() + fillna.()

3,使用模型預測缺失值,例如KNN

  具體可以參考我的部落格:

Python機器學習筆記:使用sklearn做特徵工程和數據挖掘

  我們首先看看那些變數都有缺失值:

  下面這個是train的數據,總共數據有891個,那麼在訓練的數據集中 Age,Cabin, Embarked 有缺失值。

 

   下面這個是test的數據,總共有418個數據,其中Age, Fare, Cabin 有缺失值。

   這裡對Age的缺失值處理方法是採用平均值:

traindata['Age'] = traindata['Age'].fillna(traindata['Age'].median())

test['Age'] = test['Age'].fillna(test['Age'].median())

   當然平均值可能有些不妥,我百度,也有說按照稱呼對Age的缺失值進行補全了,(上面我們不是對姓名做了處理:2.7 不同稱呼與倖存率之間的關係),因為Miss用於未婚女子,通常其年齡比較小,Mrs則表示太太,夫人,一般年齡較大,因此利用稱呼中隱含的資訊去推斷其年齡是比較合理的,下面可以根據title進行分組並對Age進行補全。

    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    all_data = pd.concat([train, test], ignore_index=True)
    all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
    Title_Dict = {}
    Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
    Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
    Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
    Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
    Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
    Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))

    all_data['Title'] = all_data['Title'].map(Title_Dict)
    # sns.barplot(x="Title", y="Survived", data=all_data)
    grouped = all_data.groupby(['Title'])
    median = grouped.Age.median()
    print(median)



Title
Master      4.5
Miss       22.0
Mr         29.0
Mrs        35.0
Officer    49.5
Royalty    40.0
Name: Age, dtype: float64

   可以看到,不同稱呼的乘客其年齡的中位數有顯著差異,因此我們只需要按稱呼對缺失值進行補全即可,這裡使用中位數(平均數也是可以的,在這個問題當中兩者差異不大,而中位數看起來更整潔一些)。

   程式碼如下:

    all_data['Title'] = all_data['Title'].map(Title_Dict)
    # sns.barplot(x="Title", y="Survived", data=all_data)
    grouped = all_data.groupby(['Title'])
    median = grouped.Age.median()
    for i in range(len(all_data['Age'])):
        if pd.isnull(all_data['Age'][i]):
            all_data['Age'][i] = median[all_data['Title'][i]]

   我們可以查看一下 all_data.info()  

   Age的缺失值已經補全了,下面我們對Cabin,Embarked 以及Fare進行補全。

  而Embarked 缺失值只有兩個,對結果影響不大,所以我們這裡將缺失值補全為登船港口人數最多的港口(這裡其實應用的是先驗概率最大原則)。從結果來看(上面分析過),S地的登船人數最多,所以我們將缺失值填充為最頻繁出現的值,S=英國安南普頓Southampton。

all_data['Embarked'] = all_data['Embarked'].fillna('S')

  上面我們也提到過Cabin缺失數值比較多,所以我們把船艙號(Cabin)的缺失值填充為U,表示未知(Unknown)。

    # 缺失數據比較多,船艙號(Cabin)缺失值填充為U,表示未知(Uknow)
    all_data['Cabin'] = all_data['Cabin'].fillna('U')

  Fare只有一個缺失值,所以我們可以用票價Fare的平均值補全。

all_data['Fare'] = all_data['Fare'].fillna(all_data.Fare.median())

   至此,我們的缺失值就補全了,check一下:

 

  完整程式碼如下:

def load_dataset(trainfile, testfile):
    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    # print(test.info())
    all_data = pd.concat([train, test], ignore_index=True)

    titles = all_data['Name'].apply(get_title)
    # print(pd.value_counts(titles))
    # map each title to an integer some titles are very rare
    # and are compressed into the same codes as other titles
    title_mapping = {
        'Mr': 2,
        'Miss': 3,
        'Mrs': 4,
        'Master': 1,
        'Rev': 5,
        'Dr': 5,
        'Col': 5,
        'Mlle': 3,
        'Ms': 4,
        'Major': 6,
        'Don': 5,
        'Countess': 5,
        'Mme': 4,
        'Jonkheer': 1,
        'Sir': 5,
        'Dona': 5,
        'Capt': 6,
        'Lady': 5,
    }
    for k, v in title_mapping.items():
        titles[titles == k] = v
        # print(k, v)
    all_data['Title'] = titles

    grouped = all_data.groupby(['Title'])
    median = grouped.Age.median()
    for i in range(len(all_data['Age'])):
        if pd.isnull(all_data['Age'][i]):
            all_data['Age'][i] = median[all_data['Title'][i]]
    # print(all_data['Age'])

    # generating a familysize column  是指所有的家庭成員
    all_data['FamilySize'] = all_data['SibSp'] + all_data['Parch']

    # the .apply method generates a new series
    all_data['NameLength'] = all_data['Name'].apply(lambda x: len(x))
    # print(all_data['NameLength'])

    all_data['Embarked'] = all_data['Embarked'].fillna('S')
    # 缺失數據比較多,船艙號(Cabin)缺失值填充為U,表示未知(Uknow)
    all_data['Cabin'] = all_data['Cabin'].fillna('U')
    all_data['Fare'] = all_data['Fare'].fillna(all_data.Fare.median())

    all_data.loc[all_data['Embarked'] == 'S', 'Embarked'] = 0
    all_data.loc[all_data['Embarked'] == 'C', 'Embarked'] = 1
    all_data.loc[all_data['Embarked'] == 'Q', 'Embarked'] = 2

    all_data.loc[all_data['Sex'] == 'male', 'Sex'] = 0
    all_data.loc[all_data['Sex'] == 'female', 'Sex'] = 1

    traindata, testdata = all_data[:891], all_data[891:]
    # print(traindata.shape, testdata.shape, all_data.shape)  # (891, 15) (418, 15) (1309, 15)
    traindata, trainlabel = traindata.drop('Survived', axis=1), traindata['Survived']  # train.pop('Survived')
    testdata = testdata
    corrDf = all_data.corr()
    '''
        查看各個特徵與生成情況(Survived)的相關係數,
        ascending=False表示按降序排列
    '''
    res = corrDf['Survived'].sort_values(ascending=False)
    print(res)
    return traindata, trainlabel, testdata



# A function to get the title from a name
def get_title(name):
    # use a regular expression to search for a title
    title_search = re.search('([A-Za-z]+)\.', name)
    # if the title exists, extract and return it
    if title_search:
        return title_search.group(1)
    return ''

  OK,下一步,我們就需要進行特徵工程對補全後的特徵做進一步處理。

3.2 特徵處理

  當補全數據後,我們查看了數據類型,總共有三種數據類型。有整形 int, 有浮點型 float, 有object。我們需要用數值替代類別。有些動作我們前面已經做過了,但是這裡總體再過一遍。

  • 1,乘客性別(Sex):男性male,女性 female  male=0,female=1
  • 2,登船港口(Embarked):S,C,Q        S=0, C=1, Q=2
  • 3,乘客姓名(Name):我們分為六類(具體見前面2.7  不同稱呼與倖存率的關係

  程式碼如下(其實上面都有):

all_data.loc[all_data['Embarked'] == 'S', 'Embarked'] = 0
all_data.loc[all_data['Embarked'] == 'C', 'Embarked'] = 1
all_data.loc[all_data['Embarked'] == 'Q', 'Embarked'] = 2

all_data.loc[all_data['Sex'] == 'male', 'Sex'] = 0
all_data.loc[all_data['Sex'] == 'female', 'Sex'] = 1

titles = all_data['Name'].apply(get_title)
# print(pd.value_counts(titles))
# map each title to an integer some titles are very rare
# and are compressed into the same codes as other titles
title_mapping = {
    'Mr': 2,
    'Miss': 3,
    'Mrs': 4,
    'Master': 1,
    'Rev': 5,
    'Dr': 5,
    'Col': 5,
    'Mlle': 3,
    'Ms': 4,
    'Major': 6,
    'Don': 5,
    'Countess': 5,
    'Mme': 4,
    'Jonkheer': 1,
    'Sir': 5,
    'Dona': 5,
    'Capt': 6,
    'Lady': 5,
}
for k, v in title_mapping.items():
    titles[titles == k] = v
    # print(k, v)
all_data['Title'] = titles

 

 

4,特徵提取

  一般來說,一個比較真實的項目,也就是說數據挖掘工程需要建立特徵工程,我們需要看看如何提取新的特徵,使得我們的準確率更高。這個怎麼說呢,其實我們上面對數據分析了那麼多,肯定是有原因的。下面一一闡述。

4.1 隨機森林尋找重要特徵

  上面我們將數據缺失值填充完後,然後進行了概述,總共有14個特徵,1個結果(Survived),其中四個特徵是object類型,我們去掉,剩下了10個特徵,我們做重要特徵提取

 

   我們使用上面五個比較重要的特徵:Pclass,Sex,Fare,Title, NameLength 做隨機森林模型訓練。

def random_forestclassifier_train(traindata, trainlabel, testdata):
    # predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'Title', 'NameLength']
    predictors =['Pclass', 'Sex', 'Fare', 'Title', 'NameLength', ]
    traindata, testdata = traindata[predictors], testdata[predictors]
    # print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    # print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = RandomForestClassifier(random_state=1, n_estimators=100,
                                 min_samples_split=4, min_samples_leaf=2)
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   80.44692737430168%

 準確率還沒有不提取特徵的高。。。。這就有點尷尬哈。

 4.2 相關係數法

  相關係數法:計算各個特徵的相關係數

    # 相關性矩陣
    corrDf = all_data.corr()
    '''
        查看各個特徵與生成情況(Survived)的相關係數,
        ascending=False表示按降序排列
    '''
    res = corrDf['Survived'].sort_values(ascending=False)
    print(res)

   結果如下:

 

   我們使用上面六個正相關的特徵:Sex,NameLength, Fare,Embarked,Parch, FamiySize, 做隨機森林模型訓練。

def random_forestclassifier_train(traindata, trainlabel, testdata):
    # predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'Title', 'NameLength']
    predictors = ['Sex', 'Fare', 'Embarked', 'NameLength', 'Parch', 'FamilySize']
    traindata, testdata = traindata[predictors], testdata[predictors]
    # print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    # print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = RandomForestClassifier(random_state=1, n_estimators=100,
                                 min_samples_split=4, min_samples_leaf=2)
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   81.56424581005587%

   相比於上個隨機森林,效果好了不少,準確率提高了1個百分點。

 4.3 嘗試用所有特徵做隨機森林模型訓練

def random_forestclassifier_train(traindata, trainlabel, testdata):
    predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'Title', 'NameLength']

    traindata, testdata = traindata[predictors], testdata[predictors]
    # print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    # print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = RandomForestClassifier(random_state=1, n_estimators=100,
                                 min_samples_split=4, min_samples_leaf=2)
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   83.79888268156425%

   準確率進一步提高,這次提高了兩個百分點。

  這可能是目前效果最好的吧。。。。。。

   所以下一步進行集成學習各種嘗試,然後調參。。。。。。over

 

 

模型訓練及其結果展示

1,線性回歸模型

  這裡展示兩種線性回歸的方式,但是不知道為什麼我的線性回歸模型準確率低的可怕,都沒有蒙的多,只有37.63003367180264%。具體原因不知道,希望細心的網友幫我發現。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import RandomForestClassifier

warnings.filterwarnings('ignore')


def load_dataset(trainfile, testfile):
    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    train['Age'] = train['Age'].fillna(train['Age'].median())
    test['Age'] = test['Age'].fillna(test['Age'].median())
    # replace all the occurences of male with the number 0
    train.loc[train['Sex'] == 'male', 'Sex'] = 0
    train.loc[train['Sex'] == 'female', 'Sex'] = 1
    test.loc[test['Sex'] == 'male', 'Sex'] = 0
    test.loc[test['Sex'] == 'female', 'Sex'] = 1
    # .fillna() 為數據填充函數  用括弧裡面的東西填充
    train['Embarked'] = train['Embarked'].fillna('S')
    train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
    train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
    train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2
    test['Embarked'] = test['Embarked'].fillna('S')
    test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
    test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
    test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2
    test['Fare'] = test['Fare'].fillna(test['Fare'].median())
    traindata, trainlabel = train.drop('Survived', axis=1), train['Survived']  # train.pop('Survived')
    testdata = test
    print(traindata.shape, trainlabel.shape, testdata.shape)
    # (891, 11) (891,) (418, 11)
    return traindata, trainlabel, testdata

def linear_regression_test(traindata, trainlabel, testdata):
    traindata = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    traindata['Age'] = traindata['Age'].fillna(traindata['Age'].median())
    test['Age'] = test['Age'].fillna(test['Age'].median())
    # replace all the occurences of male with the number 0
    traindata.loc[traindata['Sex'] == 'male', 'Sex'] = 0
    traindata.loc[traindata['Sex'] == 'female', 'Sex'] = 1
    # .fillna() 為數據填充函數  用括弧裡面的東西填充
    traindata['Embarked'] = traindata['Embarked'].fillna('S')
    traindata.loc[traindata['Embarked'] == 'S', 'Embarked'] = 0
    traindata.loc[traindata['Embarked'] == 'C', 'Embarked'] = 1
    traindata.loc[traindata['Embarked'] == 'Q', 'Embarked'] = 2
    # the columns we'll use to predict the target
    all_variables = ['PassengerID', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
                     'Ticket', 'Fare', 'Cabin', 'Embarked']
    predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    # traindata, testdata = traindata[predictors], testdata[predictors]
    alg = LinearRegression()
    kf = KFold(n_splits=3, random_state=1)
    predictions = []
    for train_index, test_index in kf.split(traindata):
        # print(train_index, test_index)
        train_predictors = (traindata[predictors].iloc[train_index, :])
        train_target = traindata['Survived'].iloc[train_index]
        alg.fit(train_predictors, train_target)
        test_predictions = alg.predict(traindata[predictors].iloc[test_index, :])
        predictions.append(test_predictions)
    # print(type(predictions))
    predictions = np.concatenate(predictions, axis=0)  #<class 'numpy.ndarray'>
    # print(type(predictions))
    predictions[predictions > 0.5] = 1
    predictions[predictions < 0.5] = 1
    accuracy = sum(predictions[predictions == traindata['Survived']]/len(predictions))
    print(accuracy)  # 0.3838383838383825

def linear_regression_train(traindata, trainlabel, testdata):
    # the columns we'll use to predict the target
    all_variables = ['PassengerID', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
                     'Ticket', 'Fare', 'Cabin', 'Embarked']
    predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    traindata, testdata = traindata[predictors], testdata[predictors]
    print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = LinearRegression()
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   37.63003367180264%
    # res = clf.predict(testdata)



if __name__ == '__main__':
    trainfile = 'data/titanic_train.csv'
    testfile = 'data/test.csv'
    traindata, trainlabel, testdata = load_dataset(trainfile, testfile)
    # print(traindata.shape[1])  # 11
    linear_regression_train(traindata, trainlabel, testdata)
    # linear_regression_test(traindata, trainlabel, testdata)

 

 

2,邏輯回歸模型

  邏輯回歸也是只選擇了幾個比較全的特徵,準確率還好,達到了: 81.56424581005587%。程式碼如下:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import RandomForestClassifier

warnings.filterwarnings('ignore')


def load_dataset(trainfile, testfile):
    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    train['Age'] = train['Age'].fillna(train['Age'].median())
    test['Age'] = test['Age'].fillna(test['Age'].median())
    # replace all the occurences of male with the number 0
    train.loc[train['Sex'] == 'male', 'Sex'] = 0
    train.loc[train['Sex'] == 'female', 'Sex'] = 1
    test.loc[test['Sex'] == 'male', 'Sex'] = 0
    test.loc[test['Sex'] == 'female', 'Sex'] = 1
    # .fillna() 為數據填充函數  用括弧裡面的東西填充
    train['Embarked'] = train['Embarked'].fillna('S')
    train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
    train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
    train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2
    test['Embarked'] = test['Embarked'].fillna('S')
    test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
    test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
    test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2
    test['Fare'] = test['Fare'].fillna(test['Fare'].median())
    traindata, trainlabel = train.drop('Survived', axis=1), train['Survived']  # train.pop('Survived')
    testdata = test
    print(traindata.shape, trainlabel.shape, testdata.shape)
    # (891, 11) (891,) (418, 11)
    return traindata, trainlabel, testdata

def logistic_regression_train(traindata, trainlabel, testdata):
    # the columns we'll use to predict the target
    all_variables = ['PassengerID', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
                     'Ticket', 'Fare', 'Cabin', 'Embarked']
    predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    traindata, testdata = traindata[predictors], testdata[predictors]
    # print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    # print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = LogisticRegression()
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   81.56424581005587%
    # res = clf.predict(testdata)



if __name__ == '__main__':
    trainfile = 'data/titanic_train.csv'
    testfile = 'data/test.csv'
    traindata, trainlabel, testdata = load_dataset(trainfile, testfile)
    logistic_regression_train(traindata, trainlabel, testdata)

 

3,隨機森林模型

  我們依舊選擇和上面相同的特徵,使用隨機森林來做,準確率為:81.56424581005587%。效果還行,不過也只是和邏輯回歸一樣,程式碼如下:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import RandomForestClassifier

warnings.filterwarnings('ignore')


def load_dataset(trainfile, testfile):
    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    train['Age'] = train['Age'].fillna(train['Age'].median())
    test['Age'] = test['Age'].fillna(test['Age'].median())
    # replace all the occurences of male with the number 0
    train.loc[train['Sex'] == 'male', 'Sex'] = 0
    train.loc[train['Sex'] == 'female', 'Sex'] = 1
    test.loc[test['Sex'] == 'male', 'Sex'] = 0
    test.loc[test['Sex'] == 'female', 'Sex'] = 1
    # .fillna() 為數據填充函數  用括弧裡面的東西填充
    train['Embarked'] = train['Embarked'].fillna('S')
    train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
    train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
    train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2
    test['Embarked'] = test['Embarked'].fillna('S')
    test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
    test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
    test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2
    test['Fare'] = test['Fare'].fillna(test['Fare'].median())
    traindata, trainlabel = train.drop('Survived', axis=1), train['Survived']  # train.pop('Survived')
    testdata = test
    print(traindata.shape, trainlabel.shape, testdata.shape)
    # (891, 11) (891,) (418, 11)
    return traindata, trainlabel, testdata


def random_forestclassifier_train(traindata, trainlabel, testdata):
    predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    traindata, testdata = traindata[predictors], testdata[predictors]
    # print(traindata.shape, trainlabel.shape, testdata.shape)  # (891, 7) (891,) (418, 7)
    # print(testdata.info())
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf = RandomForestClassifier(random_state=1, n_estimators=100,
                                 min_samples_split=4, min_samples_leaf=2)
    # training the algorithm using the predictors and target
    clf.fit(trainSet, trainlabel)
    test_accuracy = clf.score(testSet, testlabel) * 100
    print("正確率為   %s%%" % test_accuracy)  # 正確率為   81.56424581005587%


if __name__ == '__main__':
    trainfile = 'data/titanic_train.csv'
    testfile = 'data/test.csv'
    traindata, trainlabel, testdata = load_dataset(trainfile, testfile)
    random_forestclassifier_train(traindata, trainlabel, testdata)

 

 4,集成學習模型

  這裡簡單的集成GradientBoostingClassifier和LogisticRegression 兩種演算法。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import GradientBoostingClassifier
import re

warnings.filterwarnings('ignore')


# A function to get the title from a name
def get_title(name):
    # use a regular expression to search for a title
    title_search = re.search('([A-Za-z]+)\.', name)
    # if the title exists, extract and return it
    if title_search:
        return title_search.group(1)
    return ''


def load_dataset(trainfile, testfile):
    train = pd.read_csv(trainfile)
    test = pd.read_csv(testfile)
    train['Age'] = train['Age'].fillna(train['Age'].median())
    test['Age'] = test['Age'].fillna(test['Age'].median())
    # replace all the occurences of male with the number 0
    train.loc[train['Sex'] == 'male', 'Sex'] = 0
    train.loc[train['Sex'] == 'female', 'Sex'] = 1
    test.loc[test['Sex'] == 'male', 'Sex'] = 0
    test.loc[test['Sex'] == 'female', 'Sex'] = 1
    # .fillna() 為數據填充函數  用括弧裡面的東西填充
    train['Embarked'] = train['Embarked'].fillna('S')
    train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
    train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
    train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2
    test['Embarked'] = test['Embarked'].fillna('S')
    test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
    test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
    test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2
    test['Fare'] = test['Fare'].fillna(test['Fare'].median())
    # generating a familysize column  是指所有的家庭成員
    train['FamilySize'] = train['SibSp'] + train['Parch']
    test['FamilySize'] = test['SibSp'] + test['Parch']

    # the .apply method generates a new series
    train['NameLength'] = train['Name'].apply(lambda x: len(x))
    test['NameLength'] = test['Name'].apply(lambda x: len(x))

    titles_train = train['Name'].apply(get_title)
    titles_test = test['Name'].apply(get_title)
    title_mapping = {
        'Mr': 1,
        'Miss': 2,
        'Mrs': 3,
        'Master': 4,
        'Rev': 6,
        'Dr': 5,
        'Col': 7,
        'Mlle': 8,
        'Ms': 2,
        'Major': 7,
        'Don': 9,
        'Countess': 10,
        'Mme': 8,
        'Jonkheer': 10,
        'Sir': 9,
        'Dona': 9,
        'Capt': 7,
        'Lady': 10,
    }
    for k, v in title_mapping.items():
        titles_train[titles_train == k] = v
    train['Title'] = titles_train
    for k, v in title_mapping.items():
        titles_test[titles_test == k] = v
    test['Title'] = titles_test
    # print(pd.value_counts(titles_train))
    traindata, trainlabel = train.drop('Survived', axis=1), train['Survived']  # train.pop('Survived')
    testdata = test
    print(traindata.shape, trainlabel.shape, testdata.shape)
    # (891, 11) (891,) (418, 11)
    return traindata, trainlabel, testdata



def emsemble_model_train(traindata, trainlabel, testdata):
    # the algorithms we want to ensemble
    # we're using the more linear predictors for the logistic regression,
    # and everything with the gradient boosting classifier
    algorithms = [
        [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3),
         ['Pclass', 'Sex', 'Fare', 'FamilySize', 'Title', 'Age', 'Embarked', ]],
        [LogisticRegression(random_state=1),
         ['Pclass', 'Sex', 'Fare', 'FamilySize', 'Title', 'Age', 'Embarked', ]]
    ]
    # initialize the cross validation folds
    kf = KFold(n_splits=3, random_state=1)
    predictions = []
    for train_index, test_index in kf.split(traindata):
        # print(train_index, test_index)
        full_test_predictions = []
        for alg, predictors in algorithms:
            train_predictors = (traindata[predictors].iloc[train_index, :])
            train_target = trainlabel.iloc[train_index]
            alg.fit(train_predictors, train_target)
            test_predictions = alg.predict(traindata[predictors].iloc[test_index, :])
            full_test_predictions.append(test_predictions)
        # use a simple ensembling scheme
        test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
        test_predictions[test_predictions <= 0.5] = 0
        test_predictions[test_predictions >= 0.5] = 1
        predictions.append(test_predictions)
    predictions = np.concatenate(predictions, axis=0)
    # compute accuracy bu comparing to the training data
    accuracy = sum(predictions[predictions == trainlabel]) / len(predictions)
    print(accuracy)

def emsemble_model_train(traindata, trainlabel, testdata):
    # the algorithms we want to ensemble
    # we're using the more linear predictors for the logistic regression,
    # and everything with the gradient boosting classifier
    predictors = ['Pclass', 'Sex', 'Fare', 'FamilySize', 'Title', 'Age', 'Embarked', ]
    traindata, testdata = traindata[predictors], testdata[predictors]
    trainSet, testSet, trainlabel, testlabel = train_test_split(traindata, trainlabel,
                                                                test_size=0.2, random_state=12345)
    # initialize our algorithm class
    clf1 = GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3)
    clf2 = LogisticRegression(random_state=1)
    # training the algorithm using the predictors and target
    clf1.fit(trainSet, trainlabel)
    clf2.fit(trainSet, trainlabel)
    test_accuracy1 = clf1.score(testSet, testlabel) * 100
    test_accuracy2 = clf2.score(testSet, testlabel) * 100
    print(test_accuracy1, test_accuracy2)  # 78.77094972067039   80.44692737430168
    print("正確率為   %s%%" % ((test_accuracy1+test_accuracy2)/2))  # 正確率為   79.60893854748603%


if __name__ == '__main__':
    trainfile = 'data/titanic_train.csv'
    testfile = 'data/test.csv'
    traindata, trainlabel, testdata = load_dataset(trainfile, testfile)
    emsemble_model_train(traindata, trainlabel, testdata)

   當然,對於集成學習中,集成集中演算法,也可以使用賦予權重比的方式。具體可以參考:

Python機器學習筆記 集成學習總結

  這裡不再多贅述。

 

 

  我的GitHub地址://github.com/LeBron-Jian/Kaggle-learn

 參考文獻:

//www.jianshu.com/p/ee91d8880bbd

//www.jianshu.com/p/e79a8c41cb1a

//www.cnblogs.com/python-1807/p/10645170.html

//blog.csdn.net/han_xiaoyang/article/details/49797143