Python如何優雅地處理NaN

2020 年 1 月 9 日
筆記

背景

很多數據不可避免的會遺失掉，或者採集的時候採集對象不願意透露，這就造成了很多NaN（Not a Number）的出現。這些NaN會造成大部分模型運行出錯，所以對NaN的處理很有必要。

方法

1、簡單粗暴地去掉

有如下dataframe，先用df.isnull().sum()檢查下哪一列有多少NaN:

import pandas as pd    df = pd.DataFrame({'a':[None,1,2,3],'b':[4,None,None,6],'c':[1,2,1,2],'d':[7,7,9,2]})  print (df)  print (df.isnull().sum())

輸出：

將含有NaN的列(columns)去掉:

data_without_NaN =df.dropna(axis=1)  print (data_without_NaN)

輸出：

2、遺失值插補法

很多時候直接刪掉列會損失很多有價值的數據，不利於模型的訓練。所以可以考慮將NaN替換成某些數，顯然不能隨隨便便替換，有人喜歡替換成0，往往會畫蛇添足。譬如調查工資收入與學歷高低的關係，有的人不想透露工資水平，但如果給這些NaN設置為0很顯然會失真。所以Python有個Imputation（插補）的方法，其中的算法不細究。代碼如下：

from sklearn.preprocessing import Imputer    my_imputer = Imputer()  data_imputed = my_imputer.fit_transform(df)  print (type(data_imputed))  # array轉換成df  df_data_imputed = pd.DataFrame(data_imputed,columns=df.columns)  print (df_data_imputed)

輸出：

可以看出，這裡大概是用平均值進行了替換。

3、推廣的遺失值插補法

這個推廣的思想是NaN本身具有一定數據價值，譬如不愛說自己工資的被調查者是不是有什麼共性，這個時候就不能簡單的只用上面的插補法，要增加幾列，將NaN的情況記錄下來作為新的數據：

# 先複製一份愛怎麼玩怎麼玩  new_data = df.copy()    # 增加有NaN的布爾列（True/False）  cols_with_missing = (col for col in new_data.columns                                   if new_data[col].isnull().any())  for col in cols_with_missing:      new_data[col + '_was_NaN'] = new_data[col].isnull()  print (new_data)    # Imputation  my_imputer = Imputer()  new_data_imputed = my_imputer.fit_transform(new_data)  # array轉換成df  df_new_data_imputed = pd.DataFrame(new_data_imputed,columns=new_data.columns)  print (df_new_data_imputed)

輸出：

Python如何優雅地處理NaN

背景

方法

1、簡單粗暴地去掉

將含有NaN的列(columns)去掉:

2、遺失值插補法

3、推廣的遺失值插補法

VirMach 便宜 VPS

QNews

Python如何優雅地處理NaN

背景

方法

1、簡單粗暴地去掉

將含有NaN的列(columns)去掉:

2、遺失值插補法

3、推廣的遺失值插補法

分享此文：

Related Posts

聊聊虛擬內存

c#中判斷類是否繼承於泛型基類

Python字符串的基本用法總結

在CentOS7.0上安裝Python3

VirMach 便宜 VPS

QNews

熱門搜尋