數值變量-卡方分箱

  • 2019 年 12 月 25 日
  • 筆記

卡方統計量用於評估兩個名義變量(或稱類別變量)的相關性是否顯著,可以作為類別變量的分箱準則;但將數值變量做離散化處理後,卡方統計量同樣可作為數值變量的分箱準則。

知識準備

首先,再次回顧一下卡方統計量的定義。X和Y為兩個類別變量,X取值為高、中、低,Y取值為好和壞,y和x的實際值列聯表如下圖:

假設y和x不相關,總體y中壞佔比=254/1831=13.87%。根據原假設,計算出假設值列聯表:

則卡方統計量為:

其中,A是實際值,T是假設值,卡方分佈的自由度=(x屬性個數-1)*(y屬性個數-1)=(3-1)*(2-1)=2。

計算卡方值=45.41,查卡方分佈表可知P(卡方值>45.41)<<0.05,所以有理由拒絕y和x不相關的原假設,即y和x有較強的相關性。

對於同一分佈,卡方值越大,P就越小,因此在對特徵分箱時,可以直接根據卡方值大小判斷屬性的合併方式。

通過上面的介紹可以發現,卡方統計量定義為兩個名義變量之間分佈的統計量,如果有一個變量是數值變量,可以先對該數值變量做離散化處理,然後計算卡方統計量。比如,假設X是數值變量,取值為[20, 80],將X分為[20, 30)、[30, 40)、[40, 55)和[55, 80]等4段,每段作為X的一個屬性。

分箱算法

先對數值變量分成若干小段,然後不斷合併相鄰的分段,直至達到終止條件。因數值變量的取值有大小順序,所以在對其合併時仍然要保證相鄰分段之間的大小順序,這是數值變量分箱與類別變量分箱的最大不同之處。

算法如下:

(1)將數值變量按照等距方式分成SplitNum段(比如100段),此為初始分箱;

(2)計算每段的總樣本數、好樣本數、壞樣本數、樣本佔比等統計值;

(3)計算相鄰兩段的卡方值,合併卡方值最小的相鄰兩段;

(4)重複步驟(2)和(3),直至分段個數<=BinMax;

(5)檢查每段是否同時含有壞樣本和好樣本,若某段只含有壞樣本或好樣本,則將與該段卡方值最小的相鄰一段和該段進行合併;

(6)重複步驟(5),直至每段同時含有壞樣本和好樣本;

(7)檢查每段的樣本佔比是否>=BinPcntMin,若某段的樣本佔比<BinPcntMin,則將與該段卡方值最小的相鄰一段和該段進行合併;

(8)重複步驟(7),直至每段的樣本佔比>=BinPcntMin。

代碼實現

1、載入模塊

import pandas as pd  import numpy as np  from pandas import DataFrame, Series

2、編寫數據列等距分割函數

def splitCol(col, SplitNum, exclude_attri=[]):        # col: 數據列      # SplitNum: 等距分割的段數      # exclude_attri: 不參與分割的特殊值      # return: 分割節點值列表        col = list(col)      col = list(set(col).difference(set(exclude_attri)))      size = (max(col) - min(col))/SplitNum      splitPoint = [min(col)+i*size for i in range(1, SplitNum+1)]      splitPoint[-1] = 100000000.0      return splitPoint    def assignSplit(x, splitPoint):        # x: 標量數值      # splitPoint:分割節點值列表      # return:被劃入的分割段        if x <= splitPoint[0]:          return splitPoint[0]      else:          for i in range(0, len(splitPoint)-1):              if splitPoint[i] < x <= splitPoint[i+1]:                  return splitPoint[i+1] 

3、編寫計算變量總樣本、好樣本、壞樣本、壞樣本率的函數

def BinBadRate(df, col, target, BadRateIndicator = True):        # df: 需要計算好壞比率的數據集      # col: 需要計算好壞比率的變量      # target: 好壞標籤      # BadRateIndicator: 是否計算好壞比        group = df.groupby([col])[target].agg(['count', 'sum'])      group.columns = ['total', 'bad']      group.reset_index(inplace=True)      group['good'] = group['total'] - group['bad']        if BadRateIndicator:          group['BadRate'] = group['bad']/group['total']        return group

4、編寫計算卡方值函數

def calcChi2(df, total_col, bad_col, good_col):        # df: 包含各屬性的全部樣本個數、壞樣本個數、好樣本個數的數據框      # total_col: 全部樣本的個數      # bad_col: 壞樣本的個數      # good_col:好樣本的個數        df2 = df.copy()      # 求出總體的壞樣本率和好樣本率      badRate = sum(df2[bad_col])*1.0/sum(df2[total_col])      goodRate = sum(df2[good_col]) * 1.0 / sum(df2[total_col])        # 當全部樣本只有好或者壞樣本時,卡方值為0      if badRate in [0,1]:          return 0        # 計算期望壞樣本和期望好樣本的個數      df2['bad_Exp'] = df2[total_col].map(lambda x: x*badRate)      df2['good_Exp'] = df2[total_col].map(lambda x: x*goodRate)        # 計算卡方值      badzip = zip(df2['bad_Exp'], df2[bad_col])      goodzip = zip(df2['good_Exp'], df2[good_col])      badChi2 = [(elem[1]-elem[0])**2/elem[0] for elem in badzip]      goodChi2 = [(elem[1] - elem[0])**2/elem[0] for elem in goodzip]      chi2 = sum(badChi2) + sum(goodChi2)        return chi2

5、接下來實現單變量分箱的函數,其中會調用上面的幾個函數,返回單變量分箱的結果。按照前面描述的算法,分箱函數分三個部分,(1)合併相鄰兩個分組、(2)檢查是否每個分組同時含有好和壞、(3)檢查每個分組的佔比是否大於等於BinPcntMin。其中spe_attri是特殊屬性值,初始分箱時將各特殊屬性值分別單獨作為一組,singleIndicator是特殊屬性值在接下來的合併過程中是否參與合併的標識,取值True,則不參與合併,取值False,則參與合併。

###############  split the continuous variable using Chi2 value  ###############  def ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True):        # df: 包含目標變量與分箱變量的數據框      # col: 需要分箱的變量      # target: 目標變量,取值0或1      # BinMax: 最大分箱數      # BinPcntMin:每箱的最小佔比      # SplitNum:數值變量初始切分的段數,初始將變量等距切分成SplitNum段      # spe_attri:特殊屬性      # singleIndicator: True:特殊屬性單獨作為一組不參與卡方分箱,False:特殊屬性作為一組參與卡方分箱        if len(spe_attri)>=1:          df1 = df.loc[df[col].isin(spe_attri)]          df2 = df.loc[~df[col].isin(spe_attri)]      else:          df2 = df.copy()        split_col = splitCol(df2[col], SplitNum)      df2['temp'] = df2[col].map(lambda x: assignSplit(x, split_col))      binBadRate = BinBadRate(df2, 'temp', target, BadRateIndicator = False)        if len(spe_attri)>=1 and singleIndicator==False:          df1['temp'] = df1[col]          binBadRate1 = BinBadRate(df1, 'temp', target, BadRateIndicator = False)          binBadRate = pd.concat([binBadRate1, binBadRate])          binBadRate.reset_index(inplace=True, drop=True)        if len(spe_attri)>=1 and singleIndicator==True:          BinMax -= len(set(df1[col]))      # 1、迭代合併相鄰兩個組,直至分箱數<=BinMax      while binBadRate.shape[0] > BinMax:          chi2List = []          for i in range(0, binBadRate.shape[0]-1):              temp_binBadRate = binBadRate.loc[i:i+1, :]              chi2 = calcChi2(temp_binBadRate, 'total', 'bad', 'good')              chi2List.append(chi2)            combineIndex = chi2List.index(min(chi2List))          combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])          binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])          binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]          binBadRate.reset_index(drop=True, inplace=True)        # 2、檢查每組是否同時含有好和壞      binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']      minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])      while minBadRate == 0 or maxBadRate == 1:          BadRate_01 = binBadRate['temp'][binBadRate['BadRate'].isin([0, 1])]          index_01 = BadRate_01.index[0]            if index_01 == 0:                combineIndex = 0              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            elif index_01 == binBadRate.shape[0]-1:                combineIndex = binBadRate.shape[0]-2              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            else:                temp1_binBadRate = binBadRate.loc[index_01-1:index_01, :]              chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')                temp2_binBadRate = binBadRate.loc[index_01:index_01+1, :]              chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')                if chi2_1 < chi2_2:                  combineIndex = index_01-1              else:                  combineIndex = index_01                combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']          minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])        # 3、檢查每組的佔比是否大於等於BinPcntMin      binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total'])      minPercent = min(binBadRate['Percent'])      while minPercent < BinPcntMin:          minPercent_temp = binBadRate['temp'][binBadRate['Percent']==minPercent]          index_minPercent = minPercent_temp.index[0]            if index_minPercent == 0:                combineIndex = 0              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            elif  index_minPercent == binBadRate.shape[0]-1:                combineIndex = binBadRate.shape[0]-2              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            else:                temp1_binBadRate = binBadRate.loc[index_minPercent-1:index_minPercent, :]              chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')                temp2_binBadRate = binBadRate.loc[index_minPercent:index_minPercent+1, :]              chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')                if chi2_1 < chi2_2:                  combineIndex = index_minPercent-1              else:                  combineIndex = index_minPercent                combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total'])          minPercent = min(binBadRate['Percent'])        binBadRate = binBadRate.drop(['BadRate', 'Percent'], axis=1)        if len(spe_attri)>=1 and singleIndicator == True:          binBadRate_single = BinBadRate(df1, col, target, BadRateIndicator = False)          binBadRate_single.columns = ['temp', 'total', 'bad', 'good']          bindf = pd.concat([binBadRate_single, binBadRate])          bindf.reset_index(drop=True, inplace=True)      else:          bindf = binBadRate        bindf['Percent'] = bindf['total']/sum(bindf['total'])      bindf['BadRate'] = bindf['bad']/bindf['total']        bindf0 = DataFrame({'bin': range(1, bindf.shape[0]+1)})      lowerdf = DataFrame({'lower': [-100000000] + bindf['temp'].tolist()[:-1]})      upperdf = DataFrame({'upper': bindf['temp']})      bindf = pd.concat([bindf0, lowerdf, upperdf, bindf.drop('temp', axis=1)], axis=1)        return bindf

以數值變量orgnum為例,train_cont是包含數值變量的數據框,y是目標變量,-1代表數值變量的缺失值.

(1)令singleIndicator = True,即-1單獨作為一組。

orgnum_bin = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = True)

分箱結果如下,缺失值-1單獨分為一箱,即使佔比2.2%,小於5%,也不與相鄰的組進行合併:

(2)令singleIndicator = False,即缺失值-1參與分組合併,有可能和其他組合併在一起。

orgnum_bin2 = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = False)

分箱結果如下,缺失值-1與其他值合併在一起:

6、編寫批量分箱函數,將所有要分箱的數值變量進行批量分箱處理,函數返回的是存放每個變量分箱結果的字典。

########### split the continuous variable using Chi2 value by batch ############  def ContVarChi2BinBatch(df, key, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True):        # df: 數據框      # key: 主鍵      # target: 目標變量,取值0或1      # return: 存放每個變量分箱結果的字典        df_Xvar = df.drop([key, target], axis=1)      x_vars = df_Xvar.columns.tolist()        dict_bin = {}      for col in x_vars:          dict_bin[col] = ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri, singleIndicator)        return dict_bin

以訓練樣本train_cont為例,其主鍵為cus_num、目標變量為y,字典dict_train_cont里存放了各數值變量的分箱結果。

dict_train_cont = ContVarChi2BinBatch(train_cont, 'cus_num', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri=[-1], singleIndicator = False)

7、編寫將變量值替換為分箱值的函數

############################## 將變量值替換為分箱值 ##############################  def txtContVarBin(df, key, target, dict_bin, testIndicator=False):      # df: 需要將變量值替換為分箱值的數據框      # key:主鍵      # target:目標變量      # dict_bin:包含各變量分箱結果的字典      # testIndicator:是否為測試數據框,True:計算測試數據分箱後的佔比、壞樣本率等,並存放在字典中        df_bin = df[[key, target]]      df_Xvar = df.drop([key, target], axis=1)      DictBin = {}      for col in df_Xvar.columns:            Bin = dict_bin[col]          ls = Series([np.nan] * len(df))          for i in range(len(Bin.bin)):              ls[((df[col] > Bin.lower[i]) & (df[col] <= Bin.upper[i])).tolist()] = Bin.bin[i]          df_bin[col] = ls.tolist()            if testIndicator:                col_bin_BadRate = BinBadRate(df_bin, col, target, BadRateIndicator = False)              col_bin_BadRate['Percent']=col_bin_BadRate['total']/sum(col_bin_BadRate['total'])              col_bin_BadRate['BadRate']=col_bin_BadRate['bad']/col_bin_BadRate['total']              col_bin_BadRate.columns = ['bin', 'total', 'bad', 'good', 'Percent', 'BadRate']              col_bin = Bin[['bin', 'lower', 'upper']].merge(col_bin_BadRate, on='bin', how='left')              DictBin[col] = col_bin        if testIndicator:          return df_bin, DictBin        return df_bin

前面,對訓練樣本train_cont批量分箱後,得到分箱結果字典dict_train_cont,然後用該字典將訓練樣本train_cont中數值變量的取值映射為分箱值,此時令testIndicator=False,只返回映射後的訓練樣本train_cont_bin,代碼如下:

train_cont_bin = txtContVarBin(train_cont, 'cus_num', 'y', dict_train_cont, testIndicator=False)

對於測試樣本,也需要用訓練樣本上的分箱結果,將其映射成分箱值test_cont_bin,同時令testIndicator=True,返回測試樣本按照訓練樣本的分箱結果映射後的變量的風險分佈dict_test_cont,代碼嗎如下:

test_cont_bin, dict_test_cont = txtContVarBin(test_cont, 'cus_num', 'y', dict_train_cont, testIndicator=True)

以上就是數值變量卡方分箱的算法及實現的介紹,亦可將其中卡方統計量替換成基尼方差、熵方差等其他統計量作為變量分箱的準則。另外需要特別注意的是,在剛開始對數值變量做等距分割時,如果變量存在異常大的值,將會使取值被分在極少數的組內,大大損失了變量的信息。因此在做分箱操作之前,需要對數值變量做異常值檢測,並對異常大的值做蓋帽替換等。