數值變量-卡方分箱
- 2019 年 12 月 25 日
- 筆記
卡方統計量用於評估兩個名義變量(或稱類別變量)的相關性是否顯著,可以作為類別變量的分箱準則;但將數值變量做離散化處理後,卡方統計量同樣可作為數值變量的分箱準則。
知識準備
首先,再次回顧一下卡方統計量的定義。X和Y為兩個類別變量,X取值為高、中、低,Y取值為好和壞,y和x的實際值列聯表如下圖:

假設y和x不相關,總體y中壞佔比=254/1831=13.87%。根據原假設,計算出假設值列聯表:

則卡方統計量為:

其中,A是實際值,T是假設值,卡方分佈的自由度=(x屬性個數-1)*(y屬性個數-1)=(3-1)*(2-1)=2。
計算卡方值=45.41,查卡方分佈表可知P(卡方值>45.41)<<0.05,所以有理由拒絕y和x不相關的原假設,即y和x有較強的相關性。
對於同一分佈,卡方值越大,P就越小,因此在對特徵分箱時,可以直接根據卡方值大小判斷屬性的合併方式。
通過上面的介紹可以發現,卡方統計量定義為兩個名義變量之間分佈的統計量,如果有一個變量是數值變量,可以先對該數值變量做離散化處理,然後計算卡方統計量。比如,假設X是數值變量,取值為[20, 80],將X分為[20, 30)、[30, 40)、[40, 55)和[55, 80]等4段,每段作為X的一個屬性。
分箱算法
先對數值變量分成若干小段,然後不斷合併相鄰的分段,直至達到終止條件。因數值變量的取值有大小順序,所以在對其合併時仍然要保證相鄰分段之間的大小順序,這是數值變量分箱與類別變量分箱的最大不同之處。
算法如下:
(1)將數值變量按照等距方式分成SplitNum段(比如100段),此為初始分箱;
(2)計算每段的總樣本數、好樣本數、壞樣本數、樣本佔比等統計值;
(3)計算相鄰兩段的卡方值,合併卡方值最小的相鄰兩段;
(4)重複步驟(2)和(3),直至分段個數<=BinMax;
(5)檢查每段是否同時含有壞樣本和好樣本,若某段只含有壞樣本或好樣本,則將與該段卡方值最小的相鄰一段和該段進行合併;
(6)重複步驟(5),直至每段同時含有壞樣本和好樣本;
(7)檢查每段的樣本佔比是否>=BinPcntMin,若某段的樣本佔比<BinPcntMin,則將與該段卡方值最小的相鄰一段和該段進行合併;
(8)重複步驟(7),直至每段的樣本佔比>=BinPcntMin。
代碼實現
1、載入模塊
import pandas as pd import numpy as np from pandas import DataFrame, Series
2、編寫數據列等距分割函數
def splitCol(col, SplitNum, exclude_attri=[]): # col: 數據列 # SplitNum: 等距分割的段數 # exclude_attri: 不參與分割的特殊值 # return: 分割節點值列表 col = list(col) col = list(set(col).difference(set(exclude_attri))) size = (max(col) - min(col))/SplitNum splitPoint = [min(col)+i*size for i in range(1, SplitNum+1)] splitPoint[-1] = 100000000.0 return splitPoint def assignSplit(x, splitPoint): # x: 標量數值 # splitPoint:分割節點值列表 # return:被劃入的分割段 if x <= splitPoint[0]: return splitPoint[0] else: for i in range(0, len(splitPoint)-1): if splitPoint[i] < x <= splitPoint[i+1]: return splitPoint[i+1]
3、編寫計算變量總樣本、好樣本、壞樣本、壞樣本率的函數
def BinBadRate(df, col, target, BadRateIndicator = True): # df: 需要計算好壞比率的數據集 # col: 需要計算好壞比率的變量 # target: 好壞標籤 # BadRateIndicator: 是否計算好壞比 group = df.groupby([col])[target].agg(['count', 'sum']) group.columns = ['total', 'bad'] group.reset_index(inplace=True) group['good'] = group['total'] - group['bad'] if BadRateIndicator: group['BadRate'] = group['bad']/group['total'] return group
4、編寫計算卡方值函數
def calcChi2(df, total_col, bad_col, good_col): # df: 包含各屬性的全部樣本個數、壞樣本個數、好樣本個數的數據框 # total_col: 全部樣本的個數 # bad_col: 壞樣本的個數 # good_col:好樣本的個數 df2 = df.copy() # 求出總體的壞樣本率和好樣本率 badRate = sum(df2[bad_col])*1.0/sum(df2[total_col]) goodRate = sum(df2[good_col]) * 1.0 / sum(df2[total_col]) # 當全部樣本只有好或者壞樣本時,卡方值為0 if badRate in [0,1]: return 0 # 計算期望壞樣本和期望好樣本的個數 df2['bad_Exp'] = df2[total_col].map(lambda x: x*badRate) df2['good_Exp'] = df2[total_col].map(lambda x: x*goodRate) # 計算卡方值 badzip = zip(df2['bad_Exp'], df2[bad_col]) goodzip = zip(df2['good_Exp'], df2[good_col]) badChi2 = [(elem[1]-elem[0])**2/elem[0] for elem in badzip] goodChi2 = [(elem[1] - elem[0])**2/elem[0] for elem in goodzip] chi2 = sum(badChi2) + sum(goodChi2) return chi2
5、接下來實現單變量分箱的函數,其中會調用上面的幾個函數,返回單變量分箱的結果。按照前面描述的算法,分箱函數分三個部分,(1)合併相鄰兩個分組、(2)檢查是否每個分組同時含有好和壞、(3)檢查每個分組的佔比是否大於等於BinPcntMin。其中spe_attri是特殊屬性值,初始分箱時將各特殊屬性值分別單獨作為一組,singleIndicator是特殊屬性值在接下來的合併過程中是否參與合併的標識,取值True,則不參與合併,取值False,則參與合併。
############### split the continuous variable using Chi2 value ############### def ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True): # df: 包含目標變量與分箱變量的數據框 # col: 需要分箱的變量 # target: 目標變量,取值0或1 # BinMax: 最大分箱數 # BinPcntMin:每箱的最小佔比 # SplitNum:數值變量初始切分的段數,初始將變量等距切分成SplitNum段 # spe_attri:特殊屬性 # singleIndicator: True:特殊屬性單獨作為一組不參與卡方分箱,False:特殊屬性作為一組參與卡方分箱 if len(spe_attri)>=1: df1 = df.loc[df[col].isin(spe_attri)] df2 = df.loc[~df[col].isin(spe_attri)] else: df2 = df.copy() split_col = splitCol(df2[col], SplitNum) df2['temp'] = df2[col].map(lambda x: assignSplit(x, split_col)) binBadRate = BinBadRate(df2, 'temp', target, BadRateIndicator = False) if len(spe_attri)>=1 and singleIndicator==False: df1['temp'] = df1[col] binBadRate1 = BinBadRate(df1, 'temp', target, BadRateIndicator = False) binBadRate = pd.concat([binBadRate1, binBadRate]) binBadRate.reset_index(inplace=True, drop=True) if len(spe_attri)>=1 and singleIndicator==True: BinMax -= len(set(df1[col])) # 1、迭代合併相鄰兩個組,直至分箱數<=BinMax while binBadRate.shape[0] > BinMax: chi2List = [] for i in range(0, binBadRate.shape[0]-1): temp_binBadRate = binBadRate.loc[i:i+1, :] chi2 = calcChi2(temp_binBadRate, 'total', 'bad', 'good') chi2List.append(chi2) combineIndex = chi2List.index(min(chi2List)) combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) # 2、檢查每組是否同時含有好和壞 binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total'] minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate']) while minBadRate == 0 or maxBadRate == 1: BadRate_01 = binBadRate['temp'][binBadRate['BadRate'].isin([0, 1])] index_01 = BadRate_01.index[0] if index_01 == 0: combineIndex = 0 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) elif index_01 == binBadRate.shape[0]-1: combineIndex = binBadRate.shape[0]-2 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) else: temp1_binBadRate = binBadRate.loc[index_01-1:index_01, :] chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good') temp2_binBadRate = binBadRate.loc[index_01:index_01+1, :] chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good') if chi2_1 < chi2_2: combineIndex = index_01-1 else: combineIndex = index_01 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total'] minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate']) # 3、檢查每組的佔比是否大於等於BinPcntMin binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total']) minPercent = min(binBadRate['Percent']) while minPercent < BinPcntMin: minPercent_temp = binBadRate['temp'][binBadRate['Percent']==minPercent] index_minPercent = minPercent_temp.index[0] if index_minPercent == 0: combineIndex = 0 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) elif index_minPercent == binBadRate.shape[0]-1: combineIndex = binBadRate.shape[0]-2 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) else: temp1_binBadRate = binBadRate.loc[index_minPercent-1:index_minPercent, :] chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good') temp2_binBadRate = binBadRate.loc[index_minPercent:index_minPercent+1, :] chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good') if chi2_1 < chi2_2: combineIndex = index_minPercent-1 else: combineIndex = index_minPercent combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total']) minPercent = min(binBadRate['Percent']) binBadRate = binBadRate.drop(['BadRate', 'Percent'], axis=1) if len(spe_attri)>=1 and singleIndicator == True: binBadRate_single = BinBadRate(df1, col, target, BadRateIndicator = False) binBadRate_single.columns = ['temp', 'total', 'bad', 'good'] bindf = pd.concat([binBadRate_single, binBadRate]) bindf.reset_index(drop=True, inplace=True) else: bindf = binBadRate bindf['Percent'] = bindf['total']/sum(bindf['total']) bindf['BadRate'] = bindf['bad']/bindf['total'] bindf0 = DataFrame({'bin': range(1, bindf.shape[0]+1)}) lowerdf = DataFrame({'lower': [-100000000] + bindf['temp'].tolist()[:-1]}) upperdf = DataFrame({'upper': bindf['temp']}) bindf = pd.concat([bindf0, lowerdf, upperdf, bindf.drop('temp', axis=1)], axis=1) return bindf
以數值變量orgnum為例,train_cont是包含數值變量的數據框,y是目標變量,-1代表數值變量的缺失值.
(1)令singleIndicator = True,即-1單獨作為一組。
orgnum_bin = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = True)
分箱結果如下,缺失值-1單獨分為一箱,即使佔比2.2%,小於5%,也不與相鄰的組進行合併:

(2)令singleIndicator = False,即缺失值-1參與分組合併,有可能和其他組合併在一起。
orgnum_bin2 = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = False)
分箱結果如下,缺失值-1與其他值合併在一起:

6、編寫批量分箱函數,將所有要分箱的數值變量進行批量分箱處理,函數返回的是存放每個變量分箱結果的字典。
########### split the continuous variable using Chi2 value by batch ############ def ContVarChi2BinBatch(df, key, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True): # df: 數據框 # key: 主鍵 # target: 目標變量,取值0或1 # return: 存放每個變量分箱結果的字典 df_Xvar = df.drop([key, target], axis=1) x_vars = df_Xvar.columns.tolist() dict_bin = {} for col in x_vars: dict_bin[col] = ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri, singleIndicator) return dict_bin
以訓練樣本train_cont為例,其主鍵為cus_num、目標變量為y,字典dict_train_cont里存放了各數值變量的分箱結果。
dict_train_cont = ContVarChi2BinBatch(train_cont, 'cus_num', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri=[-1], singleIndicator = False)
7、編寫將變量值替換為分箱值的函數
############################## 將變量值替換為分箱值 ############################## def txtContVarBin(df, key, target, dict_bin, testIndicator=False): # df: 需要將變量值替換為分箱值的數據框 # key:主鍵 # target:目標變量 # dict_bin:包含各變量分箱結果的字典 # testIndicator:是否為測試數據框,True:計算測試數據分箱後的佔比、壞樣本率等,並存放在字典中 df_bin = df[[key, target]] df_Xvar = df.drop([key, target], axis=1) DictBin = {} for col in df_Xvar.columns: Bin = dict_bin[col] ls = Series([np.nan] * len(df)) for i in range(len(Bin.bin)): ls[((df[col] > Bin.lower[i]) & (df[col] <= Bin.upper[i])).tolist()] = Bin.bin[i] df_bin[col] = ls.tolist() if testIndicator: col_bin_BadRate = BinBadRate(df_bin, col, target, BadRateIndicator = False) col_bin_BadRate['Percent']=col_bin_BadRate['total']/sum(col_bin_BadRate['total']) col_bin_BadRate['BadRate']=col_bin_BadRate['bad']/col_bin_BadRate['total'] col_bin_BadRate.columns = ['bin', 'total', 'bad', 'good', 'Percent', 'BadRate'] col_bin = Bin[['bin', 'lower', 'upper']].merge(col_bin_BadRate, on='bin', how='left') DictBin[col] = col_bin if testIndicator: return df_bin, DictBin return df_bin
前面,對訓練樣本train_cont批量分箱後,得到分箱結果字典dict_train_cont,然後用該字典將訓練樣本train_cont中數值變量的取值映射為分箱值,此時令testIndicator=False,只返回映射後的訓練樣本train_cont_bin,代碼如下:
train_cont_bin = txtContVarBin(train_cont, 'cus_num', 'y', dict_train_cont, testIndicator=False)
對於測試樣本,也需要用訓練樣本上的分箱結果,將其映射成分箱值test_cont_bin,同時令testIndicator=True,返回測試樣本按照訓練樣本的分箱結果映射後的變量的風險分佈dict_test_cont,代碼嗎如下:
test_cont_bin, dict_test_cont = txtContVarBin(test_cont, 'cus_num', 'y', dict_train_cont, testIndicator=True)

以上就是數值變量卡方分箱的算法及實現的介紹,亦可將其中卡方統計量替換成基尼方差、熵方差等其他統計量作為變量分箱的準則。另外需要特別注意的是,在剛開始對數值變量做等距分割時,如果變量存在異常大的值,將會使取值被分在極少數的組內,大大損失了變量的信息。因此在做分箱操作之前,需要對數值變量做異常值檢測,並對異常大的值做蓋帽替換等。