数值变量-卡方分箱
- 2019 年 12 月 25 日
- 筆記
卡方统计量用于评估两个名义变量(或称类别变量)的相关性是否显著,可以作为类别变量的分箱准则;但将数值变量做离散化处理后,卡方统计量同样可作为数值变量的分箱准则。
知识准备
首先,再次回顾一下卡方统计量的定义。X和Y为两个类别变量,X取值为高、中、低,Y取值为好和坏,y和x的实际值列联表如下图:
假设y和x不相关,总体y中坏占比=254/1831=13.87%。根据原假设,计算出假设值列联表:
则卡方统计量为:
其中,A是实际值,T是假设值,卡方分布的自由度=(x属性个数-1)*(y属性个数-1)=(3-1)*(2-1)=2。
计算卡方值=45.41,查卡方分布表可知P(卡方值>45.41)<<0.05,所以有理由拒绝y和x不相关的原假设,即y和x有较强的相关性。
对于同一分布,卡方值越大,P就越小,因此在对特征分箱时,可以直接根据卡方值大小判断属性的合并方式。
通过上面的介绍可以发现,卡方统计量定义为两个名义变量之间分布的统计量,如果有一个变量是数值变量,可以先对该数值变量做离散化处理,然后计算卡方统计量。比如,假设X是数值变量,取值为[20, 80],将X分为[20, 30)、[30, 40)、[40, 55)和[55, 80]等4段,每段作为X的一个属性。
分箱算法
先对数值变量分成若干小段,然后不断合并相邻的分段,直至达到终止条件。因数值变量的取值有大小顺序,所以在对其合并时仍然要保证相邻分段之间的大小顺序,这是数值变量分箱与类别变量分箱的最大不同之处。
算法如下:
(1)将数值变量按照等距方式分成SplitNum段(比如100段),此为初始分箱;
(2)计算每段的总样本数、好样本数、坏样本数、样本占比等统计值;
(3)计算相邻两段的卡方值,合并卡方值最小的相邻两段;
(4)重复步骤(2)和(3),直至分段个数<=BinMax;
(5)检查每段是否同时含有坏样本和好样本,若某段只含有坏样本或好样本,则将与该段卡方值最小的相邻一段和该段进行合并;
(6)重复步骤(5),直至每段同时含有坏样本和好样本;
(7)检查每段的样本占比是否>=BinPcntMin,若某段的样本占比<BinPcntMin,则将与该段卡方值最小的相邻一段和该段进行合并;
(8)重复步骤(7),直至每段的样本占比>=BinPcntMin。
代码实现
1、载入模块
import pandas as pd import numpy as np from pandas import DataFrame, Series
2、编写数据列等距分割函数
def splitCol(col, SplitNum, exclude_attri=[]): # col: 数据列 # SplitNum: 等距分割的段数 # exclude_attri: 不参与分割的特殊值 # return: 分割节点值列表 col = list(col) col = list(set(col).difference(set(exclude_attri))) size = (max(col) - min(col))/SplitNum splitPoint = [min(col)+i*size for i in range(1, SplitNum+1)] splitPoint[-1] = 100000000.0 return splitPoint def assignSplit(x, splitPoint): # x: 标量数值 # splitPoint:分割节点值列表 # return:被划入的分割段 if x <= splitPoint[0]: return splitPoint[0] else: for i in range(0, len(splitPoint)-1): if splitPoint[i] < x <= splitPoint[i+1]: return splitPoint[i+1]
3、编写计算变量总样本、好样本、坏样本、坏样本率的函数
def BinBadRate(df, col, target, BadRateIndicator = True): # df: 需要计算好坏比率的数据集 # col: 需要计算好坏比率的变量 # target: 好坏标签 # BadRateIndicator: 是否计算好坏比 group = df.groupby([col])[target].agg(['count', 'sum']) group.columns = ['total', 'bad'] group.reset_index(inplace=True) group['good'] = group['total'] - group['bad'] if BadRateIndicator: group['BadRate'] = group['bad']/group['total'] return group
4、编写计算卡方值函数
def calcChi2(df, total_col, bad_col, good_col): # df: 包含各属性的全部样本个数、坏样本个数、好样本个数的数据框 # total_col: 全部样本的个数 # bad_col: 坏样本的个数 # good_col:好样本的个数 df2 = df.copy() # 求出总体的坏样本率和好样本率 badRate = sum(df2[bad_col])*1.0/sum(df2[total_col]) goodRate = sum(df2[good_col]) * 1.0 / sum(df2[total_col]) # 当全部样本只有好或者坏样本时,卡方值为0 if badRate in [0,1]: return 0 # 计算期望坏样本和期望好样本的个数 df2['bad_Exp'] = df2[total_col].map(lambda x: x*badRate) df2['good_Exp'] = df2[total_col].map(lambda x: x*goodRate) # 计算卡方值 badzip = zip(df2['bad_Exp'], df2[bad_col]) goodzip = zip(df2['good_Exp'], df2[good_col]) badChi2 = [(elem[1]-elem[0])**2/elem[0] for elem in badzip] goodChi2 = [(elem[1] - elem[0])**2/elem[0] for elem in goodzip] chi2 = sum(badChi2) + sum(goodChi2) return chi2
5、接下来实现单变量分箱的函数,其中会调用上面的几个函数,返回单变量分箱的结果。按照前面描述的算法,分箱函数分三个部分,(1)合并相邻两个分组、(2)检查是否每个分组同时含有好和坏、(3)检查每个分组的占比是否大于等于BinPcntMin。其中spe_attri是特殊属性值,初始分箱时将各特殊属性值分别单独作为一组,singleIndicator是特殊属性值在接下来的合并过程中是否参与合并的标识,取值True,则不参与合并,取值False,则参与合并。
############### split the continuous variable using Chi2 value ############### def ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True): # df: 包含目标变量与分箱变量的数据框 # col: 需要分箱的变量 # target: 目标变量,取值0或1 # BinMax: 最大分箱数 # BinPcntMin:每箱的最小占比 # SplitNum:数值变量初始切分的段数,初始将变量等距切分成SplitNum段 # spe_attri:特殊属性 # singleIndicator: True:特殊属性单独作为一组不参与卡方分箱,False:特殊属性作为一组参与卡方分箱 if len(spe_attri)>=1: df1 = df.loc[df[col].isin(spe_attri)] df2 = df.loc[~df[col].isin(spe_attri)] else: df2 = df.copy() split_col = splitCol(df2[col], SplitNum) df2['temp'] = df2[col].map(lambda x: assignSplit(x, split_col)) binBadRate = BinBadRate(df2, 'temp', target, BadRateIndicator = False) if len(spe_attri)>=1 and singleIndicator==False: df1['temp'] = df1[col] binBadRate1 = BinBadRate(df1, 'temp', target, BadRateIndicator = False) binBadRate = pd.concat([binBadRate1, binBadRate]) binBadRate.reset_index(inplace=True, drop=True) if len(spe_attri)>=1 and singleIndicator==True: BinMax -= len(set(df1[col])) # 1、迭代合并相邻两个组,直至分箱数<=BinMax while binBadRate.shape[0] > BinMax: chi2List = [] for i in range(0, binBadRate.shape[0]-1): temp_binBadRate = binBadRate.loc[i:i+1, :] chi2 = calcChi2(temp_binBadRate, 'total', 'bad', 'good') chi2List.append(chi2) combineIndex = chi2List.index(min(chi2List)) combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) # 2、检查每组是否同时含有好和坏 binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total'] minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate']) while minBadRate == 0 or maxBadRate == 1: BadRate_01 = binBadRate['temp'][binBadRate['BadRate'].isin([0, 1])] index_01 = BadRate_01.index[0] if index_01 == 0: combineIndex = 0 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) elif index_01 == binBadRate.shape[0]-1: combineIndex = binBadRate.shape[0]-2 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) else: temp1_binBadRate = binBadRate.loc[index_01-1:index_01, :] chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good') temp2_binBadRate = binBadRate.loc[index_01:index_01+1, :] chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good') if chi2_1 < chi2_2: combineIndex = index_01-1 else: combineIndex = index_01 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total'] minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate']) # 3、检查每组的占比是否大于等于BinPcntMin binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total']) minPercent = min(binBadRate['Percent']) while minPercent < BinPcntMin: minPercent_temp = binBadRate['temp'][binBadRate['Percent']==minPercent] index_minPercent = minPercent_temp.index[0] if index_minPercent == 0: combineIndex = 0 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) elif index_minPercent == binBadRate.shape[0]-1: combineIndex = binBadRate.shape[0]-2 combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) else: temp1_binBadRate = binBadRate.loc[index_minPercent-1:index_minPercent, :] chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good') temp2_binBadRate = binBadRate.loc[index_minPercent:index_minPercent+1, :] chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good') if chi2_1 < chi2_2: combineIndex = index_minPercent-1 else: combineIndex = index_minPercent combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :] binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total']) binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad']) binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good']) binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :] binBadRate.reset_index(drop=True, inplace=True) binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total']) minPercent = min(binBadRate['Percent']) binBadRate = binBadRate.drop(['BadRate', 'Percent'], axis=1) if len(spe_attri)>=1 and singleIndicator == True: binBadRate_single = BinBadRate(df1, col, target, BadRateIndicator = False) binBadRate_single.columns = ['temp', 'total', 'bad', 'good'] bindf = pd.concat([binBadRate_single, binBadRate]) bindf.reset_index(drop=True, inplace=True) else: bindf = binBadRate bindf['Percent'] = bindf['total']/sum(bindf['total']) bindf['BadRate'] = bindf['bad']/bindf['total'] bindf0 = DataFrame({'bin': range(1, bindf.shape[0]+1)}) lowerdf = DataFrame({'lower': [-100000000] + bindf['temp'].tolist()[:-1]}) upperdf = DataFrame({'upper': bindf['temp']}) bindf = pd.concat([bindf0, lowerdf, upperdf, bindf.drop('temp', axis=1)], axis=1) return bindf
以数值变量orgnum为例,train_cont是包含数值变量的数据框,y是目标变量,-1代表数值变量的缺失值.
(1)令singleIndicator = True,即-1单独作为一组。
orgnum_bin = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = True)
分箱结果如下,缺失值-1单独分为一箱,即使占比2.2%,小于5%,也不与相邻的组进行合并:
(2)令singleIndicator = False,即缺失值-1参与分组合并,有可能和其他组合并在一起。
orgnum_bin2 = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = False)
分箱结果如下,缺失值-1与其他值合并在一起:
6、编写批量分箱函数,将所有要分箱的数值变量进行批量分箱处理,函数返回的是存放每个变量分箱结果的字典。
########### split the continuous variable using Chi2 value by batch ############ def ContVarChi2BinBatch(df, key, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True): # df: 数据框 # key: 主键 # target: 目标变量,取值0或1 # return: 存放每个变量分箱结果的字典 df_Xvar = df.drop([key, target], axis=1) x_vars = df_Xvar.columns.tolist() dict_bin = {} for col in x_vars: dict_bin[col] = ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri, singleIndicator) return dict_bin
以训练样本train_cont为例,其主键为cus_num、目标变量为y,字典dict_train_cont里存放了各数值变量的分箱结果。
dict_train_cont = ContVarChi2BinBatch(train_cont, 'cus_num', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri=[-1], singleIndicator = False)
7、编写将变量值替换为分箱值的函数
############################## 将变量值替换为分箱值 ############################## def txtContVarBin(df, key, target, dict_bin, testIndicator=False): # df: 需要将变量值替换为分箱值的数据框 # key:主键 # target:目标变量 # dict_bin:包含各变量分箱结果的字典 # testIndicator:是否为测试数据框,True:计算测试数据分箱后的占比、坏样本率等,并存放在字典中 df_bin = df[[key, target]] df_Xvar = df.drop([key, target], axis=1) DictBin = {} for col in df_Xvar.columns: Bin = dict_bin[col] ls = Series([np.nan] * len(df)) for i in range(len(Bin.bin)): ls[((df[col] > Bin.lower[i]) & (df[col] <= Bin.upper[i])).tolist()] = Bin.bin[i] df_bin[col] = ls.tolist() if testIndicator: col_bin_BadRate = BinBadRate(df_bin, col, target, BadRateIndicator = False) col_bin_BadRate['Percent']=col_bin_BadRate['total']/sum(col_bin_BadRate['total']) col_bin_BadRate['BadRate']=col_bin_BadRate['bad']/col_bin_BadRate['total'] col_bin_BadRate.columns = ['bin', 'total', 'bad', 'good', 'Percent', 'BadRate'] col_bin = Bin[['bin', 'lower', 'upper']].merge(col_bin_BadRate, on='bin', how='left') DictBin[col] = col_bin if testIndicator: return df_bin, DictBin return df_bin
前面,对训练样本train_cont批量分箱后,得到分箱结果字典dict_train_cont,然后用该字典将训练样本train_cont中数值变量的取值映射为分箱值,此时令testIndicator=False,只返回映射后的训练样本train_cont_bin,代码如下:
train_cont_bin = txtContVarBin(train_cont, 'cus_num', 'y', dict_train_cont, testIndicator=False)
对于测试样本,也需要用训练样本上的分箱结果,将其映射成分箱值test_cont_bin,同时令testIndicator=True,返回测试样本按照训练样本的分箱结果映射后的变量的风险分布dict_test_cont,代码吗如下:
test_cont_bin, dict_test_cont = txtContVarBin(test_cont, 'cus_num', 'y', dict_train_cont, testIndicator=True)
以上就是数值变量卡方分箱的算法及实现的介绍,亦可将其中卡方统计量替换成基尼方差、熵方差等其他统计量作为变量分箱的准则。另外需要特别注意的是,在刚开始对数值变量做等距分割时,如果变量存在异常大的值,将会使取值被分在极少数的组内,大大损失了变量的信息。因此在做分箱操作之前,需要对数值变量做异常值检测,并对异常大的值做盖帽替换等。