数值变量-卡方分箱

  • 2019 年 12 月 25 日
  • 筆記

卡方统计量用于评估两个名义变量(或称类别变量)的相关性是否显著,可以作为类别变量的分箱准则;但将数值变量做离散化处理后,卡方统计量同样可作为数值变量的分箱准则。

知识准备

首先,再次回顾一下卡方统计量的定义。X和Y为两个类别变量,X取值为高、中、低,Y取值为好和坏,y和x的实际值列联表如下图:

假设y和x不相关,总体y中坏占比=254/1831=13.87%。根据原假设,计算出假设值列联表:

则卡方统计量为:

其中,A是实际值,T是假设值,卡方分布的自由度=(x属性个数-1)*(y属性个数-1)=(3-1)*(2-1)=2。

计算卡方值=45.41,查卡方分布表可知P(卡方值>45.41)<<0.05,所以有理由拒绝y和x不相关的原假设,即y和x有较强的相关性。

对于同一分布,卡方值越大,P就越小,因此在对特征分箱时,可以直接根据卡方值大小判断属性的合并方式。

通过上面的介绍可以发现,卡方统计量定义为两个名义变量之间分布的统计量,如果有一个变量是数值变量,可以先对该数值变量做离散化处理,然后计算卡方统计量。比如,假设X是数值变量,取值为[20, 80],将X分为[20, 30)、[30, 40)、[40, 55)和[55, 80]等4段,每段作为X的一个属性。

分箱算法

先对数值变量分成若干小段,然后不断合并相邻的分段,直至达到终止条件。因数值变量的取值有大小顺序,所以在对其合并时仍然要保证相邻分段之间的大小顺序,这是数值变量分箱与类别变量分箱的最大不同之处。

算法如下:

(1)将数值变量按照等距方式分成SplitNum段(比如100段),此为初始分箱;

(2)计算每段的总样本数、好样本数、坏样本数、样本占比等统计值;

(3)计算相邻两段的卡方值,合并卡方值最小的相邻两段;

(4)重复步骤(2)和(3),直至分段个数<=BinMax;

(5)检查每段是否同时含有坏样本和好样本,若某段只含有坏样本或好样本,则将与该段卡方值最小的相邻一段和该段进行合并;

(6)重复步骤(5),直至每段同时含有坏样本和好样本;

(7)检查每段的样本占比是否>=BinPcntMin,若某段的样本占比<BinPcntMin,则将与该段卡方值最小的相邻一段和该段进行合并;

(8)重复步骤(7),直至每段的样本占比>=BinPcntMin。

代码实现

1、载入模块

import pandas as pd  import numpy as np  from pandas import DataFrame, Series

2、编写数据列等距分割函数

def splitCol(col, SplitNum, exclude_attri=[]):        # col: 数据列      # SplitNum: 等距分割的段数      # exclude_attri: 不参与分割的特殊值      # return: 分割节点值列表        col = list(col)      col = list(set(col).difference(set(exclude_attri)))      size = (max(col) - min(col))/SplitNum      splitPoint = [min(col)+i*size for i in range(1, SplitNum+1)]      splitPoint[-1] = 100000000.0      return splitPoint    def assignSplit(x, splitPoint):        # x: 标量数值      # splitPoint:分割节点值列表      # return:被划入的分割段        if x <= splitPoint[0]:          return splitPoint[0]      else:          for i in range(0, len(splitPoint)-1):              if splitPoint[i] < x <= splitPoint[i+1]:                  return splitPoint[i+1] 

3、编写计算变量总样本、好样本、坏样本、坏样本率的函数

def BinBadRate(df, col, target, BadRateIndicator = True):        # df: 需要计算好坏比率的数据集      # col: 需要计算好坏比率的变量      # target: 好坏标签      # BadRateIndicator: 是否计算好坏比        group = df.groupby([col])[target].agg(['count', 'sum'])      group.columns = ['total', 'bad']      group.reset_index(inplace=True)      group['good'] = group['total'] - group['bad']        if BadRateIndicator:          group['BadRate'] = group['bad']/group['total']        return group

4、编写计算卡方值函数

def calcChi2(df, total_col, bad_col, good_col):        # df: 包含各属性的全部样本个数、坏样本个数、好样本个数的数据框      # total_col: 全部样本的个数      # bad_col: 坏样本的个数      # good_col:好样本的个数        df2 = df.copy()      # 求出总体的坏样本率和好样本率      badRate = sum(df2[bad_col])*1.0/sum(df2[total_col])      goodRate = sum(df2[good_col]) * 1.0 / sum(df2[total_col])        # 当全部样本只有好或者坏样本时,卡方值为0      if badRate in [0,1]:          return 0        # 计算期望坏样本和期望好样本的个数      df2['bad_Exp'] = df2[total_col].map(lambda x: x*badRate)      df2['good_Exp'] = df2[total_col].map(lambda x: x*goodRate)        # 计算卡方值      badzip = zip(df2['bad_Exp'], df2[bad_col])      goodzip = zip(df2['good_Exp'], df2[good_col])      badChi2 = [(elem[1]-elem[0])**2/elem[0] for elem in badzip]      goodChi2 = [(elem[1] - elem[0])**2/elem[0] for elem in goodzip]      chi2 = sum(badChi2) + sum(goodChi2)        return chi2

5、接下来实现单变量分箱的函数,其中会调用上面的几个函数,返回单变量分箱的结果。按照前面描述的算法,分箱函数分三个部分,(1)合并相邻两个分组、(2)检查是否每个分组同时含有好和坏、(3)检查每个分组的占比是否大于等于BinPcntMin。其中spe_attri是特殊属性值,初始分箱时将各特殊属性值分别单独作为一组,singleIndicator是特殊属性值在接下来的合并过程中是否参与合并的标识,取值True,则不参与合并,取值False,则参与合并。

###############  split the continuous variable using Chi2 value  ###############  def ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True):        # df: 包含目标变量与分箱变量的数据框      # col: 需要分箱的变量      # target: 目标变量,取值0或1      # BinMax: 最大分箱数      # BinPcntMin:每箱的最小占比      # SplitNum:数值变量初始切分的段数,初始将变量等距切分成SplitNum段      # spe_attri:特殊属性      # singleIndicator: True:特殊属性单独作为一组不参与卡方分箱,False:特殊属性作为一组参与卡方分箱        if len(spe_attri)>=1:          df1 = df.loc[df[col].isin(spe_attri)]          df2 = df.loc[~df[col].isin(spe_attri)]      else:          df2 = df.copy()        split_col = splitCol(df2[col], SplitNum)      df2['temp'] = df2[col].map(lambda x: assignSplit(x, split_col))      binBadRate = BinBadRate(df2, 'temp', target, BadRateIndicator = False)        if len(spe_attri)>=1 and singleIndicator==False:          df1['temp'] = df1[col]          binBadRate1 = BinBadRate(df1, 'temp', target, BadRateIndicator = False)          binBadRate = pd.concat([binBadRate1, binBadRate])          binBadRate.reset_index(inplace=True, drop=True)        if len(spe_attri)>=1 and singleIndicator==True:          BinMax -= len(set(df1[col]))      # 1、迭代合并相邻两个组,直至分箱数<=BinMax      while binBadRate.shape[0] > BinMax:          chi2List = []          for i in range(0, binBadRate.shape[0]-1):              temp_binBadRate = binBadRate.loc[i:i+1, :]              chi2 = calcChi2(temp_binBadRate, 'total', 'bad', 'good')              chi2List.append(chi2)            combineIndex = chi2List.index(min(chi2List))          combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]            binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])          binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])          binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])            binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]          binBadRate.reset_index(drop=True, inplace=True)        # 2、检查每组是否同时含有好和坏      binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']      minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])      while minBadRate == 0 or maxBadRate == 1:          BadRate_01 = binBadRate['temp'][binBadRate['BadRate'].isin([0, 1])]          index_01 = BadRate_01.index[0]            if index_01 == 0:                combineIndex = 0              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            elif index_01 == binBadRate.shape[0]-1:                combineIndex = binBadRate.shape[0]-2              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            else:                temp1_binBadRate = binBadRate.loc[index_01-1:index_01, :]              chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')                temp2_binBadRate = binBadRate.loc[index_01:index_01+1, :]              chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')                if chi2_1 < chi2_2:                  combineIndex = index_01-1              else:                  combineIndex = index_01                combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            binBadRate['BadRate'] = binBadRate['bad']/binBadRate['total']          minBadRate, maxBadRate = min(binBadRate['BadRate']), max(binBadRate['BadRate'])        # 3、检查每组的占比是否大于等于BinPcntMin      binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total'])      minPercent = min(binBadRate['Percent'])      while minPercent < BinPcntMin:          minPercent_temp = binBadRate['temp'][binBadRate['Percent']==minPercent]          index_minPercent = minPercent_temp.index[0]            if index_minPercent == 0:                combineIndex = 0              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            elif  index_minPercent == binBadRate.shape[0]-1:                combineIndex = binBadRate.shape[0]-2              combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            else:                temp1_binBadRate = binBadRate.loc[index_minPercent-1:index_minPercent, :]              chi2_1 = calcChi2(temp1_binBadRate, 'total', 'bad', 'good')                temp2_binBadRate = binBadRate.loc[index_minPercent:index_minPercent+1, :]              chi2_2 = calcChi2(temp2_binBadRate, 'total', 'bad', 'good')                if chi2_1 < chi2_2:                  combineIndex = index_minPercent-1              else:                  combineIndex = index_minPercent                combine_binBadRate = binBadRate.loc[combineIndex:combineIndex+1, :]                binBadRate['total'][combineIndex+1] = sum(combine_binBadRate['total'])              binBadRate['bad'][combineIndex+1] = sum(combine_binBadRate['bad'])              binBadRate['good'][combineIndex+1] = sum(combine_binBadRate['good'])                binBadRate = binBadRate.loc[binBadRate.index != combineIndex , :]              binBadRate.reset_index(drop=True, inplace=True)            binBadRate['Percent'] = binBadRate['total']/sum(binBadRate['total'])          minPercent = min(binBadRate['Percent'])        binBadRate = binBadRate.drop(['BadRate', 'Percent'], axis=1)        if len(spe_attri)>=1 and singleIndicator == True:          binBadRate_single = BinBadRate(df1, col, target, BadRateIndicator = False)          binBadRate_single.columns = ['temp', 'total', 'bad', 'good']          bindf = pd.concat([binBadRate_single, binBadRate])          bindf.reset_index(drop=True, inplace=True)      else:          bindf = binBadRate        bindf['Percent'] = bindf['total']/sum(bindf['total'])      bindf['BadRate'] = bindf['bad']/bindf['total']        bindf0 = DataFrame({'bin': range(1, bindf.shape[0]+1)})      lowerdf = DataFrame({'lower': [-100000000] + bindf['temp'].tolist()[:-1]})      upperdf = DataFrame({'upper': bindf['temp']})      bindf = pd.concat([bindf0, lowerdf, upperdf, bindf.drop('temp', axis=1)], axis=1)        return bindf

以数值变量orgnum为例,train_cont是包含数值变量的数据框,y是目标变量,-1代表数值变量的缺失值.

(1)令singleIndicator = True,即-1单独作为一组。

orgnum_bin = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = True)

分箱结果如下,缺失值-1单独分为一箱,即使占比2.2%,小于5%,也不与相邻的组进行合并:

(2)令singleIndicator = False,即缺失值-1参与分组合并,有可能和其他组合并在一起。

orgnum_bin2 = ContVarChi2Bin(train_cont, 'orgnum', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri = [-1], singleIndicator = False)

分箱结果如下,缺失值-1与其他值合并在一起:

6、编写批量分箱函数,将所有要分箱的数值变量进行批量分箱处理,函数返回的是存放每个变量分箱结果的字典。

########### split the continuous variable using Chi2 value by batch ############  def ContVarChi2BinBatch(df, key, target, BinMax, BinPcntMin, SplitNum, spe_attri = [], singleIndicator = True):        # df: 数据框      # key: 主键      # target: 目标变量,取值0或1      # return: 存放每个变量分箱结果的字典        df_Xvar = df.drop([key, target], axis=1)      x_vars = df_Xvar.columns.tolist()        dict_bin = {}      for col in x_vars:          dict_bin[col] = ContVarChi2Bin(df, col, target, BinMax, BinPcntMin, SplitNum, spe_attri, singleIndicator)        return dict_bin

以训练样本train_cont为例,其主键为cus_num、目标变量为y,字典dict_train_cont里存放了各数值变量的分箱结果。

dict_train_cont = ContVarChi2BinBatch(train_cont, 'cus_num', 'y', BinMax=5, BinPcntMin=0.05, SplitNum=100, spe_attri=[-1], singleIndicator = False)

7、编写将变量值替换为分箱值的函数

############################## 将变量值替换为分箱值 ##############################  def txtContVarBin(df, key, target, dict_bin, testIndicator=False):      # df: 需要将变量值替换为分箱值的数据框      # key:主键      # target:目标变量      # dict_bin:包含各变量分箱结果的字典      # testIndicator:是否为测试数据框,True:计算测试数据分箱后的占比、坏样本率等,并存放在字典中        df_bin = df[[key, target]]      df_Xvar = df.drop([key, target], axis=1)      DictBin = {}      for col in df_Xvar.columns:            Bin = dict_bin[col]          ls = Series([np.nan] * len(df))          for i in range(len(Bin.bin)):              ls[((df[col] > Bin.lower[i]) & (df[col] <= Bin.upper[i])).tolist()] = Bin.bin[i]          df_bin[col] = ls.tolist()            if testIndicator:                col_bin_BadRate = BinBadRate(df_bin, col, target, BadRateIndicator = False)              col_bin_BadRate['Percent']=col_bin_BadRate['total']/sum(col_bin_BadRate['total'])              col_bin_BadRate['BadRate']=col_bin_BadRate['bad']/col_bin_BadRate['total']              col_bin_BadRate.columns = ['bin', 'total', 'bad', 'good', 'Percent', 'BadRate']              col_bin = Bin[['bin', 'lower', 'upper']].merge(col_bin_BadRate, on='bin', how='left')              DictBin[col] = col_bin        if testIndicator:          return df_bin, DictBin        return df_bin

前面,对训练样本train_cont批量分箱后,得到分箱结果字典dict_train_cont,然后用该字典将训练样本train_cont中数值变量的取值映射为分箱值,此时令testIndicator=False,只返回映射后的训练样本train_cont_bin,代码如下:

train_cont_bin = txtContVarBin(train_cont, 'cus_num', 'y', dict_train_cont, testIndicator=False)

对于测试样本,也需要用训练样本上的分箱结果,将其映射成分箱值test_cont_bin,同时令testIndicator=True,返回测试样本按照训练样本的分箱结果映射后的变量的风险分布dict_test_cont,代码吗如下:

test_cont_bin, dict_test_cont = txtContVarBin(test_cont, 'cus_num', 'y', dict_train_cont, testIndicator=True)

以上就是数值变量卡方分箱的算法及实现的介绍,亦可将其中卡方统计量替换成基尼方差、熵方差等其他统计量作为变量分箱的准则。另外需要特别注意的是,在刚开始对数值变量做等距分割时,如果变量存在异常大的值,将会使取值被分在极少数的组内,大大损失了变量的信息。因此在做分箱操作之前,需要对数值变量做异常值检测,并对异常大的值做盖帽替换等。