讯飞广告反欺诈赛的王牌模型catboost介绍

  • 2019 年 10 月 7 日
  • 筆記

前段时间,MeteoAI小伙伴参加了讯飞移动广告反欺诈算法挑战赛算法挑战大赛[1],最终取得了复赛14/1428名的成绩。这是第一个我们从头到尾认真刷完的比赛,排名前1%其实我们觉得也还算可以,但还是比较遗憾与获奖区(前十名)擦肩而过……整个过程也是相当的波澜起伏,最高排名我们11名,可谓就是差一点点点就进入头部梯队了。不过通过这次比赛我们也确实收获了不少。

你一个搞气象的去凑什么热闹???

首先,模型大家都差不多,比如这个比赛,大家基本用的都是catboost。最终比拼的是数据挖掘的能力,有时候当然还是要一些灵感和运气。充分的EDA和精细的特征工程往往是在这类数据竞赛最终胜出的关键。所以大家一定要有意识培养数据分析,数据挖掘,特征工程,业务理解等多方面的能力,不能只会model.fit()model.predict(),因为这个真的谁都可以会。

获取本文代码见文末~

嗯…… 今天我们还是讲讲这个比赛的大杀器模型catboost的model.fit()model.predict(),因为这个真的谁都可以会,特征工程数据挖掘那一套…..讲真我们也还没完全摸透啊就不瞎分享了。讯飞的这个比赛,大多数的特征都是类型的特征,catboost非常擅长处理这些类别特征,结果完爆xgboost和lightgbm等常用模型。

用过sklearn进行机器学习的同学应该都知道,在用sklearn进行机器学习的时候,我们需要对类别特征进行预处理,如label encoding, one hot encoding等,因为sklearn无法处理类别特征,会报错。

而俄罗斯Yandex公司开源的 CatBoost[2]模型可直接对类别特征进行处理,在很多公开数据集上的表现都相当优异。从它的名字也可以看出来(CatBoost = Category and Boosting),它的优势是对类别特征的处理[3],同时结果更加robust,不需要费力去调参也能获得非常不错的结果,关于调参可参考链接[4]。

catboost: Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.

1. Install

首先安装相应的工具:

# 用pip  pip install catboost  # 或者用conda  conda install -c conda-forge catboost    # 安装jupyter notebook中的交互组件,用于交互绘图  pip install ipywidgets  jupyter nbextension enable --py widgetsnbextension

2. Preprocessing

Pool

Pool是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool,其内存和速度都更优。

关于Pool[5]的用法:

class Pool(data,             label=None,             cat_features=None,             column_description=None,             pairs=None,             delimiter='t',             has_header=False,             weight=None,             group_id=None,             group_weight=None,             subgroup_id=None,             pairs_weight=None             baseline=None,             feature_names=None,             thread_count=-1)
from catboost import CatBoostClassifier, Pool    train_data = Pool(data=[[1, 4, 5, 6],                          [4, 5, 6, 7],                          [30, 40, 50, 60]],                    label=[1, 1, -1],                    weight=[0.1, 0.2, 0.3])  train_data  # <catboost.core.Pool at 0x1a22af06d0>    model = CatBoostClassifier(iterations=10)  model.fit(train_data)  preds_class = model.predict(train_data)

FeaturesData

创建Pool有多种方式,而通过FeaturesData[6]构建Pool是更优的方式。

class FeaturesData(num_feature_data=None,                     cat_feature_data=None,                     num_feature_names=None,                     cat_feature_names=None)

CatBoostClassifier[7] with FeaturesData[8]:

import numpy as np  from catboost import CatBoostClassifier, FeaturesData  # Initialize data  cat_features = [0,1,2]  train_data = FeaturesData(      num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32),      cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object)  )  train_labels = [1,1,-1]  test_data = FeaturesData(      num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32),      cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object))    # Initialize CatBoostClassifier  model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')  # Fit model  model.fit(train_data, train_labels)  # Get predicted classes  preds_class = model.predict(test_data)  # Get predicted probabilities for each class  preds_proba = model.predict_proba(test_data)  # Get predicted RawFormulaVal  preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

CatBoostClassifier[9] with Pool[10] and FeaturesData[11]:

import numpy as np  from catboost import CatBoostClassifier, FeaturesData, Pool  # Initialize data  train_data = Pool(      data=FeaturesData(          num_feature_data=np.array([[1, 4, 5, 6],                                     [4, 5, 6, 7],                                     [30, 40, 50, 60]],                                     dtype=np.float32),          cat_feature_data=np.array([["a", "b"],                                     ["a", "b"],                                     ["c", "d"]],                                     dtype=object)      ),      label=[1, 1, -1]  )  test_data = Pool(      data=FeaturesData(          num_feature_data=np.array([[2, 4, 6, 8],                                     [1, 4, 50, 60]],                                     dtype=np.float32),          cat_feature_data=np.array([["a", "b"],                                     ["a", "d"]],                                     dtype=object)      )  )  # Initialize CatBoostClassifier  model = CatBoostClassifier(iterations = 2,                             learning_rate = 1,                             depth = 2,                             loss_function = 'Logloss')  # Fit model  model.fit(train_data)  # Get predicted classes  preds_class = model.predict(test_data)  # Get predicted probabilities for each class  preds_proba = model.predict_proba(test_data)  # Get predicted RawFormulaVal  preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')

3. Case

下面利用catboost内置的titanic数据集做演示。

库和数据集准备

首先导入必要的库和做数据准备,这里忽略最为重要的特征工程部分,仅仅作为演示:

from catboost.datasets import titanic  import numpy as np  from sklearn.model_selection import train_test_split  from catboost import CatBoostClassifier, Pool, cv  from sklearn.metrics import accuracy_score    # 导入数据  train_df, test_df = titanic()    # 查看缺测数据:  null_value_stats = train_df.isnull().sum(axis=0)  null_value_stats[null_value_stats != 0]    # 填充缺失值:  train_df.fillna(-999, inplace=True)  test_df.fillna(-999, inplace=True)    # 拆分features和label  X = train_df.drop('Survived', axis=1)  y = train_df.Survived    # train test split  X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)  X_test = test_df    # indices of categorical features  categorical_features_indices = np.where(X.dtypes != np.float)[0]

进行模型训练

catboost提供的默认参数可以提供非常好的baseline。所以不妨从默认参数开始。

model = CatBoostClassifier(      custom_metric=['Accuracy'],      random_seed=666,      logging_level='Silent'  )  # custom_metric <==> custom_loss    model.fit(      X_train, y_train,      cat_features=categorical_features_indices,      eval_set=(X_validation, y_validation),      logging_level='Verbose',  # you can comment this for no text output      plot=True  );    # OUTPUT:  """  ...  ...  ...  bestTest = 0.3792389991  bestIteration = 342    Shrink model to first 343 iterations.  """

应用模型进行预测

predictions = model.predict(X_test)  predictions_probs = model.predict_proba(X_test)  print(predictions[:10])  print(predictions_probs[:10])  # OUTPUT:  """  [0. 0. 0. 0. 1. 0. 1. 0. 1. 0.]  [[0.90866781 0.09133219]   [0.63668717 0.36331283]   [0.95333247 0.04666753]   [0.91051481 0.08948519]   [0.28010084 0.71989916]   [0.94618962 0.05381038]   [0.35536101 0.64463899]   [0.81843278 0.18156722]   [0.32829247 0.67170753]   [0.92653732 0.07346268]]  """

选择最好的模型输出(use_best_model)

在进行模型训练的时候,use_best_model最好用默认设置True,这意味着最后的模型训练结果是收缩在最佳的迭代次数上的(可以用model.tree_count_获得最佳的迭代次数),如果use_best_model设置为False,则 model.tree_count_ = iteration。如下面的例子:

# 数据准备的部分见库和数据集准备部分  params = {      'iterations': 500,      'learning_rate': 0.1,      'eval_metric': 'Accuracy',      'random_seed': 666,      'logging_level': 'Silent',      'use_best_model': False  }  # train  train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)  # validation  validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)    # train with 'use_best_model': False  model = CatBoostClassifier(**params)  model.fit(train_pool, eval_set=validate_pool)    # train with 'use_best_model': True  best_model_params = params.copy()  best_model_params.update({'use_best_model': True})  best_model = CatBoostClassifier(**best_model_params)  best_model.fit(train_pool, eval_set=validate_pool);    # show result  print('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format(      accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_))  print('')  print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format(      accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_))

用Early Stopping防止过拟合、节约训练时间

earlystopping是常用的防止模型过拟合的方式,同时也可以大幅度的节约训练时间。

params.update({'iterations':1000})  params  # OUTPUT:  """  {'iterations': 1000,   'learning_rate': 0.1,   'eval_metric': 'Accuracy',   'random_seed': 42,   'logging_level': 'Silent',   'use_best_model': False}  """
%%time  model = CatBoostClassifier(**params)  model.fit(train_pool, eval_set=validate_pool)  """  CPU times: user 2min 11s, sys: 52.1 s, total: 3min 3s  Wall time: 27.8 s  """
%%time  earlystop_model_1 = CatBoostClassifier(**params)  earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20)  """  CPU times: user 46.6 s, sys: 15.6 s, total: 1min 2s  Wall time: 9.2 s  """
%%time  earlystop_params = params.copy()  earlystop_params.update({      'od_type': 'Iter',      'od_wait': 200,      'logging_level': 'Verbose'  })  earlystop_model_2 = CatBoostClassifier(**earlystop_params)  earlystop_model_2.fit(train_pool, eval_set=validate_pool);  """  CPU times: user 49.6 s, sys: 19.9 s, total: 1min 9s  Wall time: 10.3 s  """

也可以直接设置参数early_stopping_rounds:

early_stopping_rounds: Set the overfitting detector type to 'Iter' ( 'od_type': 'Iter') and stop the training after the specified number of iterations since the iteration with the optimal metric value.

earlystop_params = params.copy()  earlystop_params.update({      'early_stopping_rounds': 200,      'logging_level': 'Verbose'  })

输出结果:

print('Simple model tree count: {}'.format(model.tree_count_))  print('Simple model validation accuracy: {:.4}'.format(      accuracy_score(y_validation, model.predict(X_validation))  ))  print('')  print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_))  print('Early-stopped model 1 validation accuracy: {:.4}'.format(      accuracy_score(y_validation, earlystop_model_1.predict(X_validation))  ))  print('')  print('Early-stopped model 2 tree count: {}'.format(earlystop_model_2.tree_count_))  print('Early-stopped model 2 validation accuracy: {:.4}'.format(      accuracy_score(y_validation, earlystop_model_2.predict(X_validation))  ))    """  Simple model tree count: 1000  Simple model validation accuracy: 0.8206    Early-stopped model 1 tree count: 393  Early-stopped model 1 validation accuracy: 0.8296    Early-stopped model 2 tree count: 393  Early-stopped model 2 validation accuracy: 0.8296  """

可以看到用earlystopping后训练时间更短,可以有效避免过拟合,得到的模型准确率更高。

Feature Importance

显示特征重要性:

model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)    feature_importances = model.get_feature_importance(train_pool)  feature_names = X_train.columns  for score, name in sorted(zip(feature_importances, feature_names), reverse=True):      print('{}: {}'.format(name, score))    """  Sex: 48.21061102095765  Pclass: 17.045040317206695  Age: 7.611166250335819  Parch: 5.220861205417323  SibSp: 5.16579933751564  Embarked: 4.968165121183137  Fare: 4.858908301370388  Cabin: 4.140024994004162  Ticket: 2.7794234520091585  PassengerId: 0.0  Name: 0.0  """  # 设置参数:prettified=True 获得更多的输出信息  importances = model.get_feature_importance(prettified=True)  print(importances)

封装函数,实现更好的显示方式。

import pandas as pd  import matplotlib.pyplot as plt  import seaborn as sns  sns.set(font_scale=2)  %matplotlib inline    def func_plot_importance(df_imp):        sns.set(font_scale=1)      fig = plt.figure(figsize=(3, 3), dpi=100)      ax = sns.barplot(          x="Importance", y="Features", data=df_imp, label="Total", color="b")      ax.tick_params(labelcolor='k', labelsize='10', width=3)      plt.show()    def display_importance(model_out, columns, printing=True, plotting=True):      importances = model_out.feature_importances_      indices = np.argsort(importances)[::-1]      importance_list = []      for f in range(len(columns)):          importance_list.append((columns[indices[f]], importances[indices[f]]))          if printing:              print("%2d) %-*s %f" % (f + 1, 30, columns[indices[f]],                                      importances[indices[f]]))      if plotting:          df_imp = pd.DataFrame(              importance_list, columns=['Features', 'Importance'])          func_plot_importance(df_imp)      display_importance(model_out=model, columns=X_train.columns)

Cross Validation[12]

cv(pool=None,     params=None,     dtrain=None,     iterations=None,     num_boost_round=None,     fold_count=3,     nfold=None,     inverted=False,     partition_random_seed=0,     seed=None,     shuffle=True,     logging_level=None,     stratified=None,     as_pandas=True,     metric_period=None,     verbose=None,     verbose_eval=None,     plot=False,     early_stopping_rounds=None,     folds=None)

需要先将数据封装Pool里,然后再进行交叉验证。

cv_params = model.get_params()  cv_params.update({      'loss_function': 'Logloss'  })  cv_data = cv(      Pool(X, y, cat_features=categorical_features_indices),      cv_params,      plot=True  )    print('Best validation accuracy score: {:.3f}±{:.3f} on step {}'.format(      np.max(cv_data['test-Accuracy-mean']),      cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],      np.argmax(cv_data['test-Accuracy-mean'])))  # Best validation accuracy score: 0.833±0.007 on step 286
best_value = np.min(np.array(cv_data['test-Logloss-mean']))  best_iter_idx = np.argmin(np.array(cv_data['test-Logloss-mean']))    print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(      best_value,      cv_data['test-Logloss-std'][best_iter_idx],      best_iter_idx+1))

注意:iteration = index+1

用holdout做验证容易低估或高估我们的模型预测偏差,用交叉验证是更好的方式。

Using Baseline

可以实现在之前预训练的基础上继续训练。

params = {'iterations': 200,            'learning_rate': 0.1,            'eval_metric': 'Accuracy',            'random_seed': 42,            'logging_level': 'Verbose',            'use_best_model': False}    current_params = params.copy()  current_params.update({      'iterations': 10  })  model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)  # Get baseline (only with prediction_type='RawFormulaVal')  baseline = model.predict(X_train, prediction_type='RawFormulaVal')  # Fit new model  model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

Snapshot

可用于在中断后恢复之前训练状态,以及在之前训练的基础上进行继续训练。假如我们的训练会持续较长时间,设置snapshot可以有效防止我们的电脑或者服务器在过程中重启或者其他故障而导致我们的训练前功尽弃。

params_with_snapshot = params.copy()  params_with_snapshot.update({      'iterations': 5,      'learning_rate': 0.5,      'logging_level': 'Verbose'  })  model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)    params_with_snapshot.update({      'iterations': 10,      'learning_rate': 0.1,  })  model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

训练的中间信息会默认保存在catboost_info/目录下,如需修改可以通过train_dir参数进行设置。

#!rm 'catboost_info/snapshot.bkp'  from catboost import CatBoostClassifier  model = CatBoostClassifier(      iterations=40,      random_seed=43  )  model.fit(      train_pool,      eval_set=validate_pool,      save_snapshot=True,      snapshot_file='snapshot.bkp',      logging_level='Verbose'  )

DIY Loss AND Metric Function

注意区分两个参数:

(1) loss_function, Alias: objective.

训练模型的优化目标函数。

(2) custom_metric, Alias: custom_loss

在训练时输出的评估指标,仅作为模型训练状态的参照,而非实际的优化目标。

(3) eval_metric

用于监测模型过拟合以及作为选择最优模型的参考。(loss_functioneval_metric可以不一致,比如训练用Logloss,用AUC选择最佳模型/最佳迭代次数)

model = CatBoostClassifier(      iterations=500,      loss_function= 'Logloss',      custom_metric=['Accuracy','AUC'],      eval_metric='F1',      random_seed=666  )    # custom_metric <==> custom_loss  # 只作为评估参考,而非优化目标    model.fit(      X_train, y_train,      cat_features=categorical_features_indices,      eval_set=(X_validation, y_validation),      verbose=50,      plot=True  );

不同参数的测试:

# custom_metric=['Accuracy','AUC'], eval_metric='F1',  model.best_iteration_, model.best_score_, model.tree_count_  """  (219,   {'learn': {'Accuracy': 0.9491017964071856,     'Logloss': 0.1747009677350333,     'F1': 0.9294605809128631},    'validation': {'Accuracy': 0.8385650224215246,     'Logloss': 0.39249638575985446,     'F1': 0.7906976744186046,     'AUC': 0.9018111688747275}},   220)  """    # custom_metric=['Accuracy','AUC'], eval_metric='Logloss',  model.best_iteration_, model.best_score_, model.tree_count_  """  (152,   {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},    'validation': {'Accuracy': 0.8385650224215246,     'Logloss': 0.39249638575985446,     'AUC': 0.9018111688747275}},   153)  """    # custom_metric=['Accuracy','AUC'], eval_metric='Accuracy',  model.best_iteration_, model.best_score_, model.tree_count_  """  (219,   {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333},    'validation': {'Accuracy': 0.8385650224215246,     'Logloss': 0.39249638575985446,     'AUC': 0.9018111688747275}},   220)  """

1. User Defined Objective Function[13]

class LoglossObjective(object):      def calc_ders_range(self, approxes, targets, weights):          """          approxes, targets, weights are indexed containers of floats          (containers which have only __len__ and __getitem__ defined).          weights parameter can be None.            To understand what these parameters mean, assume that there is          a subset of your dataset that is currently being processed.          approxes contains current predictions for this subset,          targets contains target values you provided with the dataset.            This function should return a list of pairs (der1, der2), where          der1 is the first derivative of the loss function with respect          to the predicted value, and der2 is the second derivative.            In our case, logloss is defined by the following formula:          target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))          where sigmoid(x) = 1 / (1 + e^(-x)).          """          assert len(approxes) == len(targets)          if weights is not None:              assert len(weights) == len(approxes)          result = []          for index in range(len(targets)):              e = np.exp(approxes[index])              p = e / (1 + e)              der1 = (1 - p) if targets[index] > 0.0 else -p              der2 = -p * (1 - p)              if weights is not None:                  der1 *= weights[index]                  der2 *= weights[index]              result.append((der1, der2))          return result    model = CatBoostClassifier(      iterations=10,      random_seed=42,      loss_function=LoglossObjective(),      eval_metric="Logloss"  )  # Fit model  model.fit(train_pool)  # Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`  preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

2. User Defined Metric Function[14]

class LoglossMetric(object):      def get_final_error(self, error, weight):          return error / (weight + 1e-38)        def is_max_optimal(self):          return False        def evaluate(self, approxes, target, weight):          """          approxes is a list of indexed containers          (containers with only __len__ and __getitem__ defined),          one container per approx dimension.          Each container contains floats.          weight is a one dimensional indexed container.          target is float.            weight parameter can be None.          Returns pair (error, weights sum)          """          assert len(approxes) == 1          assert len(target) == len(approxes[0])          approx = approxes[0]          error_sum = 0.0          weight_sum = 0.0          for i in range(len(approx)):              w = 1.0 if weight is None else weight[i]              weight_sum += w              error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))            return error_sum, weight_sum    model = CatBoostClassifier(      iterations=10,      random_seed=42,      loss_function="Logloss",      eval_metric=LoglossMetric()  )  # Fit model  model.fit(train_pool)  # Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`  preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

训练后查看模型在新数据集上的表现(Eval Metrics)

CatBoost有一个eval_metrics的方法,可以用于计算训练后的模型某一指定指标在每一轮迭代的表现,同时也可以可视化。可用于训练后的模型在新数据集上的评估。

model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)  eval_metrics = model.eval_metrics(validate_pool, ['AUC','F1','Logloss'], plot=True)  # 返回一个dict,有'AUC','F1','Logloss'这几个键

对比不同参数配置下模型的学习过程

from catboost import MetricVisualizer    model1 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent')  model1.fit(train_pool, eval_set=validate_pool)    model2 = CatBoostClassifier(iterations=100, depth=8, train_dir='model_depth_8/', logging_level='Silent')  model2.fit(train_pool, eval_set=validate_pool);    widget = MetricVisualizer(['model_depth_5', 'model_depth_8'])  widget.start()

保存和导入模型

将模型保存为二进制文件。

model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)  model.save_model('catboost_model.dump')  model = CatBoostClassifier()  model.load_model('catboost_model.dump');    print(model.get_params())  print(model.random_seed_)  print(model.learning_rate_)

模型的分析与理解

shap

调参

我们可以通过交叉验证和learning curve得到最佳的iterations (boosting steps),但还有一些重要的参数需要我们额外调整。较为重要的比如l2_leaf_reg, learning_rate等,更多的参数说明请参考官网[15]。下面用hyperopt进行调参演示:

import hyperopt  from catboost import CatBoostClassifier, Pool, cv    def hyperopt_objective(params):        model = CatBoostClassifier(          l2_leaf_reg=int(params['l2_leaf_reg']),          learning_rate=params['learning_rate'],          iterations=100,          eval_metric='Accuracy',          loss_function= 'Logloss',          random_seed=42,          logging_level='Silent'      )        cv_data = cv(          Pool(X, y, cat_features=categorical_features_indices),          model.get_params()      )      best_accuracy = np.max(cv_data['test-Accuracy-mean'])        return 1 - best_accuracy # as hyperopt minimises
from numpy.random import RandomState    params_space = {      'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),      'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),  }    trials = hyperopt.Trials()    best = hyperopt.fmin(      hyperopt_objective,      space=params_space,      algo=hyperopt.tpe.suggest,      max_evals=10,      trials=trials,      rstate=RandomState(123)  )    print(best)    """  100%|██████████| 10/10 [01:02<00:00,  6.69s/it, best loss: 0.1728395061728395]  {'l2_leaf_reg': 3.0, 'learning_rate': 0.36395429572850696}  """
model = CatBoostClassifier(      l2_leaf_reg=int(best['l2_leaf_reg']),      learning_rate=best['learning_rate'],      iterations=100,      eval_metric='Accuracy',      loss_function= 'Logloss',      random_seed=42,      logging_level='Silent'  )  cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())    print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))  print(f"Best iteration: {int(np.argmax(cv_data['test-Accuracy-mean'])+1)}")    """  Precise validation accuracy score: 0.8271604938271605  Best iteration: 49  """

一些常用参数的说明,更多参数请查阅官网文档Python Training Parameters[16]:

1.iterations + learning_rate

默认状况下会迭代1000次, learning_rate是根据数据集及iterations参数自动定义的。如果减小iterations,最好相应的增大 learning_rate,使得结果收敛。

如果在训练中发现结果没收敛,可以考虑提高 learning_rate;如果发现过拟合了,则需要减小 learning_rate

2.boosting_type

默认是Ordered,效果不错,在小数据集推荐使。但是速度会比Plain模式慢。

3.bootstrap_type[17]4.one_hot_max_size

在类别特征转换时,对取值少于或等于one_hot_max_size的类别特征,采用OneHot编码,对其他类别特征采用更多统计值。通常OneHot是更快的方式,而计算统计值耗时更多,所以为了提高速度,我们可以给该参数设置较大的值。

5.rsm: Alias: colsample_bylevel, float(0,1]

参与每次分裂选择的特征比例。在你有好几百维以上特征的情况下,这个参数非常有效,可以有效的加速训练同时保持较好的结果。如果特征较少,可以不用该参数。

假设你有很多的特征,你设置了rsm=0.1,通常你需要增加20%的迭代次数使得模型收敛,但是每次的迭代速度将会比原来快10倍。

6.max_ctr_complexity

特征组合的最大特征数量。catboost用贪心算法做类别特征的特征组合,非常耗时。设置 max_ctr_complexity = 1 取消特征组合,设置 max_ctr_complexity = 2 只做两个特征的组合。

7.depth

树深。大多数情况下,在4-10之间,可以在6-10之间多加调试。

8.l2_leaf_reg

L2正则系数,多尝试不同的取值。

9.random_strength

可以防止过拟合。在分裂过程计算各特征score时加入的随机因子。本来score是确定性的,我们加入一个满足均值为0,方差为1*random_strength(方差随着迭代减小)分布的误差项来产生随机性,防止过拟合。

10.bagging_temperature: [0,inf)

bootstrap_type[18]为Bayesian时有效,用于设置Bayesian bootstrap的参数。当取值为1时,会从指数分布中采样权值;当为0时,所有的权重为1。这个值越大,则bootstrap越aggressive。

11.has_time

如果数据集是时间序列,需要考虑样本的先后关系,则可以设置该参数。则Transforming categorical features to numerical features[19] 和 Choosing the tree structure[20] 的阶段,数据会保持原有顺序,或根据Timestamp的列排列(如果在input data中声明),而不会进行shuffle操作 (random permutations)。

12.grow_policy: 可选值为 [SymmetricTree, Depthwise, Lossguide]

决策树的生长方式,默认是level-wise的symmetric trees。

min_data_in_leaf Alias: min_child_samples: 支持Depthwise, Lossguide

max_leaves Alias: num_leaves: 支持Lossguide

如果是GPU环境:可以设置task_type="GPU"

border_count: Alias: max_bin. 对数值型特征的切分次数,在CPU上默认值为254,在GPU上默认值为128。在CPU上该参数不会显著影响到训练速度,在GPU上该参数会显著影响到训练的速度,如果为了更好的训练质量可以设置为254,如果为了更快,可以降低该参数的值。

例如:

更快的模型

from catboost import CatBoost  fast_model = CatBoostClassifier(      random_seed=63,      iterations=150,      learning_rate=0.01,      boosting_type='Plain',      bootstrap_type='Bernoulli',      subsample=0.5,      one_hot_max_size=20,      rsm=0.5,      leaf_estimation_iterations=5,      max_ctr_complexity=1,      border_count=32)    fast_model.fit(      X_train, y_train,      cat_features=cat_features,      logging_level='Silent',      plot=True  )

更准确的模型

tunned_model = CatBoostClassifier(      random_seed=63,      iterations=1000,      learning_rate=0.03,      l2_leaf_reg=3,      bagging_temperature=1,      random_strength=1,      one_hot_max_size=2,      leaf_estimation_method='Newton',      depth=6  )  tunned_model.fit(      X_train, y_train,      cat_features=cat_features,      logging_level='Silent',      eval_set=(X_validation, y_validation),      plot=True  )

本文代码:

https://github.com/zhangqibot/python_data_basic/tree/master/machine_learning/catboost

往期推荐

Deecamp 夏令营 AI 降水预测总结

Python绘制气象实用地图(附代码和测试数据)

斯坦福大学使用机器学习做次季节温度/降水预报

Nature(2019)-地球系统科学领域的深度学习及其理解

交叉新趋势|采用神经网络与深度学习来预报降水、温度等案例(附代码/数据/文献)

REFERENCE

[1] 移动广告反欺诈算法挑战赛算法挑战大赛: http://challenge.xfyun.cn/2019/gamedetail?type=detail/mobileAD [2] CatBoost: https://catboost.yandex/ [3] 对类别特征的处理: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html [4] 链接: https://catboost.ai/docs/concepts/parameter-tuning.html [5、10] Pool: https://catboost.ai/docs/concepts/python-reference_pool.html [6、8、12] FeaturesData: https://catboost.ai/docs/concepts/python-features-data__desc.html [7、9] CatBoostClassifier: https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html#python-reference_catboostclassifier [12] Cross Validation: https://catboost.ai/docs/concepts/python-reference_cv.html [13] User Defined Objective Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-objective-function [14] User Defined Metric Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric [15] 参考官网: https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list [16] Python Training Parameters: https://catboost.ai/docs/concepts/python-reference_parameters-list.html [17-18] bootstrap_type: https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html [19] Transforming categorical features to numerical features: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html#algorithm-main-stages_cat-to-numberic [20] Choosing the tree structure: https://catboost.ai/docs/concepts/algorithm-main-stages_choose-tree-structure.html#algorithm-main-stages_choose-tree-structure [21] catboost in github: https://github.com/catboost/catboost [22] catboost paper: https://arxiv.org/pdf/1706.09516.pdf [23] catboost算法细节: https://catboost.ai/docs/concepts/algorithm-main-stages.html