讯飞广告反欺诈赛的王牌模型catboost介绍
- 2019 年 10 月 7 日
- 笔记
前段时间,MeteoAI小伙伴参加了讯飞移动广告反欺诈算法挑战赛算法挑战大赛[1],最终取得了复赛14/1428名的成绩。这是第一个我们从头到尾认真刷完的比赛,排名前1%其实我们觉得也还算可以,但还是比较遗憾与获奖区(前十名)擦肩而过……整个过程也是相当的波澜起伏,最高排名我们11名,可谓就是差一点点点就进入头部梯队了。不过通过这次比赛我们也确实收获了不少。

你一个搞气象的去凑什么热闹???
首先,模型大家都差不多,比如这个比赛,大家基本用的都是catboost。最终比拼的是数据挖掘的能力,有时候当然还是要一些灵感和运气。充分的EDA和精细的特征工程往往是在这类数据竞赛最终胜出的关键。所以大家一定要有意识培养数据分析,数据挖掘,特征工程,业务理解等多方面的能力,不能只会model.fit()
和model.predict()
,因为这个真的谁都可以会。
获取本文代码见文末~
嗯…… 今天我们还是讲讲这个比赛的大杀器模型catboost的model.fit()
和model.predict()
,因为这个真的谁都可以会,特征工程数据挖掘那一套…..讲真我们也还没完全摸透啊就不瞎分享了。讯飞的这个比赛,大多数的特征都是类型的特征,catboost非常擅长处理这些类别特征,结果完爆xgboost和lightgbm等常用模型。

用过sklearn进行机器学习的同学应该都知道,在用sklearn进行机器学习的时候,我们需要对类别特征进行预处理,如label encoding, one hot encoding等,因为sklearn无法处理类别特征,会报错。
而俄罗斯Yandex公司开源的 CatBoost[2]模型可直接对类别特征进行处理,在很多公开数据集上的表现都相当优异。从它的名字也可以看出来(CatBoost = Category and Boosting),它的优势是对类别特征的处理[3],同时结果更加robust,不需要费力去调参也能获得非常不错的结果,关于调参可参考链接[4]。
catboost: Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality.
1. Install
首先安装相应的工具:
# 用pip pip install catboost # 或者用conda conda install -c conda-forge catboost # 安装jupyter notebook中的交互组件,用于交互绘图 pip install ipywidgets jupyter nbextension enable --py widgetsnbextension
2. Preprocessing
Pool
Pool
是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool
,其内存和速度都更优。
关于Pool
[5]的用法:
class Pool(data, label=None, cat_features=None, column_description=None, pairs=None, delimiter='t', has_header=False, weight=None, group_id=None, group_weight=None, subgroup_id=None, pairs_weight=None baseline=None, feature_names=None, thread_count=-1)
from catboost import CatBoostClassifier, Pool train_data = Pool(data=[[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], label=[1, 1, -1], weight=[0.1, 0.2, 0.3]) train_data # <catboost.core.Pool at 0x1a22af06d0> model = CatBoostClassifier(iterations=10) model.fit(train_data) preds_class = model.predict(train_data)
FeaturesData
创建Pool
有多种方式,而通过FeaturesData
[6]构建Pool
是更优的方式。
class FeaturesData(num_feature_data=None, cat_feature_data=None, num_feature_names=None, cat_feature_names=None)
CatBoostClassifier[7] with FeaturesData[8]:
import numpy as np from catboost import CatBoostClassifier, FeaturesData # Initialize data cat_features = [0,1,2] train_data = FeaturesData( num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32), cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object) ) train_labels = [1,1,-1] test_data = FeaturesData( num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32), cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object)) # Initialize CatBoostClassifier model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss') # Fit model model.fit(train_data, train_labels) # Get predicted classes preds_class = model.predict(test_data) # Get predicted probabilities for each class preds_proba = model.predict_proba(test_data) # Get predicted RawFormulaVal preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
CatBoostClassifier[9] with Pool[10] and FeaturesData[11]:
import numpy as np from catboost import CatBoostClassifier, FeaturesData, Pool # Initialize data train_data = Pool( data=FeaturesData( num_feature_data=np.array([[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], dtype=np.float32), cat_feature_data=np.array([["a", "b"], ["a", "b"], ["c", "d"]], dtype=object) ), label=[1, 1, -1] ) test_data = Pool( data=FeaturesData( num_feature_data=np.array([[2, 4, 6, 8], [1, 4, 50, 60]], dtype=np.float32), cat_feature_data=np.array([["a", "b"], ["a", "d"]], dtype=object) ) ) # Initialize CatBoostClassifier model = CatBoostClassifier(iterations = 2, learning_rate = 1, depth = 2, loss_function = 'Logloss') # Fit model model.fit(train_data) # Get predicted classes preds_class = model.predict(test_data) # Get predicted probabilities for each class preds_proba = model.predict_proba(test_data) # Get predicted RawFormulaVal preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
3. Case
下面利用catboost内置的titanic数据集做演示。
库和数据集准备
首先导入必要的库和做数据准备,这里忽略最为重要的特征工程部分,仅仅作为演示:
from catboost.datasets import titanic import numpy as np from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier, Pool, cv from sklearn.metrics import accuracy_score # 导入数据 train_df, test_df = titanic() # 查看缺测数据: null_value_stats = train_df.isnull().sum(axis=0) null_value_stats[null_value_stats != 0] # 填充缺失值: train_df.fillna(-999, inplace=True) test_df.fillna(-999, inplace=True) # 拆分features和label X = train_df.drop('Survived', axis=1) y = train_df.Survived # train test split X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42) X_test = test_df # indices of categorical features categorical_features_indices = np.where(X.dtypes != np.float)[0]
进行模型训练
catboost提供的默认参数可以提供非常好的baseline。所以不妨从默认参数开始。
model = CatBoostClassifier( custom_metric=['Accuracy'], random_seed=666, logging_level='Silent' ) # custom_metric <==> custom_loss model.fit( X_train, y_train, cat_features=categorical_features_indices, eval_set=(X_validation, y_validation), logging_level='Verbose', # you can comment this for no text output plot=True ); # OUTPUT: """ ... ... ... bestTest = 0.3792389991 bestIteration = 342 Shrink model to first 343 iterations. """

应用模型进行预测
predictions = model.predict(X_test) predictions_probs = model.predict_proba(X_test) print(predictions[:10]) print(predictions_probs[:10]) # OUTPUT: """ [0. 0. 0. 0. 1. 0. 1. 0. 1. 0.] [[0.90866781 0.09133219] [0.63668717 0.36331283] [0.95333247 0.04666753] [0.91051481 0.08948519] [0.28010084 0.71989916] [0.94618962 0.05381038] [0.35536101 0.64463899] [0.81843278 0.18156722] [0.32829247 0.67170753] [0.92653732 0.07346268]] """
选择最好的模型输出(use_best_model
)
在进行模型训练的时候,use_best_model
最好用默认设置True
,这意味着最后的模型训练结果是收缩在最佳的迭代次数上的(可以用model.tree_count_
获得最佳的迭代次数),如果use_best_model
设置为False
,则 model.tree_count_
= iteration
。如下面的例子:
# 数据准备的部分见库和数据集准备部分 params = { 'iterations': 500, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 666, 'logging_level': 'Silent', 'use_best_model': False } # train train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices) # validation validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices) # train with 'use_best_model': False model = CatBoostClassifier(**params) model.fit(train_pool, eval_set=validate_pool) # train with 'use_best_model': True best_model_params = params.copy() best_model_params.update({'use_best_model': True}) best_model = CatBoostClassifier(**best_model_params) best_model.fit(train_pool, eval_set=validate_pool); # show result print('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format( accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_)) print('') print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format( accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_))
用Early Stopping防止过拟合、节约训练时间
earlystopping是常用的防止模型过拟合的方式,同时也可以大幅度的节约训练时间。
params.update({'iterations':1000}) params # OUTPUT: """ {'iterations': 1000, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 42, 'logging_level': 'Silent', 'use_best_model': False} """
%%time model = CatBoostClassifier(**params) model.fit(train_pool, eval_set=validate_pool) """ CPU times: user 2min 11s, sys: 52.1 s, total: 3min 3s Wall time: 27.8 s """
%%time earlystop_model_1 = CatBoostClassifier(**params) earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20) """ CPU times: user 46.6 s, sys: 15.6 s, total: 1min 2s Wall time: 9.2 s """
%%time earlystop_params = params.copy() earlystop_params.update({ 'od_type': 'Iter', 'od_wait': 200, 'logging_level': 'Verbose' }) earlystop_model_2 = CatBoostClassifier(**earlystop_params) earlystop_model_2.fit(train_pool, eval_set=validate_pool); """ CPU times: user 49.6 s, sys: 19.9 s, total: 1min 9s Wall time: 10.3 s """
也可以直接设置参数early_stopping_rounds
:
early_stopping_rounds
: Set the overfitting detector type to 'Iter' ( 'od_type': 'Iter') and stop the training after the specified number of iterations since the iteration with the optimal metric value.
earlystop_params = params.copy() earlystop_params.update({ 'early_stopping_rounds': 200, 'logging_level': 'Verbose' })
输出结果:
print('Simple model tree count: {}'.format(model.tree_count_)) print('Simple model validation accuracy: {:.4}'.format( accuracy_score(y_validation, model.predict(X_validation)) )) print('') print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_)) print('Early-stopped model 1 validation accuracy: {:.4}'.format( accuracy_score(y_validation, earlystop_model_1.predict(X_validation)) )) print('') print('Early-stopped model 2 tree count: {}'.format(earlystop_model_2.tree_count_)) print('Early-stopped model 2 validation accuracy: {:.4}'.format( accuracy_score(y_validation, earlystop_model_2.predict(X_validation)) )) """ Simple model tree count: 1000 Simple model validation accuracy: 0.8206 Early-stopped model 1 tree count: 393 Early-stopped model 1 validation accuracy: 0.8296 Early-stopped model 2 tree count: 393 Early-stopped model 2 validation accuracy: 0.8296 """
可以看到用earlystopping后训练时间更短,可以有效避免过拟合,得到的模型准确率更高。
Feature Importance
显示特征重要性:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool) feature_importances = model.get_feature_importance(train_pool) feature_names = X_train.columns for score, name in sorted(zip(feature_importances, feature_names), reverse=True): print('{}: {}'.format(name, score)) """ Sex: 48.21061102095765 Pclass: 17.045040317206695 Age: 7.611166250335819 Parch: 5.220861205417323 SibSp: 5.16579933751564 Embarked: 4.968165121183137 Fare: 4.858908301370388 Cabin: 4.140024994004162 Ticket: 2.7794234520091585 PassengerId: 0.0 Name: 0.0 """ # 设置参数:prettified=True 获得更多的输出信息 importances = model.get_feature_importance(prettified=True) print(importances)
封装函数,实现更好的显示方式。
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set(font_scale=2) %matplotlib inline def func_plot_importance(df_imp): sns.set(font_scale=1) fig = plt.figure(figsize=(3, 3), dpi=100) ax = sns.barplot( x="Importance", y="Features", data=df_imp, label="Total", color="b") ax.tick_params(labelcolor='k', labelsize='10', width=3) plt.show() def display_importance(model_out, columns, printing=True, plotting=True): importances = model_out.feature_importances_ indices = np.argsort(importances)[::-1] importance_list = [] for f in range(len(columns)): importance_list.append((columns[indices[f]], importances[indices[f]])) if printing: print("%2d) %-*s %f" % (f + 1, 30, columns[indices[f]], importances[indices[f]])) if plotting: df_imp = pd.DataFrame( importance_list, columns=['Features', 'Importance']) func_plot_importance(df_imp) display_importance(model_out=model, columns=X_train.columns)

Cross Validation[12]
cv(pool=None, params=None, dtrain=None, iterations=None, num_boost_round=None, fold_count=3, nfold=None, inverted=False, partition_random_seed=0, seed=None, shuffle=True, logging_level=None, stratified=None, as_pandas=True, metric_period=None, verbose=None, verbose_eval=None, plot=False, early_stopping_rounds=None, folds=None)
需要先将数据封装Pool
里,然后再进行交叉验证。
cv_params = model.get_params() cv_params.update({ 'loss_function': 'Logloss' }) cv_data = cv( Pool(X, y, cat_features=categorical_features_indices), cv_params, plot=True ) print('Best validation accuracy score: {:.3f}±{:.3f} on step {}'.format( np.max(cv_data['test-Accuracy-mean']), cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])], np.argmax(cv_data['test-Accuracy-mean']))) # Best validation accuracy score: 0.833±0.007 on step 286
best_value = np.min(np.array(cv_data['test-Logloss-mean'])) best_iter_idx = np.argmin(np.array(cv_data['test-Logloss-mean'])) print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format( best_value, cv_data['test-Logloss-std'][best_iter_idx], best_iter_idx+1))
注意:iteration = index+1
用holdout做验证容易低估或高估我们的模型预测偏差,用交叉验证是更好的方式。
Using Baseline
可以实现在之前预训练的基础上继续训练。
params = {'iterations': 200, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 42, 'logging_level': 'Verbose', 'use_best_model': False} current_params = params.copy() current_params.update({ 'iterations': 10 }) model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices) # Get baseline (only with prediction_type='RawFormulaVal') baseline = model.predict(X_train, prediction_type='RawFormulaVal') # Fit new model model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);
Snapshot
可用于在中断后恢复之前训练状态,以及在之前训练的基础上进行继续训练。假如我们的训练会持续较长时间,设置snapshot可以有效防止我们的电脑或者服务器在过程中重启或者其他故障而导致我们的训练前功尽弃。
params_with_snapshot = params.copy() params_with_snapshot.update({ 'iterations': 5, 'learning_rate': 0.5, 'logging_level': 'Verbose' }) model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True) params_with_snapshot.update({ 'iterations': 10, 'learning_rate': 0.1, }) model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)
训练的中间信息会默认保存在catboost_info/
目录下,如需修改可以通过train_dir
参数进行设置。
#!rm 'catboost_info/snapshot.bkp' from catboost import CatBoostClassifier model = CatBoostClassifier( iterations=40, random_seed=43 ) model.fit( train_pool, eval_set=validate_pool, save_snapshot=True, snapshot_file='snapshot.bkp', logging_level='Verbose' )
DIY Loss AND Metric Function
注意区分两个参数:
(1) loss_function
, Alias: objective
.
训练模型的优化目标函数。
(2) custom_metric
, Alias: custom_loss
在训练时输出的评估指标,仅作为模型训练状态的参照,而非实际的优化目标。
(3) eval_metric
用于监测模型过拟合以及作为选择最优模型的参考。(loss_function
和eval_metric
可以不一致,比如训练用Logloss
,用AUC
选择最佳模型/最佳迭代次数)
model = CatBoostClassifier( iterations=500, loss_function= 'Logloss', custom_metric=['Accuracy','AUC'], eval_metric='F1', random_seed=666 ) # custom_metric <==> custom_loss # 只作为评估参考,而非优化目标 model.fit( X_train, y_train, cat_features=categorical_features_indices, eval_set=(X_validation, y_validation), verbose=50, plot=True );

不同参数的测试:
# custom_metric=['Accuracy','AUC'], eval_metric='F1', model.best_iteration_, model.best_score_, model.tree_count_ """ (219, {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333, 'F1': 0.9294605809128631}, 'validation': {'Accuracy': 0.8385650224215246, 'Logloss': 0.39249638575985446, 'F1': 0.7906976744186046, 'AUC': 0.9018111688747275}}, 220) """ # custom_metric=['Accuracy','AUC'], eval_metric='Logloss', model.best_iteration_, model.best_score_, model.tree_count_ """ (152, {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333}, 'validation': {'Accuracy': 0.8385650224215246, 'Logloss': 0.39249638575985446, 'AUC': 0.9018111688747275}}, 153) """ # custom_metric=['Accuracy','AUC'], eval_metric='Accuracy', model.best_iteration_, model.best_score_, model.tree_count_ """ (219, {'learn': {'Accuracy': 0.9491017964071856, 'Logloss': 0.1747009677350333}, 'validation': {'Accuracy': 0.8385650224215246, 'Logloss': 0.39249638575985446, 'AUC': 0.9018111688747275}}, 220) """
1. User Defined Objective Function[13]
class LoglossObjective(object): def calc_ders_range(self, approxes, targets, weights): """ approxes, targets, weights are indexed containers of floats (containers which have only __len__ and __getitem__ defined). weights parameter can be None. To understand what these parameters mean, assume that there is a subset of your dataset that is currently being processed. approxes contains current predictions for this subset, targets contains target values you provided with the dataset. This function should return a list of pairs (der1, der2), where der1 is the first derivative of the loss function with respect to the predicted value, and der2 is the second derivative. In our case, logloss is defined by the following formula: target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx)) where sigmoid(x) = 1 / (1 + e^(-x)). """ assert len(approxes) == len(targets) if weights is not None: assert len(weights) == len(approxes) result = [] for index in range(len(targets)): e = np.exp(approxes[index]) p = e / (1 + e) der1 = (1 - p) if targets[index] > 0.0 else -p der2 = -p * (1 - p) if weights is not None: der1 *= weights[index] der2 *= weights[index] result.append((der1, der2)) return result model = CatBoostClassifier( iterations=10, random_seed=42, loss_function=LoglossObjective(), eval_metric="Logloss" ) # Fit model model.fit(train_pool) # Only prediction_type='RawFormulaVal' is allowed with custom `loss_function` preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')
2. User Defined Metric Function[14]
class LoglossMetric(object): def get_final_error(self, error, weight): return error / (weight + 1e-38) def is_max_optimal(self): return False def evaluate(self, approxes, target, weight): """ approxes is a list of indexed containers (containers with only __len__ and __getitem__ defined), one container per approx dimension. Each container contains floats. weight is a one dimensional indexed container. target is float. weight parameter can be None. Returns pair (error, weights sum) """ assert len(approxes) == 1 assert len(target) == len(approxes[0]) approx = approxes[0] error_sum = 0.0 weight_sum = 0.0 for i in range(len(approx)): w = 1.0 if weight is None else weight[i] weight_sum += w error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i]))) return error_sum, weight_sum model = CatBoostClassifier( iterations=10, random_seed=42, loss_function="Logloss", eval_metric=LoglossMetric() ) # Fit model model.fit(train_pool) # Only prediction_type='RawFormulaVal' is allowed with custom `loss_function` preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')
训练后查看模型在新数据集上的表现(Eval Metrics)
CatBoost
有一个eval_metrics
的方法,可以用于计算训练后的模型某一指定指标在每一轮迭代的表现,同时也可以可视化。可用于训练后的模型在新数据集上的评估。
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool) eval_metrics = model.eval_metrics(validate_pool, ['AUC','F1','Logloss'], plot=True) # 返回一个dict,有'AUC','F1','Logloss'这几个键

对比不同参数配置下模型的学习过程
from catboost import MetricVisualizer model1 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent') model1.fit(train_pool, eval_set=validate_pool) model2 = CatBoostClassifier(iterations=100, depth=8, train_dir='model_depth_8/', logging_level='Silent') model2.fit(train_pool, eval_set=validate_pool); widget = MetricVisualizer(['model_depth_5', 'model_depth_8']) widget.start()

保存和导入模型
将模型保存为二进制文件。
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool) model.save_model('catboost_model.dump') model = CatBoostClassifier() model.load_model('catboost_model.dump'); print(model.get_params()) print(model.random_seed_) print(model.learning_rate_)
模型的分析与理解
shap
调参
我们可以通过交叉验证和learning curve得到最佳的iterations (boosting steps),但还有一些重要的参数需要我们额外调整。较为重要的比如l2_leaf_reg
, learning_rate
等,更多的参数说明请参考官网[15]。下面用hyperopt
进行调参演示:
import hyperopt from catboost import CatBoostClassifier, Pool, cv def hyperopt_objective(params): model = CatBoostClassifier( l2_leaf_reg=int(params['l2_leaf_reg']), learning_rate=params['learning_rate'], iterations=100, eval_metric='Accuracy', loss_function= 'Logloss', random_seed=42, logging_level='Silent' ) cv_data = cv( Pool(X, y, cat_features=categorical_features_indices), model.get_params() ) best_accuracy = np.max(cv_data['test-Accuracy-mean']) return 1 - best_accuracy # as hyperopt minimises
from numpy.random import RandomState params_space = { 'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1), 'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1), } trials = hyperopt.Trials() best = hyperopt.fmin( hyperopt_objective, space=params_space, algo=hyperopt.tpe.suggest, max_evals=10, trials=trials, rstate=RandomState(123) ) print(best) """ 100%|██████████| 10/10 [01:02<00:00, 6.69s/it, best loss: 0.1728395061728395] {'l2_leaf_reg': 3.0, 'learning_rate': 0.36395429572850696} """
model = CatBoostClassifier( l2_leaf_reg=int(best['l2_leaf_reg']), learning_rate=best['learning_rate'], iterations=100, eval_metric='Accuracy', loss_function= 'Logloss', random_seed=42, logging_level='Silent' ) cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params()) print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean']))) print(f"Best iteration: {int(np.argmax(cv_data['test-Accuracy-mean'])+1)}") """ Precise validation accuracy score: 0.8271604938271605 Best iteration: 49 """
一些常用参数的说明,更多参数请查阅官网文档Python Training Parameters[16]:
1.iterations
+ learning_rate
默认状况下会迭代1000次, learning_rate
是根据数据集及iterations
参数自动定义的。如果减小iterations
,最好相应的增大 learning_rate
,使得结果收敛。
如果在训练中发现结果没收敛,可以考虑提高 learning_rate
;如果发现过拟合了,则需要减小 learning_rate
2.boosting_type
默认是Ordered
,效果不错,在小数据集推荐使。但是速度会比Plain
模式慢。
3.bootstrap_type
[17]4.one_hot_max_size
在类别特征转换时,对取值少于或等于one_hot_max_size
的类别特征,采用OneHot编码,对其他类别特征采用更多统计值。通常OneHot是更快的方式,而计算统计值耗时更多,所以为了提高速度,我们可以给该参数设置较大的值。
5.rsm
: Alias: colsample_bylevel
, float(0,1]
参与每次分裂选择的特征比例。在你有好几百维以上特征的情况下,这个参数非常有效,可以有效的加速训练同时保持较好的结果。如果特征较少,可以不用该参数。
假设你有很多的特征,你设置了rsm
=0.1,通常你需要增加20%的迭代次数使得模型收敛,但是每次的迭代速度将会比原来快10倍。
6.max_ctr_complexity
特征组合的最大特征数量。catboost用贪心算法做类别特征的特征组合,非常耗时。设置 max_ctr_complexity
= 1 取消特征组合,设置 max_ctr_complexity
= 2 只做两个特征的组合。
7.depth
树深。大多数情况下,在4-10之间,可以在6-10之间多加调试。
8.l2_leaf_reg
L2正则系数,多尝试不同的取值。
9.random_strength
可以防止过拟合。在分裂过程计算各特征score时加入的随机因子。本来score是确定性的,我们加入一个满足均值为0,方差为1*random_strength
(方差随着迭代减小)分布的误差项来产生随机性,防止过拟合。
10.bagging_temperature
: [0,inf)
当bootstrap_type
[18]为Bayesian
时有效,用于设置Bayesian bootstrap的参数。当取值为1时,会从指数分布中采样权值;当为0时,所有的权重为1。这个值越大,则bootstrap越aggressive。
11.has_time
如果数据集是时间序列,需要考虑样本的先后关系,则可以设置该参数。则Transforming categorical features to numerical features[19] 和 Choosing the tree structure[20] 的阶段,数据会保持原有顺序,或根据Timestamp的列排列(如果在input data中声明),而不会进行shuffle操作 (random permutations)。
12.grow_policy
: 可选值为 [SymmetricTree
, Depthwise
, Lossguide
]
决策树的生长方式,默认是level-wise的symmetric trees。
min_data_in_leaf
Alias: min_child_samples
: 支持Depthwise
, Lossguide
max_leaves
Alias: num_leaves
: 支持Lossguide
如果是GPU环境:可以设置task_type="GPU"
。
border_count
: Alias: max_bin
. 对数值型特征的切分次数,在CPU上默认值为254,在GPU上默认值为128。在CPU上该参数不会显著影响到训练速度,在GPU上该参数会显著影响到训练的速度,如果为了更好的训练质量可以设置为254,如果为了更快,可以降低该参数的值。
例如:
更快的模型
from catboost import CatBoost fast_model = CatBoostClassifier( random_seed=63, iterations=150, learning_rate=0.01, boosting_type='Plain', bootstrap_type='Bernoulli', subsample=0.5, one_hot_max_size=20, rsm=0.5, leaf_estimation_iterations=5, max_ctr_complexity=1, border_count=32) fast_model.fit( X_train, y_train, cat_features=cat_features, logging_level='Silent', plot=True )
更准确的模型
tunned_model = CatBoostClassifier( random_seed=63, iterations=1000, learning_rate=0.03, l2_leaf_reg=3, bagging_temperature=1, random_strength=1, one_hot_max_size=2, leaf_estimation_method='Newton', depth=6 ) tunned_model.fit( X_train, y_train, cat_features=cat_features, logging_level='Silent', eval_set=(X_validation, y_validation), plot=True )
本文代码:
https://github.com/zhangqibot/python_data_basic/tree/master/machine_learning/catboost
往期推荐
Nature(2019)-地球系统科学领域的深度学习及其理解
交叉新趋势|采用神经网络与深度学习来预报降水、温度等案例(附代码/数据/文献)
REFERENCE
[1]
移动广告反欺诈算法挑战赛算法挑战大赛: http://challenge.xfyun.cn/2019/gamedetail?type=detail/mobileAD [2]
CatBoost: https://catboost.yandex/ [3]
对类别特征的处理: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html [4]
链接: https://catboost.ai/docs/concepts/parameter-tuning.html [5、10]
Pool
: https://catboost.ai/docs/concepts/python-reference_pool.html [6、8、12]
FeaturesData
: https://catboost.ai/docs/concepts/python-features-data__desc.html [7、9]
CatBoostClassifier: https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html#python-reference_catboostclassifier [12]
Cross Validation: https://catboost.ai/docs/concepts/python-reference_cv.html [13]
User Defined Objective Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-objective-function [14]
User Defined Metric Function: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric [15]
参考官网: https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list [16]
Python Training Parameters: https://catboost.ai/docs/concepts/python-reference_parameters-list.html [17-18]
bootstrap_type
: https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html [19]
Transforming categorical features to numerical features: https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html#algorithm-main-stages_cat-to-numberic [20]
Choosing the tree structure: https://catboost.ai/docs/concepts/algorithm-main-stages_choose-tree-structure.html#algorithm-main-stages_choose-tree-structure [21]
catboost in github: https://github.com/catboost/catboost [22]
catboost paper: https://arxiv.org/pdf/1706.09516.pdf [23]
catboost算法细节: https://catboost.ai/docs/concepts/algorithm-main-stages.html