Scikit-learn新版本发布，一行代码秒升级

2019 年 12 月 10 日
筆記

Scikit-learn，这个强大的Python包，一直深受机器学习玩家青睐。

而近日，scikit-learn 官方发布了 0.22 最终版本。

此次的更新修复了许多旧版本的bug，同时发布了一些新功能。

安装最新版本 scikit-learn 也很简单。

使用 pip ：

pip install --upgrade scikit-learn

使用 conda ：

conda install scikit-learn

接下来，就是此次更新的十大亮点。

全新 plotting API

对于创建可视化任务，scikit-learn 推出了一个全新 plotting API。

这个新API可以快速调整图形的视觉效果，不再需要进行重新计算。

也可以在同一个图形中添加不同的图表。

例如：

from sklearn.model_selection import train_test_split  from sklearn.svm import SVC  from sklearn.metrics import plot_roc_curve  from sklearn.ensemble import RandomForestClassifier  from sklearn.datasets import make_classification  import matplotlib.pyplot as plt    X, y = make_classification(random_state=0)  X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)    svc = SVC(random_state=42)  svc.fit(X_train, y_train)  rfc = RandomForestClassifier(random_state=42)  rfc.fit(X_train, y_train)    svc_disp = plot_roc_curve(svc, X_test, y_test)  rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_)  rfc_disp.figure_.suptitle("ROC curve comparison")    plt.show()

StackingClassifier和StackingRegressor

StackingClassifier 和 StackingRegressor 允许用户拥有一个具有最终分类器/回归器的估计器堆栈(estimator of stack)。

堆栈泛化(stacked generalization)是将各个估计器的输出叠加起来，然后使用分类器来计算最终的预测。

基础估计器拟合在完整的X( full X )上，而最终估计器则使用基于cross_val_predict的基础估计器的交叉验证预测进行训练。

例如：

from sklearn.datasets import load_iris  from sklearn.svm import LinearSVC  from sklearn.linear_model import LogisticRegression  from sklearn.preprocessing import StandardScaler  from sklearn.pipeline import make_pipeline  from sklearn.ensemble import StackingClassifier  from sklearn.model_selection import train_test_split    X, y = load_iris(return_X_y=True)  estimators = [      ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),      ('svr', make_pipeline(StandardScaler(),                            LinearSVC(random_state=42)))  ]  clf = StackingClassifier(      estimators=estimators, final_estimator=LogisticRegression()  )  X_train, X_test, y_train, y_test = train_test_split(      X, y, stratify=y, random_state=42  )  clf.fit(X_train, y_train).score(X_test, y_test)

输出：0.9473684210526315。

基于排列(permutation)的特征重要性

inspection.permutation_importance可以用来估计每个特征的重要性，对于任何拟合的估算器：

from sklearn.ensemble import RandomForestClassifier  from sklearn.inspection import permutation_importance    X, y = make_classification(random_state=0, n_features=5, n_informative=3)  rf = RandomForestClassifier(random_state=0).fit(X, y)  result = permutation_importance(rf, X, y, n_repeats=10, random_state=0,                                  n_jobs=-1)    fig, ax = plt.subplots()  sorted_idx = result.importances_mean.argsort()  ax.boxplot(result.importances[sorted_idx].T,             vert=False, labels=range(X.shape[1]))  ax.set_title("Permutation Importance of each feature")  ax.set_ylabel("Features")  fig.tight_layout()  plt.show()

对梯度提升提供缺失值的本地支持

ensemble.HistGradientBoostingClassifier 和 ensemble.HistGradientBoostingRegressor 现在对缺失值（NaNs）具有本机支持。这意味着在训练或预测时无需插补数据。

from sklearn.experimental import enable_hist_gradient_boosting  # noqa  from sklearn.ensemble import HistGradientBoostingClassifier  import numpy as np    X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)  y = [0, 0, 1, 1]    gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)  print(gbdt.predict(X))

输出：[0 0 1 1]。

预计算的稀疏近邻图

现在，大多数基于最近邻图的估算都接受预先计算的稀疏图作为输入，以将同一图重用于多个估算量拟合。

要在pipeline中使用这个特性，可以使用 memory 参数，以及neighbors.KNeighborsTransformer和neighbors.RadiusNeighborsTransformer中的一个。

预计算还可以由自定义的估算器来执行。

from tempfile import TemporaryDirectory  from sklearn.neighbors import KNeighborsTransformer  from sklearn.manifold import Isomap  from sklearn.pipeline import make_pipeline    X, y = make_classification(random_state=0)    with TemporaryDirectory(prefix="sklearn_cache_") as tmpdir:      estimator = make_pipeline(          KNeighborsTransformer(n_neighbors=10, mode='distance'),          Isomap(n_neighbors=10, metric='precomputed'),          memory=tmpdir)      estimator.fit(X)        # We can decrease the number of neighbors and the graph will not be      # recomputed.      estimator.set_params(isomap__n_neighbors=5)      estimator.fit(X)

基于Imputation的KNN

现在，scikit_learn 支持使用k近邻来填充缺失值。

from sklearn.impute import KNNImputer    X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]  imputer = KNNImputer(n_neighbors=2)  print(imputer.fit_transform(X))

输出： [[1. 2. 4. ] [3. 4. 3. ] [5.5 6. 5. ] [8. 8. 7. ]]

树剪枝

现在，在建立一个树之后，可以剪枝大部分基于树的估算器。

X, y = make_classification(random_state=0)    rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)  print("Average number of nodes without pruning {:.1f}".format(      np.mean([e.tree_.node_count for e in rf.estimators_])))    rf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y)  print("Average number of nodes with pruning {:.1f}".format(      np.mean([e.tree_.node_count for e in rf.estimators_])))

输出： Average number of nodes without pruning 22.3 Average number of nodes with pruning 6.4

从OpenML检索dataframe

datasets.fetch_openml现在可以返回pandas dataframe，从而正确处理具有异构数据的数据集：

from sklearn.datasets import fetch_openml    titanic = fetch_openml('titanic', version=1, as_frame=True)  print(titanic.data.head()[['pclass', 'embarked']])

输出： pclass embarked 0 1.0 S 1 1.0 S 2 1.0 S 3 1.0 S 4 1.0 S

检查一个估算器的scikit-learn兼容性

开发人员可以使用check_estimator检查其scikit-learn兼容估算器的兼容性。

现在，scikit-learn 提供了pytest特定的装饰器(decorator)，该装饰器允许pytest独立运行所有检查并报告失败的检查。

from sklearn.linear_model import LogisticRegression  from sklearn.tree import DecisionTreeRegressor  from sklearn.utils.estimator_checks import parametrize_with_checks      @parametrize_with_checks([LogisticRegression, DecisionTreeRegressor])  def test_sklearn_compatible_estimator(estimator, check):      check(estimator)

ROC AUC现在支持多类别分类

roc_auc_score 函数也可用于多类别分类。

目前支持两种平均策略：

one-vs-one算法计算两两配对的ROC AUC分数的平均值； one-vs-rest算法计算每个类别相对于所有其他类别的ROC AUC分数的平均值。

在这两种情况下，模型都是根据样本属于特定类别的概率估计来计算多类别ROC AUC分数。

from sklearn.datasets import make_classification  from sklearn.svm import SVC  from sklearn.metrics import roc_auc_score    X, y = make_classification(n_classes=4, n_informative=16)  clf = SVC(decision_function_shape='ovo', probability=True).fit(X, y)  print(roc_auc_score(y, clf.predict_proba(X), multi_class='ovo'))

输出：0.9957333333333332

传送门

Twitter： https://twitter.com/scikit_learn/status/1201847227561529346

博客： https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_22_0.html#new-plotting-api

使用指南： https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics

作者系网易新闻·网易号“各有态度”签约作者