Using dummy estimators to compare results使用虛擬估計值來對比結果

  • 2019 年 12 月 18 日
  • 筆記

This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build.

這部分是關於生成虛擬估計值,這不是漂亮的或者令人興奮的事情,但是這對你最終完成的模型是一個值得查閱的點

Getting ready準備工作

In this recipe, we'll perform the following tasks:在這部分,我們將展現以下目標:

1. Create some data random data.生成一些隨機數據集

2. Fit the various dummy estimators.擬合變數的虛擬估計值

We'll perform these two steps for regression data and classification data.我們將對回歸數據和分類數據展示這兩步。

How to do it…怎麼做

First, we'll create the random data:首先生成些隨機數據:

>>> from sklearn.datasets import make_regression, make_classification

# classification if for late

>>> X, y = make_regression()

>>> from sklearn import dummy

>>> dumdum = dummy.DummyRegressor()

>>> dumdum.fit(X, y)

DummyRegressor(constant=None, strategy='mean')

By default, the estimator will predict by just taking the mean of the values and predicting the mean values:

通過定義,估計值將要通過採用均值和預測均值來進行預測:

>>> dumdum.predict(X)[:5]

array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907,2.23297907])

There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value.

這還有兩種方法可以選擇嘗試,我們能預測一個可支援的常數(使用預測指令來代替constant=None),我們也能預測中位數。

Supplying a constant will only be considered if strategy is "constant".支援一個常數將被認為是策略是否是一個常數

Let's have a look:讓我們看一下:

>>> predictors = [("mean", None),("median", None),("constant", 10)]

>>> for strategy, constant in predictors:

dumdum = dummy.DummyRegressor(strategy=strategy,constant=constant)

>>> dumdum.fit(X, y)

>>> print "strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5]))

strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733

strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248

strategy: constant 10.0,10.0,10.0,10.0,10.0

We actually have four options for classifiers. These strategies are similar to the continuous case,it's just slanted toward classification problems:

我們實際上對分類器有四種選擇,這些決策很像連續性的例子,它只是傾向分類的問題。

>>> predictors = [("constant", 0),("stratified", None),("uniform", None),("most_frequent", None)]

We'll also need to create some classification data:我們也需要生成一些分類數據:

>>> X, y = make_classification()

>>> for strategy, constant in predictors:

dumdum = dummy.DummyClassifier(strategy=strategy,constant=constant)

dumdum.fit(X, y)

print "strategy: {}".format(strategy), ",".join(map(str,dumdum.predict(X)[:5]))

strategy: constant 0,0,0,0,0

strategy: stratified 1,0,0,1,0

strategy: uniform 0,0,0,1,1

strategy: most_frequent 1,1,1,1,1

How it works…如何運行的

It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.

總是測試你的模型與最簡單的模型是非常好的,並且虛擬估計值將給你確切的結果。例如,想像一個欺詐模型,在這個模型里,只有5%的暑假見是欺詐的,然而我們可能通過不猜測任何欺詐來擬合一個非常好的結果。

We can create this model by using the stratified strategy, using the following command.We can also get a good example of why class imbalance causes problems:

我們能通過使用分層策略來生成一個模型,使用以下程式碼,我們能夠得到一個關於為什麼不均衡分類導致問題的好例子。

>>> X, y = make_classification(20000, weights=[.95, .05])

>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')

>>> dumdum.fit(X, y)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

>>> from sklearn.metrics import accuracy_score

>>> print accuracy_score(y, dumdum.predict(X))

0.94575

We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time.

我們實際上常常正確,但是不是因為這點,這個點其實是我們的基準線,如果我們不能生成一個欺詐模型比這個更準確,那它就不值得我們的時間。