kaggle web traffic top solution 系列之一

2021 年 1 月 4 日
AI

首先是比赛背景介绍一下：

这个数据场景和电商的数据场景是基本差不多的，每一个page其实就是一个sku，然后后面是从2015-07-01到2017-09-10的训练集数据，我们需要做的就是预测2017-9-13（中间gap了9-11，9-12两天）往后63天

先总结了cpmp大佬的第二名的代码，主要是他的代码写的非常简单易懂，并且他的git代码里使用了nn向量输出和传统的机器学习方法，和第一名纯seq2seq的方案可以形成很好的互补，他的方案也分为多个小方案，先看一下大佬的git上写的说明以及他在kaggle上的精彩说明。

//github.com/jfpuget/Kaggle/tree/master/WebTrafficPredictiongithub.com

Kaggle上的Web流量预测竞赛的2n奖解决方案。
Kaggle竞争网站：https : //www.kaggle.com/c/web-traffic-time-series-forecasting
我对竞赛论坛上使用的方法作了一些说明：https : //www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39395
重现解决方案的过程：
克隆Kaggle存储库
将比赛数据下载到Kaggle / input目录中
转到Kaggle / WebTrafficPrediction目录
运行keras-kf-12-stage2-sept-10.ipynb笔记本。这将训练基础深度学习模型并从中计算预测。这应该在Kaggle / submissions目录中产生几个文件，包括：

keras_kf_12_stage2_sept_10_train.csv
keras_kf_12_stage2_sept_10_test.csv

文件keras_kf_12_stage2_sept_10_test.csv是我的第一次提交。它的得分为36.91121，将整体排名第4。

运行Pred_11-stage2-sept-10.ipynb笔记本。这将创建一个基于中位数的模型并从中计算出预测。它应该在Kaggle / submissions目录中产生文件，包括：

pred_10_stage2_sept_10_train.csv
pred_10_stage2_sept_10_test.csv

运行first_stage2.ipynb笔记本。它计算页面数据不为零的第一个日期。它应该在Kaggle / data目录中创建一个文件：

first.csv

运行xgb_23_keras_7_2_stage2-sept-10-2.ipynb笔记本。通过在残差上运行xgboost来创建最终模型，以进行神经网络预测。它使用过去的访问以及上面的两个笔记本输出作为功能。它应该在Kaggle / submission目录中生成文件，包括：

xgb_1_2017-09-12-19-14-14_test.csv

该文件是我的第二次提交。它得到36.78499的分数，使我获得第二名。
Kaggle要求提供一个更简单的模型，以尽可能提供90％的性能。文件keras_simple.ipynb中提供了这种模型。它的功能集要简单得多，基本上是培训数据的最后8周中的每一个的访问次数的中位数，加上网站（例如//es.wikipedia.org）和代理访问方法。它的输出得分37.58692，将获得第9位。

这里我们先跑一个keras-simple下的最简单的nn模型，模型结构和复杂的nn模型是一样的，只不过在特征和数据方面做了一些删减。

//www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39395www.kaggle.com

大佬的discussion内容，有点乱我直接看代码了….

首先是keras-simple文件里的代码:

Python">import numpy as np
import pandas as pd
import datetime
from matplotlib import pyplot as plt
%matplotlib inline

pd.options.display.max_rows = 10
pd.options.display.max_colwidth = 100
pd.options.display.max_columns = 600
from tqdm import tqdm
import gc

from sklearn.linear_model import HuberRegressor
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.decomposition import PCA

from tensorflow.keras.layers import BatchNormalization

from tensorflow.keras.models import Sequential, Model

from tensorflow.keras.layers import Input, Embedding, Dense, Activation, Dropout, Flatten

from tensorflow.keras import regularizers 

from tensorflow import keras

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GroupKFold

def init():
    np.random.seed = 0
    
init()

def smape(y_true, y_pred):
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.nanmean(diff)

def smape2D(y_true, y_pred):
    return smape(np.ravel(y_true), np.ravel(y_pred))
    
def smape_mask(y_true, y_pred, threshold):
    denominator = (np.abs(y_true) + np.abs(y_pred)) 
    diff = np.abs(y_true - y_pred) 
    diff[denominator == 0] = 0.0
    
    return diff <= (threshold / 2.0) * denominator

file=r'E:\机器学习之路\机器学习比赛\web traffic'+'\\' #这个是本地的文件路径，根据每个人存储的
# 地方有所不同

常规的导入部分，这里损失函数使用了smape，比赛的评价指标是smape，这里和比赛的评估指标保持一致，不过也不一定，像最近的m5的模型训练指标和评估指标就不是一样的，有时候是评估指标不可微而难以直接优化，有的是因为使用其它的优化指标（目标函数）可以得到更好的效果；

max_size = 181 # number of days in 2015 with 3 days before end

offset = 1/2

train_all = pd.read_csv(file+'train_2.csv')
train_all.head()

all_page = train_all.Page.copy()
train_key = train_all[['Page']].copy()
train_all = train_all.iloc[:,1:] * offset 
train_all.head()

这里train_key用来保存page的信息，也就是网页的信息如上图，后面做特征衍生用，然后就是这里的offset的部分，原作者对所有的原始数据进行了0.5的放缩，这里不太明白为啥要做放缩….后面预测的时候又做了还原…醉了

def get_date_index(date, train_all=train_all):
    for idx, c in enumerate(train_all.columns):
        if date == c:
            break
    if idx == len(train_all.columns):
        return None
    return idx

定义上述函数用于确定日期的index，因为原始的数据已经进行了对齐，一行就是一个page（相当于一个电商销量预测种的一个sku），每一个列是这个sku的销量数据的每一天的销量数据。

train_end = get_date_index('2016-09-10') + 1
test_start = get_date_index('2016-09-13')

train = train_all.iloc[ : , (train_end - max_size) : train_end].copy().astype('float32')
test = train_all.iloc[:, test_start : (63 + test_start)].copy().astype('float32')
train = train.iloc[:,::-1].copy().astype('float32')

train_all = train_all.iloc[:,-(max_size):].astype('float32')
train_all = train_all.iloc[:,::-1].copy().astype('float32')

test_3_date = test.columns

这里定义了训练集的最终日期的index和测试数据起始日的index，然后可以注意到作者仅仅取了过去的180天的数据（max_size=181）来预测未来的63天的数据，最终的数据是这样的：

train_all是从2017-03-14到2017-09-10，

而train则是2016-03-14到2019-9-10日，

test数据集是2016年09-13到2016-11-14：

data = [page.split('_') for page in tqdm(train_key.Page)]

access = ['_'.join(page[-2:]) for page in data]

site = [page[-3] for page in data]

page = ['_'.join(page[:-3]) for page in data]
page[:2]

train_key['PageTitle'] = page
train_key['Site'] = site
train_key['AccessAgent'] = access
train_key.head()

这里对page的特征进行了衍生

按照page的标题、page的类型和page的代理类型衍生了3个特征，page的特征衍生在top1 solution中也有，不过top1做的更加细致，把国家都弄出来了。。

这里就是典型的静态特征的衍生，时间序列预测的特征衍生方案算是常规的结构化数据的特征衍生方案的父集，除了常规的结构化数据里会用到的特征衍生方案之外，还有使用到时间窗、滚动平均、差分等等方式来进行时序特征的人工表征，这里实际上就是结构化数据中常见的文本特征的简单衍生，从文本中根据先验知识提取了一部分有意义的特征出来。

train_norm = np.log1p(train).astype('float32')
train_norm.head()

train_all_norm = np.log1p(train_all).astype('float32')
train_all_norm.head()

对所有的价格数据进行了log1p的对数变换，常见的时序数据的标准化做法使得原始数据的畸形程度缓解数据之间的量纲差异大大降低，对于模型收敛帮助，关于mae mape smape的关系之前写过：

马东什么：mape和smape，基于mae的回归评价指标zhuanlan.zhihu.com

关于标准化的问题有一些需要注意的地方：

马东什么：时间序列标准化中的一些问题zhuanlan.zhihu.com

first_day = 1 # 2016-09-13 is a Tuesday
test_columns_date = list(test.columns)
test_columns_code = ['w%d_d%d' % (i // 7, (first_day + i) % 7) for i in range(63)]
test.columns = test_columns_code

test.head()

对列名进行了简单的修改：

test.fillna(0, inplace=True)

test['Page'] = all_page
test.sort_values(by='Page', inplace=True)
test.reset_index(drop=True, inplace=True)

对缺失值进行0填充（话说目前看到过的关于时间序列的top solution对于缺失值大部分都是直接0值处理，是缺失值的处理影响不大吗？？）然后把page放进去了，也就是每一个page的名字用于后续的merge

test = test.merge(train_key, how='left', on='Page', copy=False)

test.head()

把page衍生的特征合并进来：

test_all_id = pd.read_csv(file+'key_2.csv')

test_all_id['Date'] = [page[-10:] for page in tqdm(test_all_id.Page)]
test_all_id['Page'] = [page[:-11] for page in tqdm(test_all_id.Page)]
test_all_id.head()

构造最终的预测数据集，读取key的信息，然后对key文件中的page进行id和date进行分离

test_all = test_all_id.drop('Id', axis=1)
test_all['Visits_true'] = np.NaN

test_all.Visits_true = test_all.Visits_true * offset ##这句代码没啥意义。。。
test_all = test_all.pivot(index='Page', columns='Date', values='Visits_true').astype('float32').reset_index()

test_all['2017-11-14'] = np.NaN
test_all.sort_values(by='Page', inplace=True)
test_all.reset_index(drop=True, inplace=True)

test_all.head()

对预测用的数据集进行构造：

test_all_columns_date = list(test_all.columns[1:])
first_day = 2 # 2017-13-09 is a Wednesday
test_all_columns_code = ['w%d_d%d' % (i // 7, (first_day + i) % 7) for i in range(63)]
cols = ['Page']
cols.extend(test_all_columns_code)
test_all.columns = cols
test_all.head()

对列名进行了修改便于后续合并：

test_all = test_all.merge(train_key, how='left', on='Page')
test_all.head()

page的衍生特征的合并

获取标签，

y_cols = test.columns[:63]
y_cols
test = test.reset_index()
test_all = test_all.reset_index()
test_all = test_all[test.columns].copy()

经过一系列的操作我们得到了：

其中，test_all由原来的：

变成了：

test由原来的：

变成了：

然后：

train_cols = ['d_%d' % i for i in range(train_norm.shape[1])]
train_norm.columns = train_cols
train_all_norm.columns = train_cols

将train_norm从原来的：

转化为了：

将train_all_norm从原来的：

转化为了：

sites = train_key.Site.unique()
test_site = pd.factorize(test.Site)[0]
test['Site_label'] = test_site
test_all['Site_label'] = test_site[:test_all.shape[0]]
accesses = train_key.AccessAgent.unique()
test_access = pd.factorize(test.AccessAgent)[0]
test['Access_label'] = test_access
test_all['Access_label'] = test_access[:test_all.shape[0]]
test0 = test.copy()
test_all0 = test_all.copy()
y_norm_cols = [c+'_norm' for c in y_cols]
y_pred_cols = [c+'_pred' for c in y_cols]

对sites和accesses进行了onehot编码（不知道如果用embedding层会不会效果更好，类别少的时候做onehot的话，网络设计的时候会简单一点不用专门设置embedding再再去flatten+concat就是了）

max_periods = 16
periods = [(0,1), (1,2), (2,3), (3,4), 
           (4,5), (5,6), (6,7), (7,8),  
           ]


site_cols = list(sites)
access_cols = list(accesses)

for c in y_pred_cols:
    test[c] = np.NaN
    test_all[c] = np.NaN

其中，y_pred_cols是：

提前预留了要预测的结果的位置，

然后是：

test1 = add_median(test, train_norm,  ### 
                   train_key, periods, max_periods, 3)

test_all1 = add_median(test_all, train_all_norm, 
                       train_key, periods, max_periods, 5)

这里的add_median的作用是对数据进行标准化处理和增加趋势性的度量特征，看一下这部分的代码，注释写在代码部分

# all visits is median
def add_median(test, train,
               train_key, periods, max_periods, first_train_weekday):
    train =  train.iloc[:,:7*max_periods]# 仅仅最近的7*16=112天的数据参与后面的运算
    
    df = train_key[['Page']].copy()
    df['AllVisits'] = train.median(axis=1).fillna(0) #这里的AllVisits是最近的112天的访问量的
#中位数，之所以用中位数是因为均值容易受部分极值的影响，其实使用截断平均值也可以
    test = test.merge(df, how='left', on='Page', copy=False) #将流量的中位数合并到test数据
#集里
    test.AllVisits = test.AllVisits.fillna(0).astype('float32')# 还是简单的0填充
    
    for site in sites:
        test[site] = (1 * (test.Site == site)).astype('float32') #site的onehot编码
    
    for access in accesses:
        test[access] = (1 * (test.AccessAgent == access)).astype('float32')#access的onehot编码


    for (w1, w2) in periods: #这里periods的取值为：
#periods = [(0,1), (1,2), (2,3), (3,4), 
#           (4,5), (5,6), (6,7), (7,8), 
 #          ]
        
        df = train_key[['Page']].copy()
        d = 'median_%d_%d' % (w1, w2)
        df[d] = train.iloc[:,7*w1:7*w2].median(axis=1, skipna=True)
#计算不同周期的中为数，以(0,1)为例，这里计算的是最近的112天中，最早的 0到7天的访问量的中位数
# 以(1,2)为例，就是计算最近112天中，最早的7：14天的访问量的中位数 
        test = test.merge(df, how='left', on='Page', copy=False)
# 把计算出来的每一个page对应时间段下的中位数合并进来，列名为c
        test[d] = (test[d] - test.AllVisits).fillna(0).astype('float32')
# 使用c减去所有112天下的中位数，缺失值补0

    for c_norm, c in zip(y_norm_cols, y_cols):
        test[c_norm] = (np.log1p(test[c]) - test.AllVisits).astype('float32')
#使用test中的y_cols，进行log1p标准化之后减去104天的访问量的中位数
    gc.collect()

    return test

这部分的核心代码1：

        df[d] = train.iloc[:,7*w1:7*w2].median(axis=1, skipna=True)
#计算不同周期的中为数，以(0,1)为例，这里计算的是最近的104天中，最早的 0到7天的访问量的中位数
# 以(1,2)为例，就是计算最近104天中，最早的7：14天的访问量的中位数 
        test = test.merge(df, how='left', on='Page', copy=False)
# 把计算出来的每一个page对应时间段下的中位数合并进来，列名为c
        test[d] = (test[d] - test.AllVisits).fillna(0).astype('float32')
# 使用c减去所有历史数据下的中位数，缺失值补0

这部分是典型的趋势性特征的衍生方法，在风控中也会有相似的处理方法，还是举个例子吧，作者用最近的104天的数据，然后取了不同periods的时间区间：

periods = [(0,1), (1,2), (2,3), (3,4), 
           (4,5), (5,6), (6,7), (7,8),  
           ]

下的访问量的中位数，用这个中位数减去104天的总的中位数来将趋势表征成一个特征，比如说104天的访问量的中位数为500，某一周的访问量为5000，则显然5000-500=4500是一个明显的上述趋势，如果某一周访问量为0，则0-500=-500是一个明显的下降趋势，并且不同大小的结果可以较好的表示出趋势的强弱，这里也可以通过相除的方式来将趋势性表征出来，比如用5000/500和用0/500；

核心代码2，也就是标准化的方法：

    for c_norm, c in zip(y_norm_cols, y_cols):
        test[c_norm] = (np.log1p(test[c]) - test.AllVisits).astype('float32')
#使用test中的y_cols，进行log1p标准化之后减去104天的访问量的中位数
    gc.collect()

注意，这里的ALLvisits之前已经做过log1p的变换了，这里的test的原始的y_cols的序列数据还没做。

所以这里的标准化方法是，每一个page（sku）的标签数据中的每一天，都针对于过去某个时间段的中位数进行了减法处理，画个图示意一下：

63天的部分，每一天的数据都减去了过去104天的访问量的中位数，从而完成趋势的消除，我们可以比较一下处理前后的一些page的序列的趋势性的变化：

相对于简单的log变换多了一个减去近期一段时间的中位数的操作，实际上本质是就是一个简单的中位数预测+nn模型，预测的时候，两个模型进行求和集成。

最后就是模型的构建和训练部分了：

import keras.backend as K



group = pd.factorize(test1.Page)[0]

n_bag = 20
kf = GroupKFold(n_bag)
batch_size=4096

#print('week:', week)
test2 = test1
test_all2 = test_all1

def smape_error(y_true, y_pred):
    return K.mean(K.clip(K.abs(y_pred - y_true),  0.0, 1.0), axis=-1)


def get_model(input_dim, num_sites, num_accesses, output_dim):
    
    dropout = 0.5
    regularizer = 0.00004
    main_input = Input(shape=(input_dim,), dtype='float32', name='main_input')
    site_input = Input(shape=(num_sites,), dtype='float32', name='site_input')
    access_input = Input(shape=(num_accesses,), dtype='float32', name='access_input')
    
    
    x0 = keras.layers.concatenate([main_input, site_input, access_input])
    x = Dense(200, activation='relu', 
              kernel_initializer='lecun_uniform', kernel_regularizer=regularizers.l2(regularizer))(x0)
    x = Dropout(dropout)(x)
    x = keras.layers.concatenate([x0, x])
    x = Dense(200, activation='relu', 
              kernel_initializer='lecun_uniform', kernel_regularizer=regularizers.l2(regularizer))(x)
    x = BatchNormalization(beta_regularizer=regularizers.l2(regularizer),
                           gamma_regularizer=regularizers.l2(regularizer)
                          )(x)
    x = Dropout(dropout)(x)
    x = Dense(100, activation='relu', 
              kernel_initializer='lecun_uniform', kernel_regularizer=regularizers.l2(regularizer))(x)
    x = Dropout(dropout)(x)

    x = Dense(200, activation='relu', 
              kernel_initializer='lecun_uniform', kernel_regularizer=regularizers.l2(regularizer))(x)
    x = Dropout(dropout)(x)
    x = Dense(output_dim, activation='linear', 
              kernel_initializer='lecun_uniform', kernel_regularizer=regularizers.l2(regularizer))(x)

    model =  Model(inputs=[main_input, site_input, access_input], outputs=[x])
    model.compile(loss=smape_error, optimizer='adam')
    return model

模型的构建如上，是一个典型的多输入单输出（输出为向量）的网络结构，关于怎么去设计网络结构的一些事项，可以参见百度或者谷歌上的一些炼丹手册之类的东西，另外推荐两本书，一本是jason brownlee的《better deep learning》简单清晰易懂，demo+原理解释，另外一本是《neural networks tricks of the trade》是众学术大佬合著的一本，第二版是2018年的，相对还比较新，然后就是kaggle上的这些很不错的网络设计可以作为初期参考的范本。

然后是样本的构造：

X, Xs, Xa, y = test2[num_cols].values, test2[site_cols].values, test2[access_cols].values, test2[y_norm_cols].values
X_all, Xs_all, Xa_all, y_all = test_all2[num_cols].values, test_all2[site_cols].values, test_all2[access_cols].values, test_all2[y_norm_cols].fillna(0).values

y_true = test2[y_cols]
y_all_true = test_all2[y_cols]

这里，site_cols和access_cols都是不随时间变化的静态数据，动态数据仅仅使用了num_cols,num_cols为：

[‘median_0_1’,
‘median_1_2’,
‘median_2_3’,
‘median_3_4’,
‘median_4_5’,
‘median_5_6’,
‘median_6_7’,
‘median_7_8’]

根据前面的代码：

# all visits is median
def add_median(test, train,
               train_key, periods, max_periods, first_train_weekday):
    train =  train.iloc[:,:7*max_periods]# 仅仅最近的7*16=112天的数据参与后面的运算
    
    df = train_key[['Page']].copy()
    df['AllVisits'] = train.median(axis=1).fillna(0) #这里的AllVisits是最近的112天的访问量的
#中位数，之所以用中位数是因为均值容易受部分极值的影响，其实使用截断平均值也可以
    test = test.merge(df, how='left', on='Page', copy=False) #将流量的中位数合并到test数据
#集里
    test.AllVisits = test.AllVisits.fillna(0).astype('float32')# 还是简单的0填充
    
    for site in sites:
        test[site] = (1 * (test.Site == site)).astype('float32') #site的onehot编码
    
    for access in accesses:
        test[access] = (1 * (test.AccessAgent == access)).astype('float32')#access的onehot编码


    for (w1, w2) in periods: #这里periods的取值为：
#periods = [(0,1), (1,2), (2,3), (3,4), 
#           (4,5), (5,6), (6,7), (7,8), 
 #          ]
        
        df = train_key[['Page']].copy()
        d = 'median_%d_%d' % (w1, w2)
        df[d] = train.iloc[:,7*w1:7*w2].median(axis=1, skipna=True)
#计算不同周期的中为数，以(0,1)为例，这里计算的是最近的112天中，最早的 0到7天的访问量的中位数
# 以(1,2)为例，就是计算最近112天中，最早的7：14天的访问量的中位数 
        test = test.merge(df, how='left', on='Page', copy=False)
# 把计算出来的每一个page对应时间段下的中位数合并进来，列名为c
        test[d] = (test[d] - test.AllVisits).fillna(0).astype('float32')
# 使用c减去所有112天下的中位数，缺失值补0

    for c_norm, c in zip(y_norm_cols, y_cols):
        test[c_norm] = (np.log1p(test[c]) - test.AllVisits).astype('float32')
#使用test中的y_cols，进行log1p标准化之后减去112天的访问量的中位数
    gc.collect()

    return test

我们可以知道，train部分的数据集，仅仅使用了靠近预测日期最近的112天的数据，并且更进一步的，每一周（周定义根据periods可以知道分别是1，2，3，4，5，6，7，8周）仅仅取了中位数那一天的实际流量减去112天的总的流量的中位数作为最终的动态特征。

总的来说，就是过去112天（上图写错不是104天）的最初的1~8周的流量的中位数作为动态特征，最后是训练部分：

group = pd.factorize(test1.Page)[0]
print(group)
n_bag = 20
kf = GroupKFold(n_bag) #这里的网页是完全独立没有任何重复，使用groupkfold和直接
#使用kfold没有什么区别，做了20折的交叉验证。。。
batch_size=4096 

#print('week:', week)
test2 = test1
test_all2 = test_all1



best_score = 100
best_all_score = 100


save_pred = 0
saved_pred_all = 0

for n_epoch in range(10, 201, 10):
    print('************** start %d epochs **************************' % n_epoch)

    y_pred0 = np.zeros((y.shape[0], y.shape[1])) #构建预测矩阵用于保存验证集的预测值
    y_all_pred0 = np.zeros((n_bag, y_all.shape[0], y_all.shape[1])) #构建矩阵用于保存
#测试集的预测值
    for fold, (train_idx, test_idx) in enumerate(kf.split(X, y, group)):
        print('train fold', fold, end=' ')    
        model = models[fold] #构建了20个模型，每一折训练20个模型中的1个
        X_train, Xs_train, Xa_train, y_train = X[train_idx,:], Xs[train_idx,:], Xa[train_idx,:], y[train_idx,:]
        X_test, Xs_test, Xa_test, y_test = X[test_idx,:], Xs[test_idx,:], Xa[test_idx,:], y[test_idx,:]

        model.fit([ X_train, Xs_train, Xa_train],  y_train, 
                  epochs=10, batch_size=batch_size, verbose=0, shuffle=True, 
                  #validation_data=([X_test, Xs_test, Xa_test],  y_test)
                 ) #一次就跑10个epochs
        y_pred = model.predict([ X_test, Xs_test, Xa_test], batch_size=batch_size)
        y_all_pred = model.predict([X_all, Xs_all, Xa_all], batch_size=batch_size)

        y_pred0[test_idx,:] = y_pred #保存验证集预测结果
        y_all_pred0[fold,:,:]  = y_all_pred #保存测试集预测结果

        y_pred += test2.AllVisits.values[test_idx].reshape((-1,1)) #这里是中位数模型集成的
#部分，直接加入过去112天的流量的中位数
        y_pred = np.expm1(y_pred) #对预测结果进行还原
        y_pred[y_pred < 0.5 * offset] = 0 #预测结果小于0.5*0.5=0.25的结果直接置0
        res = smape2D(test2[y_cols].values[test_idx, :], y_pred) #计算验证集的smape
        y_pred = offset*((y_pred / offset).round()) #预测结果进行取整。。
        res_round = smape2D(test2[y_cols].values[test_idx, :], y_pred) #取整之后的验证集的预测
#结果的smape计算

        y_all_pred += test_all2.AllVisits.values.reshape((-1,1)) #这里操作同上，只不过
#数据集换成了测试集
        y_all_pred = np.expm1(y_all_pred)
        y_all_pred[y_all_pred < 0.5 * offset] = 0
        res_all = smape2D(test_all2[y_cols], y_all_pred)
        y_all_pred = offset*((y_all_pred / offset).round())
        res_all_round = smape2D(test_all2[y_cols], y_all_pred)
        print('smape train: %0.5f' % res, 'round: %0.5f' % res_round,
              '     smape LB: %0.5f' % res_all, 'round: %0.5f' % res_all_round)

    #y_pred0  = np.nanmedian(y_pred0, axis=0)
    y_all_pred0  = np.nanmedian(y_all_pred0, axis=0) #预测结果取所有预测结果的中位数。。。

    y_pred0  += test2.AllVisits.values.reshape((-1,1)) #和测试集部分的操作一样，
#对预测值进行还原，置0，smape计算等操作。。
    y_pred0 = np.expm1(y_pred0)
    y_pred0[y_pred0 < 0.5 * offset] = 0
    res = smape2D(y_true, y_pred0)
    print('smape train: %0.5f' % res, end=' ')
    y_pred0 = offset*((y_pred0 / offset).round())
    res_round = smape2D(y_true, y_pred0)
    print('round: %0.5f' % res_round)

    y_all_pred0 += test_all2.AllVisits.values.reshape((-1,1)) #同上
    y_all_pred0 = np.expm1(y_all_pred0)
    y_all_pred0[y_all_pred0 < 0.5 * offset] = 0
    #y_all_pred0 = y_all_pred0.round()
    res_all = smape2D(y_all_true, y_all_pred0)
    print('     smape LB: %0.5f' % res_all, end=' ')
    y_all_pred0 = offset*((y_all_pred0 / offset).round())
    res_all_round = smape2D(y_all_true, y_all_pred0)
    print('round: %0.5f' % res_all_round, end=' ')

    if res_round < best_score: #score的交换，如果当前循环得到的模型预测结果更好则保存
#类似于一个model的checkpoint功能
        print('saving')
        best_score = res_round
        best_all_score = res_all_round
        test.loc[:, y_pred_cols] = y_pred0 #预测结果的保存
        test_all.loc[:, y_pred_cols] = y_all_pred0
    else:
        print()
    print('*************** end %d epochs **************************' % n_epoch)
print('best saved LB score:', best_all_score)

快没电了。。日。。。先关机了

kaggle web traffic top solution 系列之一

VirMach 便宜 VPS

QNews

kaggle web traffic top solution 系列 之一

分享此文：

Related Posts

node2vec和deepwalk中的采样

多任务学习之深度学习部分（待续）

alfred库使用教程

怎么快速衡量和消除序列数据的趋势性

VirMach 便宜 VPS

QNews

熱門搜尋

kaggle web traffic top solution 系列之一