深度学习-使用PyTorch的表格数据

作者 | Offir Inbar

来源 | Medium

编辑 | 代码医生团队

这篇文章将通过一个实际的例子提供有关将Pytorch用于表格数据的详细的端到端指南。在本文结束时,将能够构建Pytorch模型。

  • 使用Python的set_trace()可以全面了解每个步骤。
  • 可以在此处找到完整的代码

https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb

数据集

选择从Kaggle从事纽约市出租车票价预测工作,目的是预测驾驶员的出租车票价。请注意,这是一个回归问题。可以在此处找到更多详细信息和完整的数据集。

https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

训练数据包含超过200万个样本(5.31 GB)。为了最大程度地减少训练时间,One随机抽取了100k个训练样本。

import pandas  import random  filename = r"C:UsersUserDesktopoffirstudycomputer learningKaggle_compNYC Taxitrain.csv"  n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)  s = 100000 #desired sample size  skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list  df = pandas.read_csv(filename, skiprows=skip)  df.to_csv("temp.csv")

显卡

使用免费的Google Colab编写了代码。

要使用GPU,请执行以下操作:运行时->更改运行时设置->硬件加速器-> GPU。

代码

导入相关库

## import libraries    #PyTorch    import torch  import torch.nn as nn  from torch.utils.data import Dataset, DataLoader  import torch.optim as torch_optim  from torchvision import models  from torch.nn import init  import torch.optim as optim  from torch.autograd import Variable  import torch.nn.functional as F  from torch.utils import data  from torch.optim import lr_scheduler    #sklearn  from sklearn.model_selection import train_test_split  from sklearn.preprocessing import LabelEncoder  from sklearn.metrics import mean_squared_error  from sklearn import preprocessing      #other  from IPython.core.debugger import set_trace  import pandas as pd  import numpy as np  from collections import Counter  from datetime import datetime  import math  from google.colab import files  import io  import datetime as dt  import re  import pandas_profiling  import pandas_profiling as pp  from math import sqrt    #graphs  import seaborn as sns  import matplotlib.pyplot as plt  import matplotlib.ticker as ticker  import matplotlib.dates as mdates  import matplotlib.cbook as cbook  import pylab as plt  import matplotlib.dates as dates  import seaborn as sns  import pylab  import matplotlib  import matplotlib.dates  from IPython.display import display  import plotly.graph_objects as go      %matplotlib inline      # load tqdm  #!pip install --force https://github.com/chengs/tqdm/archive/colab.zip  from tqdm import tqdm, tqdm_notebook, tnrange    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")  Device

运行以下命令后,需要从计算机上载CSV文件。检查要上传的CSV文件名为sub_train。

# upload df_train csv file    uploaded = files.upload()  df_train = pd.read_csv(io.BytesIO(uploaded['sub_train.csv']))  df_train.head()

上传测试集

# upload df_Test csv file    uploaded = files.upload()  df_test = pd.read_csv(io.BytesIO(uploaded['test.csv']))  df_test.head()

数据预处理

下一步是删除所有小于0的票价(它们没有意义)

df_train的长度现在为99,990。在每个步骤中跟踪不同数据集的类型和长度非常重要。

Stacking train和测试仪,以便它们经过相同的预处理

目的是预测票价。因此它已从train_X数据帧中删除。

此外,选择在训练时预测价格的对数。

train_X = df_train.drop(columns=['fare_amount'])  Y = np.log(df_train.fare_amount)    test_X = df_test  df = train_X.append(test_X,sort=False)

特征工程

做一些特征工程。

定义haverine_distatance函数并添加DateTime列以导出有用的统计信息。可以在GitHub Repo中看到完整的过程。

https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb

在此阶段之后,数据框如下所示:

准备模型

定义分类列和连续列,并且仅采用相关列。

cat_cols = ['Hour', 'AMorPM', 'Weekday']  cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']    # keep only the cols for the model  df = df[['Hour', 'AMorPM', 'Weekday','pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]

将猫类别设置为“类别”,并对其进行标签编码。

for col in df.columns:    if col in cat_cols:      df[col] = LabelEncoder().fit_transform(df[col])      df[col]= df[col].astype('category')

定义类别列的嵌入大小。确定嵌入大小的经验法则是将每列中的唯一条目数除以2,但不得超过50。

cat_szs = [len(df[col].cat.categories) for col in cat_cols]  emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]  emb_szs

现在,处理连续变量。在对它们进行归一化之前,重要的是在训练和测试集之间进行划分,以防止数据泄漏。

df_train = df[:99990]  df_test = df[99990:]    #Normalizing    from pandas.api.types import is_numeric_dtype    #"Compute the means and stds of `self.cont_names` columns to normalize them."  def Normalize(df):    means,stds = {},{}    cont_names = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']    for n in cont_names:      assert is_numeric_dtype(df[n]), (f"""Cannot normalize '{n}' column as it isn't numerical. Are you sure it doesn't belong in the categorical set of columns?""")      means[n],stds[n] = df[n].mean(),df[n].std()      df[n] = (df[n]-means[n]) / (1e-7 + stds[n])      Normalize(df_train)  Normalize(df_test)  X = df_train

训练有效拆分

在训练和验证集之间进行划分。在这种情况下,验证集为总训练集的20%。

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42,shuffle=True )

完成此步骤后,重要的是要看一下不同的形状。

模型

目前,数据存储在pandas数组中。PyTorch知道如何使用Tensors。以下步骤将数据转换为正确的类型。跟踪每个步骤中的数据类型。添加了具有当前数据类型的注释。

class RegressionColumnarDataset(data.Dataset):      def __init__(self, df, cats, y):              self.dfcats = df[cats] #type: pandas.core.frame.DataFrame          self.dfconts = df.drop(cats, axis=1) #type: pandas.core.frame.DataFrame              self.cats = np.stack([c.values for n, c in self.dfcats.items()], axis=1).astype(np.int64) #tpye: numpy.ndarray          self.conts = np.stack([c.values for n, c in self.dfconts.items()], axis=1).astype(np.float32) #tpye: numpy.ndarray          self.y = y.values.astype(np.float32)          def __len__(self): return len(self.y)        def __getitem__(self, idx):            return [self.cats[idx], self.conts[idx], self.y[idx]]
trainds = RegressionColumnarDataset(X_train, cat_cols, y_train) #type: __main__.RegressionColumnarDataset  valds = RegressionColumnarDataset(X_val, cat_cols, y_val) #type: __main__.RegressionColumnarDataset

现在该使用PyTorch DataLoader了。选择的批次大小为128,请随意使用。

params = {'batch_size': 128,            'shuffle': True}      traindl = DataLoader(trainds, **params) #type: torch.utils.data.dataloader.DataLoader  valdl = DataLoader(valds, **params) #type: torch.utils.data.dataloader.DataLoader

定义一个TabularModel

目的是根据连续列的数量+分类列的数量及其嵌入来定义模型。由于其具有回归任务,因此输出将是单个浮点值。

  • ps:每层的丢失概率
  • emb_drop:提供嵌入辍学
  • emd_szs:元组列表:每个分类变量大小与一个嵌入大小配对
  • n_cont:连续变量的数量
  • out_sz:输出大小
# help functions    from collections.abc import Iterable      def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn=None):      "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."      layers = [nn.BatchNorm1d(n_in)] if bn else []      if p != 0: layers.append(nn.Dropout(p))      layers.append(nn.Linear(n_in, n_out))      if actn is not None: layers.append(actn)      return layers    def ifnone(a,b):      "`a` if `a` is not None, otherwise `b`."      return b if a is None else a    def listify(p, q):      "Make `p` listy and the same length as `q`."      if p is None: p=[]      elif isinstance(p, str):          p = [p]      elif not isinstance(p, Iterable): p = [p]      #Rank 0 tensors in PyTorch are Iterable but don't have a length.      else:          try: a = len(p)          except: p = [p]      n = q if type(q)==int else len(p) if q is None else len(q)      if len(p)==1: p = p * n      assert len(p)==n, f'List len mismatch ({len(p)} vs {n})'      return list(p)        class TabularModel(nn.Module):      "Basic model for tabular data."      def __init__(self, emb_szs, n_cont:int, out_sz:int, layers, ps=None,                   emb_drop:float=0., y_range=None, use_bn:bool=True, bn_final:bool=False):          super().__init__()          ps = ifnone(ps, [0]*len(layers))          ps = listify(ps, layers)          self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs]) #type: torch.nn.modules.container.ModuleList          self.emb_drop = nn.Dropout(emb_drop) #type: torch.nn.modules.dropout.Dropout          self.bn_cont = nn.BatchNorm1d(n_cont) #type torch.nn.modules.batchnorm.BatchNorm1d          n_emb = sum(e.embedding_dim for e in self.embeds) # n_emb = 17 , type: int          self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range          sizes = [n_emb + n_cont] + layers + [out_sz] #typeL list, len: 4          actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None] #type: list, len: 3.  the last in None because we finish with linear          layers = []          for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)):              layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act)          if bn_final: layers.append(nn.BatchNorm1d(sizes[-1]))          self.layers = nn.Sequential(*layers) #type: torch.nn.modules.container.Sequential              def forward(self, x_cat, x_cont):          if self.n_emb != 0:              x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)] #take the embedding list and grab an embedding and pass in our single row of data.              x = torch.cat(x, 1) # concatenate it on dim 1 ## remeber that the len is the batch size              x = self.emb_drop(x) # pass it through a dropout layer          if self.n_cont != 0:              x_cont = self.bn_cont(x_cont) # batchnorm1d              x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont # combine the categircal and continous variables on dim 1          x = self.layers(x)          if self.y_range is not None:              x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0] # deal with y_range          return x.squeeze()

设置y_range进行预测(可选),然后调用模型。

y_range = (0, y_train.max()*1.2)  model = TabularModel(emb_szs = emb_szs,n_cont = len(cont_cols),out_sz = 1,layers = [1000,500,250],ps= [0.001,0.01,0.01],emb_drop=0.04, y_range=y_range).to(device)

该模型如下所示:

定义一个优化器。选择了学习率为1e-2的亚当。学习是应该调整的第一个超参数。此外,有不同的策略来使用学习率(适合一个周期,余弦等)。在这里,使用恒定的学习率。

from collections import defaultdict  opt = torch.optim.Adam(model.parameters(), lr=1e-2) # can add: weight_decay=    lr = defaultdict(list)  tloss = defaultdict(list)  vloss = defaultdict(list)

训练和评估

训练模型。尝试跟踪并了解每个步骤。使用set_trace()命令非常有帮助。评估指标是RMSE。

def inv_y(y): return np.exp(y)    def rmse(targ, y_pred):      return np.sqrt(mean_squared_error(inv_y(y_pred), inv_y(targ))) #.detach().numpy()      def rmse(targ, y_pred):     return np.sqrt(mean_squared_error(y_pred, targ)) #.detach().numpy()      def fit(model, train_dl, val_dl, loss_fn, opt, epochs=3):      num_batch = len(train_dl)      for epoch in tnrange(epochs):          y_true_train = list()          y_pred_train = list()          total_loss_train = 0            t = tqdm_notebook(iter(train_dl), leave=False, total=num_batch)          for cat, cont, y in t:              cat = cat.cuda()              cont = cont.cuda()              y = y.cuda()                t.set_description(f'Epoch {epoch}')                opt.zero_grad() #find where the grads are zero              pred = model(cat, cont)              loss = loss_fn(pred, y)                loss.backward() # do backprop              lr[epoch].append(opt.param_groups[0]['lr'])              tloss[epoch].append(loss.item())              opt.step()              #scheduler.step()                  t.set_postfix(loss=loss.item())                y_true_train += list(y.cpu().data.numpy())              y_pred_train += list(pred.cpu().data.numpy())              total_loss_train += loss.item()            train_acc = rmse(y_true_train, y_pred_train)          train_loss = total_loss_train/len(train_dl) # len train_dl = 704. the calc is number of train examples (89991) / batch size (128)            if val_dl:              y_true_val = list()              y_pred_val = list()              total_loss_val = 0              for cat, cont, y in tqdm_notebook(val_dl, leave=False):                  cat = cat.cuda()                  cont = cont.cuda()                  y = y.cuda()                  pred = model(cat, cont)                  loss = loss_fn(pred, y)                    y_true_val += list(y.cpu().data.numpy())                  y_pred_val += list(pred.cpu().data.numpy())                  total_loss_val += loss.item()                  vloss[epoch].append(loss.item())              valacc = rmse(y_true_val, y_pred_val)              valloss = total_loss_val/len(valdl)              print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f} | val_loss: {valloss:.4f} val_rmse: {valacc:.4f}')          else:              print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f}')        return lr, tloss, vloss

将输入传递给拟合函数。在这种情况下,损失函数为MSEloss。

lr, tloss, vloss = fit(model=model, train_dl=traindl, val_dl=valdl, loss_fn=nn.MSELoss(), opt=opt,  epochs=10)

绘制训练vs验证损失

t = [np.mean(tloss[el]) for el in tloss]  v = [np.mean(vloss[el]) for el in vloss]    plt.plot(t, label='Training loss')  plt.plot(v, label='Validation loss')  plt.title("Train VS Validation Loss over Epochs")  plt.xlabel("Epochs")  plt.legend(frameon=False)

完成训练部分

在使用模型并调整超参数之后,将对它感到满意。只有这样,才能转到下一步:在测试集上测试模型。

测试集

请记住:测试必须经过与训练集相同的过程。接下来的步骤是“准备”以进行评估。

分为分类和连续列,并使其成为张量。

df_test_cats = df_test[['Hour', 'AMorPM', 'Weekday']]  test_cats = df_test_cats.astype(np.int64)  test_cats = torch.tensor(test_cats.values).cuda()    df_test_conts = df_test[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]  test_conts = df_test_conts.astype(np.float32)  test_conts = torch.tensor(test_conts.values).cuda()

做出预测

with torch.no_grad():      model.eval()      output = model.forward(test_cats,test_conts).cuda()  output

请注意,预测现在是张量。如果要将其更改为Pandas Data框架,请遍历存储库中的步骤。接下来,可以将其导出到CSV文件。

如果正在参加Kaggle比赛,请将其上传到Kaggle以查看分数。

结论

总而言之,学习了如何从头开始为表格数据构建PyTorch模型。必须投入完整的代码并尝试理解每一行。