深度學習-使用PyTorch的表格數據

2020 年 3 月 4 日
筆記

作者 | Offir Inbar

來源 | Medium

編輯 | 代碼醫生團隊

這篇文章將通過一個實際的例子提供有關將Pytorch用於表格數據的詳細的端到端指南。在本文結束時，將能夠構建Pytorch模型。

使用Python的set_trace（）可以全面了解每個步驟。
可以在此處找到完整的代碼

https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb

數據集

選擇從Kaggle從事紐約市的士票價預測工作，目的是預測駕駛員的的士票價。請注意，這是一個回歸問題。可以在此處找到更多詳細信息和完整的數據集。

https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

訓練數據包含超過200萬個樣本（5.31 GB）。為了最大程度地減少訓練時間，One隨機抽取了100k個訓練樣本。

import pandas  import random  filename = r"C:UsersUserDesktopoffirstudycomputer learningKaggle_compNYC Taxitrain.csv"  n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)  s = 100000 #desired sample size  skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list  df = pandas.read_csv(filename, skiprows=skip)  df.to_csv("temp.csv")

顯卡

使用免費的Google Colab編寫了代碼。

要使用GPU，請執行以下操作：運行時->更改運行時設置->硬件加速器-> GPU。

代碼

導入相關庫

## import libraries    #PyTorch    import torch  import torch.nn as nn  from torch.utils.data import Dataset, DataLoader  import torch.optim as torch_optim  from torchvision import models  from torch.nn import init  import torch.optim as optim  from torch.autograd import Variable  import torch.nn.functional as F  from torch.utils import data  from torch.optim import lr_scheduler    #sklearn  from sklearn.model_selection import train_test_split  from sklearn.preprocessing import LabelEncoder  from sklearn.metrics import mean_squared_error  from sklearn import preprocessing      #other  from IPython.core.debugger import set_trace  import pandas as pd  import numpy as np  from collections import Counter  from datetime import datetime  import math  from google.colab import files  import io  import datetime as dt  import re  import pandas_profiling  import pandas_profiling as pp  from math import sqrt    #graphs  import seaborn as sns  import matplotlib.pyplot as plt  import matplotlib.ticker as ticker  import matplotlib.dates as mdates  import matplotlib.cbook as cbook  import pylab as plt  import matplotlib.dates as dates  import seaborn as sns  import pylab  import matplotlib  import matplotlib.dates  from IPython.display import display  import plotly.graph_objects as go      %matplotlib inline      # load tqdm  #!pip install --force https://github.com/chengs/tqdm/archive/colab.zip  from tqdm import tqdm, tqdm_notebook, tnrange    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")  Device

運行以下命令後，需要從計算機上載CSV文件。檢查要上傳的CSV文件名為sub_train。

# upload df_train csv file    uploaded = files.upload()  df_train = pd.read_csv(io.BytesIO(uploaded['sub_train.csv']))  df_train.head()

上傳測試集

# upload df_Test csv file    uploaded = files.upload()  df_test = pd.read_csv(io.BytesIO(uploaded['test.csv']))  df_test.head()

數據預處理

下一步是刪除所有小於0的票價（它們沒有意義）

df_train的長度現在為99,990。在每個步驟中跟蹤不同數據集的類型和長度非常重要。

Stacking train和測試儀，以便它們經過相同的預處理

目的是預測票價。因此它已從train_X數據幀中刪除。

此外，選擇在訓練時預測價格的對數。

train_X = df_train.drop(columns=['fare_amount'])  Y = np.log(df_train.fare_amount)    test_X = df_test  df = train_X.append(test_X,sort=False)

特徵工程

做一些特徵工程。

定義haverine_distatance函數並添加DateTime列以導出有用的統計信息。可以在GitHub Repo中看到完整的過程。

https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb

在此階段之後，數據框如下所示：

準備模型

定義分類列和連續列，並且僅採用相關列。

cat_cols = ['Hour', 'AMorPM', 'Weekday']  cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']    # keep only the cols for the model  df = df[['Hour', 'AMorPM', 'Weekday','pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]

將貓類別設置為「類別」，並對其進行標籤編碼。

for col in df.columns:    if col in cat_cols:      df[col] = LabelEncoder().fit_transform(df[col])      df[col]= df[col].astype('category')

定義類別列的嵌入大小。確定嵌入大小的經驗法則是將每列中的唯一條目數除以2，但不得超過50。

cat_szs = [len(df[col].cat.categories) for col in cat_cols]  emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs]  emb_szs

現在，處理連續變量。在對它們進行歸一化之前，重要的是在訓練和測試集之間進行劃分，以防止數據泄漏。

df_train = df[:99990]  df_test = df[99990:]    #Normalizing    from pandas.api.types import is_numeric_dtype    #"Compute the means and stds of `self.cont_names` columns to normalize them."  def Normalize(df):    means,stds = {},{}    cont_names = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']    for n in cont_names:      assert is_numeric_dtype(df[n]), (f"""Cannot normalize '{n}' column as it isn't numerical. Are you sure it doesn't belong in the categorical set of columns?""")      means[n],stds[n] = df[n].mean(),df[n].std()      df[n] = (df[n]-means[n]) / (1e-7 + stds[n])      Normalize(df_train)  Normalize(df_test)  X = df_train

訓練有效拆分

在訓練和驗證集之間進行劃分。在這種情況下，驗證集為總訓練集的20％。

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42,shuffle=True )

完成此步驟後，重要的是要看一下不同的形狀。

模型

目前，數據存儲在pandas數組中。PyTorch知道如何使用Tensors。以下步驟將數據轉換為正確的類型。跟蹤每個步驟中的數據類型。添加了具有當前數據類型的注釋。

class RegressionColumnarDataset(data.Dataset):      def __init__(self, df, cats, y):              self.dfcats = df[cats] #type: pandas.core.frame.DataFrame          self.dfconts = df.drop(cats, axis=1) #type: pandas.core.frame.DataFrame              self.cats = np.stack([c.values for n, c in self.dfcats.items()], axis=1).astype(np.int64) #tpye: numpy.ndarray          self.conts = np.stack([c.values for n, c in self.dfconts.items()], axis=1).astype(np.float32) #tpye: numpy.ndarray          self.y = y.values.astype(np.float32)          def __len__(self): return len(self.y)        def __getitem__(self, idx):            return [self.cats[idx], self.conts[idx], self.y[idx]]

trainds = RegressionColumnarDataset(X_train, cat_cols, y_train) #type: __main__.RegressionColumnarDataset  valds = RegressionColumnarDataset(X_val, cat_cols, y_val) #type: __main__.RegressionColumnarDataset

現在該使用PyTorch DataLoader了。選擇的批次大小為128，請隨意使用。

params = {'batch_size': 128,            'shuffle': True}      traindl = DataLoader(trainds, **params) #type: torch.utils.data.dataloader.DataLoader  valdl = DataLoader(valds, **params) #type: torch.utils.data.dataloader.DataLoader

定義一個TabularModel

目的是根據連續列的數量+分類列的數量及其嵌入來定義模型。由於其具有回歸任務，因此輸出將是單個浮點值。

ps：每層的丟失概率
emb_drop：提供嵌入輟學
emd_szs：元組列表：每個分類變量大小與一個嵌入大小配對
n_cont：連續變量的數量
out_sz：輸出大小

# help functions    from collections.abc import Iterable      def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn=None):      "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."      layers = [nn.BatchNorm1d(n_in)] if bn else []      if p != 0: layers.append(nn.Dropout(p))      layers.append(nn.Linear(n_in, n_out))      if actn is not None: layers.append(actn)      return layers    def ifnone(a,b):      "`a` if `a` is not None, otherwise `b`."      return b if a is None else a    def listify(p, q):      "Make `p` listy and the same length as `q`."      if p is None: p=[]      elif isinstance(p, str):          p = [p]      elif not isinstance(p, Iterable): p = [p]      #Rank 0 tensors in PyTorch are Iterable but don't have a length.      else:          try: a = len(p)          except: p = [p]      n = q if type(q)==int else len(p) if q is None else len(q)      if len(p)==1: p = p * n      assert len(p)==n, f'List len mismatch ({len(p)} vs {n})'      return list(p)        class TabularModel(nn.Module):      "Basic model for tabular data."      def __init__(self, emb_szs, n_cont:int, out_sz:int, layers, ps=None,                   emb_drop:float=0., y_range=None, use_bn:bool=True, bn_final:bool=False):          super().__init__()          ps = ifnone(ps, [0]*len(layers))          ps = listify(ps, layers)          self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs]) #type: torch.nn.modules.container.ModuleList          self.emb_drop = nn.Dropout(emb_drop) #type: torch.nn.modules.dropout.Dropout          self.bn_cont = nn.BatchNorm1d(n_cont) #type torch.nn.modules.batchnorm.BatchNorm1d          n_emb = sum(e.embedding_dim for e in self.embeds) # n_emb = 17 , type: int          self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range          sizes = [n_emb + n_cont] + layers + [out_sz] #typeL list, len: 4          actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None] #type: list, len: 3.  the last in None because we finish with linear          layers = []          for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)):              layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act)          if bn_final: layers.append(nn.BatchNorm1d(sizes[-1]))          self.layers = nn.Sequential(*layers) #type: torch.nn.modules.container.Sequential              def forward(self, x_cat, x_cont):          if self.n_emb != 0:              x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)] #take the embedding list and grab an embedding and pass in our single row of data.              x = torch.cat(x, 1) # concatenate it on dim 1 ## remeber that the len is the batch size              x = self.emb_drop(x) # pass it through a dropout layer          if self.n_cont != 0:              x_cont = self.bn_cont(x_cont) # batchnorm1d              x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont # combine the categircal and continous variables on dim 1          x = self.layers(x)          if self.y_range is not None:              x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0] # deal with y_range          return x.squeeze()

設置y_range進行預測（可選），然後調用模型。

y_range = (0, y_train.max()*1.2)  model = TabularModel(emb_szs = emb_szs,n_cont = len(cont_cols),out_sz = 1,layers = [1000,500,250],ps= [0.001,0.01,0.01],emb_drop=0.04, y_range=y_range).to(device)

該模型如下所示：

定義一個優化器。選擇了學習率為1e-2的亞當。學習是應該調整的第一個超參數。此外，有不同的策略來使用學習率（適合一個周期，餘弦等）。在這裡，使用恆定的學習率。

from collections import defaultdict  opt = torch.optim.Adam(model.parameters(), lr=1e-2) # can add: weight_decay=    lr = defaultdict(list)  tloss = defaultdict(list)  vloss = defaultdict(list)

訓練和評估

訓練模型。嘗試跟蹤並了解每個步驟。使用set_trace（）命令非常有幫助。評估指標是RMSE。

def inv_y(y): return np.exp(y)    def rmse(targ, y_pred):      return np.sqrt(mean_squared_error(inv_y(y_pred), inv_y(targ))) #.detach().numpy()      def rmse(targ, y_pred):     return np.sqrt(mean_squared_error(y_pred, targ)) #.detach().numpy()      def fit(model, train_dl, val_dl, loss_fn, opt, epochs=3):      num_batch = len(train_dl)      for epoch in tnrange(epochs):          y_true_train = list()          y_pred_train = list()          total_loss_train = 0            t = tqdm_notebook(iter(train_dl), leave=False, total=num_batch)          for cat, cont, y in t:              cat = cat.cuda()              cont = cont.cuda()              y = y.cuda()                t.set_description(f'Epoch {epoch}')                opt.zero_grad() #find where the grads are zero              pred = model(cat, cont)              loss = loss_fn(pred, y)                loss.backward() # do backprop              lr[epoch].append(opt.param_groups[0]['lr'])              tloss[epoch].append(loss.item())              opt.step()              #scheduler.step()                  t.set_postfix(loss=loss.item())                y_true_train += list(y.cpu().data.numpy())              y_pred_train += list(pred.cpu().data.numpy())              total_loss_train += loss.item()            train_acc = rmse(y_true_train, y_pred_train)          train_loss = total_loss_train/len(train_dl) # len train_dl = 704. the calc is number of train examples (89991) / batch size (128)            if val_dl:              y_true_val = list()              y_pred_val = list()              total_loss_val = 0              for cat, cont, y in tqdm_notebook(val_dl, leave=False):                  cat = cat.cuda()                  cont = cont.cuda()                  y = y.cuda()                  pred = model(cat, cont)                  loss = loss_fn(pred, y)                    y_true_val += list(y.cpu().data.numpy())                  y_pred_val += list(pred.cpu().data.numpy())                  total_loss_val += loss.item()                  vloss[epoch].append(loss.item())              valacc = rmse(y_true_val, y_pred_val)              valloss = total_loss_val/len(valdl)              print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f} | val_loss: {valloss:.4f} val_rmse: {valacc:.4f}')          else:              print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f}')        return lr, tloss, vloss

將輸入傳遞給擬合函數。在這種情況下，損失函數為MSEloss。

lr, tloss, vloss = fit(model=model, train_dl=traindl, val_dl=valdl, loss_fn=nn.MSELoss(), opt=opt,  epochs=10)

繪製訓練vs驗證損失

t = [np.mean(tloss[el]) for el in tloss]  v = [np.mean(vloss[el]) for el in vloss]    plt.plot(t, label='Training loss')  plt.plot(v, label='Validation loss')  plt.title("Train VS Validation Loss over Epochs")  plt.xlabel("Epochs")  plt.legend(frameon=False)

完成訓練部分

在使用模型並調整超參數之後，將對它感到滿意。只有這樣，才能轉到下一步：在測試集上測試模型。

測試集

請記住：測試必須經過與訓練集相同的過程。接下來的步驟是「準備」以進行評估。

分為分類和連續列，並使其成為張量。

df_test_cats = df_test[['Hour', 'AMorPM', 'Weekday']]  test_cats = df_test_cats.astype(np.int64)  test_cats = torch.tensor(test_cats.values).cuda()    df_test_conts = df_test[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]  test_conts = df_test_conts.astype(np.float32)  test_conts = torch.tensor(test_conts.values).cuda()

做出預測

with torch.no_grad():      model.eval()      output = model.forward(test_cats,test_conts).cuda()  output

請注意，預測現在是張量。如果要將其更改為Pandas Data框架，請遍歷存儲庫中的步驟。接下來，可以將其導出到CSV文件。

如果正在參加Kaggle比賽，請將其上傳到Kaggle以查看分數。

結論

總而言之，學習了如何從頭開始為表格數據構建PyTorch模型。必須投入完整的代碼並嘗試理解每一行。

深度學習-使用PyTorch的表格數據

VirMach 便宜 VPS

QNews

深度學習-使用PyTorch的表格數據

分享此文：

Related Posts

算法數據結構 | 三個步驟完成強連通分量分解的Kosaraju算法

學習javaScript必知必會(6)~類、類的定義、prototype 原型、json對象

DevOps工具介紹連載（14）——裸金屬服務器

2020年2月編程語言排行榜,C#增長率排第二！

VirMach 便宜 VPS

QNews

熱門搜尋