深度學習-使用PyTorch的表格數據
- 2020 年 3 月 4 日
- 筆記

作者 | Offir Inbar
來源 | Medium
編輯 | 代碼醫生團隊
這篇文章將通過一個實際的例子提供有關將Pytorch用於表格數據的詳細的端到端指南。在本文結束時,將能夠構建Pytorch模型。
- 使用Python的set_trace()可以全面了解每個步驟。
- 可以在此處找到完整的代碼
https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb
數據集
選擇從Kaggle從事紐約市的士票價預測工作,目的是預測駕駛員的的士票價。請注意,這是一個回歸問題。可以在此處找到更多詳細信息和完整的數據集。
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
訓練數據包含超過200萬個樣本(5.31 GB)。為了最大程度地減少訓練時間,One隨機抽取了100k個訓練樣本。
import pandas import random filename = r"C:UsersUserDesktopoffirstudycomputer learningKaggle_compNYC Taxitrain.csv" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 100000 #desired sample size skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip) df.to_csv("temp.csv")
顯卡
使用免費的Google Colab編寫了代碼。
要使用GPU,請執行以下操作:運行時->更改運行時設置->硬件加速器-> GPU。
代碼
導入相關庫
## import libraries #PyTorch import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader import torch.optim as torch_optim from torchvision import models from torch.nn import init import torch.optim as optim from torch.autograd import Variable import torch.nn.functional as F from torch.utils import data from torch.optim import lr_scheduler #sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from sklearn import preprocessing #other from IPython.core.debugger import set_trace import pandas as pd import numpy as np from collections import Counter from datetime import datetime import math from google.colab import files import io import datetime as dt import re import pandas_profiling import pandas_profiling as pp from math import sqrt #graphs import seaborn as sns import matplotlib.pyplot as plt import matplotlib.ticker as ticker import matplotlib.dates as mdates import matplotlib.cbook as cbook import pylab as plt import matplotlib.dates as dates import seaborn as sns import pylab import matplotlib import matplotlib.dates from IPython.display import display import plotly.graph_objects as go %matplotlib inline # load tqdm #!pip install --force https://github.com/chengs/tqdm/archive/colab.zip from tqdm import tqdm, tqdm_notebook, tnrange device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") Device
運行以下命令後,需要從計算機上載CSV文件。檢查要上傳的CSV文件名為sub_train。
# upload df_train csv file uploaded = files.upload() df_train = pd.read_csv(io.BytesIO(uploaded['sub_train.csv'])) df_train.head()

上傳測試集
# upload df_Test csv file uploaded = files.upload() df_test = pd.read_csv(io.BytesIO(uploaded['test.csv'])) df_test.head()

數據預處理
下一步是刪除所有小於0的票價(它們沒有意義)

df_train的長度現在為99,990。在每個步驟中跟蹤不同數據集的類型和長度非常重要。
Stacking train和測試儀,以便它們經過相同的預處理
目的是預測票價。因此它已從train_X數據幀中刪除。
此外,選擇在訓練時預測價格的對數。
train_X = df_train.drop(columns=['fare_amount']) Y = np.log(df_train.fare_amount) test_X = df_test df = train_X.append(test_X,sort=False)
特徵工程
做一些特徵工程。
定義haverine_distatance函數並添加DateTime列以導出有用的統計信息。可以在GitHub Repo中看到完整的過程。
https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb
在此階段之後,數據框如下所示:

準備模型
定義分類列和連續列,並且僅採用相關列。
cat_cols = ['Hour', 'AMorPM', 'Weekday'] cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km'] # keep only the cols for the model df = df[['Hour', 'AMorPM', 'Weekday','pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]
將貓類別設置為「類別」,並對其進行標籤編碼。
for col in df.columns: if col in cat_cols: df[col] = LabelEncoder().fit_transform(df[col]) df[col]= df[col].astype('category')
定義類別列的嵌入大小。確定嵌入大小的經驗法則是將每列中的唯一條目數除以2,但不得超過50。
cat_szs = [len(df[col].cat.categories) for col in cat_cols] emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs] emb_szs

現在,處理連續變量。在對它們進行歸一化之前,重要的是在訓練和測試集之間進行劃分,以防止數據泄漏。
df_train = df[:99990] df_test = df[99990:] #Normalizing from pandas.api.types import is_numeric_dtype #"Compute the means and stds of `self.cont_names` columns to normalize them." def Normalize(df): means,stds = {},{} cont_names = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km'] for n in cont_names: assert is_numeric_dtype(df[n]), (f"""Cannot normalize '{n}' column as it isn't numerical. Are you sure it doesn't belong in the categorical set of columns?""") means[n],stds[n] = df[n].mean(),df[n].std() df[n] = (df[n]-means[n]) / (1e-7 + stds[n]) Normalize(df_train) Normalize(df_test) X = df_train
訓練有效拆分
在訓練和驗證集之間進行劃分。在這種情況下,驗證集為總訓練集的20%。
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42,shuffle=True )
完成此步驟後,重要的是要看一下不同的形狀。
模型
目前,數據存儲在pandas數組中。PyTorch知道如何使用Tensors。以下步驟將數據轉換為正確的類型。跟蹤每個步驟中的數據類型。添加了具有當前數據類型的注釋。
class RegressionColumnarDataset(data.Dataset): def __init__(self, df, cats, y): self.dfcats = df[cats] #type: pandas.core.frame.DataFrame self.dfconts = df.drop(cats, axis=1) #type: pandas.core.frame.DataFrame self.cats = np.stack([c.values for n, c in self.dfcats.items()], axis=1).astype(np.int64) #tpye: numpy.ndarray self.conts = np.stack([c.values for n, c in self.dfconts.items()], axis=1).astype(np.float32) #tpye: numpy.ndarray self.y = y.values.astype(np.float32) def __len__(self): return len(self.y) def __getitem__(self, idx): return [self.cats[idx], self.conts[idx], self.y[idx]]
trainds = RegressionColumnarDataset(X_train, cat_cols, y_train) #type: __main__.RegressionColumnarDataset valds = RegressionColumnarDataset(X_val, cat_cols, y_val) #type: __main__.RegressionColumnarDataset
現在該使用PyTorch DataLoader了。選擇的批次大小為128,請隨意使用。
params = {'batch_size': 128, 'shuffle': True} traindl = DataLoader(trainds, **params) #type: torch.utils.data.dataloader.DataLoader valdl = DataLoader(valds, **params) #type: torch.utils.data.dataloader.DataLoader
定義一個TabularModel
目的是根據連續列的數量+分類列的數量及其嵌入來定義模型。由於其具有回歸任務,因此輸出將是單個浮點值。
- ps:每層的丟失概率
- emb_drop:提供嵌入輟學
- emd_szs:元組列表:每個分類變量大小與一個嵌入大小配對
- n_cont:連續變量的數量
- out_sz:輸出大小
# help functions from collections.abc import Iterable def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn=None): "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`." layers = [nn.BatchNorm1d(n_in)] if bn else [] if p != 0: layers.append(nn.Dropout(p)) layers.append(nn.Linear(n_in, n_out)) if actn is not None: layers.append(actn) return layers def ifnone(a,b): "`a` if `a` is not None, otherwise `b`." return b if a is None else a def listify(p, q): "Make `p` listy and the same length as `q`." if p is None: p=[] elif isinstance(p, str): p = [p] elif not isinstance(p, Iterable): p = [p] #Rank 0 tensors in PyTorch are Iterable but don't have a length. else: try: a = len(p) except: p = [p] n = q if type(q)==int else len(p) if q is None else len(q) if len(p)==1: p = p * n assert len(p)==n, f'List len mismatch ({len(p)} vs {n})' return list(p) class TabularModel(nn.Module): "Basic model for tabular data." def __init__(self, emb_szs, n_cont:int, out_sz:int, layers, ps=None, emb_drop:float=0., y_range=None, use_bn:bool=True, bn_final:bool=False): super().__init__() ps = ifnone(ps, [0]*len(layers)) ps = listify(ps, layers) self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs]) #type: torch.nn.modules.container.ModuleList self.emb_drop = nn.Dropout(emb_drop) #type: torch.nn.modules.dropout.Dropout self.bn_cont = nn.BatchNorm1d(n_cont) #type torch.nn.modules.batchnorm.BatchNorm1d n_emb = sum(e.embedding_dim for e in self.embeds) # n_emb = 17 , type: int self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range sizes = [n_emb + n_cont] + layers + [out_sz] #typeL list, len: 4 actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None] #type: list, len: 3. the last in None because we finish with linear layers = [] for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)): layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act) if bn_final: layers.append(nn.BatchNorm1d(sizes[-1])) self.layers = nn.Sequential(*layers) #type: torch.nn.modules.container.Sequential def forward(self, x_cat, x_cont): if self.n_emb != 0: x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)] #take the embedding list and grab an embedding and pass in our single row of data. x = torch.cat(x, 1) # concatenate it on dim 1 ## remeber that the len is the batch size x = self.emb_drop(x) # pass it through a dropout layer if self.n_cont != 0: x_cont = self.bn_cont(x_cont) # batchnorm1d x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont # combine the categircal and continous variables on dim 1 x = self.layers(x) if self.y_range is not None: x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0] # deal with y_range return x.squeeze()
設置y_range進行預測(可選),然後調用模型。
y_range = (0, y_train.max()*1.2) model = TabularModel(emb_szs = emb_szs,n_cont = len(cont_cols),out_sz = 1,layers = [1000,500,250],ps= [0.001,0.01,0.01],emb_drop=0.04, y_range=y_range).to(device)
該模型如下所示:

定義一個優化器。選擇了學習率為1e-2的亞當。學習是應該調整的第一個超參數。此外,有不同的策略來使用學習率(適合一個周期,餘弦等)。在這裡,使用恆定的學習率。
from collections import defaultdict opt = torch.optim.Adam(model.parameters(), lr=1e-2) # can add: weight_decay= lr = defaultdict(list) tloss = defaultdict(list) vloss = defaultdict(list)
訓練和評估
訓練模型。嘗試跟蹤並了解每個步驟。使用set_trace()命令非常有幫助。評估指標是RMSE。
def inv_y(y): return np.exp(y) def rmse(targ, y_pred): return np.sqrt(mean_squared_error(inv_y(y_pred), inv_y(targ))) #.detach().numpy() def rmse(targ, y_pred): return np.sqrt(mean_squared_error(y_pred, targ)) #.detach().numpy() def fit(model, train_dl, val_dl, loss_fn, opt, epochs=3): num_batch = len(train_dl) for epoch in tnrange(epochs): y_true_train = list() y_pred_train = list() total_loss_train = 0 t = tqdm_notebook(iter(train_dl), leave=False, total=num_batch) for cat, cont, y in t: cat = cat.cuda() cont = cont.cuda() y = y.cuda() t.set_description(f'Epoch {epoch}') opt.zero_grad() #find where the grads are zero pred = model(cat, cont) loss = loss_fn(pred, y) loss.backward() # do backprop lr[epoch].append(opt.param_groups[0]['lr']) tloss[epoch].append(loss.item()) opt.step() #scheduler.step() t.set_postfix(loss=loss.item()) y_true_train += list(y.cpu().data.numpy()) y_pred_train += list(pred.cpu().data.numpy()) total_loss_train += loss.item() train_acc = rmse(y_true_train, y_pred_train) train_loss = total_loss_train/len(train_dl) # len train_dl = 704. the calc is number of train examples (89991) / batch size (128) if val_dl: y_true_val = list() y_pred_val = list() total_loss_val = 0 for cat, cont, y in tqdm_notebook(val_dl, leave=False): cat = cat.cuda() cont = cont.cuda() y = y.cuda() pred = model(cat, cont) loss = loss_fn(pred, y) y_true_val += list(y.cpu().data.numpy()) y_pred_val += list(pred.cpu().data.numpy()) total_loss_val += loss.item() vloss[epoch].append(loss.item()) valacc = rmse(y_true_val, y_pred_val) valloss = total_loss_val/len(valdl) print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f} | val_loss: {valloss:.4f} val_rmse: {valacc:.4f}') else: print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f}') return lr, tloss, vloss
將輸入傳遞給擬合函數。在這種情況下,損失函數為MSEloss。
lr, tloss, vloss = fit(model=model, train_dl=traindl, val_dl=valdl, loss_fn=nn.MSELoss(), opt=opt, epochs=10)
繪製訓練vs驗證損失
t = [np.mean(tloss[el]) for el in tloss] v = [np.mean(vloss[el]) for el in vloss] plt.plot(t, label='Training loss') plt.plot(v, label='Validation loss') plt.title("Train VS Validation Loss over Epochs") plt.xlabel("Epochs") plt.legend(frameon=False)

完成訓練部分
在使用模型並調整超參數之後,將對它感到滿意。只有這樣,才能轉到下一步:在測試集上測試模型。
測試集
請記住:測試必須經過與訓練集相同的過程。接下來的步驟是「準備」以進行評估。
分為分類和連續列,並使其成為張量。
df_test_cats = df_test[['Hour', 'AMorPM', 'Weekday']] test_cats = df_test_cats.astype(np.int64) test_cats = torch.tensor(test_cats.values).cuda() df_test_conts = df_test[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']] test_conts = df_test_conts.astype(np.float32) test_conts = torch.tensor(test_conts.values).cuda()

做出預測
with torch.no_grad(): model.eval() output = model.forward(test_cats,test_conts).cuda() output
請注意,預測現在是張量。如果要將其更改為Pandas Data框架,請遍歷存儲庫中的步驟。接下來,可以將其導出到CSV文件。
如果正在參加Kaggle比賽,請將其上傳到Kaggle以查看分數。

結論
總而言之,學習了如何從頭開始為表格數據構建PyTorch模型。必須投入完整的代碼並嘗試理解每一行。