深度学习-使用PyTorch的表格数据
- 2020 年 3 月 4 日
- 筆記

作者 | Offir Inbar
来源 | Medium
编辑 | 代码医生团队
这篇文章将通过一个实际的例子提供有关将Pytorch用于表格数据的详细的端到端指南。在本文结束时,将能够构建Pytorch模型。
- 使用Python的set_trace()可以全面了解每个步骤。
- 可以在此处找到完整的代码
https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb
数据集
选择从Kaggle从事纽约市出租车票价预测工作,目的是预测驾驶员的出租车票价。请注意,这是一个回归问题。可以在此处找到更多详细信息和完整的数据集。
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
训练数据包含超过200万个样本(5.31 GB)。为了最大程度地减少训练时间,One随机抽取了100k个训练样本。
import pandas import random filename = r"C:UsersUserDesktopoffirstudycomputer learningKaggle_compNYC Taxitrain.csv" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 100000 #desired sample size skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip) df.to_csv("temp.csv")
显卡
使用免费的Google Colab编写了代码。
要使用GPU,请执行以下操作:运行时->更改运行时设置->硬件加速器-> GPU。
代码
导入相关库
## import libraries #PyTorch import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader import torch.optim as torch_optim from torchvision import models from torch.nn import init import torch.optim as optim from torch.autograd import Variable import torch.nn.functional as F from torch.utils import data from torch.optim import lr_scheduler #sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from sklearn import preprocessing #other from IPython.core.debugger import set_trace import pandas as pd import numpy as np from collections import Counter from datetime import datetime import math from google.colab import files import io import datetime as dt import re import pandas_profiling import pandas_profiling as pp from math import sqrt #graphs import seaborn as sns import matplotlib.pyplot as plt import matplotlib.ticker as ticker import matplotlib.dates as mdates import matplotlib.cbook as cbook import pylab as plt import matplotlib.dates as dates import seaborn as sns import pylab import matplotlib import matplotlib.dates from IPython.display import display import plotly.graph_objects as go %matplotlib inline # load tqdm #!pip install --force https://github.com/chengs/tqdm/archive/colab.zip from tqdm import tqdm, tqdm_notebook, tnrange device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") Device
运行以下命令后,需要从计算机上载CSV文件。检查要上传的CSV文件名为sub_train。
# upload df_train csv file uploaded = files.upload() df_train = pd.read_csv(io.BytesIO(uploaded['sub_train.csv'])) df_train.head()

上传测试集
# upload df_Test csv file uploaded = files.upload() df_test = pd.read_csv(io.BytesIO(uploaded['test.csv'])) df_test.head()

数据预处理
下一步是删除所有小于0的票价(它们没有意义)

df_train的长度现在为99,990。在每个步骤中跟踪不同数据集的类型和长度非常重要。
Stacking train和测试仪,以便它们经过相同的预处理
目的是预测票价。因此它已从train_X数据帧中删除。
此外,选择在训练时预测价格的对数。
train_X = df_train.drop(columns=['fare_amount']) Y = np.log(df_train.fare_amount) test_X = df_test df = train_X.append(test_X,sort=False)
特征工程
做一些特征工程。
定义haverine_distatance函数并添加DateTime列以导出有用的统计信息。可以在GitHub Repo中看到完整的过程。
https://github.com/offirinbar/NYC_Taxi/blob/master/NYC_Taxi_PyTorch.ipynb
在此阶段之后,数据框如下所示:

准备模型
定义分类列和连续列,并且仅采用相关列。
cat_cols = ['Hour', 'AMorPM', 'Weekday'] cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km'] # keep only the cols for the model df = df[['Hour', 'AMorPM', 'Weekday','pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']]
将猫类别设置为“类别”,并对其进行标签编码。
for col in df.columns: if col in cat_cols: df[col] = LabelEncoder().fit_transform(df[col]) df[col]= df[col].astype('category')
定义类别列的嵌入大小。确定嵌入大小的经验法则是将每列中的唯一条目数除以2,但不得超过50。
cat_szs = [len(df[col].cat.categories) for col in cat_cols] emb_szs = [(size, min(50, (size+1)//2)) for size in cat_szs] emb_szs

现在,处理连续变量。在对它们进行归一化之前,重要的是在训练和测试集之间进行划分,以防止数据泄漏。
df_train = df[:99990] df_test = df[99990:] #Normalizing from pandas.api.types import is_numeric_dtype #"Compute the means and stds of `self.cont_names` columns to normalize them." def Normalize(df): means,stds = {},{} cont_names = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km'] for n in cont_names: assert is_numeric_dtype(df[n]), (f"""Cannot normalize '{n}' column as it isn't numerical. Are you sure it doesn't belong in the categorical set of columns?""") means[n],stds[n] = df[n].mean(),df[n].std() df[n] = (df[n]-means[n]) / (1e-7 + stds[n]) Normalize(df_train) Normalize(df_test) X = df_train
训练有效拆分
在训练和验证集之间进行划分。在这种情况下,验证集为总训练集的20%。
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=42,shuffle=True )
完成此步骤后,重要的是要看一下不同的形状。
模型
目前,数据存储在pandas数组中。PyTorch知道如何使用Tensors。以下步骤将数据转换为正确的类型。跟踪每个步骤中的数据类型。添加了具有当前数据类型的注释。
class RegressionColumnarDataset(data.Dataset): def __init__(self, df, cats, y): self.dfcats = df[cats] #type: pandas.core.frame.DataFrame self.dfconts = df.drop(cats, axis=1) #type: pandas.core.frame.DataFrame self.cats = np.stack([c.values for n, c in self.dfcats.items()], axis=1).astype(np.int64) #tpye: numpy.ndarray self.conts = np.stack([c.values for n, c in self.dfconts.items()], axis=1).astype(np.float32) #tpye: numpy.ndarray self.y = y.values.astype(np.float32) def __len__(self): return len(self.y) def __getitem__(self, idx): return [self.cats[idx], self.conts[idx], self.y[idx]]
trainds = RegressionColumnarDataset(X_train, cat_cols, y_train) #type: __main__.RegressionColumnarDataset valds = RegressionColumnarDataset(X_val, cat_cols, y_val) #type: __main__.RegressionColumnarDataset
现在该使用PyTorch DataLoader了。选择的批次大小为128,请随意使用。
params = {'batch_size': 128, 'shuffle': True} traindl = DataLoader(trainds, **params) #type: torch.utils.data.dataloader.DataLoader valdl = DataLoader(valds, **params) #type: torch.utils.data.dataloader.DataLoader
定义一个TabularModel
目的是根据连续列的数量+分类列的数量及其嵌入来定义模型。由于其具有回归任务,因此输出将是单个浮点值。
- ps:每层的丢失概率
- emb_drop:提供嵌入辍学
- emd_szs:元组列表:每个分类变量大小与一个嵌入大小配对
- n_cont:连续变量的数量
- out_sz:输出大小
# help functions from collections.abc import Iterable def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn=None): "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`." layers = [nn.BatchNorm1d(n_in)] if bn else [] if p != 0: layers.append(nn.Dropout(p)) layers.append(nn.Linear(n_in, n_out)) if actn is not None: layers.append(actn) return layers def ifnone(a,b): "`a` if `a` is not None, otherwise `b`." return b if a is None else a def listify(p, q): "Make `p` listy and the same length as `q`." if p is None: p=[] elif isinstance(p, str): p = [p] elif not isinstance(p, Iterable): p = [p] #Rank 0 tensors in PyTorch are Iterable but don't have a length. else: try: a = len(p) except: p = [p] n = q if type(q)==int else len(p) if q is None else len(q) if len(p)==1: p = p * n assert len(p)==n, f'List len mismatch ({len(p)} vs {n})' return list(p) class TabularModel(nn.Module): "Basic model for tabular data." def __init__(self, emb_szs, n_cont:int, out_sz:int, layers, ps=None, emb_drop:float=0., y_range=None, use_bn:bool=True, bn_final:bool=False): super().__init__() ps = ifnone(ps, [0]*len(layers)) ps = listify(ps, layers) self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni,nf in emb_szs]) #type: torch.nn.modules.container.ModuleList self.emb_drop = nn.Dropout(emb_drop) #type: torch.nn.modules.dropout.Dropout self.bn_cont = nn.BatchNorm1d(n_cont) #type torch.nn.modules.batchnorm.BatchNorm1d n_emb = sum(e.embedding_dim for e in self.embeds) # n_emb = 17 , type: int self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range sizes = [n_emb + n_cont] + layers + [out_sz] #typeL list, len: 4 actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None] #type: list, len: 3. the last in None because we finish with linear layers = [] for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)): layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act) if bn_final: layers.append(nn.BatchNorm1d(sizes[-1])) self.layers = nn.Sequential(*layers) #type: torch.nn.modules.container.Sequential def forward(self, x_cat, x_cont): if self.n_emb != 0: x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)] #take the embedding list and grab an embedding and pass in our single row of data. x = torch.cat(x, 1) # concatenate it on dim 1 ## remeber that the len is the batch size x = self.emb_drop(x) # pass it through a dropout layer if self.n_cont != 0: x_cont = self.bn_cont(x_cont) # batchnorm1d x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont # combine the categircal and continous variables on dim 1 x = self.layers(x) if self.y_range is not None: x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0] # deal with y_range return x.squeeze()
设置y_range进行预测(可选),然后调用模型。
y_range = (0, y_train.max()*1.2) model = TabularModel(emb_szs = emb_szs,n_cont = len(cont_cols),out_sz = 1,layers = [1000,500,250],ps= [0.001,0.01,0.01],emb_drop=0.04, y_range=y_range).to(device)
该模型如下所示:

定义一个优化器。选择了学习率为1e-2的亚当。学习是应该调整的第一个超参数。此外,有不同的策略来使用学习率(适合一个周期,余弦等)。在这里,使用恒定的学习率。
from collections import defaultdict opt = torch.optim.Adam(model.parameters(), lr=1e-2) # can add: weight_decay= lr = defaultdict(list) tloss = defaultdict(list) vloss = defaultdict(list)
训练和评估
训练模型。尝试跟踪并了解每个步骤。使用set_trace()命令非常有帮助。评估指标是RMSE。
def inv_y(y): return np.exp(y) def rmse(targ, y_pred): return np.sqrt(mean_squared_error(inv_y(y_pred), inv_y(targ))) #.detach().numpy() def rmse(targ, y_pred): return np.sqrt(mean_squared_error(y_pred, targ)) #.detach().numpy() def fit(model, train_dl, val_dl, loss_fn, opt, epochs=3): num_batch = len(train_dl) for epoch in tnrange(epochs): y_true_train = list() y_pred_train = list() total_loss_train = 0 t = tqdm_notebook(iter(train_dl), leave=False, total=num_batch) for cat, cont, y in t: cat = cat.cuda() cont = cont.cuda() y = y.cuda() t.set_description(f'Epoch {epoch}') opt.zero_grad() #find where the grads are zero pred = model(cat, cont) loss = loss_fn(pred, y) loss.backward() # do backprop lr[epoch].append(opt.param_groups[0]['lr']) tloss[epoch].append(loss.item()) opt.step() #scheduler.step() t.set_postfix(loss=loss.item()) y_true_train += list(y.cpu().data.numpy()) y_pred_train += list(pred.cpu().data.numpy()) total_loss_train += loss.item() train_acc = rmse(y_true_train, y_pred_train) train_loss = total_loss_train/len(train_dl) # len train_dl = 704. the calc is number of train examples (89991) / batch size (128) if val_dl: y_true_val = list() y_pred_val = list() total_loss_val = 0 for cat, cont, y in tqdm_notebook(val_dl, leave=False): cat = cat.cuda() cont = cont.cuda() y = y.cuda() pred = model(cat, cont) loss = loss_fn(pred, y) y_true_val += list(y.cpu().data.numpy()) y_pred_val += list(pred.cpu().data.numpy()) total_loss_val += loss.item() vloss[epoch].append(loss.item()) valacc = rmse(y_true_val, y_pred_val) valloss = total_loss_val/len(valdl) print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f} | val_loss: {valloss:.4f} val_rmse: {valacc:.4f}') else: print(f'Epoch {epoch}: train_loss: {train_loss:.4f} train_rmse: {train_acc:.4f}') return lr, tloss, vloss
将输入传递给拟合函数。在这种情况下,损失函数为MSEloss。
lr, tloss, vloss = fit(model=model, train_dl=traindl, val_dl=valdl, loss_fn=nn.MSELoss(), opt=opt, epochs=10)
绘制训练vs验证损失
t = [np.mean(tloss[el]) for el in tloss] v = [np.mean(vloss[el]) for el in vloss] plt.plot(t, label='Training loss') plt.plot(v, label='Validation loss') plt.title("Train VS Validation Loss over Epochs") plt.xlabel("Epochs") plt.legend(frameon=False)

完成训练部分
在使用模型并调整超参数之后,将对它感到满意。只有这样,才能转到下一步:在测试集上测试模型。
测试集
请记住:测试必须经过与训练集相同的过程。接下来的步骤是“准备”以进行评估。
分为分类和连续列,并使其成为张量。
df_test_cats = df_test[['Hour', 'AMorPM', 'Weekday']] test_cats = df_test_cats.astype(np.int64) test_cats = torch.tensor(test_cats.values).cuda() df_test_conts = df_test[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'dist_km']] test_conts = df_test_conts.astype(np.float32) test_conts = torch.tensor(test_conts.values).cuda()

做出预测
with torch.no_grad(): model.eval() output = model.forward(test_cats,test_conts).cuda() output
请注意,预测现在是张量。如果要将其更改为Pandas Data框架,请遍历存储库中的步骤。接下来,可以将其导出到CSV文件。
如果正在参加Kaggle比赛,请将其上传到Kaggle以查看分数。

结论
总而言之,学习了如何从头开始为表格数据构建PyTorch模型。必须投入完整的代码并尝试理解每一行。