(PyStan)零售价格贝叶斯策略建模（上）

2019 年 10 月 7 日
筆記

编辑 | sunlei 发布 | ATYUN订阅号

本文中蓝色字体为外部链接，部分外部链接无法从文章中直接跳转，请点击【阅读原文】以访问。

定价是任何电子商务企业都面临的一个普遍问题，可以通过贝叶斯统计方法得到有效的解决。

Kaggle的Mercari Price建议数据集似乎是我想学习的贝叶斯模型的一个很好的候选。

如果你还记得，数据集的目的是为Mercari网站卖家建立一个模型，自动为任何给定的产品给出正确的价格。我在这里尝试看看我们是否可以用通过使用pystan的贝叶斯统计方法来解决这个问题。

下面的定价分析复制了Fonnesbeck教授对家庭氡水平的案例研究。事实上，方法和代码在很大程度上借鉴了他的教程。

数据

在此分析中，我们将评估类别中存在的单个产品价格的参数。测量价格是运输条件（买方支付运费或卖方支付运费）和总价格的函数。

最后，我们对产品价格参数的估计可以看作是一种预测。

简单地说，我们使用的自变量是：类别名称和运输。因变量是：价格。

from scipy import stats  import arviz as az  import numpy as np  import matplotlib.pyplot as plt  import pystan  import seaborn as sns  import pandas as pd  from theano import shared  from sklearn import preprocessing  plt.style.use('bmh')df = pd.read_csv('train.tsv', sep = 't')  df = df.sample(frac=0.01, random_state=99)  df = df[pd.notnull(df['category_name'])]df.category_name.nunique()

为了让事情更有趣，我将为所有这689个产品类别建模。如果你想更快地产生更好的结果，你可能先为前10或前20个类别建模。

shipping_0 = df.loc[df['shipping'] == 0, 'price']  shipping_1 = df.loc[df['shipping'] == 1, 'price']  fig, ax = plt.subplots(figsize=(10,5))  ax.hist(shipping_0, color='#8CB4E1', alpha=1.0, bins=50, range = [0, 100],  label=0)  ax.hist(shipping_1, color='#007D00', alpha=0.7, bins=50, range = [0, 100],  label=1)  plt.xlabel('price', fontsize=12)  plt.ylabel('frequency', fontsize=12)  plt.title('Price Distribution by Shipping Type', fontsize=15)  plt.tick_params(labelsize=12)  plt.legend()  plt.show();

“shipping = 0”系指买方支付的运费，“shipping = 1”系指卖方支付的运费。一般来说，买方支付运费时价格较高。

建模

对于斯坦模型的构建，将相关变量作为本地副本是很方便的——这有助于可读性。

category：每个类别名称的索引代码

price：价格

category_names：唯一类别名称

categories：类别的数量

log_price：价格日志

shipping：谁支付运费

category_lookup：使用查找字典索引类别

le = preprocessing.LabelEncoder()  df['category_code'] = le.fit_transform(df['category_name'])  category_names = df.category_name.unique()  categories = len(category_names)  category = df['category_code'].values  price = df.price  df['log_price'] = log_price = np.log(price + 0.1).values  shipping = df.shipping.values  category_lookup = dict(zip(category_names, range(len(category_names))))

我们应始终探索数据中的价格分布（对数尺度）：

df.price.apply(lambda x: np.log(x+0.1)).hist(bins=25)    plt.title('Distribution of price (log scale)')  plt.xlabel('log (price)')  plt.ylabel('Frequency');

常规方法

有两种传统的价格建模方法代表了偏差-方差权衡的两个极端：

1.完全池：

对所有类别进行相同的处理，并使用以下公式估算单个价格水平：

要在Stan中指定这个模型，我们首先构造数据块，其中包括log-price度量(y)和谁支付运输协变量(x)的向量，以及样本数量(N)。

完整的池模型:

pooled_data = """  data {    int<lower=0> N;    vector[N] x;    vector[N] y;  }  """  pooled_parameters = """  parameters {    vector[2] beta;    real<lower=0> sigma;  }  """  pooled_model = """  model {    y ~ normal(beta[1] + beta[2] * x, sigma);  }  """

拟合模型:

当将代码、数据和参数传递给Stan函数时，我们指定2条长度为1000的采样链:

pooled_data_dict = {'N': len(log_price),                      'x': shipping,                      'y': log_price}    sm = pystan.StanModel(model_code=pooled_data + pooled_parameters + pooled_model)  pooled_fit = sm.sampling(data=pooled_data_dict, iter=1000, chains=2)

检查配合

一旦运行了fit，该方法将提取并指定permuted=True提取样本到数组字典中，以便进行可视化和总结。

我们感兴趣的是样本参数的这些估计值的平均值。

b0 = alpha=跨类别平均价格

m0 = beta=价格的平均变化，随运费支付方的变化而变化

我们现在可以可视化这个池化模型与观测数据的吻合程度。

pooled_sample = pooled_fit.extract(permuted=True)  b0, m0 = pooled_sample['beta'].T.mean(1)plt.scatter(df.shipping, np.log(df.price+0.1))  xvals = np.linspace(-0.2, 1.2)  plt.xticks([0, 1])  plt.plot(xvals, m0*xvals+b0, 'r--')  plt.title("Fitted model")  plt.xlabel("Shipping")  plt.ylabel("log(price)");

观察：

拟合线贯穿数据中心，描述了趋势。

然而，拟合模型的观测点差异很大，并且存在多个异常值，表明原始价格变化很大。

如果我们选择不同的数据子集，我们可能会期望不同的梯度。

上池化

上池化时，我们分别对每个类别的价格进行建模，公式如下：

其中j = 1，…，689

Unpooled模型

unpooled_model = """data {    int<lower=0> N;    int<lower=1,upper=689> category[N];    vector[N] x;    vector[N] y;  }  parameters {    vector[689] a;    real beta;    real<lower=0,upper=100> sigma;  }  transformed parameters {    vector[N] y_hat;      for (i in 1:N)      y_hat[i] <- beta * x[i] + a[category[i]];  }  model {    y ~ normal(y_hat, sigma);  }"

拟合模型：

在Stan中运行上池化模型时，我们再次将Python变量映射到Stan模型中使用的变量，然后将数据、参数和模型传递给Stan。我们再次指定两个链的1000次迭代。

unpooled_data = {'N': len(log_price),                   'category': category+1, # Stan counts starting at 1                   'x': shipping,                   'y': log_price}    sm = pystan.StanModel(model_code=unpooled_model)  unpooled_fit = sm.sampling(data=unpooled_data, iter=1000, chains=2)

检查配合

为了检验预测价格在类别水平上的变化，我们绘制了每个估计的平均值及其相关的标准误差。为了直观地构造这一结构，我们将重新排序类别，以便从最低到最高绘制类别。

unpooled_estimates = pd.Series(unpooled_fit['a'].mean(0), index=category_names)  unpooled_se = pd.Series(unpooled_fit['a'].std(0), index=category_names)order = unpooled_estimates.sort_values().indexplt.figure(figsize=(18, 6))  plt.scatter(range(len(unpooled_estimates)), unpooled_estimates[order])  for i, m, se in zip(range(len(unpooled_estimates)), unpooled_estimates[order], unpooled_se[order]):      plt.plot([i,i], [m-se, m+se], 'b-')      plt.xlim(-1,690);  plt.ylabel('Price estimate (log scale)');plt.xlabel('Ordered category');plt.title('Variation in category price estimates');

观察：

预测价格水平相对较低的有多个类别，预测价格水平相对较高的也有多个类别。它们的距离可能很大。

所有价格水平的单一全类别估计不能很好地表示这种变化。

Pooled和unpooled估计值的比较

我们可以对所有类别的汇总和未汇总估计值进行直观的比较，我们将展示几个例子，我特意选择了一些产品较多的类别，以及一些产品很少的类别。

# Define subset of categories  sample_categories = ('Women/Tops & Blouses/T-Shirts', "Women/Women's Handbags/Cosmetic Bags",                     'Kids/Diapering/Diaper Bags', 'Women/Underwear/Bras', 'Beauty/Makeup/Body',                     'Beauty/Fragrance/Women', 'Women/Sweaters/Full Zip', 'Home/Bedding/Quilts')    fig, axes = plt.subplots(2, 4, figsize=(16, 8), sharey=True, sharex=True)  axes = axes.ravel()  m = unpooled_fit['beta'].mean(0)  for i,c in enumerate(sample_categories):      y = df.log_price[df.category_name==c]      x = df.shipping[df.category_name==c]      axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)        # No pooling model      b = unpooled_estimates[c]        # Plot both models and data      xvals = np.linspace(-0.2, 1.2)      axes[i].plot(xvals, m*xvals+b)  #unpooled      axes[i].plot(xvals, m0*xvals+b0, 'r--')  # pooled      axes[i].set_xticks([0,1])      axes[i].set_title(c)      if not i%2:          axes[i].set_ylabel('log price level');

让我试着解释一下上面的可视化告诉我们什么:

每个类别中的池模型(红色虚线)都是相同的，这意味着所有类别的模型都是相同的，这表明池是无用的。
对于观测很少的类别，拟合估计数与观测值非常接近，表明存在过拟合。因此，我们不能相信使用少量观测值的模型得出的估计值。

2.多级和层次模型

Partial Pooling –最简单

电子商务价格数据集最简单的可能部分池模型是一个简单估计价格的模型，没有其他预测因素（即忽略运输的影响）。这是集合（所有类别的平均值）和未合并（类别级别的平均值）之间的折衷，并近似未合并类别估计值和合并估计值的加权平均值(按样本大小)，公式为:

最简单的partial pooling模型：

partial_pooling = """  data {    int<lower=0> N;    int<lower=1,upper=689> category[N];    vector[N] y;  }  parameters {    vector[689] a;    real mu_a;    real<lower=0,upper=100> sigma_a;    real<lower=0,upper=100> sigma_y;  }  transformed parameters {    vector[N] y_hat;    for (i in 1:N)      y_hat[i] <- a[category[i]];  }  model {    mu_a ~ normal(0, 1);    a ~ normal (10 * mu_a, sigma_a);      y ~ normal(y_hat, sigma_y);  }"""

现在我们有两个标准差，一个是描述观测值的残差，另一个是描述平均值附近类别均值的可变性。

partial_pool_data = {'N': len(log_price),                       'category': category+1, # Stan counts starting at 1                       'y': log_price}    sm = pystan.StanModel(model_code=partial_pooling)  partial_pool_fit = sm.sampling(data=partial_pool_data, iter=1000, chains=2)

我们主要对价格的类别级别估计感兴趣，因此我们获得“a”的样本估计：

sample_trace = partial_pool_fit['a']    fig, axes = plt.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)  samples, categories = sample_trace.shape  jitter = np.random.normal(scale=0.1, size=categories)    n_category = df.groupby('category_name')['train_id'].count()  unpooled_means = df.groupby('category_name')['log_price'].mean()  unpooled_sd = df.groupby('category_name')['log_price'].std()  unpooled = pd.DataFrame({'n':n_category, 'm':unpooled_means, 'sd':unpooled_sd})  unpooled['se'] = unpooled.sd/np.sqrt(unpooled.n)    axes[0].plot(unpooled.n + jitter, unpooled.m, 'b.')  for j, row in zip(jitter, unpooled.iterrows()):      name, dat = row      axes[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se], 'b-')  axes[0].set_xscale('log')  axes[0].hlines(sample_trace.mean(), 1, 1000, linestyles='--')    samples, categories = sample_trace.shape  means = sample_trace.mean(axis=0)  sd = sample_trace.std(axis=0)  axes[1].scatter(n_category.values + jitter, means)  axes[1].set_xscale('log')  # axes[1].set_xlim(100,1000)  # axes[1].set_ylim(2, 4)  axes[1].hlines(sample_trace.mean(), 1, 1000, linestyles='--')  for j,n,m,s in zip(jitter, n_category.values, means, sd):      axes[1].plot([n+j]*2, [m-s, m+s], 'b-');    axes[0].set_title("Unpooled model estimates")  axes[1].set_title("Partially pooled model estimates");

观察：

分类价格的未汇总估计值和部分汇总估计值之间存在显著差异，部分汇总估计值看起来不那么极端。

Partial Pooling变截距

简单地说，多级建模在类别之间共享优势，允许在数据较少的类别中进行更合理的推理，公式如下：

变截距模型:

varying_intercept = """  data {    int<lower=0> J;    int<lower=0> N;    int<lower=1,upper=J> category[N];    vector[N] x;    vector[N] y;  }  parameters {    vector[J] a;    real b;    real mu_a;    real<lower=0,upper=100> sigma_a;    real<lower=0,upper=100> sigma_y;  }  transformed parameters {      vector[N] y_hat;      for (i in 1:N)      y_hat[i] <- a[category[i]] + x[i] * b;  }  model {    sigma_a ~ uniform(0, 100);    a ~ normal (mu_a, sigma_a);      b ~ normal (0, 1);      sigma_y ~ uniform(0, 100);    y ~ normal(y_hat, sigma_y);  }  """

拟合模型：

varying_intercept_data = {'N': len(log_price),                            'J': len(n_category),                            'category': category+1, # Stan counts starting at 1                            'x': shipping,                            'y': log_price}    sm = pystan.StanModel(model_code=varying_intercept)  varying_intercept_fit = sm.sampling(data=varying_intercept_data, iter=1000, chains=2)

无法将所有689个类别一起可视化，因此我将对其中20个类别进行可视化。

a_sample = pd.DataFrame(varying_intercept_fit['a'])  plt.figure(figsize=(20, 5))  g = sns.boxplot(data=a_sample.iloc[:,0:20], whis=np.inf, color="g")  # g.set_xticklabels(df.category_name.unique(), rotation=90) # label counties  g.set_title("Estimates of log(price), by category")  g;

今天的内容已经够丰富了，相信你也需要一定的时间消化吸收一下，好吧，我们明天继续。

原文链接：

https://towardsdatascience.com/bayesian-strategy-for-modeling-retail-price-with-pystan-fd0571ed778

end