(PyStan)零售價格貝葉斯策略建模（上）

2019 年 10 月 7 日
筆記

編輯 | sunlei 發佈 | ATYUN訂閱號

本文中藍色字體為外部鏈接，部分外部鏈接無法從文章中直接跳轉，請點擊【閱讀原文】以訪問。

定價是任何電子商務企業都面臨的一個普遍問題，可以通過貝葉斯統計方法得到有效的解決。

Kaggle的Mercari Price建議數據集似乎是我想學習的貝葉斯模型的一個很好的候選。

如果你還記得，數據集的目的是為Mercari網站賣家建立一個模型，自動為任何給定的產品給出正確的價格。我在這裡嘗試看看我們是否可以用通過使用pystan的貝葉斯統計方法來解決這個問題。

下面的定價分析複製了Fonnesbeck教授對家庭氡水平的案例研究。事實上，方法和代碼在很大程度上借鑒了他的教程。

數據

在此分析中，我們將評估類別中存在的單個產品價格的參數。測量價格是運輸條件（買方支付運費或賣方支付運費）和總價格的函數。

最後，我們對產品價格參數的估計可以看作是一種預測。

簡單地說，我們使用的自變量是：類別名稱和運輸。因變量是：價格。

from scipy import stats  import arviz as az  import numpy as np  import matplotlib.pyplot as plt  import pystan  import seaborn as sns  import pandas as pd  from theano import shared  from sklearn import preprocessing  plt.style.use('bmh')df = pd.read_csv('train.tsv', sep = 't')  df = df.sample(frac=0.01, random_state=99)  df = df[pd.notnull(df['category_name'])]df.category_name.nunique()

為了讓事情更有趣，我將為所有這689個產品類別建模。如果你想更快地產生更好的結果，你可能先為前10或前20個類別建模。

shipping_0 = df.loc[df['shipping'] == 0, 'price']  shipping_1 = df.loc[df['shipping'] == 1, 'price']  fig, ax = plt.subplots(figsize=(10,5))  ax.hist(shipping_0, color='#8CB4E1', alpha=1.0, bins=50, range = [0, 100],  label=0)  ax.hist(shipping_1, color='#007D00', alpha=0.7, bins=50, range = [0, 100],  label=1)  plt.xlabel('price', fontsize=12)  plt.ylabel('frequency', fontsize=12)  plt.title('Price Distribution by Shipping Type', fontsize=15)  plt.tick_params(labelsize=12)  plt.legend()  plt.show();

「shipping = 0」系指買方支付的運費，「shipping = 1」系指賣方支付的運費。一般來說，買方支付運費時價格較高。

建模

對於斯坦模型的構建，將相關變量作為本地副本是很方便的——這有助於可讀性。

category：每個類別名稱的索引代碼

price：價格

category_names：唯一類別名稱

categories：類別的數量

log_price：價格日誌

shipping：誰支付運費

category_lookup：使用查找字典索引類別

le = preprocessing.LabelEncoder()  df['category_code'] = le.fit_transform(df['category_name'])  category_names = df.category_name.unique()  categories = len(category_names)  category = df['category_code'].values  price = df.price  df['log_price'] = log_price = np.log(price + 0.1).values  shipping = df.shipping.values  category_lookup = dict(zip(category_names, range(len(category_names))))

我們應始終探索數據中的價格分佈（對數尺度）：

df.price.apply(lambda x: np.log(x+0.1)).hist(bins=25)    plt.title('Distribution of price (log scale)')  plt.xlabel('log (price)')  plt.ylabel('Frequency');

常規方法

有兩種傳統的價格建模方法代表了偏差-方差權衡的兩個極端：

1.完全池：

對所有類別進行相同的處理，並使用以下公式估算單個價格水平：

要在Stan中指定這個模型，我們首先構造數據塊，其中包括log-price度量(y)和誰支付運輸協變量(x)的向量，以及樣本數量(N)。

完整的池模型:

pooled_data = """  data {    int<lower=0> N;    vector[N] x;    vector[N] y;  }  """  pooled_parameters = """  parameters {    vector[2] beta;    real<lower=0> sigma;  }  """  pooled_model = """  model {    y ~ normal(beta[1] + beta[2] * x, sigma);  }  """

擬合模型:

當將代碼、數據和參數傳遞給Stan函數時，我們指定2條長度為1000的採樣鏈:

pooled_data_dict = {'N': len(log_price),                      'x': shipping,                      'y': log_price}    sm = pystan.StanModel(model_code=pooled_data + pooled_parameters + pooled_model)  pooled_fit = sm.sampling(data=pooled_data_dict, iter=1000, chains=2)

檢查配合

一旦運行了fit，該方法將提取並指定permuted=True提取樣本到數組字典中，以便進行可視化和總結。

我們感興趣的是樣本參數的這些估計值的平均值。

b0 = alpha=跨類別平均價格

m0 = beta=價格的平均變化，隨運費支付方的變化而變化

我們現在可以可視化這個池化模型與觀測數據的吻合程度。

pooled_sample = pooled_fit.extract(permuted=True)  b0, m0 = pooled_sample['beta'].T.mean(1)plt.scatter(df.shipping, np.log(df.price+0.1))  xvals = np.linspace(-0.2, 1.2)  plt.xticks([0, 1])  plt.plot(xvals, m0*xvals+b0, 'r--')  plt.title("Fitted model")  plt.xlabel("Shipping")  plt.ylabel("log(price)");

觀察：

擬合線貫穿數據中心，描述了趨勢。

然而，擬合模型的觀測點差異很大，並且存在多個異常值，表明原始價格變化很大。

如果我們選擇不同的數據子集，我們可能會期望不同的梯度。

上池化

上池化時，我們分別對每個類別的價格進行建模，公式如下：

其中j = 1，…，689

Unpooled模型

unpooled_model = """data {    int<lower=0> N;    int<lower=1,upper=689> category[N];    vector[N] x;    vector[N] y;  }  parameters {    vector[689] a;    real beta;    real<lower=0,upper=100> sigma;  }  transformed parameters {    vector[N] y_hat;      for (i in 1:N)      y_hat[i] <- beta * x[i] + a[category[i]];  }  model {    y ~ normal(y_hat, sigma);  }"

擬合模型：

在Stan中運行上池化模型時，我們再次將Python變量映射到Stan模型中使用的變量，然後將數據、參數和模型傳遞給Stan。我們再次指定兩個鏈的1000次迭代。

unpooled_data = {'N': len(log_price),                   'category': category+1, # Stan counts starting at 1                   'x': shipping,                   'y': log_price}    sm = pystan.StanModel(model_code=unpooled_model)  unpooled_fit = sm.sampling(data=unpooled_data, iter=1000, chains=2)

檢查配合

為了檢驗預測價格在類別水平上的變化，我們繪製了每個估計的平均值及其相關的標準誤差。為了直觀地構造這一結構，我們將重新排序類別，以便從最低到最高繪製類別。

unpooled_estimates = pd.Series(unpooled_fit['a'].mean(0), index=category_names)  unpooled_se = pd.Series(unpooled_fit['a'].std(0), index=category_names)order = unpooled_estimates.sort_values().indexplt.figure(figsize=(18, 6))  plt.scatter(range(len(unpooled_estimates)), unpooled_estimates[order])  for i, m, se in zip(range(len(unpooled_estimates)), unpooled_estimates[order], unpooled_se[order]):      plt.plot([i,i], [m-se, m+se], 'b-')      plt.xlim(-1,690);  plt.ylabel('Price estimate (log scale)');plt.xlabel('Ordered category');plt.title('Variation in category price estimates');

觀察：

預測價格水平相對較低的有多個類別，預測價格水平相對較高的也有多個類別。它們的距離可能很大。

所有價格水平的單一全類別估計不能很好地表示這種變化。

Pooled和unpooled估計值的比較

我們可以對所有類別的匯總和未匯總估計值進行直觀的比較，我們將展示幾個例子，我特意選擇了一些產品較多的類別，以及一些產品很少的類別。

# Define subset of categories  sample_categories = ('Women/Tops & Blouses/T-Shirts', "Women/Women's Handbags/Cosmetic Bags",                     'Kids/Diapering/Diaper Bags', 'Women/Underwear/Bras', 'Beauty/Makeup/Body',                     'Beauty/Fragrance/Women', 'Women/Sweaters/Full Zip', 'Home/Bedding/Quilts')    fig, axes = plt.subplots(2, 4, figsize=(16, 8), sharey=True, sharex=True)  axes = axes.ravel()  m = unpooled_fit['beta'].mean(0)  for i,c in enumerate(sample_categories):      y = df.log_price[df.category_name==c]      x = df.shipping[df.category_name==c]      axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)        # No pooling model      b = unpooled_estimates[c]        # Plot both models and data      xvals = np.linspace(-0.2, 1.2)      axes[i].plot(xvals, m*xvals+b)  #unpooled      axes[i].plot(xvals, m0*xvals+b0, 'r--')  # pooled      axes[i].set_xticks([0,1])      axes[i].set_title(c)      if not i%2:          axes[i].set_ylabel('log price level');

讓我試着解釋一下上面的可視化告訴我們什麼:

每個類別中的池模型(紅色虛線)都是相同的，這意味着所有類別的模型都是相同的，這表明池是無用的。
對於觀測很少的類別，擬合估計數與觀測值非常接近，表明存在過擬合。因此，我們不能相信使用少量觀測值的模型得出的估計值。

2.多級和層次模型

Partial Pooling –最簡單

電子商務價格數據集最簡單的可能部分池模型是一個簡單估計價格的模型，沒有其他預測因素（即忽略運輸的影響）。這是集合（所有類別的平均值）和未合併（類別級別的平均值）之間的折衷，並近似未合併類別估計值和合併估計值的加權平均值(按樣本大小)，公式為:

最簡單的partial pooling模型：

partial_pooling = """  data {    int<lower=0> N;    int<lower=1,upper=689> category[N];    vector[N] y;  }  parameters {    vector[689] a;    real mu_a;    real<lower=0,upper=100> sigma_a;    real<lower=0,upper=100> sigma_y;  }  transformed parameters {    vector[N] y_hat;    for (i in 1:N)      y_hat[i] <- a[category[i]];  }  model {    mu_a ~ normal(0, 1);    a ~ normal (10 * mu_a, sigma_a);      y ~ normal(y_hat, sigma_y);  }"""

現在我們有兩個標準差，一個是描述觀測值的殘差，另一個是描述平均值附近類別均值的可變性。

partial_pool_data = {'N': len(log_price),                       'category': category+1, # Stan counts starting at 1                       'y': log_price}    sm = pystan.StanModel(model_code=partial_pooling)  partial_pool_fit = sm.sampling(data=partial_pool_data, iter=1000, chains=2)

我們主要對價格的類別級別估計感興趣，因此我們獲得「a」的樣本估計：

sample_trace = partial_pool_fit['a']    fig, axes = plt.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)  samples, categories = sample_trace.shape  jitter = np.random.normal(scale=0.1, size=categories)    n_category = df.groupby('category_name')['train_id'].count()  unpooled_means = df.groupby('category_name')['log_price'].mean()  unpooled_sd = df.groupby('category_name')['log_price'].std()  unpooled = pd.DataFrame({'n':n_category, 'm':unpooled_means, 'sd':unpooled_sd})  unpooled['se'] = unpooled.sd/np.sqrt(unpooled.n)    axes[0].plot(unpooled.n + jitter, unpooled.m, 'b.')  for j, row in zip(jitter, unpooled.iterrows()):      name, dat = row      axes[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se], 'b-')  axes[0].set_xscale('log')  axes[0].hlines(sample_trace.mean(), 1, 1000, linestyles='--')    samples, categories = sample_trace.shape  means = sample_trace.mean(axis=0)  sd = sample_trace.std(axis=0)  axes[1].scatter(n_category.values + jitter, means)  axes[1].set_xscale('log')  # axes[1].set_xlim(100,1000)  # axes[1].set_ylim(2, 4)  axes[1].hlines(sample_trace.mean(), 1, 1000, linestyles='--')  for j,n,m,s in zip(jitter, n_category.values, means, sd):      axes[1].plot([n+j]*2, [m-s, m+s], 'b-');    axes[0].set_title("Unpooled model estimates")  axes[1].set_title("Partially pooled model estimates");

觀察：

分類價格的未匯總估計值和部分匯總估計值之間存在顯著差異，部分匯總估計值看起來不那麼極端。

Partial Pooling變截距

簡單地說，多級建模在類別之間共享優勢，允許在數據較少的類別中進行更合理的推理，公式如下：

變截距模型:

varying_intercept = """  data {    int<lower=0> J;    int<lower=0> N;    int<lower=1,upper=J> category[N];    vector[N] x;    vector[N] y;  }  parameters {    vector[J] a;    real b;    real mu_a;    real<lower=0,upper=100> sigma_a;    real<lower=0,upper=100> sigma_y;  }  transformed parameters {      vector[N] y_hat;      for (i in 1:N)      y_hat[i] <- a[category[i]] + x[i] * b;  }  model {    sigma_a ~ uniform(0, 100);    a ~ normal (mu_a, sigma_a);      b ~ normal (0, 1);      sigma_y ~ uniform(0, 100);    y ~ normal(y_hat, sigma_y);  }  """

擬合模型：

varying_intercept_data = {'N': len(log_price),                            'J': len(n_category),                            'category': category+1, # Stan counts starting at 1                            'x': shipping,                            'y': log_price}    sm = pystan.StanModel(model_code=varying_intercept)  varying_intercept_fit = sm.sampling(data=varying_intercept_data, iter=1000, chains=2)

無法將所有689個類別一起可視化，因此我將對其中20個類別進行可視化。

a_sample = pd.DataFrame(varying_intercept_fit['a'])  plt.figure(figsize=(20, 5))  g = sns.boxplot(data=a_sample.iloc[:,0:20], whis=np.inf, color="g")  # g.set_xticklabels(df.category_name.unique(), rotation=90) # label counties  g.set_title("Estimates of log(price), by category")  g;

今天的內容已經夠豐富了，相信你也需要一定的時間消化吸收一下，好吧，我們明天繼續。

原文鏈接：

https://towardsdatascience.com/bayesian-strategy-for-modeling-retail-price-with-pystan-fd0571ed778

end