二手車價格預測 | 構建AI模型並部署Web應用 ⛵

2022 年 8 月 10 日
筆記
二手車, 圖解機器學習演算法 | 從入門到精通系列教程, 數據挖掘, 數據探索, 機器學習, 機器學習實戰 | 手把手教你玩轉機器學習, 特徵工程, 調參

💡 作者：韓信子@ShowMeAI
📘 數據分析實戰系列：//www.showmeai.tech/tutorials/40
📘 機器學習實戰系列：//www.showmeai.tech/tutorials/41
📘 本文地址：//www.showmeai.tech/article-detail/300
📢 聲明：版權所有，轉載請聯繫平台與作者並註明出處
📢 收藏ShowMeAI查看更多精彩內容

一份來自『RESEARCH AND MARKETS』的二手車報告預計，從 2022 年到 2030 年，全球二手車市場將以 6.1% 的複合年增長率增長，到 2030 年達到 2.67 萬億美元。人工智慧技術的廣泛使用增加了車主和買家之間的透明度，提升了購買體驗，極大地推動了二手車市場的增長。

基於機器學習對二手車交易價格進行預估，這一技術已經在二手車交易平台中廣泛使用。在本篇內容中，ShowMeAI 會完整構建用於二手車價格預估的模型，並部署成web應用。

💡 數據分析處理&特徵工程

本案例涉及的數據集可以在 🏆 kaggle汽車價格預測獲取，也可以在ShowMeAI的百度網盤地址直接下載。

🏆 實戰數據集下載（百度網盤）：公眾號『ShowMeAI研究中心』回復『實戰』，或者點擊這裡獲取本文 [11] 構建AI模型並部署Web應用，預測二手車價格『CarPrice 二手車價格預測數據集』

⭐ ShowMeAI官方GitHub：//github.com/ShowMeAI-Hub

① 數據探索

數據分析處理涉及的工具和技能，歡迎大家查閱ShowMeAI對應的教程和工具速查表，快學快用。

圖解數據分析：從入門到精通系列教程

數據科學工具庫速查表 | Pandas 速查表

數據科學工具庫速查表 | Seaborn 速查表

我們先載入數據並初步查看資訊。

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
%matplotlib.inline

df=pd.read_csv('CarPrice_Assignment.csv')
df.head()

數據 Dataframe 的數據預覽如下：

我們對屬性欄位做點分析，看看哪些欄位與價格最相關，我們先計算相關性矩陣

df.corr()

再對相關性進行熱力圖可視化。

sns.set(rc={"figure.figsize":(20, 20)})
sns.heatmap(df.corr(), annot = True)

其中各欄位和price的相關性如下圖所示，我們可以看到其中有些欄位和結果之間有非常強的相關性。

我們可以對數值型欄位，分別和price目標欄位進行繪圖詳細分析，如下：

for col in df.columns: 
    if df[col].dtypes != 'object':
        sns.lmplot(data = df, x = col, y = 'price')

可視化結果圖如下：

我們把一些與價格相關性低（r<0.15）的欄位刪除掉：

df.drop(['car_ID'], axis = 1, inplace = True) 
to_drop = ['peakrpm', 'compressionratio', 'stroke', 'symboling']
df.drop(df[to_drop], axis = 1, inplace = True)

② 特徵工程

特徵工程涉及的方法技能，歡迎大家查閱ShowMeAI對應的教程文章，快學快用。

機器學習實戰 | 機器學習特徵工程最全解讀

車名列包括品牌和型號，我們對其拆分並僅保留品牌：

df['CarName'] = df['CarName'].apply(lambda x: x.split()[0])

輸出：

我們發現有一些車品牌的別稱或者拼寫錯誤，我們做一點數據清洗如下：

df['CarName'] = df['CarName'].str.lower()
df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'})

再對不同車品牌的數量做繪圖，如下：

sns.set(rc={'figure.figsize':(30,10)})
sns.countplot(data = df, x='CarName')

③ 特徵編碼&數據變換

下面我們要做進一步的特徵工程：

類別型特徵

大部分機器學習模型並不能處理類別型數據，我們會手動對其進行編碼操作。類別型特徵的編碼可以採用序號編碼或者獨熱向量編碼（具體參見ShowMeAI文章 機器學習實戰 | 機器學習特徵工程最全解讀），獨熱向量編碼示意圖如下：

數值型特徵

針對不同的模型，有不同的處理方式，比如幅度縮放和分布調整。

下面我們先將數據集的欄位分為兩類：類別型和數值型：

categorical = []
numerical = []
for col in df.columns:
   if df[col].dtypes == 'object':
      categorical.append(col)
   else:
      numerical.append(col)

下面我們使用pandas中的啞變數變換操作把所有標記為「categorical」的特徵進行獨熱向量編碼。

# 獨熱向量編碼
x1 = pd.get_dummies(df[categorical], drop_first = False)
x2 = df[numerical]
X = pd.concat([x2,x1], axis = 1)
X.drop('price', axis = 1, inplace = True)

下面我們對數值型特徵進行處理，首先我們看看標籤欄位price，我們先繪製一下它的分布，如下：

sns.histplot(data=df, x="price", kde=True)

大家從圖上可以看出這是一個有偏分布。我們對它做一個對數處理，以使其更接近正態分布。（另外一個考量是，如果我們以對數後的結果作為標籤來建模學習，那還原回 price 的過程，會使用指數操作，這能保證我們得到的價格一定是正數），程式碼如下：

#修復偏態分布 
df["price_log"]=np.log(df["price"])
sns.histplot(data=df, x="price_log", kde=True)

校正過後的數據分布更接近正態分布了，做過這些基礎處理之後，我們準備開始建模了。

💡 機器學習建模

① 數據集切分&數據變換

讓我們拆分數據集為訓練和測試集，並對其進行基本的數據變換操作：

＃切分數據 
from sklearn.model_selection import train_test_split

y = df['price_log']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1)
 
＃特徵工程-幅度縮放
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))])
X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))])

② 建模&調優

建模涉及的方法技能，歡迎大家查閱ShowMeAI對應的教程文章，快學快用。

機器學習實戰 | SKLearn最全應用指南

我們這裡的數據集並不大（樣本數不多），基於模型複雜度和效果考慮，我們先測試 4 個模型，看看哪一個表現最好。

Lasso regression
Ridge regression
隨機森林回歸器
XGBoost回歸器

我們先從scikit-learn導入對應的模型，如下：

#回歸模型 
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

③ 建模 pipeline

為了讓整個建模過程更加緊湊簡介，我們創建一個pipeline來訓練和調優模型。具體步驟為：

使用隨機超參數訓練評估每個模型。
使用網格搜索調優每個模型的超參數。
用找到的最佳參數重新訓練評估模型。

我們先從 scikit-learn 導入網格搜索：

from sklearn.model_selection import GridSearchCV

接著我們構建一個全面的評估指標函數，列印每個擬合模型的指標（R 平方、均方根誤差和平均絕對誤差等）：

def metrics(model):
   res_r2 = []
   res_RMSE = []
   res_MSE = []
   model.fit(X_train, y_train)
   Y_pred = model.predict(X_test)   

   #計算R方
   r2 = round(r2_score(y_test, Y_pred),4)
   print( 'R2_Score: ', r2)
   res_r2.append(r2)   
   
   #計算RMSE
   rmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2)
   print("RMSE: ",rmse)
   res_RMSE.append(rmse)   

   #計算MAE
   mse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2)
   print("MAE: ", mse)
   res_MSE.append(mse)

下面要構建pipeline了：

# 候選模型
models={
   'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100),
   
   'lasso':Lasso(alpha=0.005, fit_intercept=True),
   
   'ridge':Ridge(alpha = 10, fit_intercept=True), 'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100)
}

# 不同的模型不同建模方法
for mod in models:
   if mod == 'rfr' or mod == 'xgb':
     print('Untuned metrics for: ', mod)
     metrics(models[mod])
     print('\n')
     print('Starting grid search for: ', mod)
     params = {
       "n_estimators"      : [10,100, 1000, 2000, 4000, 6000],
       "max_features"      : ["auto", "sqrt", "log2"],
       "max_depth"         : [2, 4, 8, 12, 15],
       "min_samples_split" : [2,4,8],
       "bootstrap": [True, False],
    }
    if mod == 'rfr':
       rfr = RandomForestRegressor()
       grid = GridSearchCV(rfr, params, verbose=5, cv=2)
       grid.fit(X_train, y_train)
       print("Best score: ", grid.best_score_ )
       print("Best: params", grid.best_params_)
    else:
       xgboost = xgb.XGBRegressor()
       grid = GridSearchCV(xgboost, params, verbose=5, cv=2)
       grid.fit(X_train, y_train)
       print("Best score: ", grid.best_score_ )
       print("Best: params", grid.best_params_)
   else:
      print('Untuned metrics for: ', mod)
      metrics(models[mod])
      print('\n')
      print('Starting grid search for: ', mod)
      params = {
         "alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500],
         "fit_intercept": [True, False]
      }
      if mod == 'lasso':
         lasso = Lasso()
         grid = GridSearchCV(lasso, params, verbose = 5, cv = 2)
         grid.fit(X_train, y_train)
         print("Best score: ", grid.best_score_ ) 
         print("Best: params", grid.best_params_)
      else:
         ridge = Ridge()
         grid = GridSearchCV(ridge, params, verbose = 5, cv = 2)
         grid.fit(X_train, y_train)
         print("Best score: ", grid.best_score_ )
         print("Best: params", grid.best_params_)

以下是隨機調整模型的結果：

在未調超參數的情況下，我們看到差異不大的R方結果，但 Lasso 的誤差最小。

我們再看看網格搜索的結果，以找到每個模型的最佳參數：

現在讓我們將這些參數應用於每個模型，並查看結果：

調參後的結果相比默認超參數，都有提升，但 Lasso回歸依舊有最佳的效果（與本例的數據集樣本量和特徵相關性有關），我們最終保留Lasso回歸模型並存儲模型到本地。

lasso_reg = Lasso(alpha = 0.005, fit_intercept = True)
pickle.dump(lasso_reg, open('model.pkl','wb'))

💡 web應用開發

下面我們把上面得到的模型部署到網頁端，形成一個可以實時預估的應用，我們這裡使用 gradio 庫來開發 Web 應用程式，實際的web應用預估包含下面的步驟：

用戶在網頁表單中輸入數據
處理數據（特徵編碼&變換）
數據處理以匹配模型輸入格式
預測並呈現給用戶的價格

① 基本開發

首先，我們導入原始數據集和做過數據處理（獨熱向量編碼）的數據集，並保留它們各自的列。

# df的列
#Columns of the df
df = pd.read_csv('df_columns')
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
cols = df.columns

# df的啞變數列
dummy = pd.read_csv('dummy_df')
dummy.drop('Unnamed: 0', axis = 1, inplace=True)
cols_to_use = dummy.columns

接下來，對於類別型特徵，我們構建web應用端下拉選項：

# 構建應用中的候選值

# 車品牌首字母大寫
cars = df['CarName'].unique().tolist()
carNameCap = []
for col in cars:
   carNameCap.append(col.capitalize())

#fueltype欄位
fuel = df['fueltype'].unique().tolist()
fuelCap = []
for fu in fuel:
   fuelCap.append(fu.capitalize())

#carbod, engine type, fuel systems等欄位
carb = df['carbody'].unique().tolist()
engtype = df['enginetype'].unique().tolist()
fuelsys = df['fuelsystem'].unique().tolist()

OK，我們會針對上面這些模型預估需要用到的類別型欄位，開發下拉功能並添加候選項。

下面我們定義一個函數進行數據處理，並預估返回價格：

# 數據變換處理以匹配模型
def transform(data):
   # 數據幅度縮放
   sc = StandardScaler()
   
   # 導入模型
   model= pickle.load(open('model.pkl','rb'))
   
   # 新數據Dataframe
   new_df = pd.DataFrame([data],columns = cols)   
   # 區分類別型和數值型特徵
   cat = []
   num = []
   for col in new_df.columns:
      if new_df[col].dtypes == 'object':
         cat.append(col)
      else:
         num.append(col)    
    x1_new = pd.get_dummies(new_df[cat], drop_first = False)
    x2_new = new_df[num]
    
    X_new = pd.concat([x2_new,x1_new], axis = 1)
    final_df = pd.DataFrame(columns = cols_to_use)
    final_df = pd.concat([final_df, X_new])
    final_df = final_df.fillna(0)
    X_new = final_df.values
    X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:,
:(len(x1_new.columns))])    
    output = model.predict(X_new)
    return "The price of the car " + str(round(np.exp(output)[0],2)) + "$"

下面我們在gradio web應用程式中創建元素，我們會為類別型欄位構建下拉菜單或複選框，為數值型欄位構建輸入框。參考程式碼如下：

# 類別型
car = gr.Dropdown(label = "Car brand", choices=carNameCap)
# 數值型
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

現在，讓我們在介面中添加所有內容：

一切就緒就可以部署了！

② 部署

下面我們把上面得到應用部署一下，首先我們對於應用的 ip 和埠做一點設定

export GRADIO_SERVER_NAME=0.0.0.0
export GRADIO_SERVER_PORT="$PORT"

大家確定使用pip安裝好下述依賴：

numpy                            
pandas                             
scikit-learn                             
gradio                             
Flask                             
argparse                             
gunicorn                             
rq

接著運行 python WebApp.py 就可以測試應用程式了，WebApp.py內容如下：

import gradio as gr
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler

# 數據字典
asp = {
    'Standard':'std',
   'Turbo':'turbo'
}

drivew = {
    'Rear wheel drive': 'rwd',
    'Front wheel drive': 'fwd', 
    '4 wheel drive': '4wd'
}

cylnum = {
    2: 'two',
    3: 'three', 
    4: 'four',
    5: 'five', 
    6: 'six', 
    8: 'eight',
    12: 'twelve'
}

# 原始df欄位名
df = pd.read_csv('df_columns')
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
cols = df.columns

# 獨熱向量編碼過後的欄位名
dummy = pd.read_csv('dummy_df')
dummy.drop('Unnamed: 0', axis = 1, inplace=True)
cols_to_use = dummy.columns

# 車品牌名
cars = df['CarName'].unique().tolist()
carNameCap = []
for col in cars:
    carNameCap.append(col.capitalize())

# fuel
fuel = df['fueltype'].unique().tolist()
fuelCap = []
for fu in fuel:
    fuelCap.append(fu.capitalize())

#For carbod, engine type, fuel systme
carb = df['carbody'].unique().tolist() 
engtype = df['enginetype'].unique().tolist()
fuelsys = df['fuelsystem'].unique().tolist()

#Function to model data to fit the model
def transform(data):
    # 數值型幅度縮放
    sc= StandardScaler()

    # 導入模型
    lasso_reg = pickle.load(open('model.pkl','rb'))

    # 新數據Dataframe
    new_df = pd.DataFrame([data],columns = cols)

    # 切分類別型與數值型欄位
    cat = []
    num = []
    for col in new_df.columns: 
        if new_df[col].dtypes == 'object': 
            cat.append(col)
        else: 
            num.append(col)

    # 構建模型所需數據格式
    x1_new = pd.get_dummies(new_df[cat], drop_first = False)
    x2_new = new_df[num]
    X_new = pd.concat([x2_new,x1_new], axis = 1)
    
    final_df = pd.DataFrame(columns = cols_to_use)
    final_df = pd.concat([final_df, X_new])
    final_df = final_df.fillna(0)
    final_df = pd.concat([final_df,dummy])

    X_new = final_df.values
    X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))])
    print(X_new[-1].reshape(-1, 1))
    output = lasso_reg.predict(X_new[-1].reshape(1, -1))
    return "The price of the car " + str(round(np.exp(output)[0],2)) + "$"

# 預估價格的主函數
def predict_price(car, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, 
                carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg): 

    new_data = [car.lower(), fueltype.lower(), asp[aspiration], doornumber.lower(), carbody, drivew[drivewheel], enginelocation.lower(),
                wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylnum[cylindernumber], enginesize, fuelsystem, 
                boreratio, horsepower, citympg, highwaympg]
    
    return transform(new_data) 


car = gr.Dropdown(label = "Car brand", choices=carNameCap)

fueltype = gr.Radio(label = "Fuel Type", choices = fuelCap)

aspiration = gr.Radio(label = "Aspiration type", choices = ["Standard", "Turbo"])

doornumber = gr.Radio(label = "Number of doors", choices = ["Two", "Four"])

carbody = gr.Dropdown(label ="Car body type", choices = carb)

drivewheel = gr.Radio(label = "Drive wheel", choices = ['Rear wheel drive', 'Front wheel drive', '4 wheel drive'])

enginelocation = gr.Radio(label = "Engine location", choices = ['Front', 'Rear'])

wheelbase = gr.Slider(label = "Distance between the wheels on the side of the car (in inches)", minimum = 50, maximum = 300)

carlength = gr.Slider(label = "Length of the car (in inches)", minimum = 50, maximum = 300)

carwidth = gr.Slider(label = "Width of the car (in inches)", minimum = 50, maximum = 300)

carheight = gr.Slider(label = "Height of the car (in inches)", minimum = 50, maximum = 300)

curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

enginetype = gr.Dropdown(label = "Engine type", choices = engtype)

cylindernumber = gr.Radio(label = "Cylinder number", choices = [2, 3, 4, 5, 6, 8, 12])

enginesize = gr.Slider(label = "Engine size (swept volume of all the pistons inside the cylinders)", minimum = 50, maximum = 500)

fuelsystem = gr.Dropdown(label = "Fuel system (link to ressource: ", choices = fuelsys)

boreratio = gr.Slider(label = "Bore ratio (ratio between cylinder bore diameter and piston stroke)", minimum = 1, maximum = 6)

horsepower = gr.Slider(label = "Horse power of the car", minimum = 25, maximum = 400)

citympg = gr.Slider(label = "Mileage in city (in km)", minimum = 0, maximum = 100)

highwaympg = gr.Slider(label = "Mileage on highway (in km)", minimum = 0, maximum = 100)

Output = gr.Textbox()

app = gr.Interface(title="Predict the price of a car based on its specs", 
                    fn=predict_price,
                    inputs=[car,
                            fueltype,
                            aspiration,
                            doornumber,
                            carbody,
                            drivewheel, 
                            enginelocation, 
                            wheelbase,
                            carlength, 
                            carwidth, 
                            carheight, 
                            curbweight,
                            enginetype, 
                            cylindernumber, 
                            enginesize,
                            fuelsystem,
                            boreratio,
                            horsepower, 
                            citympg, 
                            highwaympg
                            ],
                    outputs=Output)

app.launch()

最終的應用結果如下，可以自己勾選與填入特徵進行模型預估！

參考資料

🏆 實戰數據集下載（百度網盤）：公眾號『ShowMeAI研究中心』回復『實戰』，或者點擊這裡獲取本文 [11] 構建AI模型並部署Web應用，預測二手車價格『CarPrice 二手車價格預測數據集』
⭐ ShowMeAI官方GitHub：//github.com/ShowMeAI-Hub
📘 圖解數據分析：從入門到精通系列教程 //www.showmeai.tech/tutorials/33
📘 數據科學工具庫速查表 | Pandas 速查表 //www.showmeai.tech/article-detail/101
📘 數據科學工具庫速查表 | Seaborn 速查表 //www.showmeai.tech/article-detail/105
📘 機器學習實戰 | 機器學習特徵工程最全解讀 //www.showmeai.tech/article-detail/208
📘 機器學習實戰 | SKLearn最全應用指南 //www.showmeai.tech/article-detail/203

Tags: 二手車圖解機器學習演算法 | 從入門到精通系列教程數據挖掘數據探索機器學習機器學習實戰 | 手把手教你玩轉機器學習特徵工程調參

二手車價格預測 | 構建AI模型並部署Web應用 ⛵

💡 數據分析處理&特徵工程

① 數據探索

② 特徵工程

③ 特徵編碼&數據變換

💡 機器學習建模

① 數據集切分&數據變換

② 建模&調優

③ 建模 pipeline

💡 web應用開發

① 基本開發

② 部署

參考資料

VirMach 便宜 VPS

QNews

二手車價格預測 | 構建AI模型並部署Web應用 ⛵

💡 數據分析處理&特徵工程

① 數據探索

② 特徵工程

③ 特徵編碼&數據變換

💡 機器學習建模

① 數據集切分&數據變換

② 建模&調優

③ 建模 pipeline

💡 web應用開發

① 基本開發

② 部署

參考資料

分享此文：

Related Posts

靜默命令行安裝 Visual C++ 發行包

【完虐演算法】LeetCode 接雨水問題，全復盤

國產屏崛起！面板一哥獨供一加Ace旗艦自研藍鑽屏

蘇炳添是蘇軾後代？村委會：不知情 專家給出觀點

VirMach 便宜 VPS

QNews

熱門文章

熱門搜尋

蘇炳添是蘇軾後代？村委會：不知情專家給出觀點