AI 音辨世界：藝術小白的我，靠這個AI模型，速識音樂流派選擇音樂 ⛵

2022 年 8 月 26 日
筆記
Python, 人工智能, 數據分析, 數據挖掘, 機器學習, 機器學習實戰 | 手把手教你玩轉機器學習, 音樂流派

💡 作者：韓信子@ShowMeAI
📘 數據分析實戰系列：//www.showmeai.tech/tutorials/40
📘 機器學習實戰系列：//www.showmeai.tech/tutorials/41
📘 本文地址：//www.showmeai.tech/article-detail/309
📢 聲明：版權所有，轉載請聯繫平台與作者並註明出處
📢 收藏ShowMeAI查看更多精彩內容

只要給到足夠的相關信息，AI模型可以迅速學習一個新的領域問題，並構建起很好的知識和預估系統。比如音樂領域，藉助於歌曲相關信息，模型可以根據歌曲的音頻和歌詞特徵將歌曲精準進行流派分類。在本篇內容中 ShowMeAI 就帶大家一起來看看，如何基於機器學習完成對音樂的識別分類。

本篇內容使用到的數據集為 🏆Spotify音樂數據集，大家也可以通過 ShowMeAI 的百度網盤地址快速下載。

🏆 實戰數據集下載（百度網盤）：公眾號『ShowMeAI研究中心』回復『實戰』，或者點擊這裡獲取本文 [18]音樂流派識別的機器學習系統搭建與調優『Spotify 音樂數據集』

⭐ ShowMeAI官方GitHub：//github.com/ShowMeAI-Hub

我們在本篇內容中將用到最常用的 boosting 集成工具庫 LightGBM，並且將結合 optuna 工具庫對其進行超參數調優，優化模型效果。

關於 LightGBM 的模型原理和使用詳細講解，歡迎大家查閱 ShowMeAI 的文章：

📘圖解機器學習算法(11) | LightGBM模型詳解

📘機器學習實戰(5) | LightGBM建模應用詳解

本篇文章包含以下內容板塊：

數據概覽和預處理
EDA探索性數據分析
歌詞特徵&數據降維
建模和超參數優化
總結&經驗

💡 數據概覽和預處理

本次使用的數據集包含超過 18000 首歌曲的信息，包括其音頻特徵信息（如活力度，播放速度或調性等），以及歌曲的歌詞。

我們讀取數據並做一個速覽如下：

import pandas as pd
# 讀取數據
data = pd.read_csv("spotify_songs.csv")
# 數據速覽
data.head()

# 數據基本信息
data.info()

字段說明如下：

字段	含義
track_id	歌曲唯一ID
track_name	歌曲名稱
track_artist	歌手
lyrics	歌詞
track_popularity	唱片熱度
track_album_id	唱片的唯一ID
track_album_name	唱片名字
track_album_release_date	唱片發行日期
playlist_name	歌單名稱
playlist_id	歌單ID
playlist_genre	歌單風格
playlist_subgenre	歌單子風格
danceability	舞蹈性描述的是根據音樂元素的組合，包括速度、節奏的穩定性、節拍的強度和整體的規律性，來衡量一首曲目是否適合跳舞。0.0的值是最不適合跳舞的，1.0是最適合跳舞的。
energy	能量是一個從0.0到1.0的度量，代表強度和活動的感知度。一般來說，有能量的曲目給人的感覺是快速、響亮。例如，死亡金屬有很高的能量，而巴赫的前奏曲在該量表中得分較低。
key	音軌的估測總調。用標準的音階符號將整數映射為音高。例如，0=C，1=C♯/D♭，2=D，以此類推。如果沒有檢測到音調，則數值為-1。
loudness	軌道的整體響度，單位是分貝（dB）。響度值是整個音軌的平均值，對於比較音軌的相對響度非常有用。
mode	模式表示音軌的調式（大調或小調），即其旋律內容所來自的音階類型。大調用1表示，小調用0表示。
speechiness	言語性檢測音軌中是否有口語。錄音越是完全類似於語音（如脫口秀、說唱、詩歌），屬性值就越接近1.0。
acousticness	衡量音軌是否為聲學的信心指數，從0.0到1.0。1.0表示該曲目為原聲的高置信度。
instrumentalness	預測一個音軌是否包含人聲。越接近1.0該曲目就越有可能不包含人聲內容。
liveness	檢測錄音中是否有聽眾存在。越接近現場演出數值越大。
valence	0.0到1.0，描述了一個音軌所傳達的音樂積極性，接近1的曲目聽起來更積極（如快樂、歡快、興奮），而接近0的曲目聽起來更消極（如悲傷、壓抑、憤怒）。
tempo	軌道的整體估計速度，單位是每分鐘節拍（BPM）。
duration_ms	歌曲的持續時間（毫秒）
language	歌詞的語言語種

原始的數據有點雜亂，我們先進行過濾和數據清洗。

# 數據工具庫
import pandas as pd
import re

# 歌詞處理的nlp工具庫
import nltk
from nltk.corpus import stopwords
from collections import Counter
# nltk.download('stopwords')

# 讀取數據
data = pd.read_csv("spotify_songs.csv")
# 字段選擇
keep_cols = [x for x in data.columns if not x.startswith("track") and not x.startswith("playlist")]
keep_cols.append("playlist_genre")
df = data[keep_cols].copy()
# 只保留英文歌曲
subdf = df[(df.language == "en") & (df.playlist_genre != "latin")].copy().drop(columns = "language")


# 歌詞規整化，全部小寫
pattern = r"[^a-zA-Z ]"
subdf.lyrics = subdf.lyrics.apply(lambda x: re.sub(pattern, "", x.lower()))

# 移除停用詞
subdf.lyrics = subdf.lyrics.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords.words("english"))]))

# 查看歌詞中的詞彙出現的頻次

# 連接所有歌詞
all_text = " ".join(subdf.lyrics)
# 統計詞頻
word_count = Counter(all_text.split())
# 如果一個詞在200首以上的歌里都出現，則保留，否則視作低頻過濾掉
keep_words = [k for k, v in word_count.items() if v > 200]
# 構建一個副本
lyricdf = subdf.copy().reset_index(drop=True)
# 字段名稱規範化
lyricdf.columns = ["audio_"+ x if not x in ["lyrics", "playlist_genre"] else x for x in lyricdf.columns]
# 歌詞內容
lyricdf.lyrics = lyricdf.lyrics.apply(lambda x: Counter([word for word in x.split() if word in keep_words]))
# 構建詞彙詞頻Dataframe
unpacked_lyrics = pd.DataFrame.from_records(lyricdf.lyrics).add_prefix("lyrics_")
# 缺失填充為0
unpacked_lyrics = unpacked_lyrics.fillna(0) 
# 拼接並刪除原始歌詞列
lyricdf = pd.concat([lyricdf, unpacked_lyrics], axis = 1).drop(columns = "lyrics")
# 排序
reordered_cols = [col for col in lyricdf.columns if not col.startswith("lyrics_")] + sorted([col for col in lyricdf.columns if col.startswith("lyrics_")])
lyricdf = lyricdf[reordered_cols]

# 存儲為新的csv文件
lyricdf.to_csv("music_data.csv", index = False)

主要的數據預處理在上述代碼的注釋里大家可以看到，核心步驟概述如下：

過濾數據以僅包含英語歌曲並刪除「拉丁」類型的歌曲（因為這些歌曲幾乎完全是西班牙語，所以會產生嚴重的類不平衡）。
通過將歌詞設為小寫、刪除標點符號和停用詞來整理歌詞。計算每個剩餘單詞在歌曲歌詞中出現的次數，然後過濾掉所有歌曲中出現頻率最低的單詞（混亂的數據/噪音）。
清理與排序。

💡 EDA探索性數據分析

和過往所有的項目一樣，我們也需要先對數據做一些分析和更進一步的理解，也就是EDA探索性數據分析過程。

EDA數據分析部分涉及的工具庫，大家可以參考ShowMeAI製作的工具庫速查表和教程進行學習和快速使用。
📘數據科學工具庫速查表 | Pandas 速查表
📘圖解數據分析：從入門到精通系列教程

首先我們檢查一下我們的標籤（流派）的類分佈和平衡。

# 分組統計
by_genre = data.groupby("playlist_genre")["audio_key"].count().reset_index()
fig, ax = plt.subplots()

# 繪圖
ax.bar(by_genre.playlist_genre, by_genre.audio_key)
ax.set_ylabel("Number of Observations")
ax.set_xlabel("Genre")
ax.set_title("Observations per Class")
ax.set_ylim(0, 4000)

# 每個柱子上標註數量
rects = ax.patches
for rect in rects:
    height = rect.get_height()
    ax.text(
        rect.get_x() + rect.get_width() / 2, height + 5, height, ha="center", va="bottom"
    )

存在輕微的類別不平衡，那後續我們在交叉驗證和訓練測試拆分時候注意數據分層（保持比例分佈） 即可。

# 把所有字段切分為音頻和歌詞列
audio = data[[x for x in data.columns if x.startswith("audio")]]
lyric = data[[x for x in data.columns if x.startswith("lyric")]]
# 讓字段命名更簡單一些
audio.columns = audio.columns.str.replace("audio_", "")
lyric.columns = lyric.columns.str.replace("lyric_", "")

💡 歌詞特徵&數據降維

我們的機器學習算法在處理高維數據的時候，可能會有一些性能問題，有時候我們會對數據進行降維處理。

降維的本質是將高維數據投影到低維子空間中，同時儘可能多地保留數據中的信息。關於降維大家可以查看 ShowMeAI 的算法原理講解文章 📘圖解機器學習 | 降維算法詳解

我們探索一下降維算法（PCA 和 t-SNE）在我們的歌詞數據上降維是否合適，並做一點調整。

📌 PCA主成分分析

PCA是最常用的降維算法之一，我們藉助這個算法可以對數據進行降維，並且看到它保留大概多少的原始信息量。例如，在我們當前場景中，如果將歌詞減少到400 維，我們仍然保留了歌詞中60% 的信息（方差） ；如果降維到800維，則可以覆蓋 80% 的原始信息（方差）。歌詞本身是很稀疏的，我們對其降維也能讓模型更好地建模。

# 常規數據工具庫
import pandas as pd
import numpy as np
# 繪圖
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
# 數據處理
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

# 讀取數據
data = pd.read_csv("music_data.csv")
# 切分為音頻與歌詞
audio = data[[x for x in data.columns if x.startswith("audio")]]
lyric = data[[x for x in data.columns if x.startswith("lyric")]]
# 特徵字段
y = data.playlist_genre

# 數據幅度縮放 + PCA降維
scaler = MinMaxScaler()
audio_features = scaler.fit_transform(audio)
lyric_features = scaler.fit_transform(lyric)

pca = PCA()
lyric_pca  = pca.fit_transform(lyric_features)
var_explained_ratio = pca.explained_variance_ratio_
   
# Plot graph
fig, ax = plt.subplots()
# Reduce margins
plt.margins(x=0.01)
# Get cumuluative sum of variance explained
cum_var_explained = np.cumsum(var_explained_ratio)
# Plot cumulative sum
ax.fill_between(range(len(cum_var_explained)), cum_var_explained,
                alpha = 0.4, color = "tab:orange",
                label = "Cum. Var.")
ax.set_ylim(0, 1)
# Plot actual proportions
ax2 = ax.twinx()
ax2.plot(range(len(var_explained_ratio)), var_explained_ratio,
         alpha = 1, color = "tab:blue", lw  = 4, ls = "--",
         label = "Var per PC")
ax2.set_ylim(0, 0.005)

# Add lines to indicate where good values of components may be
ax.hlines(0.6, 0, var_explained_ratio.shape[0], color = "tab:green", lw = 3, alpha = 0.6, ls=":")
ax.hlines(0.8, 0, var_explained_ratio.shape[0], color = "tab:green", lw = 3, alpha = 0.6, ls=":")
# Plot both legends together
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2)
# Format axis as percentages
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
ax2.yaxis.set_major_formatter(mtick.PercentFormatter(1)) 

# Add titles and labels
ax.set_ylabel("Cum. Prop. of Variance Explained")
ax2.set_ylabel("Prop. of Variance Explained per PC", rotation = 270, labelpad=30)
ax.set_title("Variance Explained by Number of Principal Components")
ax.set_xlabel("Number of Principal Components")

📌 t-SNE可視化

我們還可以更進一步，可視化數據在一系列降維過程中的可分離性。t-SNE算法是一個非常有效的非線性降維可視化方法，藉助於它，我們可以把數據繪製在二維平面觀察其分散程度。下面的t-SNE可視化展示了當我們使用所有1806個特徵或將其減少為 1000、500、100 個主成分時，如果將歌詞數據投影到二維空間中會是什麼樣子。

代碼如下：

from sklearn.manifold import TSNE
import seaborn as sns

# Merge numeric labels with normalised audio data and lyric principal components
tsne_processed = pd.concat([
    pd.Series(y, name = "genre"),
    pd.DataFrame(audio_features, columns=audio.columns),
    # Add prefix to make selecting pcs easier later on
    pd.DataFrame(lyric_pca).add_prefix("lyrics_pc_")
          ], axis = 1)

# Get t-SNE values for a range of principal component cutoffs, 1806 is all PCs
all_tsne = pd.DataFrame()
for cutoff in ["1806", "1000", "500", "100"]:
    # Create t-SNE object
    tsne = TSNE(init = "random", learning_rate = "auto")
    # Fit on normalised features (excluding the y/label column)
    tsne_results = tsne.fit_transform(tsne_processed.loc[:, "audio_danceability":f"lyrics_pc_{cutoff}"])
    
    # neater graph
    if cutoff == "1806":
        cutoff = "All 1806"
    # Get results
    tsne_df = pd.DataFrame({"y":y,
                        "tsne-2d-one":tsne_results[:,0],
                       "tsne-2d-two":tsne_results[:,1],
                           "Cutoff":cutoff})
    # Store results
    all_tsne = pd.concat([all_tsne, tsne_df], axis = 0)
    
# Plot gridplot
g = sns.FacetGrid(all_tsne, col="Cutoff", hue = "y",
                col_wrap = 2, height = 6,
                palette=sns.color_palette("hls", 4),
               )
# Add plots
g.map(sns.scatterplot, "tsne-2d-one", "tsne-2d-two", alpha = 0.3)
# Add titles/legends
g.fig.suptitle("t-SNE Plots vs Number of Principal Components Included", y = 1)
g.add_legend()

理想情況下，我們希望看到的是，在降維到某些主成分數量（例如 cutoff = 1000）時，流派變得更加可分離。

然而，上述 t-SNE 圖的結果顯示，PCA 這一步不同數量的主成分並沒有哪個會讓數據標籤更可分離。

📌 自編碼器降維

實際上我們有不同的方式可以完成數據降維任務，在下面的代碼中，我們提供了 PCA、截斷 SVD 和 Keras 自編碼器三種方式作為候選，調整配置即可進行選擇。

為簡潔起見，自動編碼器的代碼已被省略，但可以在 autoencode 內的功能 custom_functions.py 中的文件庫。

# 通用庫
import pandas as pd
import numpy as np
# 建模庫
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
# 神經網絡
from keras.layers import Dense, Input, LeakyReLU, BatchNormalization
from keras.callbacks import EarlyStopping
from keras import Model

# 定義自編碼器
def autoencode(lyric_tr, n_components):
    """Build, compile and fit an autoencoder for
    lyric data using Keras. Uses a batch normalised,
    undercomplete encoder with leaky ReLU activations.
    It will take a while to train.
    --------------------------------------------------
    lyric_tr = df of lyric training data
    n_components = int, number of output dimensions
    from encoder
    """
    n_inputs = lyric_tr.shape[1]
    # 定義encoder
    visible = Input(shape=(n_inputs,))

    # encoder模塊1
    e = Dense(n_inputs*2)(visible)
    e = BatchNormalization()(e)
    e = LeakyReLU()(e)
    # encoder模塊2
    e = Dense(n_inputs)(e) 
    e = BatchNormalization()(e)
    e = LeakyReLU()(e)
    bottleneck = Dense(n_components)(e)

    # decoder模塊1
    d = Dense(n_inputs)(bottleneck)
    d = BatchNormalization()(d)
    d = LeakyReLU()(d)
    # decoder模塊2
    d = Dense(n_inputs*2)(d)
    d = BatchNormalization()(d)
    d = LeakyReLU()(d)
    # 輸出層
    output = Dense(n_inputs, activation='linear')(d)
    # 完整的autoencoder模型
    model = Model(inputs=visible, outputs=output)

    # 編譯
    model.compile(optimizer='adam', loss='mse')
    # 回調函數
    callbacks = EarlyStopping(patience = 20, restore_best_weights = True)
    # 訓練模型
    model.fit(lyric_tr, lyric_tr, epochs=200,
                        batch_size=16, verbose=1, validation_split=0.2,
             callbacks = callbacks)
    
    # 在降維階段，我們只用encoder部分就可以(對數據進行壓縮)
    encoder = Model(inputs=visible, outputs=bottleneck)

    return encoder

# 數據預處理函數，主要是對特徵列進行降維，標籤列進行編碼
def pre_process(train = pd.DataFrame,
                test = pd.DataFrame,
                reduction_method = "pca",
                n_components = 400):
    # 切分X和y
    y_train = train.playlist_genre
    y_test = test.playlist_genre
    X_train = train.drop(columns = "playlist_genre")
    X_test = test.drop(columns = "playlist_genre")
    
    # 標籤編碼為數字
    label_encoder = LabelEncoder()
    label_train = label_encoder.fit_transform(y_train)
    label_test = label_encoder.transform(y_test)

    # 對數據進行幅度縮放處理
    scaler = MinMaxScaler()
    X_norm_tr = scaler.fit_transform(X_train)
    X_norm_te = scaler.transform(X_test)

    # 重建數據
    X_norm_tr = pd.DataFrame(X_norm_tr, columns = X_train.columns)
    X_norm_te = pd.DataFrame(X_norm_te, columns = X_test.columns)

    # mode和key都設定為類別型
    X_norm_tr["audio_mode"] = X_train["audio_mode"].astype("category").reset_index(drop = True)
    X_norm_tr["audio_key"] = X_train["audio_key"].astype("category").reset_index(drop = True)
    X_norm_te["audio_mode"] = X_test["audio_mode"].astype("category").reset_index(drop = True)
    X_norm_te["audio_key"] = X_test["audio_key"].astype("category").reset_index(drop = True)
    
    # 歌詞特徵
    lyric_tr = X_norm_tr.loc[:, "lyrics_aah":]
    lyric_te = X_norm_te.loc[:, "lyrics_aah":]

    # 如果使用PCA降維
    if reduction_method == "pca":
        pca = PCA(n_components)
        # 擬合訓練集
        reduced_tr = pd.DataFrame(pca.fit_transform(lyric_tr)).add_prefix("lyrics_pca_")
        # 對測試集變換（降維）
        reduced_te = pd.DataFrame(pca.transform(lyric_te)).add_prefix("lyrics_pca_")
    
    # 如果使用SVD降維
    if reduction_method == "svd":
        svd = TruncatedSVD(n_components)
        # 擬合訓練集
        reduced_tr = pd.DataFrame(svd.fit_transform(lyric_tr)).add_prefix("lyrics_svd_")
        # 對測試集變換（降維）
        reduced_te = pd.DataFrame(svd.transform(lyric_te)).add_prefix("lyrics_svd_")
    
    # 如果使用自編碼器降維（注意，神經網絡的訓練時間會長一點，要耐心等待）
    if reduction_method == "keras":
        # 構建自編碼器
        encoder = autoencode(lyric_tr, n_components)
        
        # 通過編碼器部分進行數據降維
        reduced_tr = pd.DataFrame(encoder.predict(lyric_tr)).add_prefix("lyrics_keras_")
        reduced_te = pd.DataFrame(encoder.predict(lyric_te)).add_prefix("lyrics_keras_")

        
        
    # 合併降維後的歌詞特徵與音頻特徵
    X_norm_tr = pd.concat([X_norm_tr.loc[:, :"audio_duration_ms"],
                          reduced_tr
                          ], axis = 1)

    X_norm_te = pd.concat([X_norm_te.loc[:, :"audio_duration_ms"],
                           reduced_te
                           ], axis = 1)


    return X_norm_tr, label_train, X_norm_te, label_test, label_encoder


# 分層切分數據
train_raw, test_raw = train_test_split(data, test_size = 0.2,
                                       shuffle = True, random_state = 42, # random, reproducible split
                                       stratify = data.playlist_genre)
# 設定降維最終維度
n_components = 500
# 選擇降維方法，候選: "pca", "svd", "keras"
reduction_method = "pca"

# 完整的數據預處理
X_train, y_train, X_test, y_test, label_encoder = pre_process(train_raw, test_raw,
                                                      reduction_method = reduction_method,
                                                     n_components = n_components)

上述過程之後我們已經完成對數據的標準化、編碼轉換和降維，接下來我們使用它進行建模。

💡 建模和超參數優化

📌 構建模型

在實際建模之前，我們要先選定一個評估指標來評估我們模型的性能，也方便指導進一步的優化。由於我們數據最終的標籤『流派/類別』略有不平衡，宏觀 F1 分數（macro f1-score） 可能是一個不錯的選擇，因為它平等地評估了類別的貢獻。我們在下面對這個評估準則進行定義，也敲定 LightGBM 模型的部分超參數。

from sklearn.metrics import f1_score

# 定義評估準則(Macro F1)
def lgb_f1_score(preds, data):
    labels = data.get_label()
    preds = preds.reshape(5, -1).T
    preds = preds.argmax(axis = 1)
    f_score = f1_score(labels , preds,  average = 'macro')
    return 'f1_score', f_score, True

# 用於編譯的參數
fixed_params = {
        'objective': 'multiclass',
        'metric': "None",   # 我們自定義的f1-score可以應用
        'num_class': 5,
        'verbosity': -1,
}

LightGBM 帶有大量可調超參數，這些超參數對於最終效果影響很大。

關於 LightGBM 的超參數細節詳細講解，歡迎大家查閱 ShowMeAI 的文章：

📘機器學習實戰(5) | LightGBM建模應用詳解

下面我們會基於Optuna這個工具庫對 LightGBM 的超參數進行調優，我們需要在 param 定義超參數的搜索空間，在此基礎上 Optuna 會進行優化和超參數的選擇。


# 建模
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb
from optuna.integration import LightGBMPruningCallback

# 定義目標函數
def objective(trial, X, y):    
    # 候選超參數
    param = {**fixed_params,
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 2, 3000, step = 20),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.2, 0.99, step = 0.05),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.2, 0.99, step = 0.05),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        "n_estimators": trial.suggest_int("n_estimators", 200, 5000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 5, 2000, step=5),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 10),
        "max_bin": trial.suggest_int("max_bin", 200, 300),
    }
    
    # 構建分層交叉驗證
    cv = StratifiedKFold(n_splits = 5, shuffle = True)
    # 5組得分
    cv_scores = np.empty(5)
    
    # 切分為K個數據組，輪番作為訓練集和驗證集進行實驗
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        # 數據切分
        X_train_cv, X_test_cv = X.iloc[train_idx], X.iloc[test_idx]
        y_train_cv, y_test_cv = y[train_idx], y[test_idx]

        # 轉為lightgbm的Dataset格式
        train_data = lgb.Dataset(X_train_cv, label = y_train_cv, categorical_feature="auto")
        val_data = lgb.Dataset(X_test_cv, label = y_test_cv,  categorical_feature="auto",
                              reference = train_data)
        
        # 回調函數
        callbacks = [
            LightGBMPruningCallback(trial, metric = "f1_score"),
                     # 間歇輸出信息
                    lgb.log_evaluation(period = 100),
                     # 早停止，防止過擬合
                    lgb.early_stopping(50)]

        # 訓練模型
        model = lgb.train(params = param,  train_set = train_data,
                          valid_sets = val_data,   
                          callbacks = callbacks,
                          feval = lgb_f1_score # 自定義評估準則
                         )
        
        # 預估
        preds = np.argmax(model.predict(X_test_cv), axis = 1)
        # 計算f1-score
        cv_scores[idx] = f1_score(y_test_cv, preds, average = "macro")

    return np.mean(cv_scores)

📌 超參數優化

我們在上面定義完了目標函數，現在可以使用 Optuna 來調優模型的超參數了。

# 超參數優化
import optuna

# 定義Optuna的實驗次數
n_trials = 200
# 構建Optuna study去進行超參數檢索與調優
study = optuna.create_study(direction = "maximize", # 最大化交叉驗證的F1得分
                            study_name = "LGBM Classifier",
                           pruner=optuna.pruners.HyperbandPruner())
func = lambda trial: objective(trial, X_train, y_train)
study.optimize(func, n_trials = n_trials)

然後，我們可以使用 📘Optuna 的可視化模塊 對不同超參數組合的性能進行可視化查看。例如，我們可以使用 plot_param_importances(study) 查看哪些超參數對模型性能/影響優化最重要。

plot_param_importances(study)

我們也可以使用 plot_parallel_coordinate(study)查看嘗試了哪些超參數組合/範圍可以帶來高評估結果值（好的效果性能）。

plot_parallel_coordinate(study)

然後我們可以使用 plot_optimization_history 查看歷史情況。

plot_optimization_history(study)

在Optuna完成調優之後：

最好的超參數存儲在 study.best_params 屬性中。我們把模型的最終參數 params 定義為 params = {**fixed_params, **study.best_params} 即可，如後續的代碼所示。
當然，你也可以縮小搜索空間/超參數範圍，進一步做精確的超參數優化。

# 最佳模型實驗
cv = StratifiedKFold(n_splits = 5, shuffle = True)
# 5組得分
cv_scores = np.empty(5)

# 切分為K個數據組，輪番作為訓練集和驗證集進行實驗
for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
    # 數據切分
    X_train_cv, X_test_cv = X.iloc[train_idx], X.iloc[test_idx]
    y_train_cv, y_test_cv = y[train_idx], y[test_idx]

    # 轉為lightgbm的Dataset格式
    train_data = lgb.Dataset(X_train_cv, label = y_train_cv, categorical_feature="auto")
    val_data = lgb.Dataset(X_test_cv, label = y_test_cv,  categorical_feature="auto",
                          reference = train_data)
    
    # 回調函數
    callbacks = [
        LightGBMPruningCallback(trial, metric = "f1_score"),
                 # 間歇輸出信息
                lgb.log_evaluation(period = 100),
                 # 早停止，防止過擬合
                lgb.early_stopping(50)]

    # 訓練模型
    model = lgb.train(params = {**fixed_params, **study.best_params},  train_set = train_data,
                      valid_sets = val_data,   
                      callbacks = callbacks,
                      feval = lgb_f1_score # 自定義評估準則
                     )
    
    # 預估
    preds = np.argmax(model.predict(X_test_cv), axis = 1)
    # 計算f1-score
    cv_scores[idx] = f1_score(y_test_cv, preds, average = "macro")

💡 最終評估

通過上述過程我們就獲得了最終模型，讓我們來評估一下吧！


# 預估與評估訓練集
train_preds = model.predict(X_train)
train_predictions = np.argmax(train_preds, axis = 1)
train_error = f1_score(y_train, train_predictions, average = "macro")

# 交叉驗證結果
cv_error = np.mean(cv_scores)

# 評估測試集
test_preds = model.predict(X_test)
test_predictions = np.argmax(test_preds, axis = 1)
test_error = f1_score(y_test, test_predictions, average = "macro")

# 存儲評估結果
results = pd.DataFrame({"n_components": n_components,
                        "reduction_method": reduction_method,
                        "train_error": train_error,
                        "cv_error": cv_error,
                        "test_error": test_error,
                        "n_trials": n_trials
                       }, index = [0])

我們可以實驗和比較不同的降維方法、降維維度，再調參查看模型效果。如下圖所示，在我們當前的嘗試中，PCA降維到 400 維產出最好的模型 ——macro f1-score 為66.48%。

💡 總結

在本篇內容中， ShowMeAI 展示了基於歌曲信息與文本對其進行『流派』分類的過程，包含對文本數據的處理、特徵工程、模型建模和超參數優化等。大家可以把整個pipeline作為一個模板來應用在其他任務當中。

參考資料

📘 圖解數據分析：從入門到精通系列教程：//www.showmeai.tech/tutorials/3
📘 數據科學工具庫速查表 | Pandas 速查表：//www.showmeai.tech/article-detail/101
📘 圖解機器學習算法 | 降維算法詳解：//www.showmeai.tech/article-detail/198
📘 圖解機器學習算法 | LightGBM模型詳解：//www.showmeai.tech/article-detail/195
📘 機器學習實戰 | LightGBM建模應用詳解：//www.showmeai.tech/article-detail/205
📘 Optuna 的可視化模塊
📘 Akiba,T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019, July). Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2623–2631).
📘 Autoencoder Feature Extractions
📘 Kaggler』s Guide to LightGBM Hyperparameter Tuning with Optuna in 2021
📘 You Are Missing Out on LightGBM. It Crushes XGBoost in Every Aspect

Tags: Python 人工智能數據分析數據挖掘機器學習機器學習實戰 | 手把手教你玩轉機器學習音樂流派

AI 音辨世界：藝術小白的我，靠這個AI模型，速識音樂流派選擇音樂 ⛵

💡 數據概覽和預處理

💡 EDA探索性數據分析

💡 歌詞特徵&數據降維

📌 PCA主成分分析

📌 t-SNE可視化

📌 自編碼器降維

💡 建模和超參數優化

📌 構建模型

📌 超參數優化

💡 最終評估

💡 總結

參考資料

VirMach 便宜 VPS

QNews

AI 音辨世界：藝術小白的我，靠這個AI模型，速識音樂流派選擇音樂 ⛵

💡 數據概覽和預處理

💡 EDA探索性數據分析

💡 歌詞特徵&數據降維

📌 PCA主成分分析

📌 t-SNE可視化

📌 自編碼器降維

💡 建模和超參數優化

📌 構建模型

📌 超參數優化

💡 最終評估

💡 總結

參考資料

分享此文：

Related Posts

CF 920A Water The Garden

Alink漫談(二十) ：卡方檢驗源碼解析

通用場景語音合成數據集推薦

CVPR 2022 | 數據堂亮相計算機視覺領域盛會

VirMach 便宜 VPS

QNews

熱門搜尋