【複賽前排分享(二)】收好這份王牌優化指南,助你輕鬆上分無壓力
- 2020 年 12 月 16 日
- AI
2020騰訊廣告演算法大賽複賽已經落幕,決賽答辯終極一戰即將在8月3日14:00深圳騰訊濱海大廈舉行,了解決賽詳情並預約直播觀賽,請點擊:
外部賽場戰況激烈,騰訊公司也聯合碼客開啟了面向員工的內部賽道。其中奪得複賽內部榜第二名的大雄團隊,受邀來到本次前排分享會,與大家分享他們的解題秘訣。在競賽過程中,他們的答題策略透露出優秀的時間管理能力和豐富的實戰經驗。如何在保證優化效果的前提下減輕訓練壓力?聽聽他們怎麼說。
01 賽題解讀
本屆騰訊廣告演算法大賽的題目是用戶畫像,即根據用戶的廣告點擊行為以及廣告相應的資訊對用戶的年齡和性別進行預測。
02 數據欄位
– time: 天粒度時間 nunique: 91
– user_id: 從1到N隨機編號生成 nunique: 400w
– creative_id: 用戶點擊的廣告素材id nunique: 2618159
– click_times: 當天該用戶點擊該廣告素材的次數 nunique: 54
– ad_id: 該素材所歸屬的廣告id,每個廣告可能包含多個可展示的素材 nunique: 2379475
– product_id: 該廣告中所宣傳的產品id nunique: 34111
– product_category: 該廣告中所宣傳的產品的類別id nunique: 18
– advertiser_id: 廣告主的id nunique: 52861
– industry: 廣告主所屬行業的id nunique: 326
– age: 用戶年齡段[1-10]
– gender: 用戶性別[1,2]
03 模型輸入
最終方案只使用了五個id序列作為模型輸入:
‘creative_id’
‘ad_id’
‘advertiser_id’
‘product_id’
‘industry’
由於只能在非工作時間參賽,我放棄了特徵構造,安心每天掛機調輸入調結構。其實最終解決方案並不複雜,只要把握好試錯時間成本,相信大家都能得到理想的結果,下面我就針對這兩部分,分別說說我一路調優下來的感受。
模型輸入直接決定了模型的天花板,我嘗試了多種方案後總結出:對輸入影響最直接的就是有效詞的選擇、word2vec的詞向量生成階段以及輸入的shuffle。詞有效性的選擇既決定了訓練是否有效,又決定了詞向量矩陣的記憶體消耗,在主辦方沒提供TI-ONE的條件下還是很有效地緩解了記憶體不足的問題。這裡只出現一次的id將被視為不起效,將其與訓練測試集不相交的id統一起來,視為一個id,會大大減輕訓練壓力,對訓練效果也沒有影響。
“`python
differ = set(train[col].unique()).symmetric_difference(set(test[col].unique())) #獲取不同的id
common = set(train[col].unique()) and (set(test[col].unique())) #獲取相同的id
for v in val_cnt[val_cnt == 1].index: # 出現一次的統一起來當成一個id
id_map[v] = 0
for v in differ: # 訓練集測試集不一樣的也統一起來當一個id
id_map[v] = 0
for i, v in enumerate(common): # 相同的按index累加當id
id_map[v] = i + 1
“`
w2v訓練參數最終採用了skip-gram形式,關鍵參數為min_count=1,size=256,window=10,當然size和window不太有普適性,多跑幾個嘗試一下即可。
“`python
model = models.Word2Vec(list_d,sg=1,min_count=1,size=256,window=10,workers=48,iter=10)
We = []
if ‘0’ in model.wv:
for i in tqdm(range(len(model.wv.index2word))):
We.append(model.wv[str(i)].reshape((1,-1)))
else:
We.append(np.zeros((1,128)))
for i in tqdm(range(len(model.wv.index2word))):
We.append(model.wv[str(i+1)].reshape((1,-1)))
We = np.vstack(We)
“`
輸入構造這裡有正序、逆序、隨機shuffle、click_times加倍等幾種操作,click_times加倍後也要相應地適當增加sequence_length,取95%序列長度即可。
“`python
for col in tqdm([‘creative_id’, ‘ad_id’, ‘advertiser_id’, ‘product_id’, ‘industry’, ‘product_category’]):
list_d = pd.read_pickle(‘./idlist/{}_list.pkl’.format(col))
We = np.load(‘./w2v_256_10/{}_embedding_weight.npy’.format(col))
We = np.vstack([We, np.zeros(config.embeddingSize)])
list_d = list(list_d)
for i in range(len(list_d)):
ret = []
for j in range(len(list_d[i])):
ret += [list_d[i][j]] * click_times[i][j]
list_d[i] = ret
if len(list_d[i]) > config.sequenceLength:
list_d[i] = list_d[i][:config.sequenceLength]
else:
list_d[i] += [len(We) – 1] * (config.sequenceLength – len(list_d[i]))
list_d = np.array(list_d)
list_d = list_d.astype(np.int32) # 減少記憶體使用量
class DataSequence(Sequence):
def __init__(self, xs, y, batch_size=128, shuffle=True):
self.xs = xs
self.y = y
self.batch_size = batch_size
self.size = xs[0].shape[0]
self.shuffle = shuffle
if self.shuffle:
state = np.random.get_state()
for x in self.xs:
np.random.set_state(state)
np.random.shuffle(x)
np.random.set_state(state)
np.random.shuffle(self.y)
def __len__(self):
return int(np.ceil(self.size / float(self.batch_size)))
def __getitem__(self, idx):
batch_idx = np.arange(idx * self.batch_size, min((idx + 1) * self.batch_size, self.size))
batch_xs = [x[batch_idx] for x in self.xs]
batch_y = self.y[batch_idx]
# shuffle
if self.shuffle:
x = []
for i in range(len(batch_xs)):
x.append(batch_xs[i].copy())
for i in range(len(x[0])):
p = np.random.rand()
if p < 0.8:
state = np.random.get_state()
for j in range(len(batch_xs)):
np.random.set_state(state)
np.random.shuffle(x[j][i])
batch_xs = x
return batch_xs, batch_y
“`
04 模型結構
模型結構方面嘗試了LSTM、CNN_Inception結構,CNN最終也能到1.47左右的水平,transformer結合LSTM效果也不錯,最終沒調試出超過純LSTM。當然也可以只是用transformer模型,但是我的效果並不好,有興趣的可以參考CyberZHG/Hugging Face開源的實現調調看。個人感覺針對本題數據,少頭優於多頭,多層優於少層。可以改一下只用QK,放棄dense層,弄成個精簡版的multi-head。最終我是實現了keras和torch兩個版本的模型框架(solo參賽為了最終融合只能想想辦法了),模型結構如下:
“`python
##LSTM keras-version
def LSTM(config, n_cls=10):
cols = [‘creative_id’, ‘ad_id’, ‘advertiser_id’, ‘product_id’, ‘industry’]
n_in = len(cols)
inputs = []
outputs = []
max_len = []
for i in range(n_in):
We = np.load(‘./w2v_256_10/{}_embedding_weight.npy’.format(cols[i]))
We = np.vstack([We, np.zeros(config.embeddingSize)])
inp = Input(shape=(config.sequenceLength,), dtype=”int32″)
x = Embedding(We.shape[0], We.shape[1], weights=[We], trainable=False)(inp)
inputs.append(inp)
outputs.append(x)
del We
gc.collect()
embedding_model = Model(inputs, outputs)
inputs = []
for i in range(n_in):
inp = Input(shape=(config.sequenceLength, config.embeddingSize,))
inputs.append(inp)
all_input = Concatenate()(inputs)
all_input = SpatialDropout1D(0.2)(all_input)
lstm1 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(all_input)
lstm2 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(lstm1)
pool_1 = GlobalMaxPooling1D()(lstm1)
pool_2 = GlobalMaxPooling1D()(lstm2)
pool = Concatenate()([pool_1, pool_2])
pool = Dropout(0.2)(pool)
outputs = Dense(n_cls, activation=’softmax’)(pool)
lstm_model = Model(inputs, outputs)
model = Model(embedding_model.inputs, lstm_model(embedding_model.outputs))
return model, lstm_model
##LSTM Torch-version
class LSTM(nn.Module):
def __init__(self):
super(LSTM, self).__init__()
emb_outputs = []
cols = [‘creative_id’, ‘ad_id’, ‘advertiser_id’, ‘product_id’, ‘industry’]
n_in = len(cols)
for i in range(n_in):
We = np.load(‘./w2v_256_120/{}_embedding_weight.npy’.format(cols[i]))
We = np.vstack([We, np.zeros(256)])
embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) – 1,
_weight=t.FloatTensor(We))
for p in embed.parameters():
p.requires_grad = False
emb_outputs.append(embed)
for i in range(n_in):
We = np.load(‘./w2v_128_60/{}_embedding_weight.npy’.format(cols[i]))
We = np.vstack([We, np.zeros(128)])
embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) – 1,
_weight=t.FloatTensor(We))
for p in embed.parameters():
p.requires_grad = False
emb_outputs.append(embed)
del We
gc.collect()
self.encoders = nn.ModuleList(emb_outputs)
self.emb_drop = nn.Dropout(p=0.2)
self.lstm = nn.LSTM(input_size=(256 + 128) * 5, hidden_size=384, num_layers=2, bias=True, batch_first=True,
dropout=0.2, bidirectional=True)
self.max_pool = nn.MaxPool1d(kernel_size=2, stride=2)
self.fc = nn.Sequential(nn.Linear(384, n_cls))
self.fc_drop = nn.Dropout(p=0.2)
def forward(self, xs):
inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)]
x = t.cat(inp, 2)
x = self.emb_drop(x)
x = self.lstm(x)[0]
x = self.max_pool(x)
x = t.max(x, dim=1)[0]
x = self.fc_drop(x)
logits = self.fc(x)
return logits
##CNN_Inception Torch-verison
class Inception(nn.Module):
def __init__(self,cin,co,relu=True,norm=True):
super(Inception, self).__init__()
assert(co%4==0)
cos=[co//4]*4
self.activa=nn.Sequential()
if norm:self.activa.add_module(‘norm’,nn.BatchNorm1d(co))
if relu:self.activa.add_module(‘relu’,nn.ReLU(True))
self.branch1 =nn.Sequential(OrderedDict([
(‘conv1’, nn.Conv1d(cin,cos[0], 1,stride=1)),
]))
self.branch2 =nn.Sequential(OrderedDict([
(‘conv1’, nn.Conv1d(cin,cos[1], 1)),
(‘norm1’, nn.BatchNorm1d(cos[1])),
(‘relu1’, nn.ReLU(inplace=True)),
(‘conv3’, nn.Conv1d(cos[1],cos[1], 3,stride=1,padding=1)),
]))
self.branch3 =nn.Sequential(OrderedDict([
(‘conv1’, nn.Conv1d(cin,cos[2], 3,padding=1)),
(‘norm1’, nn.BatchNorm1d(cos[2])),
(‘relu1’, nn.ReLU(inplace=True)),
(‘conv3’, nn.Conv1d(cos[2],cos[2], 5,stride=1,padding=2)),
]))
self.branch4 =nn.Sequential(OrderedDict([
#(‘pool’,nn.MaxPool1d(2)),
(‘conv3’, nn.Conv1d(cin,cos[3], 3,stride=1,padding=1)),
]))
def forward(self,x):
branch1=self.branch1(x)
branch2=self.branch2(x)
branch3=self.branch3(x)
branch4=self.branch4(x)
result=self.activa(t.cat((branch1,branch2,branch3,branch4),1))
return result
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
emb_outputs = []
cols = [‘creative_id’, ‘ad_id’, ‘advertiser_id’, ‘product_id’, ‘industry’]
n_in = len(cols)
for i in range(n_in):
We = np.load(‘./w2v_256_120/{}_embedding_weight.npy’.format(cols[i]))
We = np.vstack([We, np.zeros(256)])
embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) – 1,
_weight=t.FloatTensor(We))
for p in embed.parameters():
p.requires_grad = False
emb_outputs.append(embed)
for i in range(n_in):
We = np.load(‘./w2v_128_60/{}_embedding_weight.npy’.format(cols[i]))
We = np.vstack([We, np.zeros(128)])
embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) – 1,
_weight=t.FloatTensor(We))
for p in embed.parameters():
p.requires_grad = False
emb_outputs.append(embed)
del We
gc.collect()
self.encoders = nn.ModuleList(emb_outputs)
self.emb_drop = nn.Dropout(p=0.2)
self.embed_conv = nn.Sequential(
Inception(1920, 1024), # (batch_size,64,opt.title_seq_len)->(batch_size,32,(opt.title_seq_len)/2)
Inception(1024, 1024),
# nn.MaxPool1d(opt.title_seq_len)
)
self.fc = nn.Sequential(
nn.Linear(1024 * 2, 1024),
nn.BatchNorm1d(1024),
nn.ReLU(inplace=True),
nn.Dropout(p=0.2),
nn.Linear(1024, n_cls)
)
def forward(self, xs):
inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)]
x = t.cat(inp, 2)
x = self.emb_drop(x)
x = self.embed_conv(x.permute(0, 2, 1))
x = t.max(x.permute(0, 2, 1), dim=1)[0]
logits = self.fc(x)
return logits
“`
05 結果
最優單模在複賽A榜約為1.475x,B榜經過一頓融合到了1.479952,還是差點上1.48,內部榜第二,外部榜第十四,有點遺憾。
感謝分享,大雄團隊的高效風格真是讓人印象深刻。而每支隊伍都有著自己的特色,即將參加決賽的選手們,期待你們的風采!
8月3日14:00騰訊廣告演算法大賽決賽即將啟幕,演算法王者巔峰對決,為你帶來演算法與技術激烈碰撞的盛筵。快點擊【文末鏈接】,掃描報名頁面底部二維碼,預約線上直播觀賽吧!
同時,歡迎選手們到「官網—個人資訊」頁面上傳簡歷。加入騰訊,就趁現在!
掃碼加入大賽官方QQ群
或搜索群號:1094257162
和小夥伴一起解鎖更多內容
點擊下方鏈接,預約直播觀賽: