kaggle上的attention layer到底实现的是啥?
- 2021 年 6 月 22 日
- AI
LSTM + Attention Baseline [0.672 LB]
//www.kaggle.com/braquino/5-fold-lstm-attention-fully-commented-0-694
kaggle的kernel里经常看到这样的attention实现:
class Attention(Layer):
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight((input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight((input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
return None
def call(self, x, mask=None):
features_dim = self.features_dim
step_dim = self.step_dim
eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
if self.bias:
eij += self.b
eij = K.tanh(eij)
a = K.exp(eij)
if mask is not None:
a *= K.cast(mask, K.floatx())
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a)
weighted_input = x * a
return K.sum(weighted_input, axis=1)
def compute_output_shape(self, input_shape):
return input_shape[0], self.features_dim
inp = Input(shape=(SEQ_LEN,300 ))
x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(inp)
x = Bidirectional(CuDNNLSTM(64,return_sequences=True))(x)
x = Attention(SEQ_LEN)(x)
x = Dense(256, activation="relu")(x)
# x = Dropout(0.25)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
刚开始看的时候很懵,这是什么attention,因为压根没有涉及到q v k的计算,仅仅就是简单随机初始化一个权重矩阵用于加权求和处理。
今天逛kaggle的时候终于找到出处了,来源于:

就是要给简化版的attention,真的非常的简单。。。

上图的alpha1到alphat的权重都是随机初始化的,然后直接求和即可。
这个模块的设计,实现了nn中的“特征选择”的功能,并且实现的是软性的特征选择,h1到ht可以是LSTM的t个时间步的hidden state,不进行简单平均而是做加权求和,权重也是学习出来的。
这样的设计对于常规的dense层也可以使用,实现很简单,比如有10个特征,每个特征分别通过一个dense层分别映射为10个64维的新的特征,然后在这10个64维的特征上使用这里的attention layer即可。