kaggle上的attention layer到底实现的是啥?

  • 2021 年 6 月 22 日
  • AI

LSTM + Attention Baseline [0.672 LB]www.kaggle.com

//www.kaggle.com/braquino/5-fold-lstm-attention-fully-commented-0-694www.kaggle.com

kaggle的kernel里经常看到这样的attention实现:

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim


inp = Input(shape=(SEQ_LEN,300 ))
x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(inp)
x = Bidirectional(CuDNNLSTM(64,return_sequences=True))(x)
x = Attention(SEQ_LEN)(x)
x = Dense(256, activation="relu")(x)
# x = Dropout(0.25)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

刚开始看的时候很懵,这是什么attention,因为压根没有涉及到q v k的计算,仅仅就是简单随机初始化一个权重矩阵用于加权求和处理。

今天逛kaggle的时候终于找到出处了,来源于:

就是要给简化版的attention,真的非常的简单。。。

上图的alpha1到alphat的权重都是随机初始化的,然后直接求和即可。

这个模块的设计,实现了nn中的“特征选择”的功能,并且实现的是软性的特征选择,h1到ht可以是LSTM的t个时间步的hidden state,不进行简单平均而是做加权求和,权重也是学习出来的。

这样的设计对于常规的dense层也可以使用,实现很简单,比如有10个特征,每个特征分别通过一个dense层分别映射为10个64维的新的特征,然后在这10个64维的特征上使用这里的attention layer即可。