keras的几种attention layer的实现，顺便梳理一下attention的发展史

2020 年 12 月 16 日
AI

首先是seq2seq中的attention机制

这是基本款的seq2seq，没有引入teacher forcing（引入teacher forcing说起来很麻烦，这里就用最简单最原始的seq2seq作为例子讲一下好了），代码实现很简单：

from tensorflow.keras.layers.recurrent import GRU
from tensorflow.keras.layers.wrappers import TimeDistributed
from tensorflow.keras.models import Sequential, model_from_json
from tensorflow.keras.layers.core import Dense, RepeatVector    

def build_model(input_size, seq_len, hidden_size):
    """建立一个 sequence to sequence 模型"""
    model = Sequential()
    model.add(GRU(input_dim=input_size, output_dim=hidden_size, return_sequences=False))
    model.add(Dense(hidden_size, activation="relu"))
    model.add(RepeatVector(seq_len))
    model.add(GRU(hidden_size, return_sequences=True))
    model.add(TimeDistributed(Dense(output_dim=input_size, activation="linear")))
    model.compile(loss="mse", optimizer='adam')

    return model

context vector作为decoder的每一个输出的input，完成，这里我以时间序列的问题为例来讲比较简单，用nlp讲起来烦的要死。

比如我们的输入有4个时间步，要预测未来的3个时间步，也就是每一个时间样本有4个时间切片，为了简单起见我们就以简单的单变量为例，每个时间步下1个特征也就是序列数据本身，然后标签也是时间样本，每个时间样本下3个时间切片，每个时间切片下也是一个特征，样本的构造大概长这个样子，以1个样本为例：

input

[[1]]

[[2]]

[[3]]

[[4]]

output:

[[5]]

[[6]]

[[7]]

则代码写成这样：

    model = Sequential()
    model.add(GRU(input_dim=(4,1), output_dim=hidden_size, return_sequences=False))
    model.add(Dense(hidden_size, activation="relu"))
    model.add(RepeatVector(4)) ## seq_len和我们预测未来多少个时间步有关，上面我们用历史的4个时间
#步的数据来预测未来的3个时间步，则repeat 3次
    model.add(GRU(hidden_size, return_sequences=True))
    model.add(TimeDistributed(Dense(output_dim=1, activation="linear")))
    model.compile(loss="mse", optimizer='adam')

实际上nlp和多元时间序列的数据形式基本差不多，一个句子embedding之后大概长这样

[[1,2,1,3]

[2,3,3,1]

[5,5,1,1]]

假设一个句子3个词吧，则每个句子的shape就是 (4,3)，4个时间步，每个时间步下3个维度，那么10000个句子构成的3维张量的shape就是(10000,4,3)

多元时间序列一样一样的:

[[1,2,1,3]

[2,3,3,1]

[5,5,1,1]]

这个数据也可以看成一个多元时间序列预测的样本，比如[1,2,1,3]就类似于温度预测里的温度、压强、湿度、风力，所以把多元时间序列预测当作句子embedding之后的形式，很多nlp的方法就很容易可以套进来了，反正都是package调来调去的问题。。。

如果要实现这种，代码上麻烦一点，实现一个encoder和decoder的keras function api的形式：

编码器部分

encoder_inputs = Input(shape=(None, num_encoder_tokens))

encoder = LSTM(latent_dim, return_state=True)

# 调用编码器，得到编码器的输出，以及状态信息 state_h 和 state_c
encoder_outpus, state_h, state_c = encoder(encoder_inputs)

# 丢弃encoder_outputs, 我们只需要编码器的状态
encoder_state = [state_h, state_c]

解码器部分：

decoder_inputs = Input(shape=(None, num_decoder_tokens))

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

# 将编码器输出的状态作为初始解码器的初始状态
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_state)

# 添加全连接层
decoder_dense = Dense(num_decoder_tokens, activation='linear')
decoder_outputs = decoder_dense(decoder_outputs)

针对上述的时序问题改一改就行了没啥难的；

然后是2014年的vanilla seq2seq attention，也就是我们的soft attention，软性注意力机制，最好理解的一种：

原来的seq2seq长这个样子：

加入软性注意力机制之后长下面这个样子：

这里暂时就不好用时序的例子来解释了（当然时序问题使用的attention机制也不一样），还是回到nlp的例子来说

//www.tensorflow.org/tutorials/text/nmt_with_attentionwww.tensorflow.org

这里，tensorflow的官方文档写的最清楚最权威了：

累死了，休息

首先我们利用RNN结构得到encoder中的hidden state $(h_1, h_2, ..., h_T)$ ，
假设当前decoder的hidden state 是 $s_{t-1}$ ，我们可以计算每一个输入位置j与当前输出位置的关联性， $e_{tj}=a(s_{t-1}, h_j)$ ，写成相应的向量形式即为 $\vec{e_t}=(a(s_{t-1},h_1),...,a(s_{t-1},h_T))$ ，其中 $a$ 是一种相关性的算符，例如常见的有点乘形式 $\vec{e_t}=\vec{s_{t-1}}^T\vec{h}$ ，加权点乘 $\vec{e_t}=\vec{s_{t-1}}^TW\vec{h}$ ，加和 $\vec{e_t}=\vec{v}^Ttanh(W_1\vec{h}+W_2\vec{s_{t-1}})$ 等等。
对于 $\vec{e_t}$ 进行softmax操作将其normalize得到attention的分布， $\vec{\alpha_t}=softmax(\vec{e_t})$ ，展开形式为 $\alpha_{tj}=\frac{exp(e_{tj})}{\sum_{k=1}^Texp(e_{tk})}$
利用 $\vec{\alpha_t}$ 我们可以进行加权求和得到相应的context vector $\vec{c_t} = \sum_{j=1}^T\alpha_{tj}h_j$
由此，我们可以计算decoder的下一个hidden state $s_t = f(s_{t-1},y_{t-1},c_t)$ 以及该位置的输出 $p(y_t|y_1,...,y_{t-1}, \vec{x}) = g(y_{i-1}, s_i, c_i)$ 。

//github.com/philipperemy/keras-attention-mechanismgithub.com

经典款，

from attention import Attention

# [...]

m = Sequential([
      LSTM(128, input_shape=(seq_length, 1), return_sequences=True),
      Attention(), # <--------- here.
      Dense(1, activation='linear')
])

使用的方式也非常的简单

from tensorflow.keras.layers import Dense, Lambda, dot, Activation, concatenate
from tensorflow.keras.layers import Layer


class Attention(Layer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def __call__(self, hidden_states):
        """
        Many-to-one attention mechanism for Keras.
        @param hidden_states: 3D tensor with shape (batch_size, time_steps, input_dim).
        @return: 2D tensor with shape (batch_size, 128)
        @author: felixhao28.
        """
        hidden_size = int(hidden_states.shape[2])
        # Inside dense layer
        #              hidden_states            dot               W            =>           score_first_part
        # (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
        # W is the trainable weight matrix of attention Luong's multiplicative style score
        score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
        #            score_first_part           dot        last_hidden_state     => attention_weights
        # (batch_size, time_steps, hidden_size) dot   (batch_size, hidden_size)  => (batch_size, time_steps)
        h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
        score = dot([score_first_part, h_t], [2, 1], name='attention_score')
        attention_weights = Activation('softmax', name='attention_weight')(score)
        # (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
        context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
        pre_activation = concatenate([context_vector, h_t], name='attention_output')
        attention_vector = Dense(128, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
        return attention_vector

源码非常的简单干净。

根据代码的注意，这里实现的是luong’s 风格的attention.

以这个经典的图为例，这里：

 h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)

通过上面的代码，提取出x4的最终的hidden state，即代码中的h_t，

score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)

这一步，通过

//github.com/keras-team/keras/blob/ce5728bbd36004c7a17b86e69a8e59b21d6ee6d4/keras/layers/dense_attention.pygithub.com

  def _calculate_scores(self, query, key):
    """Calculates attention scores as a query-key dot product.
    Args:
      query: Query tensor of shape `[batch_size, Tq, dim]`.
      key: Key tensor of shape `[batch_size, Tv, dim]`.
    Returns:
      Tensor of shape `[batch_size, Tq, Tv]`.
    """
    scores = tf.matmul(query, key, transpose_b=True)
    if self.scale is not None:
      scores *= self.scale
    return scores

score的计算采用了点积，即使用点积相似性来作为相似性的衡量方式，scale参数用于放缩score的大小；

attention的base类，注意注释部分：

“”密集网络的基本注意力类别。 此类适用于Dense或CNN网络，不适用于RNN网络。

注意机制的实现应继承自此类，并且重用`apply_attention_scores（）`方法。

我们在git上看到的大部分的attention机制的实现都是基于LSTM的，比如百度上使用非常多的例子来自于：

//github.com/philipperemy/keras-attention-mechanismgithub.com

根据源代码可以知道，上面的这个attention的keras实现基本上是针对于LSTM结构的，其源代码非常的简洁干净：

from tensorflow.keras.layers import Dense, Lambda, dot, Activation, concatenate
from tensorflow.keras.layers import Layer


class Attention(Layer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def __call__(self, hidden_states):
        """
        Many-to-one attention mechanism for Keras.
        @param hidden_states: 3D tensor with shape (batch_size, time_steps, input_dim).
        @return: 2D tensor with shape (batch_size, 128)
        @author: felixhao28.
        """
        hidden_size = int(hidden_states.shape[2])
        # Inside dense layer
        #              hidden_states            dot               W            =>           score_first_part
        # (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
        # W is the trainable weight matrix of attention Luong's multiplicative style score
        score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
        #            score_first_part           dot        last_hidden_state     => attention_weights
        # (batch_size, time_steps, hidden_size) dot   (batch_size, hidden_size)  => (batch_size, time_steps)
        h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
        score = dot([score_first_part, h_t], [2, 1], name='attention_score')
        attention_weights = Activation('softmax', name='attention_weight')(score)
        # (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
        context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
        pre_activation = concatenate([context_vector, h_t], name='attention_output')
        attention_vector = Dense(128, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
        return attention_vector

这里实现的是luong’s 类型的attention，scores的计算使用的是dot的方法，简单易懂，

core = dot([score_first_part, h_t], [2, 1], name='attention_score')

两种类型可见上图，除此之外还有一种最基本简单的attention类型，文末介绍。

然后回头来看keras的实现，先跳过attention的base类直接看可以直接调用的两种类型的attention layer：

第一个，luong’s类型的attention机制

精氨酸：

因果：布尔值。设置为True可使解码器自动关注。添加一个面具这样

该位置“ i”不能参与位置“ j> i”。这样可以防止

从未来到过去的信息流。

dropout：在0到1之间浮动。

注意分数。

通话参数：

输入：以下张量的列表：

*查询：查询形状为[[batch_size，Tq，dim]`的Tensor。

* value：形状为[[batch_size，Tv，dim]]的值`Tensor`。

*键：形状为[[batch_size，Tv，dim]]的可选键`Tensor`。如果不

给定，将对key和value使用value，这是

最常见的情况。

mask：以下张量的列表：

* query_mask：形状为[[batch_size，Tq]`的布尔值掩码Tensor。

如果给定，则输出将在以下位置为零

mask ==假

* value_mask：形状为[[batch_size，Tv]`的布尔值遮罩Tensor。

如果给定，将应用遮罩，使值位于

mask == False不会对结果有所帮助。

训练：Python布尔值，指示该层是否应在

训练模式（添加辍学）或推理模式（无辍学）。

return_attention_scores：布尔值，为True，返回注意力得分

（在masking和softmax之后）作为附加输出参数。

输出：

形状为[[batch_size，Tq，dim]`的注意输出。

[可选]遮掩后的注意力得分和形状的softmax

`[batch_size，Tq，Tv]`。

“”

keras的几种attention layer的实现，顺便梳理一下attention的发展史

VirMach 便宜 VPS

QNews

keras的几种attention layer的实现，顺便梳理一下attention的发展史

分享此文：

Related Posts

DeepMind 最新发文：AlphaZero 的黑箱打开了

S2DNAS：北大提出动态推理网络搜索，加速推理，可转换任意网络 | ECCV 2020 Oral

网易有道AI团队首战中文语法错误诊断大赛夺冠

网络 IO 模型简单介绍

VirMach 便宜 VPS

QNews

熱門搜尋