[NLP] Reformer: The Efficient Transformer

2020 年 7 月 1 日
笔记
Deep Learning, LSH, nlp, reformer, reversible residual network, Transformer, 算法模型

1.现状

(1) 模型层数加深

(2) 模型参数量变大

(3) 难以训练

(4) 难以fine-tune

2. 单层参数量和占用内存分析

层	参数设置	参数量与占用内存
1 layer	0.5Billion	0.5Billion * 4Byte = 2GB
embedding layer	64K tokens 1024 emb_size 8 batch_size	参数量 64K × 1K × 8 = 0.5B 内存 2GB

层

参数设置

参数量与占用内存

1 layer

0.5Billion

0.5Billion * 4Byte = 2GB

embedding layer

64K tokens

1024 emb_size

8 batch_size

参数量 64K × 1K × 8 = 0.5B

内存 2GB

3. Transformer 模型内存占用的问题以及Reformer相应解决方案

Transformer 内存占用的问题	Reformer采用的方案	实现的效果	对模型的影响
memory(N layer) > N * memory(1 layer) → 需要为反向传播保留值	Reversible layers	保留N层activations → 保留1层	negligible
feed-forward 层 $d_{ff}$ >> $d_{model}$，占据大量内存	split activation && process in chunks	$d_{ff}$ → 每个chunk中处理的维度	numerically identical
Attention 运算: $O(L^2)$：存储 && 计算开销都很大	Locality-Sensitive Hashing 局部敏感哈希，简称 LSH	$O(L^2)$ → $O(L log L)$ 可以更好的处理长序列	major change

Transformer 内存占用的问题

Reformer采用的方案

实现的效果

对模型的影响

memory(N layer) > N * memory(1 layer) → 需要为反向传播保留值

Reversible layers

保留N层activations → 保留1层

negligible

feed-forward 层 $d_{ff}$ >> $d_{model}$，占据大量内存

split activation &&

process in chunks

$d_{ff}$ → 每个chunk中处理的维度

numerically identical

Attention 运算: $O(L^2)$：存储 && 计算开销都很大

Locality-Sensitive Hashing

局部敏感哈希，简称 LSH

$O(L^2)$ → $O(L log L)$

可以更好的处理长序列

major change

4. Attention → LSH

4.1 Recall Attention && Thoughts

(1) dot-product attention && multi-head attention

Q, K, V [batch_size, length, d_model]

$QK^T$ → [batch_size, length, length]

ex. length = 64K → 64K * 64K * 4Byte = 16GB → 该部分占用内存过大

(2) 对公式 (1) 的优化思路：

(a) 优化QK^T

→ 4.2 把大矩阵运算降低为向量矩阵运算；

→ 4.3 Q = K

(b) 优化 softmax(x)

→ 4.4 softmax更关注x中数值大的那些元素，在这里也就是只找和 $q_i$ 相似的那些 $k_j$，它们点积值更大，而其他元素被忽略不参与计算节省开销。

→ 4.5 但是如何判断哪些 $k_j$ 和 $q_i$ 更相近呢？这里引入LSH的思想。

→ 4.6 如何将LSH应用于Attention? LSH Attention

→ 4.7 用一轮hash，有些相近元素并不能hash到一个桶里 → 多来几轮hash取并集 Multi-round LSH attention

4.2 优化 $QK^T$ → Memory-efficient attention

$QK^T$ → $q_iK^T$ => 对每个query单独计算，而无需计算整个大矩阵乘积

4.3 优化 $QK^T$ → Shared-QK Transformer (Q = K)

Transformer	Reformer (LSH attention)
A → linear projection 1 → Q A → linear projection 2 → K A → linear projection 3 → V	A → linear projection 1 → Q A → linear projection 1 → K A → linear projection 3 → V

Transformer

Reformer (LSH attention)

A → linear projection 1 → Q

A → linear projection 2 → K

A → linear projection 3 → V

A → linear projection 1 → Q

A → linear projection 1 → K

A → linear projection 3 → V

4.4 优化softmax(x) → Hashing Attention

$QK^T$只是一个中间值，我们最关注的还是softmax之后的这个结果。

softmax 更关注大数值的元素，对于length = 64K, 我们找到和q_i 最相近的32/64个 $k_j$，即可大大降低计算消耗。

4.5 优化softmax(x) → Locality sensitive hashing

(1) LSH

相近的向量vectors被hash到同一个bucket的概率大。

(2) hash函数如何选？

为了得到b个hash值，我们首先固定一个随机矩阵R [$d_k$, b/2]

定义：h(x) = arg max([xR; −xR])

LSH scheme Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and Ludwig Schmidt. Practical and optimal LSH for angular distance. CoRR, abs/1509.02897, 2015. URL //arxiv. org/abs/1509.02897.

4.6 优化softmax(x) → LSH Attention

有了LSH机制，如何应用于Attention?

→

其中，

Figure 2

(a) Normal → sparse

(b) Bucketed → q, k 根据它们hash值是否相同进行排序。颜色相同的hash值相同。

因为hash值相同的vector接近，所以将(a)中对所有vector两两点积可以近似替换为只对bucket内部vector计算点积。

(c) Q = K

→ 问题：每个bucket中q, k 个数可能不相等，不好计算。

→ 解决方法：

首先通过设定 $k_j = q_j / ||q_j||$ 来保证：$h(k_j) = h(q_j) $

根据bucket number 来sort bucket

在bucket 内部根据 seqence position 来sort。$i$ → $s_i$

→ 效果：bucket中的pair 集中在对角附近。

(d) Chunked

如何chunk?

chunk个数： $m = 2l / n_{buckets}$ 其中l是序列长度

每个bucket平均size = $l / n_{buckets}$ → 认为size变为2倍大不太容易 → 不容易超过chunk大小

4.7 优化softmax(x) → Multi-round LSH attention

用一轮hash，有些相近元素并不能hash到一个桶里 → 多来几轮hash取并集

并行来做

4.8 Causal masking for shared-QK attention

Transformer中原有的mask: mask后面的元素

LSH attention中元素顺序变化，公式(3)mask如何实现？

为每个query, key加入position index，根据与原来一样的排列重新排序，用一个比较的操作来实现mask.

4.9 复杂度分析

4l?

4.10 实验效果

5. Reversible Transormer

5.1 Recall RevNets

反向传播求导：$F = AX^T$

F对A求梯度：X → 每层的输入

所以获取每层的输入值很重要~~

~ The Reversible Residual Network: Backpropagation Without Storing Activations

residual layer 类型	输入输出	form	反向传播计算梯度时输入值来源
Normal residual layer	x → y	y = x + F(x)	正向传播时每层保留
reversible residual layer	(x1, x2) → (y1, y2)	y1 = x1 + F(x2) y2 = x2 + G(y1)	根据上一层activations推理获得 x2 = y2 −G(y1) x1 = y1 − F(x2)

residual layer 类型

输入输出

form

反向传播计算梯度时输入值来源

Normal residual layer

x → y

y = x + F(x)

正向传播时每层保留

reversible residual layer

(x1, x2) → (y1, y2)

y1 = x1 + F(x2)

y2 = x2 + G(y1)

根据上一层activations推理获得

x2 = y2 −G(y1)

x1 = y1 − F(x2)

5.2 Reversible Transformer

如何将reversible residual layer思想应用于Transformer?

~ Attention is all you need

reversible residual layer	Transformer
y1 = x1 + F(x2)	Y1 = X1 + Attention(X2)
y2 = x2 + G(x1)	Y2 = X2 + FeedForward(Y1)

5.3 Chunking

→ 问题：5.2 是对各个残差层进行的优化，但是在每个残差层内部，FeedForward占用内存也很高。

ex. Attention is all you need 中 $d_{ff}$ = 2048 >> $d_{model}$ = 512

→ 解决方法：由于FeedForward层对于sequence的各个position是独立的，所以可以分解为c个chunk

在5.2中FeedForward层：

→ 这里对FeedForward层进行chunk，相应的也要对reversible的正向和反向计算都要chunk。

→ 对于vocab非常大的情况，对output的log概率计算也chunk，并且会计算那个section sequence的loss.

5.4 复杂度分析

6. 他山之石

Reformer中并没有提出一种全新的算法，但是却对以往算法提出了适合Transformer模型的应用。

LSH本来用于大规模向量中求相近向量，reformer中用于softmax中找与$q_i$ 相近的 $k_j$向量，并根据attention计算的需要，设计了LSH attention。

Reversible resnet 本来用于ResNet, 这里替换了F, G函数为 Attention 和 FeedForward；

Chunk思想也广为人知，这里用于 FeedForward 矩阵的分解。

综合起来，这些技术对于Transformer模型降低存储与计算开销有非常大的帮助。

7. 参考文献

(1) Reformer：Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer.” arXiv preprint arXiv:2001.04451 (2020).

(2) LSH相关：Andoni, Alexandr, et al. “Practical and optimal LSH for angular distance.” Advances in neural information processing systems. 2015.

(3) LSH相关：CS369G: Algorithmic Techniques for Big Data. Lecture 16: Approximate Nearest Neighbor Search. Spring 2015-2016

(4) LSH相关：CS369G: Algorithmic Techniques for Big Data. Lecture 17: Locality Sensitive Hashing and Dimension Reduction. Spring 2015-2016

(5) Reversible ResNet: Gomez, Aidan N., et al. “The reversible residual network: Backpropagation without storing activations.” Advances in neural information processing systems. 2017.

(6) Transformer: Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

(7) Reformer Pytorch Github: //github.com/lucidrains/reformer-pytorch

Tags: Deep Learning LSH nlp reformer reversible residual network Transformer 算法模型