ALBERT 论文+代码笔记

Paper:1909.11942.pdf

Code:google-research/albert: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

核心思想:基于 Bert 的改进版本:分解 Embedding 参数、层间参数共享、SOP 替代 NSP。

What

动机和核心问题

大模型有好表现,但因为内存限制和训练时间太久,导致模型无法继续变大。本文提出两种参数裁剪的方法来降低 Bert 的内存消耗并增加训练速度。

关于内存已有的一些解决方案:模型并行、更聪明的内存管理。ALBERT 结合了两种技术同时解决了内存和训练时长的问题:

  • 分解 Embedding 的参数
  • 跨层参数共享

还有个增益是可以充当正则化的形式,从而稳定训练并有助于泛化。

模型和算法

对 Bert 模型进行了三个方面调整:

  • 分解 Embedding 参数:WordPiece Embedding 学习的是 context-independent 表示;hidden-layer Embedding 学习的是 context-dependent 表示。前者 Size 取小点就可以缩小参数规模,因此本文将 Embedding 的参数分解为两个较小的矩阵。即首先将 One-hot 投影到尺寸为 E(128) 的较低维嵌入空间中,然后再将其投影到隐藏空间中。参数规模从 O(V × H) 减小到 O(V × E + E × H)。
  • 跨层共享:共享了层间的所有参数。这里作者对比了 Bert 和 ALBERT 层输入和输出的相似度,发现 ALBERT 的结果更加平滑,说明权重共享对稳定网络参数有影响。另外相似度的结果是振荡的,不是像 DQEs(见《相关工作》)所说的达到了平衡点(对于该平衡点,特定层的输入和输出嵌入保持不变)。
  • 句子连贯性损失函数:Bert 的 NSP(Next Sentence Prediction) 被发现不可靠,本文作者猜测任务难度相比 MLM 来说太小,其实它可以看作一个任务做了主题预测和连贯性预测,但主题预测很容易,而且和 MLM 有重叠。因此本文提出了 SOP(Sentence-order Prediction),聚焦在句子连贯的建模上,具体做法是:Positive 和 Bert 一样,来自同一个文档的两个连续片段;Negative 用的还是这两个片段,只不过交换了一下顺序。事实证明 NSP 根本无法解决 SOP 任务(即,它最终学习了更容易的主题预测信号,并在 SOP 任务上以随机基线水平执行),而 SOP 可以将 NSP 任务解决为合理的程度。

还有个地方要注意,本文并没有使用(其实是在 xxlarge 中)dropout

接下来是代码部分了,我们直接阅读官方代码(注意,是 1.x 的 Tensorflow,不是最新的 2.x)。模型的整体架构可以简化如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class AlbertModel:
def __init__(self, config, input_ids, input_mask, token_type_ids):
input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size, seq_length = input_shape

# Token Embedding
# (batch_size, seq_length, embedding_size)
word_embedding_output = embedding_lookup(
input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=128
)

# Add positional embeddings and token type embeddings,
# then layer normalize and perform dropout.
# (batch_size, seq_length, embedding_size)
embedding_output = embedding_postprocessor(
input_tensor=word_embedding_output, token_type_ids=token_type_ids,
token_type_vocab_size=2, max_position_embeddings=512,
dropout_prob=config.hidden_dropout_prob, # 0.1, 0.1, 0.1, 0
token_type_embedding_name="token_type_embeddings",
position_embedding_name="position_embeddings"
)

# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
# 下面注释的参数分别对应 base large xlarge 和 xxlarge
all_encoder_layers = transformer_model(
input_tensor=embedding_output, attention_mask=input_mask,
hidden_size=config.hidden_size, # 768, 1024, 2048, 4096
num_hidden_layers=config.num_hidden_layers, # 12, 24, 24, 12
num_hidden_groups=1,
num_attention_heads=config.num_attention_heads, # 12, 16, 16, 64
intermediate_size=config.intermediate_size, # 3072, 4096, 8192, 16384
inner_group_num=1,
intermediate_act_fn=get_activation("gelu"),
hidden_dropout_prob=config.hidden_dropout_prob, # 0.1, 0.1, 0.1, 0
attention_probs_dropout_prob=config.attention_probs_dropout_prob, # as above
initializer_range=0.02, do_return_all_layers=True, use_einsum=True
)

sequence_output = all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(sequence_output[:, 0:1, :], axis=1)
pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range)
)

共分为三个组块:Embedding,Transformer 和 Pooling,其中 Embedding 包括了 Token Embedding、Position Embedding 和 Token type Embedding;Transformer 。。。;Pooling 。。。。

Embedding

Token Embedding 比较简单,没有特别的,稍微需要提一下的是 table 的 initializer 都是使用了 tf.truncated_normal_initializer(stddev=0.02),这也包括下面的位置编码和 token type 编码。再就是第一个调整的地方:使用了低维的嵌入(128),后面进到 Transformer 后会先转成 hidden_size。

然后是 Token type Embedding,这里的 type_vocab_size 等于 2,其实就是用 0 和 1 分别表示两个句子,label 就是第二个句子是不是第一个句子的下一句,这是 Bert 的基本配置,参见这里

接下来是 Position Embedding,使用的是绝对位置编码,先用 max_sequence_len(512)创建一个 table,然后根据 sequence 的长度切片。

此处有感慨,参见《打开脑洞》。

Transformer

首先就是将进来的 Embedding 转为 hidden size(Bert 不需要这一步,因为两者是相等的):

1
2
# (batch_size, seq_length, embedding_size) =>  (batch_size, seq_length, hidden_size)
prev_output = dense_layer_2d(input_tensor, hidden_size)

然后就是 stack Transformer 了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
for layer_idx in range(num_hidden_layers=12):
group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups=1)
layer_output = prev_output
for inner_group_idx in range(inner_group_num=1):
layer_output = attention_ffn_block(...)
prev_output = layer_output
all_layer_outputs.append(layer_output)

# 原始代码如下:
def transformer_model(input_tensor,
attention_mask,
hidden_size=768,
num_hidden_layers=12,
num_hidden_groups=1,
num_attention_heads=12,
intermediate_size=3072,
inner_group_num=1,
intermediate_act_fn="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False,
use_einsum=True):
attention_head_size = hidden_size // num_attention_heads
input_shape = get_shape_list(input_tensor, expected_rank=3)
input_width = input_shape[2]

all_layer_outputs = []

if input_width != hidden_size:
# 本文的情况(第一个调整点)
prev_output = dense_layer_2d(
input_tensor, hidden_size, create_initializer(initializer_range),
None, use_einsum=use_einsum, name="embedding_hidden_mapping_in")
else:
# 正常情况(如 Bert)
prev_output = input_tensor

# 层间参数共享(第二个调整点)
with tf.variable_scope("transformer", reuse=tf.AUTO_REUSE):
for layer_idx in range(num_hidden_layers):
group_idx = int(layer_idx / num_hidden_layers * num_hidden_groups)
with tf.variable_scope("group_%d" % group_idx):
with tf.name_scope("layer_%d" % layer_idx):
layer_output = prev_output
for inner_group_idx in range(inner_group_num):
with tf.variable_scope("inner_group_%d" % inner_group_idx):
layer_output = attention_ffn_block(
layer_input=layer_output,
hidden_size=hidden_size,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
attention_head_size=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
intermediate_size=intermediate_size,
intermediate_act_fn=intermediate_act_fn,
initializer_range=initializer_range,
hidden_dropout_prob=hidden_dropout_prob,
use_einsum=use_einsum)
prev_output = layer_output
all_layer_outputs.append(layer_output)
if do_return_all_layers:
return all_layer_outputs
else:
return all_layer_outputs[-1]

因为这里其实并没有分组(都是 1),所以就是 12 层的 attention_ffn_block stack。看到这里的时候有个关于归一化的疑惑,之前看 Transformer 的时候看的是这个版本:The Annotated Transformer,每个 block 里面是先 norm 再传给 attention layer,出来后和输入做残差连接,当时觉得和不太一样,也没有多想;后来看到 OpenNMT 也是这么实现的,就更没多想了。但是这次回看了一下 Bert 的代码实现,发现就是和图示一样的:先将输入和 attention layer 的输出做残差连接,再 norm。如果有同学看到这里知道原因的话还希望能告知一二,不胜感激。

接下来看一下 attention_ffn_block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# 参数都是 base 的
def attention_ffn_block(layer_input,
attention_mask,
hidden_size=768,
num_attention_heads=12,
attention_head_size=64,
attention_probs_dropout_prob=0.1,
intermediate_size=3072,
intermediate_act_fn="gleu",
initializer_range=0.02,
hidden_dropout_prob=0.1,
use_einsum=True):
with tf.variable_scope("attention_1"):
with tf.variable_scope("self"):
attention_output = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
use_einsum=use_einsum
)
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
attention_output = dense_layer_3d_proj(
attention_output,
hidden_size,
attention_head_size,
create_initializer(initializer_range),
None,
use_einsum=use_einsum,
name="dense"
)
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)

with tf.variable_scope("ffn_1"):
with tf.variable_scope("intermediate"):
intermediate_output = dense_layer_2d(
attention_output,
intermediate_size,
create_initializer(initializer_range),
intermediate_act_fn,
use_einsum=use_einsum,
num_attention_heads=num_attention_heads,
name="dense")
with tf.variable_scope("output"):
ffn_output = dense_layer_2d(
intermediate_output,
hidden_size,
create_initializer(initializer_range),
None,
use_einsum=use_einsum,
num_attention_heads=num_attention_heads,
name="dense")
ffn_output = dropout(ffn_output, hidden_dropout_prob)
ffn_output = layer_norm(ffn_output + attention_output)
return ffn_output

# 去掉参数共享的简化版
def attention_ffn_block(...):
# Multi-Head Attention
attention_output = attention_layer(...)
attention_output = dense_layer_3d_proj(...)
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)

# Feed Forward
intermediate_output = dense_layer_2d(...)
ffn_output = dense_layer_2d(...)
ffn_output = dropout(ffn_output, hidden_dropout_prob)
ffn_output = layer_norm(ffn_output + attention_output)

return ffn_output

这个就是 Transformer(准确来说是 Encoder)的一个 block 了,和原 Bert 的代码对比就可以发现,ALBERT 版本是把 block 单独抽出来作为一个函数(同时也是方便参数共享),另外也把原来的 dense 连接单独抽为一个函数(dense_layer_xd),代码看起来更加清晰。dense_layer_2d 输出的 shape 是 (batch_size, seq_length, hidden_size),而 dense_layer_3d 输出的 shape 则是 (batch_size, seq_length, num_heads, head_size)

再下来就是 attention layer,也就是 Self attention,Transformer Encoder 的 Attention,qkv 都来自前一层输出的 Attention(可以参考这里的代码):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def attention_layer(from_tensor,
to_tensor,
attention_mask,
num_attention_heads=12,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
use_einsum=True):

# (batch_size, seq_length, hidden_size)
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
# 768/12 = 64
size_per_head = int(from_shape[2]/num_attention_heads)

batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]

# `q,k,v` = [B, F, N, H]
# from_tensor == to_tensor == prev_layer_output
q = dense_layer_3d(from_tensor, num_attention_heads, size_per_head)
k = dense_layer_3d(to_tensor, num_attention_heads, size_per_head)
v = dense_layer_3d(to_tensor, num_attention_heads, size_per_head)

q = tf.transpose(q, [0, 2, 1, 3])
k = tf.transpose(k, [0, 2, 1, 3])
v = tf.transpose(v, [0, 2, 1, 3])

if attention_mask:
attention_mask = tf.reshape(attention_mask, [batch_size, 1, to_seq_length, 1])

# 'new_embeddings = [B, N, F, H]'
new_embeddings = dot_product_attention(
q, k, v, attention_mask, attention_probs_dropout_prob)
return tf.transpose(new_embeddings, [0, 2, 1, 3])

我们也稍微回顾一下 attention 的计算,之前写过类似的笔记:Bahdanau AttentionLuong Attention,另外在 Transformer 笔记中也有介绍过 OpenNMT 的实现,可以参考。下面的实现来自 [Attention is All You Need](Attention Is All You Need Note | Yam)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def dot_product_attention(q, k, v, mask, dropout_rate=0.1):
# (seq_length, num_heads, q_length, kv_length)
# (qk^t)/sqrt(d_k)·v, d_k = q_d
logits = tf.matmul(q, k, transpose_b=True)
logits = tf.multiply(logits, 1.0 / math.sqrt(float(get_shape_list(q)[-1])))
if mask is not None:
# `attention_mask` = [B, T]
from_shape = get_shape_list(q)
broadcast_ones = tf.ones([from_shape[0], 1, from_shape[2], 1], tf.float32)
mask = tf.matmul(broadcast_ones,
tf.cast(mask, tf.float32), transpose_b=True)

# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - mask) * -10000.0

# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
logits += adder
else:
adder = 0.0

attention_probs = tf.nn.softmax(logits, name="attention_probs")
attention_probs = dropout(attention_probs, dropout_rate)
return tf.matmul(attention_probs, v)

到这里 Transformer 就介绍完了,可以看到虽然源代码量看起来很大,但其实读起来并不复杂,整体还是非常清晰流畅的。

Pooling

这部分非常简单:

1
2
3
4
5
6
7
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
# (batch_size, hidden_size)
pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))

这里的 first_token 其实就是句子标记 CLS,在本模型中就是 0 或 1,0 表示两个句子是连贯的,1 表示两个句子是交换了顺序的,这样做的目的是为了接下来计算 Loss(因为只要 CLS 的结果即可)。

以上就是模型和算法的所有部分了,简单总结一下:

  • 模型还是基于 Bert,即 Transformer 的 Encoder 架构。
  • 对模型进行了三个调整:分解 Embedding、共享层间参数、SOP 替代 NSP(代码见《如何开始训练》)。

特点和创新

  • 分解 Embedding 参数
  • SOP 替代 NSP
  • 证明 dropout 有损基于 Transformer 的模型

How

如何构造数据

主要的不同是使用了 n-gram masking,也就是随机选择 mask n-gram,n 最大取 3,分布为:

p(n)=1/nk=1N1/kp(n)=\frac{1 / n}{\sum_{k=1}^{N} 1 / k}

n 为 1 时就是 mask 一个词。

1
2
3
4
# 取 unigram, bigram, thigram 的概率
pvals = 1. / np.arange(1, 3 + 1)
pvals /= pvals.sum(keepdims=True)
# pvals = array([0.54545455, 0.27272727, 0.18181818])

这段预处理代码比较繁琐,我们稍微简化一下,以词 token 为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def create_masked_lm_predictions(
tokens,
masked_lm_prob=0.15,
max_predictions_per_seq=20,
vocab_words=list(tokenizer.vocab.keys()),
rng=random.Random(12345)):
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
cand_indexes.append([i])

output_tokens = list(tokens)
masked_lm_positions = []
masked_lm_labels = []
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))

print("num_to_predict: ", num_to_predict)
# 不同 gram 的比例
ngrams = np.arange(1, 3 + 1, dtype=np.int64)
pvals = 1. / np.arange(1, 3 + 1)
pvals /= pvals.sum(keepdims=True)

# 每个 token 对应的三个 ngram
ngram_indexes = []
for idx in range(len(cand_indexes)):
ngram_index = []
for n in ngrams:
ngram_index.append(cand_indexes[idx:idx+n])
ngram_indexes.append(ngram_index)
rng.shuffle(ngram_indexes)

masked_lms = []
# 获取 masked tokens
# cand_index_set 其实就是每个 token 的三个 ngram
# 比如:[[[13]], [[13], [14]], [[13], [14], [15]]]
for cand_index_set in ngram_indexes:
if len(masked_lms) >= num_to_predict:
break
# 根据 cand_index_set 不同长度 choice
n = np.random.choice(
ngrams[:len(cand_index_set)],
p=pvals[:len(cand_index_set)] / pvals[:len(cand_index_set)].sum(keepdims=True))
# [16, 17] = sum([[16], [17]], [])
index_set = sum(cand_index_set[n - 1], [])
# 处理选定的 ngram index :80% MASK,10% 是原来的,10% 随机替换一个
for index in index_set:
masked_token = None
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
if rng.random() < 0.5:
masked_token = tokens[index]
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

masked_lms = sorted(masked_lms, key=lambda x: x.index)
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)

这里做了不少简化,但足够说明思路了,其实就是根据给定的 p(n) 选择 ngram 作为预测词处理。原代码除了考虑 wordpiece 的情况,还考虑了其他的一些细节,比如不能超过 num_to_predict,不重复处理等等,想起了陈皓的一句话:“细节处尽是魔鬼。”。

最后看一个输入和输出的例子:

1
2
3
4
5
6
7
8
9
10
print(" ".join(tokens))
"""
[CLS] Text should be one-sentence-per-line, with empty lines between documents. [SEP] This sample text is public domain and was randomly selected from Project Guttenberg. [SEP]
"""
print(" ".joint(output_tokens))
"""
[CLS] Text should be one-sentence-per-line, with empty lines between [MASK] [SEP] [MASK] [MASK] text is public domain and was randomly 屿 from Project Guttenberg. [SEP]
"""
masked_lm_positions # [9, 11, 12, 20]
masked_lm_labels # ['documents.', 'This', 'sample', 'selected']

如何开始训练

这里主要提一下 SOP( Sentence Order Prediction Loss Function),其他的和 Bert 类似,具体原理前面已经提到了,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def get_sentence_order_output(pooled_output, labels):
# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
output_weights = tf.get_variable(
"output_weights",
shape=[2, 768],
initializer=create_initializer(0.02))
output_bias = tf.get_variable(
"output_bias",
shape=[2],
initializer=tf.zeros_initializer())

# (batch_size, hidden_size) * (2, hidden_size)^T
# => (batch_size, 2), 2 是类别数量
logits = tf.matmul(pooled_output, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
labels = tf.reshape(labels, [-1])
# (batch_size, 2)
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)

其实就是个多(二)分类。MLM 的 Loss 不再赘述。

如何使用结果

与 Bert 一样。

数据和实验

层间相似度,对应《模型和算法》的第二个调整:

参数量

RoBERTa 的参数量在这里,DistilBERT 的参数量是 66M。

Bert vs. ALBERT

Speedup 是训练时间,以 Bert large 为基准,ALBERT large 速度是 1.7 倍,但 xxlarge 比 Bert large 慢了 3 倍。

Embedding Size 的影响

层间参数共享的影响

NSP vs. SOP

Dropout

在各项任务中的表现可以查阅这里

Discussion

相关工作

扩大语言表征模型

  • 语言表征是有用的,近几年最显著的变化是从词或上下文表征到全网络预训练然后下游任务精调。
    • 词表征:Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed represen- tations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
    • 句表征:Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st ICML, Beijing, China, 2014.
    • 词表征:Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word rep- resentation. EMNLP 2014.
    • 下游任务:Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural infor- mation processing systems, pp. 3079–3087, 2015.
    • 上下文:Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. NIPS 2017.
    • 上下文:Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ACL 2018.
    • 下游任务:Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018.
    • 下游任务:Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ACL 2019.
    • 大模型好效果:Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
  • 已有方案:
    • 时间换空间:Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
    • 时间换空间:Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.
    • 并行训练:Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

跨层参数共享

关注 Encoder-Decoder 架构:Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.

跨层共享效果更好(与本文观察不一致):Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.

DQEs,Input 和 Output 能达到平衡点(与本文观察不一致):Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Neural Information Processing Systems (NeurIPS), 2019.

加一个共享参数的 Transformer:Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling recurrence for transformer. Proceedings of the 2019 Conference of the North, 2019.

句子顺序

  • 连贯性和衔接性:

    • Jerry R. Hobbs. Coherence and coreference. Cognitive Science, 3(1):67–90, 1979.
    • M.A.K. Halliday and Ruqaiya Hasan. Cohesion in English. Routledge, 1976.
    • Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225, 1995.
  • 用句子预测相邻句子的词:

    • Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. NIPS 2015.
    • Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. ACL 2016.
  • 预测未来的句子:Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. Learn- ing generic sentence representations using convolutional neural networks. ACL 2017.

  • 预测话语标记:

    • Yacine Jernite, Samuel R Bowman, and David Sontag. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557, 2017.(本文类似)
    • Allen Nie, Erin Bennett, and Noah Goodman. DisSent: Learning sentence representations from explicit discourse relations. ACL 2019.
  • 预测句对的第二部分是不是被另一个文档的句子替换:Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ACL 2019.(Bert)

  • 将预测句子顺序结合在一个三分类任务中:Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. StructBERT: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577, 2019.

Dropout
CNN 中同时使用批归一化和 dropout 可能导致结果更差:

  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Confer- ence on Artificial Intelligence, 2017.
  • Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019.

加快 ALBERT 的训练和推理速度

  • 稀疏注意力:Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  • 块注意力:Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block self- attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857, 2018.

打开脑洞

又到了自由讨论的时间了。首先,必须感慨一下,Google 的代码写的真是实在,很浅白,注释很清楚。记得张俊林老师曾在一篇介绍预训练的知乎文章中评价过 Bert,说它在模型上其实没有太多创新,但是自然、简洁、优雅、有效地解决了问题,我想这可能就是 “工程师” 的 “工程” 两字的价值体现吧。这可能是我们平时应该特别注意和学习的地方,大多数人总是不由自主地把一个别人本来很简洁的东西 “改造” 得很复杂,如果是研究需要,那为了 1-2 个百分点是可以的,但工程上就没太多必要了,那是事倍功半。Google 风格真是越看越爱,导致我现在基本上只跟踪 Google 的最新研究。当然还有个原因——最近这些年 NLP 领域的几个重大突破(Word2Vec,Transformer,Bert)基本都和 Google 有关,这是算法、工程、数据综合后的结果,不应该感到意外。Facebook 则总是会及时做出更加易用和优化的产品(FastText,RoBERTa),当然 Facebook 更强的是推荐和视频领域。由于自己目前做 NLP 工作偏多,自然会更加关注 Google 一些。

然后开始胡侃这篇文章。这篇文章和 Facebook 的 RoBERTa 结果相差不多,但是参数量的确少了不少(毕竟前两个改动都是减少参数的),而且除 xxlarge 外比 DistilBERT 都少(操作起来要比后者省事多了,个人意见,蒸馏真不是一种优雅的方法)。所以实际使用的时候可以酌情考虑 ALBERT 或 RoBERTa,当然对中文来说,考虑到要 wwm(Whole Word Masking),目前开源只有 RoBERTa。

另外有个有意思的地方是 ALBERT 在构造数据时用了 n-gram mask,有点类似 wwm,可以算是一种改进,并没有用 RoBERTa 中的动态 Mask(可能是考虑提升比较微弱吧,但把 NSP 给换成 SOP 了,因为 RoBERTa 证明那没啥用然后就把它给去了),也没有使用 DistilBERT 中的 token prob mask(就是让选择 mask 时更加关注低频词,进而实现对 mask 的平滑取样,我觉得是非常 make sense 的一个点),可能是时间相隔太近吧,看了一下 arxiv 上的第一版的投稿时间,确实只差了 6 天。所以,下一个创新点也许是把这个点加进去?也许说不定已经有了,只是我没关注到。

Appendix