Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning学习记录

程序员文章站 2024-03-21 08:07:04

...

T-TA

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

这是一篇ACL2020的论文，主要内容是对bert的预训练任务MLM进行改进。传统的bert中使用的MLM任务每次只能遮蔽（mask）15%的字符，导致训练时效率不高。本文通过对transformer中的QKV进行了改造，使得训练时每次可以预测全部的token，并且不需要额外的[MASK][MASK]符号，实现了预训练和微调之间的一致性。

模型介绍

针对MLM任务的效率低下，本文提出了一个新的预训练任务LAE（language autoencoding），目标是一次性预测文本序列中的每个标记，在训练时可以使用相同的输入和输出。对于所提出的任务，语言模型应该在避免过度拟合的同时，防止输入信息泄露（因为输入和输出使用相同的文本）。否则，模型只输出从输入表示中复制的表示，而不学习语言的任何统计信息。

为了达成这个目标，作者对原始的transformer进行了如下改造：

去掉Q里边的token输入，也就是说第一层的Attention的Q不能包含token信息，只能包含位置向量。
去掉KV中的token泄露，在self-attention中将attention_scores矩阵进行对角线遮蔽

除非我们放弃shortcut或者只用一层的网络，否则上面使用的对角线遮蔽将失去效果。

因为在后面的网络层中， $K_{i+1},V_{i+1}$ 是由上一层的 $Q_i$ 产生，而 $Q_i$ 来由 $V_i$ 加权得到，因此在k+1层的self-attention中即使将attention_scores矩阵进行对角线遮蔽，也会产生token泄露。因为非对角线的部分也会包含第一层被遮蔽的token信息。

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning学习记录

原始论文中提出了一个简单有效的解决方案：每一层Attention都共用第一层生成的K,V，所以，设E为token的embedding序列，P为对应的位置向量，那么T-TA与BERT的计算过程可以简写为：
$\begin{aligned} Q0&=E+P\\ Q1&=Attention(Q0,Q0,Q0)\\Q2&=Attention(Q1,Q1,Q1)\\⋮\\Qn&=Attention(Qn−1,Qn−1,Qn−1)\\ &BERT运算示意图 \end{aligned}$

$\begin{aligned} Q0&=P\\ Q1&=Attention(Q0,E+P,E+P)\\ Q2&=Attention(Q1,E+P,E+P)\\⋮\\ Qn&=Attention(Qn−1,E+P,E+P)\\ &T-TA运算示意图 \end{aligned}$

在上式中省略残差、FFN等细节，只保留了核心运算部分，预训练阶段T-TA的Attention是进行了对角线形式的Attention Mask的，如果是下游任务的微调，则可以把它去掉。

代码说明

对角线遮蔽

生成对角线遮蔽矩阵

attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)
attention_mask = 1.0 - tf.linalg.band_part(attention_mask, 0, 0) ## for self-blind

def create_attention_mask_from_input_mask(from_tensor, to_mask):
  """Create 3D attention mask from a 2D tensor mask.
  Args:
    from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
    to_mask: int32 Tensor of shape [batch_size, to_seq_length].
  Returns:
    float Tensor of shape [batch_size, from_seq_length, to_seq_length].
  """
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  batch_size = from_shape[0]
  from_seq_length = from_shape[1]

  to_shape = get_shape_list(to_mask, expected_rank=2)
  to_seq_length = to_shape[1]

  to_mask = tf.cast(
      tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

  # We don't assume that `from_tensor` is a mask (although it could be). We
  # don't actually care if we attend *from* padding tokens (only *to* padding)
  # tokens so we create a tensor of all ones.
  #
  # `broadcast_ones` = [batch_size, from_seq_length, 1]
  broadcast_ones = tf.ones(
      shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

  # Here we broadcast along two dimensions to create the mask.
  mask = broadcast_ones * to_mask

  return mask

attention代码和原生代码一致，主要在于transformer部分加入对角线遮蔽。

attention_head = attention_layer(
              from_tensor=layer_input_q,
              to_tensor=layer_input_kv,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)

参考：

修改Transformer结构，设计一个更快更好的MLM模型

相关标签：预训练模型深度学习

上一篇： Rsync 实现远程同步复制-源码编译安装Rsync

下一篇： Mysql数据库主从配置