欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

InferSent的代码实现

程序员文章站 2024-03-15 19:24:30
...

我最近抽空完成了一个新的github项目–InferSent 。 关于InferSent前面的文章有过介绍。我实现它的原因有二:一是因为算法本身简单,二是因为它在各种NLP任务上表现可以和其他state-of-art的模型对标。

InferSent的模型结构如下:
InferSent的代码实现

InferSent选择了NLI任务用来训练句子embedding,对应的数据集是SNLI,前文有介绍,这里不再赘述。 作为premise和hypothesis的句子共享同一个sentence encoder。 论文实验了LSTM and GRU, BiLSTM with mean/max pooling, Self-attentive network, Hierarchical ConvNet。 结论是BiLSTM with max pooling在迁移学习中综合表现最好。

我在代码里实现了两种encoder,DAN(deep averaging network) 和 BiLSTM with max pooling。 实现后者是因为其作为encoder表现最优,实现DAN仅仅是因为它简单,可以做一个基线算法。

def bilstm_as_encoder(sent_padded_as_tensor, word_embeddings,
layer_size, hidden_size=100, sent_length=50, embedding_size=300):
    embed_input = tf.nn.embedding_lookup(word_embeddings,
sent_padded_as_tensor)
    print("sent_padded_as_tensor: "+str(sent_padded_as_tensor))
    print("embed_input: "+str(embed_input))

    cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size)
    print('build fw cell: '+str(cell_fw))
    cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size)
    print('build bw cell: '+str(cell_bw))

    rnn_outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,
inputs=embed_input, dtype=tf.float32)
    print('rnn outputs: '+str(rnn_outputs))

    concatenated_rnn_outputs = tf.concat(rnn_outputs, 2)
    print('concatenated rnn outputs: '+str(concatenated_rnn_outputs))

    max_pooled = tf.layers.max_pooling1d(concatenated_rnn_outputs,
sent_length, strides=1)
    print('max_pooled: '+str(max_pooled))

    max_pooled_formated = tf.reshape(max_pooled, [-1, 2*hidden_size])
    print('max_pooled_formated: '+str(max_pooled_formated))

    w1 = tf.get_variable(name="w1", dtype=tf.float32,
shape=[2*hidden_size, layer_size[0]])
    b1 = tf.get_variable(name="b1", dtype=tf.float32,
shape=[layer_size[0]])
    encoded = tf.matmul(max_pooled_formated, w1) + b1

    return encoded

以上是BiLSTM with max pooling的encoder实现。BiLSTM的问题是速度比较慢,想提高一点速度也很简单,直接把cell_fw和cell_bw给换成GRU就行了。我试过,速度提高了而且精度差距也不大。这里word_embeddings用的是论文用到的840B token训练出的300维GloVe vectors。LSTM cell的hidden size我用的是100,应该说还可以再取大一点的值,因为资源的限制,我只能委曲求全。对于有资源的同学强烈建议实验更大的hidden vectors。代码里的layer_size是最终输出的encoded sentence的维度,我在实验中用了论文建议的512。

def build_graph(
    inputs1,
    inputs2,
    emb_matrix,
    encoder,
    embedding_size = 300,
    layer_size = None,
    nclasses = 3
    ):

    print(" input1 shape: "+str(inputs1.shape))
    print(" input2 shape: "+str(inputs2.shape))

  #  reuse_var = None 
   # reuse_encoder_var = None 
    word_embeddings = tf.convert_to_tensor(emb_matrix, np.float32)
    print("word_embeddings shape: "+str(word_embeddings.shape))
    print(word_embeddings)

    # the encoders
    with tf.variable_scope("encoder_vars") as encoder_scope:
        encoded_input1 = encoder(inputs1, word_embeddings, layer_size)
        encoder_scope.reuse_variables()
        encoded_input2 = encoder(inputs2, word_embeddings, layer_size)
        print("encoded inputs1 shape: "+str(encoded_input1.shape))
        print("encoded inputs2 shape: "+str(encoded_input2.shape))

    abs_diffed = tf.abs(tf.subtract(encoded_input1, encoded_input2))
    print(abs_diffed)

    multiplied = tf.multiply(encoded_input1, encoded_input2)
    print(multiplied)
    concatenated = tf.concat([encoded_input1, encoded_input2,
abs_diffed, multiplied], 1)
    print(concatenated)
    concatenated_dim = concatenated.shape.as_list()[1]

    # the fully-connected dnn layer
    # fix it as 512
    fully_connected_layer_size = 512
    with tf.variable_scope("dnn_vars") as encoder_scope:
        wd = tf.get_variable(name="wd", dtype=tf.float32,
shape=[concatenated_dim, fully_connected_layer_size])
        bd = tf.get_variable(name="bd", dtype=tf.float32,
shape=[fully_connected_layer_size])
dnned = tf.matmul(concatenated, wd) + bd
    print(dnned)

    with tf.variable_scope("out") as out:
        w_out = tf.get_variable(name="w_out", dtype=tf.float32,
shape=[fully_connected_layer_size, nclasses])
        b_out = tf.get_variable(name="b_out", dtype=tf.float32,
shape=[nclasses])
    logits = tf.matmul(dnned, w_out) + b_out

    return logits

上面这段代码是计算图。利用第一段代码里的encoder完成模型剩下的部分。
注意这一段代码:

    # the encoders
    with tf.variable_scope("encoder_vars") as encoder_scope:
        encoded_input1 = encoder(inputs1, word_embeddings, layer_size)
        encoder_scope.reuse_variables()
        encoded_input2 = encoder(inputs2, word_embeddings, layer_size)
        print("encoded inputs1 shape: "+str(encoded_input1.shape))
        print("encoded inputs2 shape: "+str(encoded_input2.shape))

对于encoded_input1和encoded_input2来说,encoder的参数是共享的。

模型在每个epoch结束后都会存入logs文件夹。我提供了一个sentence_encoder.py,用来读入存好的model,把输入的句子encode,产生sentence embedding。

如果不是资源限制,我相信完全能够复现文章的精度。没时间做downstream的NLP的任务了,谁有兴趣做欢迎分享。