Text Classification -- Convolutional Networks、sentence level Attentional RNN、Hierarchical attention

程序员文章站 2022-03-11 10:56:52

...

from：https://richliao.github.io/

Text Classification, Part I - Convolutional Networks

Text classification is a very classical problem. The goal is to classify documents into a fixed number of predefined categories, given a variable length of text bodies. It is widely use in sentimental analysis (IMDB, YELP reviews classification), stock market sentimental analysis, to GOOGLE’s smart email reply. This is a very active research area both in academia and industry. In the following series of posts, I will try to present a few different approaches and compare their performances. Ultimately, the goal for me is to implement the paper Hierarchical Attention Networks for Document Classification.

Given the limitation of data set I have, all exercises are based on Kaggle’s IMDB dataset. And implementation are all based on Keras.

Text classification using CNN

In this first post, I will look into how to use convolutional neural network to build a classifier, particularly Convolutional Neural Networks for Sentence Classification - Yoo Kim.

First use BeautifulSoup to remove some html tags and remove some unwanted characters.

def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\\", "", string)    
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

texts = []
labels = []

for idx in range(data_train.review.shape[0]):
    text = BeautifulSoup(data_train.review[idx])
    texts.append(clean_str(text.get_text().encode('ascii','ignore')))
    labels.append(data_train.sentiment[idx])

Keras has provide very nice text processing functions.

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

For this project, I have used Google Glove 6B vector 100d. For Unknown word, the following code will just randomize its vector.

GLOVE_DIR = "~/data/glove"
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

A simplified Convolutional

First, I will just use a very simple convolutional architecture here. Simply use total 128 filters with size 5 and max pooling of 5 and 35, following the sample from this blog

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
l_pool2 = MaxPooling1D(5)(l_cov2)
l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
l_pool3 = MaxPooling1D(35)(l_cov3)  # global max pooling
l_flat = Flatten()(l_pool3)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(2, activation='softmax')


Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_1 (InputLayer)             (None, 1000)          0
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1000, 100)     8057000     input_1[0][0]
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 996, 128)      64128       embedding_1[0][0]
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 199, 128)      0           convolution1d_1[0][0]
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 195, 128)      82048       maxpooling1d_1[0][0]
____________________________________________________________________________________________________
maxpooling1d_2 (MaxPooling1D)    (None, 39, 128)       0           convolution1d_2[0][0]
____________________________________________________________________________________________________
convolution1d_3 (Convolution1D)  (None, 35, 128)       82048       maxpooling1d_2[0][0]
____________________________________________________________________________________________________
maxpooling1d_3 (MaxPooling1D)    (None, 1, 128)        0           convolution1d_3[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 128)           0           maxpooling1d_3[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 128)           16512       flatten_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             258         dense_1[0][0]
====================================================================================================
Total params: 8301994
____________________________________________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 43s - loss: 0.6347 - acc: 0.6329 - val_loss: 0.6107 - val_acc: 0.7024
Epoch 2/10
20000/20000 [==============================] - 43s - loss: 0.4141 - acc: 0.8188 - val_loss: 0.4098 - val_acc: 0.8180
Epoch 3/10
20000/20000 [==============================] - 43s - loss: 0.3252 - acc: 0.8651 - val_loss: 0.4162 - val_acc: 0.8148
Epoch 4/10
20000/20000 [==============================] - 44s - loss: 0.2651 - acc: 0.8929 - val_loss: 0.3545 - val_acc: 0.8640
Epoch 5/10
20000/20000 [==============================] - 43s - loss: 0.2170 - acc: 0.9140 - val_loss: 0.2764 - val_acc: 0.8906
Epoch 6/10
20000/20000 [==============================] - 43s - loss: 0.1666 - acc: 0.9382 - val_loss: 0.4196 - val_acc: 0.8496
Epoch 7/10
20000/20000 [==============================] - 43s - loss: 0.1223 - acc: 0.9568 - val_loss: 0.4271 - val_acc: 0.8680
Epoch 8/10
20000/20000 [==============================] - 43s - loss: 0.0896 - acc: 0.9683 - val_loss: 0.8233 - val_acc: 0.8308
Epoch 9/10
20000/20000 [==============================] - 43s - loss: 0.0830 - acc: 0.9770 - val_loss: 0.5868 - val_acc: 0.8852
Epoch 10/10
20000/20000 [==============================] - 43s - loss: 0.0667 - acc: 0.9794 - val_loss: 0.5159 - val_acc: 0.8872

The accuracy we can achieve is 89%

Deeper Convolutional neural network

In Yoon Kim’s paper, multiple filters have been applied. This can be easily implemented using Keras Merge Layer.

Convolutional network with multiple filter sizes

convs = []
filter_sizes = [3,4,5]

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

for fsz in filter_sizes:
    l_conv = Conv1D(nb_filter=128,filter_length=fsz,activation='relu')(embedded_sequences)
    l_pool = MaxPooling1D(5)(l_conv)
    convs.append(l_pool)

l_merge = Merge(mode='concat', concat_axis=1)(convs)
l_cov1= Conv1D(128, 5, activation='relu')(l_merge)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
l_pool2 = MaxPooling1D(30)(l_cov2)
l_flat = Flatten()(l_pool2)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(2, activation='softmax')(l_dense)

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 1000)          0
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1000, 100)     8057000     input_2[0][0]
____________________________________________________________________________________________________
convolution1d_4 (Convolution1D)  (None, 998, 128)      38528       embedding_2[0][0]
____________________________________________________________________________________________________
convolution1d_5 (Convolution1D)  (None, 997, 128)      51328       embedding_2[0][0]
____________________________________________________________________________________________________
convolution1d_6 (Convolution1D)  (None, 996, 128)      64128       embedding_2[0][0]
____________________________________________________________________________________________________
maxpooling1d_4 (MaxPooling1D)    (None, 199, 128)      0           convolution1d_4[0][0]
____________________________________________________________________________________________________
maxpooling1d_5 (MaxPooling1D)    (None, 199, 128)      0           convolution1d_5[0][0]
____________________________________________________________________________________________________
maxpooling1d_6 (MaxPooling1D)    (None, 199, 128)      0           convolution1d_6[0][0]
____________________________________________________________________________________________________
merge_1 (Merge)                  (None, 597, 128)      0           maxpooling1d_4[0][0]
                                                                   maxpooling1d_5[0][0]
                                                                   maxpooling1d_6[0][0]
____________________________________________________________________________________________________
convolution1d_7 (Convolution1D)  (None, 593, 128)      82048       merge_1[0][0]
____________________________________________________________________________________________________
maxpooling1d_7 (MaxPooling1D)    (None, 118, 128)      0           convolution1d_7[0][0]
____________________________________________________________________________________________________
convolution1d_8 (Convolution1D)  (None, 114, 128)      82048       maxpooling1d_7[0][0]
____________________________________________________________________________________________________
maxpooling1d_8 (MaxPooling1D)    (None, 3, 128)        0           convolution1d_8[0][0]
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 384)           0           maxpooling1d_8[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 2)             770         flatten_2[0][0]
====================================================================================================
Total params: 8375850
____________________________________________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 117s - loss: 0.4950 - acc: 0.7472 - val_loss: 0.2895 - val_acc: 0.8830
Epoch 2/10
20000/20000 [==============================] - 117s - loss: 0.2868 - acc: 0.8807 - val_loss: 0.2460 - val_acc: 0.9036
Epoch 3/10
20000/20000 [==============================] - 118s - loss: 0.2040 - acc: 0.9202 - val_loss: 0.2530 - val_acc: 0.8986
Epoch 4/10
20000/20000 [==============================] - 117s - loss: 0.1293 - acc: 0.9530 - val_loss: 0.2931 - val_acc: 0.8870
Epoch 5/10
20000/20000 [==============================] - 117s - loss: 0.0596 - acc: 0.9788 - val_loss: 0.4155 - val_acc: 0.8896
Epoch 6/10
20000/20000 [==============================] - 117s - loss: 0.0334 - acc: 0.9881 - val_loss: 0.5213 - val_acc: 0.8954
Epoch 7/10
20000/20000 [==============================] - 117s - loss: 0.0173 - acc: 0.9934 - val_loss: 0.5742 - val_acc: 0.8910
Epoch 8/10
20000/20000 [==============================] - 118s - loss: 0.0166 - acc: 0.9949 - val_loss: 0.6220 - val_acc: 0.8944
Epoch 9/10
20000/20000 [==============================] - 117s - loss: 0.0114 - acc: 0.9970 - val_loss: 0.6947 - val_acc: 0.8934
Epoch 10/10
20000/20000 [==============================] - 117s - loss: 0.0095 - acc: 0.9967 - val_loss: 0.8724 - val_acc: 0.8974

As you can see, the result slighly improved to 90.3%

To achieve the best performances, we can 1) fine tune hyper parameters 2) further improve text preprocessing 3) use drop out layer

Full source code is in my repository in github.

Conclusion

Based on the observation, the complexity of convolutional neural network doesn’t seem to improve performance, at least using this small dataset. We might be able to see performance improvement using larger dataset, which I won’t be able to verify here. One observation I have is allowing the embedding layer training or not does significantly impact the performance, same did pretrained Google Glove word vectors. In both cases, I can see performance improved from 82% to 90%.

Text Classification, Part 2 - sentence level Attentional RNN

In the second post, I will try to tackle the problem by using recurrent neural network and attention based LSTM encoder. Further, to make one step closer to implement Hierarchical Attention Networks for Document Classification, I will implement an Attention Network on top of LSTM/GRU for the classification task.

Please note that all exercises are based on Kaggle’s IMDB dataset. And implementation are all based on Keras

Text classification using LSTM

By using LSTM encoder, we intent to encode all information of the text in the last output of recurrent neural network before running feed forward network for classification. This is very similar to neural translation machine and sequence to sequence learning. See the following figure that came from A Hierarchical Neural Autoencoder for Paragraphs and Documents.

I’m going to use LSTM layer in Keras to implement this. Other than forward LSTM, here I am going to use bidirectional LSTM and concatenate both last output of LSTM outputs. Keras has provide a very nice wrapper called bidirectional, which will make this coding exercise effortless. You can see the sample code here

The following code snippet is pretty much the same as Keras sample code except that I didn’t use any drop out layer.

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
preds = Dense(2, activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

print("model fitting - Bidirectional LSTM")
model.summary()
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          nb_epoch=10, batch_size=50)

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_1 (InputLayer)             (None, 1000)          0
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 1000, 100)     8057000     input_1[0][0]
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 200)           160800      embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 2)             402         bidirectional_1[0][0]
====================================================================================================
Total params: 8218202
____________________________________________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 1088s - loss: 0.5343 - acc: 0.7304 - val_loss: 0.3738 - val_acc: 0.8414
Epoch 2/10
20000/20000 [==============================] - 1092s - loss: 0.3348 - acc: 0.8605 - val_loss: 0.3199 - val_acc: 0.8678
Epoch 3/10
20000/20000 [==============================] - 1091s - loss: 0.2382 - acc: 0.9083 - val_loss: 0.2758 - val_acc: 0.8912
Epoch 4/10
20000/20000 [==============================] - 1092s - loss: 0.1808 - acc: 0.9309 - val_loss: 0.2562 - val_acc: 0.8988
Epoch 5/10
20000/20000 [==============================] - 1087s - loss: 0.1383 - acc: 0.9492 - val_loss: 0.2572 - val_acc: 0.9068
Epoch 6/10
20000/20000 [==============================] - 1091s - loss: 0.1032 - acc: 0.9634 - val_loss: 0.2666 - val_acc: 0.9040
Epoch 7/10
20000/20000 [==============================] - 1088s - loss: 0.0736 - acc: 0.9750 - val_loss: 0.3069 - val_acc: 0.9042
Epoch 8/10
20000/20000 [==============================] - 1087s - loss: 0.0488 - acc: 0.9834 - val_loss: 0.3886 - val_acc: 0.8950
Epoch 9/10
20000/20000 [==============================] - 1081s - loss: 0.0328 - acc: 0.9892 - val_loss: 0.3788 - val_acc: 0.8984
Epoch 10/10
20000/20000 [==============================] - 1087s - loss: 0.0197 - acc: 0.9944 - val_loss: 0.5636 - val_acc: 0.8734

The best peformance I can see is about 90.4%.

Attention Network

In the following, I am going to implement an attention layer which is well studied in many papers including sequence to sequence learning. Particularly for this text classification task, I have followed the implementation of FEED-FORWARD NETWORKS WITH ATTENTION CAN SOLVE SOME LONG-TERM MEMORY PROBLEMS by Colin Raffel

To implement the attention layer, we need to build a custom Keras layer. You can follow the instruction here

The following code can only strictly run on Theano backend since tensorflow matrix dot product doesn’t behave the same as np.dot. I don’t know how to get a 2D tensor by dot product of 3D tensor of recurrent layer output and 1D tensor of weight.

class AttLayer(Layer):
    def __init__(self, **kwargs):
        self.init = initializations.get('normal')
        #self.input_spec = [InputSpec(ndim=3)]
        super(AttLayer, self).__init__(** kwargs)

    def build(self, input_shape):
        assert len(input_shape)==3
        #self.W = self.init((input_shape[-1],1))
        self.W = self.init((input_shape[-1],))
        #self.input_spec = [InputSpec(shape=input_shape)]
        self.trainable_weights = [self.W]
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))

        ai = K.exp(eij)
        weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

        weighted_input = x*weights.dimshuffle(0,1,'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return (input_shape[0], input_shape[-1])

Then following code is pretty much the same as the previous one except I will add an attention layer on top of GRU Output

embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)



sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_gru = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
l_att = AttLayer()(l_gru)
preds = Dense(2, activation='softmax')(l_att)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

print("model fitting - attention GRU network")
model.summary()
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          nb_epoch=10, batch_size=50)

model fitting - attention GRU network
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 1000)          0
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 1000, 100)     8057000     input_2[0][0]
____________________________________________________________________________________________________
bidirectional_2 (Bidirectional)  (None, 1000, 200)     120600      embedding_2[0][0]
____________________________________________________________________________________________________
attlayer_1 (AttLayer)            (None, 200)           200         bidirectional_2[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             402         attlayer_1[0][0]
====================================================================================================
Total params: 8178202
____________________________________________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 936s - loss: 0.4635 - acc: 0.7666 - val_loss: 0.3315 - val_acc: 0.8602
Epoch 2/10
20000/20000 [==============================] - 937s - loss: 0.2563 - acc: 0.8980 - val_loss: 0.2848 - val_acc: 0.8824
Epoch 3/10
20000/20000 [==============================] - 933s - loss: 0.1851 - acc: 0.9294 - val_loss: 0.2445 - val_acc: 0.9046
Epoch 4/10
20000/20000 [==============================] - 935s - loss: 0.1322 - acc: 0.9535 - val_loss: 0.2519 - val_acc: 0.9010
Epoch 5/10
20000/20000 [==============================] - 935s - loss: 0.0901 - acc: 0.9687 - val_loss: 0.3053 - val_acc: 0.8922
Epoch 6/10
20000/20000 [==============================] - 937s - loss: 0.0556 - acc: 0.9826 - val_loss: 0.3063 - val_acc: 0.9038
Epoch 7/10
20000/20000 [==============================] - 936s - loss: 0.0317 - acc: 0.9913 - val_loss: 0.4064 - val_acc: 0.8980
Epoch 8/10
20000/20000 [==============================] - 936s - loss: 0.0187 - acc: 0.9946 - val_loss: 0.3858 - val_acc: 0.9012
Epoch 9/10
20000/20000 [==============================] - 934s - loss: 0.0099 - acc: 0.9975 - val_loss: 0.4575 - val_acc: 0.9062
Epoch 10/10
20000/20000 [==============================] - 933s - loss: 0.0046 - acc: 0.9986 - val_loss: 0.5417 - val_acc: 0.9008

The accuracy we can achieve is 90.4%

Compare to previous approach, the result is pretty much the same.

To achieve the best performances, we may 1) fine tune hyper parameters 2) further improve text preprocessing. 3) apply drop out layer

Full source code is in my repository in github.

Conclusion

Based on the observations, performances of both approaches are quite good. Long sentence sequence trainings are quite slow, in both approaches, training time took more than 15 minutes for each epoch.

Text Classification, Part 3 - Hierarchical attention network

After the exercise of building convolutional, RNN, sentence level attention RNN, finally I have come to implement Hierarchical Attention Networks for Document Classification. I’m very thankful to Keras, which make building this project painless. The custom layer is very powerful and flexible to build your custom logic to embed into the existing frame work. Functional API makes the Hierarchical InputLayers very easy to implement.

Please note that all exercises are based on Kaggle’s IMDB dataset.

Text classification using Hierarchical LSTM

Before fully implement Hierarchical attention network, I want to build a Hierarchical LSTM network as a base line. To have it implemented, I have to construct the data input as 3D other than 2D in previous two posts. So the input tensor would be [# of reviews each batch, # of sentences, # of words in each sentence].

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)

data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

for i, sentences in enumerate(reviews):
    for j, sent in enumerate(sentences):
        if j< MAX_SENTS:
            wordTokens = text_to_word_sequence(sent)
            #update 1/10/2017 - bug fixed - set max number of words
            k=0
            for _, word in enumerate(wordTokens):
                if k<MAX_SENT_LENGTH and tokenizer.word_index[word]<MAX_NB_WORDS:
                    data[i,j,k] = tokenizer.word_index[word]
                    k=k+1

After that we can use Keras magic function TimeDistributed to construct the Hierarchical input layers as following. This is what I have learned from this post

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENT_LENGTH,
                            trainable=True)

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(2, activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)

Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 15, 100)       0
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribute(None, 15, 200)       8217800     input_2[0][0]
____________________________________________________________________________________________________
bidirectional_2 (Bidirectional)  (None, 200)           240800      timedistributed_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 2)             402         bidirectional_2[0][0]
====================================================================================================
Total params: 8459002
____________________________________________________________________________________________________
None
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 494s - loss: 0.5558 - acc: 0.6976 - val_loss: 0.4443 - val_acc: 0.7962
Epoch 2/10
20000/20000 [==============================] - 494s - loss: 0.3135 - acc: 0.8659 - val_loss: 0.3219 - val_acc: 0.8552
Epoch 3/10
20000/20000 [==============================] - 495s - loss: 0.2319 - acc: 0.9076 - val_loss: 0.2627 - val_acc: 0.8948
Epoch 4/10
20000/20000 [==============================] - 494s - loss: 0.1753 - acc: 0.9323 - val_loss: 0.2784 - val_acc: 0.8920
Epoch 5/10
20000/20000 [==============================] - 495s - loss: 0.1306 - acc: 0.9517 - val_loss: 0.2884 - val_acc: 0.8944
Epoch 6/10
20000/20000 [==============================] - 495s - loss: 0.0901 - acc: 0.9696 - val_loss: 0.3073 - val_acc: 0.8972
Epoch 7/10
20000/20000 [==============================] - 494s - loss: 0.0586 - acc: 0.9796 - val_loss: 0.4159 - val_acc: 0.8874
Epoch 8/10
20000/20000 [==============================] - 495s - loss: 0.0369 - acc: 0.9880 - val_loss: 0.4317 - val_acc: 0.8956
Epoch 9/10
20000/20000 [==============================] - 495s - loss: 0.0233 - acc: 0.9936 - val_loss: 0.4392 - val_acc: 0.8818
Epoch 10/10
20000/20000 [==============================] - 494s - loss: 0.0148 - acc: 0.9960 - val_loss: 0.5817 - val_acc: 0.8840

The performance is slightly worser than previous post at about 89.4%. However, the training time is much faster than one level of LSTM in the second post.

Attention Network

To implement the attention layer, we need to build a custom Keras layer. You can follow the instruction here

class AttLayer(Layer):
    def __init__(self, **kwargs):
        self.init = initializations.get('normal')
        #self.input_spec = [InputSpec(ndim=3)]
        super(AttLayer, self).__init__(** kwargs)

    def build(self, input_shape):
        assert len(input_shape)==3
        #self.W = self.init((input_shape[-1],1))
        self.W = self.init((input_shape[-1],))
        #self.input_spec = [InputSpec(shape=input_shape)]
        self.trainable_weights = [self.W]
        super(AttLayer, self).build(input_shape)  # be sure you call this somewhere!

    def call(self, x, mask=None):
        eij = K.tanh(K.dot(x, self.W))

        ai = K.exp(eij)
        weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

        weighted_input = x*weights.dimshuffle(0,1,'x')
        return weighted_input.sum(axis=1)

    def get_output_shape_for(self, input_shape):
        return (input_shape[0], input_shape[-1])

Following the paper, Hierarchical Attention Networks for Document Classification. I have also added a dense layer taking the output from GRU before feeding into attention layer. In the following implementation, there’re two layers of attention network built in, one at sentence level and the other at review level.

sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
l_dense = TimeDistributed(Dense(200))(l_lstm)
l_att = AttLayer()(l_dense)
sentEncoder = Model(sentence_input, l_att)

review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder)
l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent)
l_att_sent = AttLayer()(l_dense_sent)
preds = Dense(2, activation='softmax')(l_att_sent)
model = Model(review_input, preds)

model fitting - Hierachical attention network
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_4 (InputLayer)             (None, 15, 100)       0
____________________________________________________________________________________________________
timedistributed_3 (TimeDistribute(None, 15, 200)       8218000     input_4[0][0]
____________________________________________________________________________________________________
bidirectional_4 (Bidirectional)  (None, 15, 200)       180600      timedistributed_3[0][0]
____________________________________________________________________________________________________
timedistributed_4 (TimeDistribute(None, 15, 200)       40200       bidirectional_4[0][0]
____________________________________________________________________________________________________
attlayer_2 (AttLayer)            (None, 200)           200         timedistributed_4[0][0]
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 2)             402         attlayer_2[0][0]
====================================================================================================
Total params: 8439402
____________________________________________________________________________________________________
None
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 441s - loss: 0.5509 - acc: 0.7072 - val_loss: 0.3391 - val_acc: 0.8564
Epoch 2/10
20000/20000 [==============================] - 440s - loss: 0.2972 - acc: 0.8776 - val_loss: 0.2767 - val_acc: 0.8850
Epoch 3/10
20000/20000 [==============================] - 442s - loss: 0.2212 - acc: 0.9141 - val_loss: 0.2670 - val_acc: 0.8898
Epoch 4/10
20000/20000 [==============================] - 440s - loss: 0.1635 - acc: 0.9392 - val_loss: 0.2500 - val_acc: 0.9040
Epoch 5/10
20000/20000 [==============================] - 441s - loss: 0.1183 - acc: 0.9582 - val_loss: 0.2795 - val_acc: 0.9040
Epoch 6/10
20000/20000 [==============================] - 440s - loss: 0.0793 - acc: 0.9721 - val_loss: 0.3198 - val_acc: 0.8924
Epoch 7/10
20000/20000 [==============================] - 441s - loss: 0.0479 - acc: 0.9849 - val_loss: 0.3575 - val_acc: 0.8948
Epoch 8/10
20000/20000 [==============================] - 441s - loss: 0.0279 - acc: 0.9913 - val_loss: 0.3876 - val_acc: 0.8934
Epoch 9/10
20000/20000 [==============================] - 440s - loss: 0.0158 - acc: 0.9954 - val_loss: 0.6058 - val_acc: 0.8838
Epoch 10/10
20000/20000 [==============================] - 440s - loss: 0.0109 - acc: 0.9968 - val_loss: 0.8289 - val_acc: 0.8816

The best performance is pretty much still cap at 90.4%

What has remained to do is deriving attention weights so that we can visualize the importance of words and sentences, which is not hard to do. By using K.function in Keras, we can derive GRU and dense layer output and compute the attention weights on the fly. I will update the post as long as I have it completed.

Full source code is in my repository in github.

Also see the Keras group discussion about this implementation

Conclusion

The result is a bit disappointing. I couldn’t achieve a better accuracy although the training time is much faster, comparing to different approaches from using convolutional, bidirectional RNN, to one level attention network. Maybe the dataset is too small for Hierarchical attention network to be powerful. However, given the potential power of explaining the importance of words and sentences, Hierarchical attention network could have the potential to be the best text classification method. At last, please contact me or comment below if I have made any mistaken in the exercise or anything I can improve. Thank you!

Update - 1/11/2017

Ben on Keras google group nicely pointed out to me where to download emnlp data. So I have used the same code run against Yelp-2013 dataset. I can’t match author’s performance. The one level LSTM attention and Hierarchical attention network can only achieve 65%, while BiLSTM achieves roughly 64%. However, I didn’t follow exactly author’s text preprocessing. I am still using Keras data preprocessing logic that takes top 20,000 or 50,000 tokens, skip the rest and pad remaining with 0. I felt there could be some major improvement in performance if we can do the text processing right, such as replacing time and money to unique tokens and attaching POS information on the sequence etc.

Update - 6/22/2017

Took couple hours and tried to finish the long due attention weight visualization job. The idea is just to do a forward pass. The steps and codes are following:

Define a K.function and derive GRU or whatever layer output before Attention input

get_layer_output = K.function([model.layers[0].input, K.learning_phase()], [model.layers[2].output])
test_seq = pad_sequences([sequences[index]], maxlen=MAX_SEQUENCE_LENGTH)
out = get_layer_output([test_seq, 0])[0]  # test mode
print(out[0].shape)

Repeat the process in attention weights calculation.

eij = np.tanh(np.dot(out[0], att_w[0]))
ai = np.exp(eij)
weights = ai/np.sum(ai)

weights will be the attention weights, the dimension is 1000 for this program.

Now you can get what are the top weights of words

K = 10
topKeys = np.argpartition(weights,-K)[-K:]
print topKeys
print test_seq[0][topKeys]

However, the top keywords I am getting are not the desire words I am hopping - some make senses but some don’t. I will continue to investigate when time is allowed. Please message me if you observe something wrong.