欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Convolutional Neural Networks for Sentence Classification

程序员文章站 2022-05-30 18:28:58
...

论文总体结构

一、摘要

          使用卷积神经网络处理句子级别文本分类,并在多个数据集上有好的效果

二、Introduction(背景介绍)

          使用预训练词向量和卷积神经网络,提出一种有效分类模型

       本文的主要契机:

      1、深度学习的发展(2012)

      2、预训练词向量方法

      3、卷积神经网络的方法

       本文的历史意义:

      1、开启基于深度学习的文本分类的序幕

      2、推动卷积神经网络在自然语言处理的发展

三、Model

          TextCNN模型结构和正则化

          

Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification

 

模型结构如上所示,先通过卷积操作,计算卷积核和数据卷积的效果,然后进行池化操作,最后进行全连接 + softmax操作

注:全连接之前拼接成的数据维度等于卷积核的个数,模型的通道(channel)个数等于相同维度卷积核的个数

为了防止过拟合,本文提出了两种方法:

1、Dropout:

      在神经网络的传播过程中,让某个神经元以一定的概率p停止工作,从而增加模型的泛化能力

2、L2-正则

      给所有参数加以限制,使得学习不偏激,增加模型的泛化能力

四、Dataset and Experiment Setup

          数据集介绍、实验超参设置以及实验结果

Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification

实验参数及模型对比,证明该模型较好。

五、Results and Discussion

          实验探究、通道个数讨论和词向量使用方法讨论

六、Conclusion

          对全文进行总结

         关键点:

         1、预训练的词向量 - wordd2vec、Glove

         2、卷积神经网络结构- 一维卷积、池化

         3、超参选择- 卷积核选择、词向量方式选择

         创新点:

         1、提出基于CNN文本分类模型TextCNN

         2、提出多种词向量设置方式

         3、在四个文本分类任务上取得最优结果

         4、对超参进行大量实验和分析

         启发点:

         1、在预训练模型的基础上微调能够得到非常好的结果,说明预训练词向量学习到了一些通用的特征

         2、预训练词向量的基础上使用简单模型比复杂模型表现的还要好

         3、对于不在预训练词向量中的词,微调能够学习到更多的意义

 

七、代码实现

# ************* 数据预处理部分 **********

# encoding = 'utf-8'

from torch.utils import data
import os
import random
import numpy as np
from gensim.test.utils import datapath,get_tmpfile
from gensim.models import KeyedVectors


"""加载预训练词向量模型"""
wvmodel = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz",binary=True)


""" 读取论文中的实验数据 """
pos_data = open("./data/MR/rt-polarity.pos",errors='ignore').readlines()
neg_data = open("./data/MR/rt-polarity.neg",errors='ignore').readlines()
datas = pos_data + neg_data
datas = [data.split() for data in datas]
labels = [1] * len(pos_data) + [0] * len(neg_data)


""" 构建词表,以句子最大长度为标准做padding"""
max_sentence_length = max([len(sentence) for sentence in datas])

word2id = {'<pad>':0}

for i,data in enumerate(datas):
    for j,word in enumerate(data):
        if word2id.get(word) == None:
            word2id[word] = len(word2id)
            
        datas[i][j] = word2id[word]
        
    datas[i] = datas[i] + [0] * (max_sentence_length - len(datas[i]))

""" 计算已有词向量的均值和方差 """
tmp = []

for word,index in word2id.items():
    try:
        tmp.append(wvmodel.get_vector(word))
    except:
        pass
    
mean = np.mean(np.array(tmp))
std = np.std(np.array(tmp))
print(mean,std)

""" 如果词在预训练的词向量模型中,则使用词向量,否则使用已有词向量计算的均值和方差构造的随机初始化向量 """
vocab_size = len(word2id)
embed_size = 300

embedding_weigths = np.random.normal(mean,std,[vocab_size,embed_size])

for word,index in word2id.items():
    try:
        embedding_weigths[index,:] = wvmodel.get_vector(word)
    except:
        pass

""" 打乱数据顺序 """
c = list(zip(datas,labels))
random.seed(1)
random.shuffle(c)
datas[:],labels[:] = zip(*c)

""" 生成训练集、验证集、 测试集 """

k = 0
train_datas = datas[:int(k*len(datas)/10)] + datas[int((k+1)*len(datas)/10):]
train_labels = labels[:int(k*len(datas)/10)] + labels[int((k+1)*len(datas)/10):]

valid_datas = np.array(train_datas[int(0.9*len(train_datas)):])
valid_labels = np.array(train_labels[int(0.9*len(train_labels)):])


train_datas = np.array(train_datas[:int(0.9*len(train_datas))])
train_labels = np.array(train_labels[:int(0.9*len(train_labels))])

test_datas = np.array(datas[int(k*len(datas)/10):int((k+1)*len(datas)/10)])
test_labels = np.array(datas[int(k*len(datas)/10):int((k+1)*len(datas)/10)])

 

# ***********  模型构建部分  ***********


# encoding = 'utf-8'

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchsummary import summary


class BasicModule(nn.Module):
    def __init__(self):
        super(BasicModule,self).__init__()
        self.model_name = str(type(self))
        
    def load(self, path):
        self.load_state_dict(torch.load(path))
        
    def save(self, path):
        torch.save(self.state_dict(),path)
        
    def forward(self):
        pass


class TextCNN(BasicModule):
    
    def __init__(self,config):
        super(TextCNN,self).__init__()
        
        # 嵌入层
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed_size)
            
        # 卷积层
        self.conv1d_1 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[0])
        self.convld_2 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[1])
        self.convld_3 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[2])
        
        # 池化层
        self.max_pool_1 = nn.MaxPool1d(config.sentence_max_size - config.filters[0] + 1)
        self.max_pool_2 = nn.MaxPool1d(config.sentence_max_size - config.filters[1] + 1)
        self.max_pool_3 = nn.MaxPool1d(config.sentence_max_size - config.filters[2] + 1)
        
        # Dropout 层
        self.dropout = nn.Dropout(config.dropout)
        
        # 分类层
        self.fc = nn.Linear(config.filter_num * len(config.filters), config.label_num)
        
    def forward(self, x):
        
        x = x.long()
        out = self.embedding(x) # batch_size * embeding_size * sentence_length
        out = out.transpose(1,2).contiguous() # batch_size * sentence_length * embeding_size
        
        x1 = F.relu(self.conv1d_1(out))
        x2 = F.relu(self.convld_2(out))
        x3 = F.relu(self.convld_3(out))
        
        x1 = self.max_pool_1(x1).squeeze()
        x2 = self.max_pool_2(x2).squeeze()
        x3 = self.max_pool_3(x3).squeeze()
        
        out = torch.cat([x1,x2,x3],1)
        out = self.dropout(out)
        out = self.fc(out)
        
        return out


class config:
    def __init__(self):
        self.embedding_pretrained = None # 是否使用预训练词向量
        self.n_vocab = 100 # 词表中单词个数
        self.embed_size = 300 # 词向量维度
        self.cuda = False # 是否使用GPU
        self.filter_num = 100 # 每种尺寸卷积核的个数
        self.filters = [3,4,5] # 卷积核尺寸
        self.label_num = 2 # 标签个数
        self.dropout = 0.5 # dropout的概率
        self.sentence_max_size = 50 # 最大句子长度



configs = config()
textcnn = TextCNN(configs)
summary(textcnn,input_size=(50,))

 

# ******** 模型训练部分 **********


from pytorchtools import EarlyStopping
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim
from model import TextCNN
from data import MR_Dataset
import numpy as np
import config as argumentparser


config = argumentparser.ArgumentParser()
config.filters = list(map(int,config.filters.split(",")))
early_stopping = EarlyStopping(patience=10, verbose=True,cv_index=i)


training_set = MR_Dataset(state="train",k=i)
config.n_vocab = training_set.n_vocab
training_iter = torch.utils.data.DataLoader(dataset=training_set,
                                            batch_size=config.batch_size,
                                            shuffle=True,
                                            num_workers=2)

if config.use_pretrained_embed:
    config.embedding_pretrained = torch.from_numpy(training_set.weight).float()
else:
    pass


valid_set = MR_Dataset(state="valid", k=i)
valid_iter = torch.utils.data.DataLoader(dataset=valid_set,
                                         batch_size=config.batch_size,
                                         shuffle=False,
                                         num_workers=2)

test_set = MR_Dataset(state="test", k=i)
test_iter = torch.utils.data.DataLoader(dataset=test_set,
                                        batch_size=config.batch_size,
                                        shuffle=False,
                                        num_workers=2)

model = TextCNN(config)
if config.cuda and torch.cuda.is_available():
    model.cuda()
    config.embedding_pretrained.cuda()


criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
count = 0
loss_sum = 0

def get_test_result(data_iter,data_set):
    model.eval()
    data_loss = 0
    true_sample_num = 0
    for data, label in data_iter:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
        else:
            data = torch.autograd.Variable(data).long()
        out = model(data)
        loss = criterion(out, autograd.Variable(label.long()))
        data_loss += loss.data.item()
        true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy()) #(0,0.5)
    acc = true_sample_num / data_set.__len__()
    return data_loss,acc

for epoch in range(config.epoch):
    # 训练开始
    model.train()
    for data, label in training_iter:
        if config.cuda and torch.cuda.is_available():
            data = data.cuda()
            label = label.cuda()
        else:
            data = torch.autograd.Variable(data).long()
        label = torch.autograd.Variable(label).squeeze()
        out = model(data)
        # l2_alpha*w^2
        l2_loss = config.l2*torch.sum(torch.pow(list(model.parameters())[1],2))
        loss = criterion(out, autograd.Variable(label.long()))+l2_loss
        loss_sum += loss.data.item()
        count += 1
        if count % 100 == 0:
            print("epoch", epoch, end='  ')
            print("The loss is: %.5f" % (loss_sum / 100))
            loss_sum = 0
            count = 0
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # save the model in every epoch
    # 一轮训练结束
    # 验证集上测试
    valid_loss,valid_acc = get_test_result(valid_iter,valid_set)
    early_stopping(valid_loss, model)
    print ("The valid acc is: %.5f" % valid_acc)
    if early_stopping.early_stop:
        print("Early stopping")
        break
# 训练结束,开始测试
model.load_state_dict(torch.load('./checkpoints/checkpoint%d.pt'%i))
test_loss, test_acc = get_test_result(test_iter, test_set)
print("The test acc is: %.5f" % test_acc)

""" EarlyStopping """

import numpy as np
import torch

class EarlyStopping:
    """Early stops the training if validation loss doesn't improve after a given patience."""
    def __init__(self, patience=7, verbose=False, delta=0,cv_index = 0):
        """
        Args:
            patience (int): How long to wait after last time validation loss improved.
                            Default: 7
            verbose (bool): If True, prints a message for each validation loss improvement. 
                            Default: False
            delta (float): Minimum change in the monitored quantity to qualify as an improvement.
                            Default: 0
        """
        self.patience = patience
        self.verbose = verbose
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.val_loss_min = np.Inf
        self.delta = delta
        self.cv_index = cv_index

    def __call__(self, val_loss, model):

        score = -val_loss

        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
        elif score < self.best_score + self.delta:
            self.counter += 1
            print('EarlyStopping counter: %d out of %d'%(self.counter,self.patience))
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
            self.counter = 0

    def save_checkpoint(self, val_loss, model):
        '''Saves model when validation loss decrease.'''
        if self.verbose:
            print('Validation loss decreased (%.5f --> %.5f).  Saving model ...'%(self.val_loss_min,val_loss))
        torch.save(model.state_dict(), './checkpoints/checkpoint%d.pt'%self.cv_index)
        self.val_loss_min = val_loss


完整代码详见:https://github.com/wangtao666666/NLP/tree/master/TextCNN

相关标签: NLP Paper