欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Keras NLP—预测新评论的情感

程序员文章站 2024-01-25 08:50:04
文章目录一.Python代码二.代码说明三.结果输出一.Python代码#!/usr/bin/env python3# encoding: utf-8'''@file: Keras_predict_sentiment.py@time: 2020/7/5 0005 11:58@author: Jack@contact: jack18588951684@163.com'''import stringimport refrom os import listdirfrom numpy i...

一.Python代码

#!/usr/bin/env python3
# encoding: utf-8
'''
@file: Keras_predict_sentiment.py
@time: 2020/7/5 0005 11:58
@author: Jack
@contact: jack18588951684@163.com
'''

import string
import re
from os import listdir
from numpy import array
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense


def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text


def clean_doc(doc):
    tokens = doc.split()
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', w) for w in tokens]
    tokens = [w for w in tokens if w.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [w for w in tokens if len(w) > 1]
    return tokens


def doc_to_line(filename, vocab):
    doc = load_doc(filename)
    tokens = clean_doc(doc)
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)


def precess_docs(directory, vocab):
    lines = list()
    for filename in listdir(directory):
        path = directory + '/' + filename
        line = doc_to_line(path, vocab)
        lines.append(line)
    return lines


def load_clean_dataset(vocab):
    neg = precess_docs('txt_sentoken/neg', vocab)
    pos = precess_docs('txt_sentoken/pos', vocab)
    docs = neg + pos
    labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels


def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer


def define_model(n_words):
    model = Sequential()
    model.add(Dense(50, input_shape=(n_words,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    # plot_model(model, to_file='model.png', show_shapes=True)
    return model


def predict_sentiment(review, vocab, tokenizer, model):
    tokens = clean_doc(review)
    tokens = [w for w in tokens if w in vocab]
    line = ' '.join(tokens)
    encoded = tokenizer.texts_to_matrix([line], mode='binary')
    yhat = model.predict(encoded, verbose=0)
    percent_pos = yhat[0, 0]
    if round(percent_pos) == 0:
        return (1 - percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE'


if __name__ == "__main__":
    vocab_filename = 'vocab.txt'
    vocab = load_doc(vocab_filename)
    vocab = set(vocab.split())
    train_docs, ytrain = load_clean_dataset(vocab)
    test_docs, ytest = load_clean_dataset(vocab)
    tokenizer = create_tokenizer(train_docs)
    Xtrain = tokenizer.texts_to_matrix(train_docs, mode='binary')
    Xtest = tokenizer.texts_to_matrix(test_docs, mode='binary')
    n_words = Xtrain.shape[1]
    model = define_model(n_words)
    model.fit(Xtrain, ytrain, epochs=10, verbose=2)
    text = 'Best movie ever! It was great!.'
    percent, sentiment = predict_sentiment(text, vocab, tokenizer, model)
    print("Review:{}\tSentiment:{}({})".format(text, sentiment, percent * 100))

    text = 'This is a bad movie.'
    percent, sentiment = predict_sentiment(text, vocab, tokenizer, model)
    print("Review:{}\tSentiment:{}({})".format(text, sentiment, percent * 100))

二.代码说明

最后使用所有可用数据(训练集+测试集)训练开发最终模型并使用其来预测新的电影评论的类别。预测新评论的情感分类同样需要遵循相同的测试数据准备步骤,即加载文档、清理文档、编码文档,然后进行预测。代码中体现为函数predict_sentiment(),函数的参数为评论文本、词汇表、分词器tokenizer和模型,返回为预测的情绪和该分类在两个分类中的百分比,通过调用predict()来直接使用拟合模型预测分类值,对于负面评论返回0,正面评论返回1。

三.结果输出

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 50)                1288950   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
=================================================================
Total params: 1,289,001
Trainable params: 1,289,001
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
63/63 - 0s - loss: 0.4540 - accuracy: 0.7920
Epoch 2/10
63/63 - 0s - loss: 0.0531 - accuracy: 0.9965
Epoch 3/10
63/63 - 0s - loss: 0.0151 - accuracy: 1.0000
Epoch 4/10
63/63 - 0s - loss: 0.0069 - accuracy: 1.0000
Epoch 5/10
63/63 - 0s - loss: 0.0037 - accuracy: 1.0000
Epoch 6/10
63/63 - 1s - loss: 0.0020 - accuracy: 1.0000
Epoch 7/10
63/63 - 1s - loss: 0.0012 - accuracy: 1.0000
Epoch 8/10
63/63 - 0s - loss: 7.2241e-04 - accuracy: 1.0000
Epoch 9/10
63/63 - 0s - loss: 4.8940e-04 - accuracy: 1.0000
Epoch 10/10
63/63 - 0s - loss: 3.5379e-04 - accuracy: 1.0000
Review:Best movie ever! It was great!.	Sentiment:POSITIVE(51.90609693527222)
Review:This is a bad movie.	Sentiment:NEGATIVE(67.59185492992401)

本文地址:https://blog.csdn.net/u013010473/article/details/107137434