文本分类练习二:按照THUCNews的子集对新闻所属类别进行分类
程序员文章站
2022-06-12 16:01:54
...
1. 特点:中文数据集、十个类别
2. 工具:TensorFlow
3. 数据集说明及代码示例:https://github.com/gaussic/text-classification-cnn-rnn
4. 对代码示例的run_cnn.py做如下修改(run_rnn.py可做类似修改),并将cnews数据子集放在data文件夹下,即可在PyCharm里运行代码(MacOS + PyCharm + TensorFlow 1.12.0 + Python 3.6)
if __name__ == '__main__':
# if len(sys.argv) != 2 or sys.argv[1] not in ['train', 'test']:
# raise ValueError("""usage: python run_cnn.py [train / test]""")
print('Configuring CNN model...')
config = TCNNConfig()
if not os.path.exists(vocab_dir): # 如果不存在词汇表,重建
build_vocab(train_dir, vocab_dir, config.vocab_size)
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_dir)
config.vocab_size = len(words)
model = TextCNN(config)
# if sys.argv[1] == 'train':
# train()
# else:
# test()
train()
test()
5. 代码输出
/Users/gaoxuanxuan/anaconda3/envs/tensorflow/bin/python /Users/gaoxuanxuan/PycharmProjects/NLP/TextClassification/text-classification-cnn-rnn/run_cnn.py
Configuring CNN model...
WARNING:tensorflow:From /Users/gaoxuanxuan/PycharmProjects/NLP/TextClassification/text-classification-cnn-rnn/cnn_model.py:66: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
Configuring TensorBoard and Saver...
Loading training and validation data...
Time usage: 0:00:24
2019-03-03 21:34:01.237824: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Training and evaluating...
Epoch: 1
Iter: 0, Train Loss: 2.3, Train Acc: 3.12%, Val Loss: 2.3, Val Acc: 9.92%, Time: 0:00:07 *
Iter: 100, Train Loss: 0.86, Train Acc: 78.12%, Val Loss: 1.2, Val Acc: 68.96%, Time: 0:01:17 *
Iter: 200, Train Loss: 0.36, Train Acc: 89.06%, Val Loss: 0.72, Val Acc: 80.48%, Time: 0:02:24 *
Iter: 300, Train Loss: 0.18, Train Acc: 96.88%, Val Loss: 0.43, Val Acc: 89.72%, Time: 0:03:24 *
Iter: 400, Train Loss: 0.14, Train Acc: 96.88%, Val Loss: 0.36, Val Acc: 91.00%, Time: 0:04:22 *
Iter: 500, Train Loss: 0.22, Train Acc: 93.75%, Val Loss: 0.39, Val Acc: 91.16%, Time: 0:05:26 *
Iter: 600, Train Loss: 0.3, Train Acc: 90.62%, Val Loss: 0.33, Val Acc: 92.28%, Time: 0:06:42 *
Iter: 700, Train Loss: 0.11, Train Acc: 95.31%, Val Loss: 0.29, Val Acc: 92.92%, Time: 0:08:45 *
Epoch: 2
Iter: 800, Train Loss: 0.051, Train Acc: 98.44%, Val Loss: 0.29, Val Acc: 92.84%, Time: 0:10:57
Iter: 900, Train Loss: 0.21, Train Acc: 93.75%, Val Loss: 0.3, Val Acc: 90.86%, Time: 0:12:56
Iter: 1000, Train Loss: 0.044, Train Acc: 100.00%, Val Loss: 0.29, Val Acc: 91.52%, Time: 0:14:52
Iter: 1100, Train Loss: 0.13, Train Acc: 98.44%, Val Loss: 0.28, Val Acc: 92.72%, Time: 0:17:00
Iter: 1200, Train Loss: 0.06, Train Acc: 98.44%, Val Loss: 0.29, Val Acc: 93.14%, Time: 0:19:00 *
Iter: 1300, Train Loss: 0.084, Train Acc: 98.44%, Val Loss: 0.29, Val Acc: 90.76%, Time: 0:21:04
Iter: 1400, Train Loss: 0.13, Train Acc: 93.75%, Val Loss: 0.19, Val Acc: 94.68%, Time: 0:23:24 *
Iter: 1500, Train Loss: 0.066, Train Acc: 98.44%, Val Loss: 0.21, Val Acc: 94.20%, Time: 0:25:49
Epoch: 3
Iter: 1600, Train Loss: 0.0086, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 94.92%, Time: 0:27:49 *
Iter: 1700, Train Loss: 0.0068, Train Acc: 100.00%, Val Loss: 0.21, Val Acc: 94.60%, Time: 0:46:25
Iter: 1800, Train Loss: 0.039, Train Acc: 98.44%, Val Loss: 0.18, Val Acc: 94.94%, Time: 3:12:36 *
Iter: 1900, Train Loss: 0.043, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 94.70%, Time: 4:55:34
Iter: 2000, Train Loss: 0.0047, Train Acc: 100.00%, Val Loss: 0.21, Val Acc: 94.46%, Time: 7:22:15
Iter: 2100, Train Loss: 0.015, Train Acc: 100.00%, Val Loss: 0.17, Val Acc: 95.26%, Time: 9:09:48 *
Iter: 2200, Train Loss: 0.13, Train Acc: 96.88%, Val Loss: 0.22, Val Acc: 93.24%, Time: 10:55:41
Iter: 2300, Train Loss: 0.091, Train Acc: 95.31%, Val Loss: 0.22, Val Acc: 93.14%, Time: 12:44:59
Epoch: 4
Iter: 2400, Train Loss: 0.1, Train Acc: 96.88%, Val Loss: 0.21, Val Acc: 94.24%, Time: 13:53:55
Iter: 2500, Train Loss: 0.021, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 94.96%, Time: 15:43:02
Iter: 2600, Train Loss: 0.012, Train Acc: 100.00%, Val Loss: 0.2, Val Acc: 95.00%, Time: 18:03:03
Iter: 2700, Train Loss: 0.037, Train Acc: 98.44%, Val Loss: 0.18, Val Acc: 95.18%, Time: 19:59:22
Iter: 2800, Train Loss: 0.041, Train Acc: 98.44%, Val Loss: 0.2, Val Acc: 94.46%, Time: 22:22:35
Iter: 2900, Train Loss: 0.022, Train Acc: 100.00%, Val Loss: 0.22, Val Acc: 94.10%, Time: 1 day, 0:11:13
Iter: 3000, Train Loss: 0.051, Train Acc: 98.44%, Val Loss: 0.18, Val Acc: 95.42%, Time: 1 day, 2:37:24 *
Iter: 3100, Train Loss: 0.14, Train Acc: 96.88%, Val Loss: 0.22, Val Acc: 94.08%, Time: 1 day, 4:24:21
Epoch: 5
Iter: 3200, Train Loss: 0.0027, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 95.60%, Time: 1 day, 6:07:38 *
Iter: 3300, Train Loss: 0.001, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 95.16%, Time: 1 day, 8:19:46
Iter: 3400, Train Loss: 0.0047, Train Acc: 100.00%, Val Loss: 0.2, Val Acc: 95.36%, Time: 1 day, 10:04:01
Iter: 3500, Train Loss: 0.0057, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 95.18%, Time: 1 day, 12:14:28
Iter: 3600, Train Loss: 0.011, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 95.26%, Time: 1 day, 13:10:23
Iter: 3700, Train Loss: 0.076, Train Acc: 98.44%, Val Loss: 0.2, Val Acc: 94.50%, Time: 1 day, 13:12:36
Iter: 3800, Train Loss: 0.0061, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 95.64%, Time: 1 day, 13:14:34 *
Iter: 3900, Train Loss: 0.014, Train Acc: 100.00%, Val Loss: 0.2, Val Acc: 94.86%, Time: 1 day, 13:16:45
Epoch: 6
Iter: 4000, Train Loss: 0.016, Train Acc: 98.44%, Val Loss: 0.22, Val Acc: 94.34%, Time: 1 day, 13:18:47
Iter: 4100, Train Loss: 0.034, Train Acc: 96.88%, Val Loss: 0.22, Val Acc: 94.82%, Time: 1 day, 13:20:49
Iter: 4200, Train Loss: 0.0029, Train Acc: 100.00%, Val Loss: 0.23, Val Acc: 94.72%, Time: 1 day, 13:23:00
Iter: 4300, Train Loss: 0.0052, Train Acc: 100.00%, Val Loss: 0.16, Val Acc: 96.10%, Time: 1 day, 13:24:04 *
Iter: 4400, Train Loss: 0.025, Train Acc: 98.44%, Val Loss: 0.18, Val Acc: 95.36%, Time: 1 day, 13:25:03
Iter: 4500, Train Loss: 0.0013, Train Acc: 100.00%, Val Loss: 0.21, Val Acc: 95.06%, Time: 1 day, 13:26:02
Iter: 4600, Train Loss: 0.028, Train Acc: 98.44%, Val Loss: 0.25, Val Acc: 93.72%, Time: 1 day, 13:26:59
Epoch: 7
Iter: 4700, Train Loss: 0.014, Train Acc: 98.44%, Val Loss: 0.24, Val Acc: 94.42%, Time: 1 day, 13:27:55
Iter: 4800, Train Loss: 0.0071, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 95.98%, Time: 1 day, 13:28:51
Iter: 4900, Train Loss: 0.00074, Train Acc: 100.00%, Val Loss: 0.2, Val Acc: 95.42%, Time: 1 day, 13:29:53
Iter: 5000, Train Loss: 0.00081, Train Acc: 100.00%, Val Loss: 0.18, Val Acc: 95.60%, Time: 1 day, 13:31:02
Iter: 5100, Train Loss: 0.00093, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 95.78%, Time: 1 day, 13:32:01
Iter: 5200, Train Loss: 0.001, Train Acc: 100.00%, Val Loss: 0.22, Val Acc: 94.86%, Time: 1 day, 13:33:02
Iter: 5300, Train Loss: 0.0074, Train Acc: 100.00%, Val Loss: 0.19, Val Acc: 95.26%, Time: 1 day, 13:34:01
No optimization for a long time, auto-stopping...
Loading test data...
Testing...
Test Loss: 0.12, Test Acc: 97.08%
Precision, Recall and F1-Score...
precision recall f1-score support
体育 1.00 0.99 0.99 1000
财经 0.96 0.98 0.97 1000
房产 1.00 1.00 1.00 1000
家居 0.98 0.91 0.94 1000
教育 0.95 0.94 0.95 1000
科技 0.96 0.99 0.98 1000
时尚 0.96 0.98 0.97 1000
时政 0.93 0.97 0.95 1000
游戏 0.99 0.97 0.98 1000
娱乐 0.98 0.97 0.98 1000
micro avg 0.97 0.97 0.97 10000
macro avg 0.97 0.97 0.97 10000
weighted avg 0.97 0.97 0.97 10000
Confusion Matrix...
[[990 0 0 0 2 2 1 4 1 0]
[ 0 985 0 1 1 3 0 10 0 0]
[ 0 0 997 1 1 0 0 1 0 0]
[ 0 20 2 906 17 5 10 36 1 3]
[ 0 6 0 5 945 13 8 18 2 3]
[ 0 3 0 2 0 988 3 1 2 1]
[ 2 0 0 2 4 1 982 0 2 7]
[ 0 9 0 2 13 5 0 970 1 0]
[ 1 1 1 3 7 2 9 3 970 3]
[ 0 2 0 5 3 5 7 0 3 975]]
Time usage: 0:00:30
Process finished with exit code 0