用LSTM做时间序列预测的思路,tensorflow代码实现及传入数据格式
首先推荐一个对LSTM一些类函数进行说明的博客: 函数说明
我的目标是用LSTM进行某种水果价格的预测,一开始我的做法是,将一种水果前n天的价格作为变量传入,即这样传入的DataFrame格式是有n+1列,结果训练出来的效果不尽人意,完全比不上之前我用ARIMA时间序列去拟合价格曲线.
之后继续浏览了很多博客,资料什么的,终于明白了一个参数:time_step的意义,LSTM,长短时训练网络,time_step这个参数才是体现其记忆的地方,比如说我要用前一百天的价格求当天的价格,time_step需要为100,且需要建立在数据是连续的基础上.而且其中还遇到了一个坑,就是传入数据的格式问题.代码过后来讲.
我的代码是基于 博客 这个博客的代码修改的,主要修改的地方就是格式的问题,这个问题在我一开始传入的X有很多列的时候是不会存在的,但在我单纯用价格来预测价格的时候,就出现问题了,如果直接按原来一样处理,会报错:
ValueError: Cannot feed value of shape (1,) for Tensor 'train/Placeholder:0', which has shape '(?, 100, 1)'
类似这样的错误,都是shape的错误引起的,一般来讲,tf.nn.dynamic_rnn()的inputs的输入格式大概是[batch_size,time_steps_size,input_size],time_steps_size便是要考虑的天数,input_size指输入的变量数batch_size是块大小,由数据量决定.在这份代码中,训练数据的获取没毛病,问题在于测试数据,比如说,我训练数据对应600条,测试数据200条.不过那是给[batch_size,time_steps_size,input_size],train_x可能是[590,10,10],train_y为[590,10,1],而test_x为[20,10,10],test_y为一个数组,主要是为了计算准确率,注意了,test_x的input_size大于1时代在处理时候比较简单,Dataframe的iloc().values获取后得到的numpy库的ndarray,比如
data[[1,2,3],[2,3,4],[3,4,5]] # 为numpy.ndarray()格式 x = data[i * time_step:(i + 1) * time_step, 1:101] # 获取多列的时候,返回多个数组 x = data[i * time_step:(i + 1) * time_step, 1] # 这个时候返回一个数组
具体的话可以print出来看看,最主要的shape问题就出现在这里
# -*- coding: utf-8 -*-
# @Time : 18-10-19
# @Author : lin
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
pd.set_option('max_colwidth', 5000)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 1000)
# 定义常量
rnn_unit = 10 # hidden layer units
input_size = 1 # 输入1个变量
output_size = 1 # 输出1个变量
lr = 0.0006 # 学习率
# ——————————————————导入数据——————————————————————
f = open("/home/user_name/下载/vegetable_data/veg/xxx.csv")
df = pd.read_csv(f) # 读入数据
df = pd.concat([df['price'], df['pre_price_100']], axis=1)
print(df)
data = df.iloc[:800, :].values # 一共两列
# 获取训练集
def get_train_data(batch_size=60, time_step=10, train_begin=0, train_end=600):
"""
得到的train_x的格式为shape[batch_size,time_step,输入变量数]
得到的train_y的格式为shape[batch_size,time_step,输出变量数]
batch_size*time_step=数据量
:param batch_size:
:param time_step:
:param train_begin:
:param train_end:
:return:
"""
batch_index = []
data_train = data[train_begin:train_end]
normalized_train_data = data_train
train_x, train_y = [], [] # 训练集
for i in range(len(normalized_train_data) - time_step):
if i % batch_size == 0:
batch_index.append(i)
x = normalized_train_data[i:i + time_step, 1, np.newaxis]
y = normalized_train_data[i:i + time_step, 0, np.newaxis]
train_x.append(x.tolist())
train_y.append(y.tolist())
batch_index.append((len(normalized_train_data) - time_step))
return batch_index, train_x, train_y
# 获取测试集
def get_test_data(time_step=10, test_begin=600):
# 输出的y_test为一个数组,长度为数据量的长度
data_test = data[test_begin:]
mean = np.mean(data_test, axis=0)
std = np.std(data_test, axis=0)
# normalized_test_data = (data_test - mean) / std # 标准化,但我没采用
normalized_test_data = data_test
size = (len(normalized_test_data) + time_step - 1) // time_step # 有size个sample
# print(size)
test_x, test_y = [], []
for i in range(size - 1):
x = normalized_test_data[i * time_step:(i + 1) * time_step, 1, np.newaxis]
y = normalized_test_data[i * time_step:(i + 1) * time_step, 0]
test_x.append(x)
test_y.extend(y)
x = (normalized_test_data[(i + 1) * time_step:, 1, np.newaxis])
test_x.append(x)
test_y.extend((normalized_test_data[(i + 1) * time_step:, 0]).tolist())
return mean, std, test_x, test_y
# ——————————————————定义神经网络变量——————————————————
# 输入层、输出层权重、偏置
weights = {
'in': tf.Variable(tf.random_normal([input_size, rnn_unit])),
'out': tf.Variable(tf.random_normal([rnn_unit, 1]))
}
biases = {
'in': tf.Variable(tf.constant(0.1, shape=[rnn_unit, ])),
'out': tf.Variable(tf.constant(0.1, shape=[1, ]))
}
# ——————————————————定义神经网络变量——————————————————
def lstm(X):
batch_size = tf.shape(X)[0]
time_step = tf.shape(X)[1]
w_in = weights['in']
b_in = biases['in']
# -1表示第一层靠第二层来决定
input = tf.reshape(X, [-1, input_size]) # 需要将tensor转成2维进行计算,计算后的结果作为隐藏层的输入
input_rnn = tf.matmul(input, w_in) + b_in
input_rnn = tf.reshape(input_rnn, [-1, time_step, rnn_unit]) # 将tensor转成3维,作为lstm cell的输入
cell = tf.nn.rnn_cell.BasicLSTMCell(rnn_unit)
init_state = cell.zero_state(batch_size, dtype=tf.float32)
# output_rnn是记录lstm每个输出节点的结果,final_states是最后一个cell的结果
output_rnn, final_states = tf.nn.dynamic_rnn(cell,
input_rnn, initial_state=init_state,
dtype=tf.float32)
output = tf.reshape(output_rnn, [-1, rnn_unit]) # 作为输出层的输入
w_out = weights['out']
b_out = biases['out']
pred = tf.matmul(output, w_out) + b_out
print(pred, final_states)
return pred, final_states
# ——————————————————训练模型——————————————————
def train_lstm(batch_size=80, time_step=100, train_begin=0, train_end=500):
X = tf.placeholder(tf.float32, shape=[None, time_step, input_size])
Y = tf.placeholder(tf.float32, shape=[None, time_step, output_size])
batch_index, train_x, train_y = get_train_data(batch_size, time_step, train_begin, train_end)
mean, std, test_x, test_y = get_test_data(time_step)
# test_y为list
pred, _ = lstm(X)
# 损失函数
loss = tf.reduce_mean(tf.square(tf.reshape(pred, [-1]) - tf.reshape(Y, [-1])))
train_op = tf.train.AdamOptimizer(lr).minimize(loss)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=1)
module_file = tf.train.latest_checkpoint('/home/lin/PycharmProjects/lins/vegetable/model/')
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
try:
saver.restore(sess, module_file)
except Exception as error:
pass
# 模型的恢复用的是restore()函数,它需要两个参数restore(sess, save_path),
# save_path指的是保存的模型路径。我们可以使用tf.train.latest_checkpoint()来自动获取最后一次保存的模型
# 重复训练10000次
for i in range(1001):
for step in range(len(batch_index) - 1):
_, loss_ = sess.run([train_op, loss], feed_dict={X: train_x[batch_index[step]:batch_index[step + 1]],
Y: train_y[batch_index[step]:batch_index[step + 1]]})
if i % 50 == 0:
print(i, loss_)
test_predict = []
for step in range(len(test_x)):
prob = sess.run(pred, feed_dict={X: [test_x[step]]})
predict = prob.reshape((-1))
test_predict.extend(predict)
# 接下来输出预测数据和原数据的对比以及预测误差在10%,5%,1%的准确率
print(test_predict)
print(test_y)
boolen_list = [abs(test_y[i] - test_predict[i]) / test_predict[i] < 0.1 for i in range(len(test_y))]
num_list = (tf.cast(boolen_list, tf.float32))
accuracy = tf.reduce_mean(num_list)
print(sess.run(accuracy))
boolen_list = [abs(test_y[i] - test_predict[i]) / test_predict[i] < 0.05 for i in range(len(test_y))]
num_list = (tf.cast(boolen_list, tf.float32))
accuracy = tf.reduce_mean(num_list)
print(sess.run(accuracy))
boolen_list = [abs(test_y[i] - test_predict[i]) / test_predict[i] < 0.01 for i in range(len(test_y))]
num_list = (tf.cast(boolen_list, tf.float32))
accuracy = tf.reduce_mean(num_list)
print(sess.run(accuracy))
with tf.variable_scope('train'):
train_lstm()
因此在数据的处理过程加入了对test_x的添加一个np.newaxis,对应成了[20,10,1],不然的话,得到的test_x shape则为[20,10]
上一篇: mybatis批量插入和批量更新
下一篇: 利用外键关系实现多表关联更新