Bike sharing demand prediction 【kaggle competition】

程序员文章站 2024-03-07 18:00:03

...

Bike sharing demand prediction

（一）数据集介绍

这是来自kaggle competition的一个数据集。记录了华盛顿特区的2011-2012年间每天每个小时的共享单车的相关数据。给出了 train 和 test 两份数据。在test数据中缺失 casual registered count 三列数据，这也是我们需要预测得到的数据。在train中给出的是一个月从1号到20号的数据，而test中给出21号到月底的数据。数据共给出12个变量。

1.1独立变量

Name	Type	Introduction
Datetime	yy/mm/dd xx:xx	Hourly date +timestamp
Season	Integer	1=spring 2=summer 3=fall 4=winter
Holiday	Integer	1=holiday 0= not a holiay
Weather	integer	1= Clear, Few clouds, Partly cloudy, Partly cloudy 2=Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3=Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4=Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
Temp	Float	Temperature in Celsius
Atemp	Float	“feel like” temperature in Celsius
Humidity	Integer	Relative humidity
Windspeed	Folat	Wind speed
Working day	integer	1=working day 0=not a working day

1.2关联变量

Name	Type	Introduction
Casual	Integer	number of non-registered user rentals initiated
Registered	Integer	number of registered user rentals initiated
Count	Integer	Count=casual + registered

1.3正确率测试

Bike sharing demand prediction 【kaggle competition】

其中：

1. n是测试集中小时的个数

2. 是预测count

3. 是实际的count

4. Log（x）是自然算法

(二）数据预处理及可视化

2.1数据总览

1.	import pandas as pd  
2.	train_df=pd.read_csv("E:/final repoet _ML/dataset/train.csv")  
3.	test_df=pd.read_csv("E:/final repoet _ML/dataset/test.csv")  
4.	train_df.info()

Bike sharing demand prediction 【kaggle competition】

在训练集train*12列数据，除了datetime之外，都是非空数值型数据（整型，浮点型）。因此对于几乎不需要对于数据进行处理。唯一需要注意的是对于datetime的划分以及处理，可能存在信息冗余的情况。可能需要“丢掉”datetime 中的某些数据

1.	test_df.info()

1.	train_df.describe()

2.2数据可视化

1.	test_df["casual"]=0;  
2.	test_df["registered"]=0;  
3.	test_df["count"]=0;  
4.	test_df["traintest"]='test';  
5.	train_df["traintest"]='train';  
6.	all_df=pd.concat((train_df,test_df))  
7.	all_df["date"] = all_df.datetime.apply(lambda x : x.split()[0])  
8.	all_df["monthnum"] = all_df.datetime.apply(lambda x : int(x.split()[0].split('-')[1]))  
9.	all_df["daynum"]=all_df.datetime.apply(lambda x : int(x.split()[0].split('-')[2]))  
10.	all_df["mouthnum"]=all_df.datetime.apply(lambda x:int(x.split()[0].split('-')[1]))  
11.	all_df.mouthnum.value_counts().sort_index().plot(kind='bar')


Season	Weather

Humidity	Holiday

Workingday	Windspeed

Temp	Atemp

分析及结论：

1. 季节对于单车使用量的影响几乎相同

2. Weather1 对于单车使用量巨大

3. 因为 humidity temp atemp 的关联性较大，因此影响能力及曲线相近，所以在需要取一个变量进行预测即可。

4. Workingday holiday 影响较难直观看出。

1.	hourAggregated=pd.DataFrame(all_df.loc[all_df.traintest=='train'].groupby(["hour","weekday"],sort=True)["count"].mean()).reset.index()  
2.	sn.pointplot(x=hourAggregated["hour"],y=hourAggregated["count"],hue=hourAggregated["weekday"],data=hourAggregated,join=True)  
3.	hourAggregated=pd.DataFrame(all_df.loc[all_df.traintest=='train'].groupby(["hour","season"],sort=True)["count"].mean()).reset.index()    
4.	sn.pointplot(x=hourAggregated["hour"],y=hourAggregated["count"],hue=hourAggregated["season],data=hourAggregated,join=True)


Seaon	Weekday

Holiday	Workingday

分析与结论：

1. 春冬季节天气较冷，使用量较少，符合常理

2. 工作日有上班和下班两个波峰，周末只有一个波峰，看起来符合正态分布，也合理

3. 是否为假日看起来区别不大

（三）相关性分析

1.	dailyData=pd.read_csv("E:/final repoet _ML/dataset/train.csv")  
2.	corrMatt = dailyData[["temp","atemp","casual","registered","humidity","windspeed","count"]].corr()  
3.	mask = np.array(corrMatt)  
4.	mask[np.tril_indices_from(mask)] =False  
5.	fig,ax =plt.subplots()  
6.	fig.set_size_inches(20,10)  
7.	sn.heatmap(corrMatt,mask=mask,vmax=.8,square = True,annot = True)

皮尔逊相关系数：

ρ =Cor(X,Y)=Cov(X,Y)/sqrt(Var(X)*Var(Y))

衡量两个值线性相关强度的量

取值范围[-1:1]

正向相关：>0; 负向相关：<0; 不相关：=0

1. 显然temp 和 atemp 两个变量的相关系数达到98%，具有高度的线性相关性，当然这也与我们的常识相符合，因此可以忽略其中的一个。

2. 虽然 registered 和count 也具有较高的相关性，但是由于是被预测量，不做处理。

3. 其他值之间的关联性均不强。

（四）算法选取与实现

因为bike sharingdemand 数据集是个数值型数据集，因此考虑使用回归预测数值型模型。在机器学习与模式识别课程中，主要学习了线性回归，局部加权回归，以及树回归等回归预测模型。另外根据，搜索得知随机森林和xgboost算法的效果也较为理想，因此也进行测试。

4.1线性回归

线性回归会出现欠拟合现象，是因为它求的是具有最小均方误差的无偏估计。但线性回归是学习到的第一个回归预测模型，在这里还是使用线性回归进行预测。

1.	from sklearn.linear_model import LinearRegression  
2.	  
3.	dataTrain = train.drop(['date','casual', 'registered', 'count'], axis=1)  
4.	dataTest = test.drop(['date'], axis=1)  
5.	yLabels = train["count"]  
6.	yLablesRegistered = train["registered"]  
7.	yLablesCasual = train["casual"]  
8.	  
9.	model = LinearRegression()  
10.	yLabelsLog = np.log1p(yLabels)  
11.	model.fit(X=dataTrain, y=yLabelsLog)  
12.	  
13.	LR_preds = model.predict(dataTest)  
14.	  
15.	np.exp(LR_preds).mean()  
16.	  
17.	submission = pd.DataFrame({  
18.	    "datetime": submmit_datetime,   
19.	    "count": np.exp(LR_preds)  
20.	})  
21.	submission.to_csv('lr-submission.csv', index=False)

submission = pd.DataFrame({
    "datetime": submmit_datetime, 
    "count": np.exp(LR_preds)
})
submission.to_csv('lr-submission.csv', index=False)

上传得分：

Bike sharing demand prediction 【kaggle competition】

4.2 Ramdon forest（随机森林）

随机森林是由多颗决策树组成的分类器，但是随机森林也可以进行回归。随机森林包括多个决策树来降低过拟合的现象。随机森林分别训练一系列的决策树，所以训练过程是并行的。因算法中加入随机过程，所以每个决策树又有少量区别。通过合并每个树的预测结果来减少预测的方差，提高在测试集上的性能表现。

1.	X_train, X_test, y_train, y_test = train_test_split(X, new_y, test_size = 0.33, random_state = 42)  
2.	rf = RandomForestRegressor()  
3.	rf.fit(X_train, y_train)  
4.	prediction = rf.predict(X_test)  
5.	mean_squared_error(y_test, prediction)     
6.	rf.fit(X, new_y)  
7.	bikes_test= bikes_test.drop(['datetime'],axis=1)  
8.	prediction = rf.predict(bikes_test)  
9.	prediction = np.exp(prediction) - 1

df.to_csv("E:/final repoet _ML/dataset/ramdonfrSubmission.csv",index=False,sep=',')

上传得分：

4.3 xgboost 预测

XGBoost（eXtreme Gradient Boosting）是工业界逐渐风靡的基于GradientBoosting算法的一个优化的版本，可以给预测模型带来能力的提升。

回归树的分裂结点对于平方损失函数，拟合的就是残差；对于一般损失函数（梯度下降），拟合的就是残差的近似值，分裂结点划分时枚举所有特征的值，选取划分点。最后预测的结果是每棵树的预测结果相加。

Xgboost的优势：

1. 正则化：减少过拟合

2. 并行处理：相对于GBM有了速度上的提升

3. 高度的灵活性：允许用户自定义优化目标和评价标准

4. 缺失值处理：

5.	def evalerror(preds, dtrain):  
6.	    labels = dtrain.get_label()  
7.	    assert len(preds) == len(labels)  
8.	    labels = labels.tolist()  
9.	    preds = preds.tolist()  
10.	    terms_to_sum = [(math.log(labels[i] + 1) - math.log(max(0,preds[i]) + 1)) ** 2.0   
11.	                for i,pred in enumerate(labels)]  
12.	    return 'error', (sum(terms_to_sum) * (1.0/len(preds))) ** 0.5  
13.	X = dailyData.drop(['datetime','casual','registered','count'], axis = 1)  
14.	y = np.log1p(dailyData['count'])  
15.	x_test = test.drop(['datetime'], axis = 1)  
16.	x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=4242)  
17.	  
18.	d_train = xgb.DMatrix(x_train, label=y_train)  
19.	d_valid = xgb.DMatrix(x_valid, label=y_valid)  
20.	d_test = xgb.DMatrix(x_test)  
21.	  
22.	params = {}  
23.	params['objective'] = 'reg:linear'  
24.	params['eta'] = 0.1  
25.	params['max_depth'] = 5  
26.	  
27.	watchlist = [(d_train, 'train'), (d_valid, 'valid')]  
28.	  
29.	clf = xgb.train(params, d_train, 2000, watchlist, early_stopping_rounds=50, feval = evalerror, maximize=False, verbose_eval=10)  
30.	xgb.plot_importance(clf)

Bike sharing demand prediction 【kaggle competition】

1.	p_test = np.expm1(clf.predict(d_test))  
2.	date = test['datetime']  
3.	res = pd.concat([date , pd.Series(p_test)], axis = 1)  
4.	res.columns = ['datetime','count']  
5.	dataframe = pd.DataFrame({'datetime':date,'count':res["count"]})  
6.	dataframe.to_csv("E:/final repoet _ML/dataset/sampleSubmission.csv",index=False,sep=',')

结果：

Bike sharing demand prediction 【kaggle competition】