【数据分析】相关性矩阵可视化(热力图heatmap)
程序员文章站
2022-07-14 10:05:37
...
数据概览
# 以波士顿房价数据为例
import pandas as pd
train=pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
train.head
<bound method NDFrame.head of Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg
1456 1457 20 RL 85.0 13175 Pave NaN Reg
1457 1458 70 RL 66.0 9042 Pave NaN Reg
1458 1459 20 RL 68.0 9717 Pave NaN Reg
1459 1460 20 RL 75.0 9937 Pave NaN Reg
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \
0 Lvl AllPub ... 0 NaN NaN NaN 0
1 Lvl AllPub ... 0 NaN NaN NaN 0
2 Lvl AllPub ... 0 NaN NaN NaN 0
3 Lvl AllPub ... 0 NaN NaN NaN 0
4 Lvl AllPub ... 0 NaN NaN NaN 0
... ... ... ... ... ... ... ... ...
1455 Lvl AllPub ... 0 NaN NaN NaN 0
1456 Lvl AllPub ... 0 NaN MnPrv NaN 0
1457 Lvl AllPub ... 0 NaN GdPrv Shed 2500
1458 Lvl AllPub ... 0 NaN NaN NaN 0
1459 Lvl AllPub ... 0 NaN NaN NaN 0
MoSold YrSold SaleType SaleCondition SalePrice
0 2 2008 WD Normal 208500
1 5 2007 WD Normal 181500
2 9 2008 WD Normal 223500
3 2 2006 WD Abnorml 140000
4 12 2008 WD Normal 250000
... ... ... ... ... ...
1455 8 2007 WD Normal 175000
1456 2 2010 WD Normal 210000
1457 5 2010 WD Normal 266500
1458 4 2010 WD Normal 142125
1459 6 2008 WD Normal 147500
[1458 rows x 81 columns]>
相关性矩阵获取
import numpy as np
k=10
corrmat=train_drop.corr()#获取相关性矩阵
#获取相关度最高的K个特征
cols=corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
# 获得相关性最高的 K 个特征组成的子数据集
cm=np.corrcoef(train_drop[cols].values.T)#获取相关性矩阵
print(cm)
array([[1. , 0.79577427, 0.73496816, 0.65115291, 0.64104701,
0.63153038, 0.62921745, 0.56216475, 0.53776882, 0.5236084 ],
[0.79577427, 1. , 0.58941358, 0.53859453, 0.60074082,
0.46909184, 0.55723004, 0.54841971, 0.4206215 , 0.57136809],
[0.73496816, 0.58941358, 1. , 0.40879348, 0.47544152,
0.53369718, 0.4563575 , 0.63837846, 0.8294982 , 0.19439712],
[0.65115291, 0.53859453, 0.40879348, 1. , 0.45188972,
0.80382963, 0.47506909, 0.32772043, 0.26614613, 0.40026576],
[0.64104701, 0.60074082, 0.47544152, 0.45188972, 1. ,
0.44919454, 0.8873045 , 0.46819822, 0.36115155, 0.5373007 ],
[0.63153038, 0.46909184, 0.53369718, 0.80382963, 0.44919454,
1. , 0.47729916, 0.38212 , 0.39638135, 0.28125344],
[0.62921745, 0.55723004, 0.4563575 , 0.47506909, 0.8873045 ,
0.47729916, 1. , 0.4040763 , 0.32871405, 0.47799759],
[0.56216475, 0.54841971, 0.63837846, 0.32772043, 0.46819822,
0.38212 , 0.4040763 , 1. , 0.55303847, 0.46714602],
[0.53776882, 0.4206215 , 0.8294982 , 0.26614613, 0.36115155,
0.39638135, 0.32871405, 0.55303847, 1. , 0.09122031],
[0.5236084 , 0.57136809, 0.19439712, 0.40026576, 0.5373007 ,
0.28125344, 0.47799759, 0.46714602, 0.09122031, 1. ]])
数据可视化
sns.set(font_scale=1.25)#字符大小设定
hm=sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
上一篇: matplotlib入门-条形图
下一篇: 【基本图像操作】Matplotlib