阿里天池新人赛_二手车交易价格预测
task2-EDA 数据探索性分析
比赛官网链接:https://tianchi.aliyun.com/competition/entrance/231784/information
目录
1 载入各种数据科学以及可视化库
- 数据科学库 pandas、numpy、scipy;
- 可视化库 matplotlib、seabon;
- 其他
2 载入数据
- 载入训练集和测试集;
3 数据总览
- 简略观察数据;
- 通过describe()来熟悉数据的相关统计量
- 通过info()来熟悉数据类型
4 判断数据缺失和异常
- 查看每列的存在nan情况
- 异常值检测
5 了解预测值的分布
- 总体分布概况(*约翰逊分布等)
- 查看skewness and kurtosis 查看预测值的具体频数
6 划分类别特征和数字特征
6-1 数字特征分析
- 相关性分析
- 查看几个特征的偏度和峰值
- 每个数字特征得分布可视化
- 数字特征相互之间的关系可视化
- 多变量互相回归关系可视化
6-2 类型特征分析
- unique分布
- 类别特征箱形图可视化
- 类别特征的小提琴图可视化
- 类别特征的柱形图可视化类别
- 特征的每个类别频数可视化
7 生成数据报告
1 载入各种数据科学以及可视化库
## 基础工具
import numpy as np
import pandas as pd
import pandas_profiling
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
import scipy.stats as st
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore')
%matplotlib inline
## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb
## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
2 载入数据
## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('data/used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('data/used_car_testA_20200313.csv', sep=' ')
## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)
Train data shape: (150000, 31)
TestA data shape: (50000, 30)
3 数据总览
## 简略观察数据
Train_data.head().append(Train_data.tail())
## 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
Train_data.info()
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
#通过 .colums 查看列名
Train_data.columns
Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’,
‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’,
‘v_13’, ‘v_14’],
dtype=‘object’)
TestA_data.info()
## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()
TestA_data.describe()
4 判断数据缺失和异常
# 训练集数据缺失情况
train_total = Train_data.isnull().sum().sort_values(ascending=False)
train_perscent = (Train_data.isnull().sum()) / (Train_data.isnull().count()).sort_values(ascending=False)
missing_train_data = pd.concat([train_total, train_perscent], axis=1, keys=['train_total', 'train_perscent'])
missing_train_data.head(10)
# 测试集数值型数据缺失情况
test_total = TestA_data.isnull().sum().sort_values(ascending=False)
test_perscent = (TestA_data.isnull().sum()) / (TestA_data.isnull().count()).sort_values(ascending=False)
missing_test_data = pd.concat([test_total, test_perscent], axis=1, keys=['test_total', 'test_perscent'])
missing_test_data.head(10)
## 查看异常点并进行处理
# 除了 notRepairedDamage 为object类型其他都为数值型,先查看notRepairedDamage信息
Train_data['notRepairedDamage'].value_counts()
TestA_data['notRepairedDamage'].value_counts()
# 将otRepairedDamage列的‘-’替换为nan
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
TestA_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
TestA_data['notRepairedDamage'].value_counts()
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
0.0 37249
1.0 4720
Name: notRepairedDamage, dtype: int64
5 了解预测值的分布
# 查看预测值总体分布概况,是否符合正态分布
y = Train_data['price']
plt.figure(1)
plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2)
plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3)
plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
# 观察一下它的偏度值 skewness and kurtosis
# 偏度skewness:正态分布(偏度=0),右偏分布(也叫正偏分布,其偏度>0),左偏分布(也叫负偏分布,其偏度<0)
print("Skewness: %f" % Train_data['price'].skew())
print("kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
kurtosis: 18.995183
# 查看预测值price的频数
Train_data['price'].describe()
count 150000.000000
mean 5923.327333
std 7501.998477
min 11.000000
25% 1300.000000
50% 3250.000000
75% 7700.000000
max 99999.000000
Name: price, dtype: float64
# 查看预测值的分布
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='blue')
plt.show()
# tips:对预测值进行log变换,会使得分布看起来更加均匀
# plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='blue')
# plt.show()
6 划分类别特征和数字特征
# 拆分数值型特征和非数值型特征
y_train = Train_data['price']
numerical_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical__features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
6-1 数字特征分析
price_numeric = Train_data[numerical_features]
correlation = price_numeric.corr()
# print(correlation['price'].sort_values(ascending = False),'\n')
# 相关性分析
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
# 查看几个特征的偏度和峰值
for col in numerical_features:
print('{:15}'.format(col),
'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
' ' ,
'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
)
power Skewness: 65.86 Kurtosis: 5733.45
kilometer Skewness: -1.53 Kurtosis: 001.14
v_0 Skewness: -1.32 Kurtosis: 003.99
v_1 Skewness: 00.36 Kurtosis: -01.75
v_2 Skewness: 04.84 Kurtosis: 023.86
v_3 Skewness: 00.11 Kurtosis: -00.42
v_4 Skewness: 00.37 Kurtosis: -00.20
v_5 Skewness: -4.74 Kurtosis: 022.93
v_6 Skewness: 00.37 Kurtosis: -01.74
v_7 Skewness: 05.13 Kurtosis: 025.85
v_8 Skewness: 00.20 Kurtosis: -00.64
v_9 Skewness: 00.42 Kurtosis: -00.32
v_10 Skewness: 00.03 Kurtosis: -00.58
v_11 Skewness: 03.03 Kurtosis: 012.57
v_12 Skewness: 00.37 Kurtosis: 000.27
v_13 Skewness: 00.27 Kurtosis: -00.44
v_14 Skewness: -1.19 Kurtosis: 002.39
# 每个特征的分布可视化
f = pd.melt(Train_data, value_vars=numerical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
# 特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
6-2 类型特征分析
## 查看unique分布
# 训练集
for train_feat in categorical__features:
print(train_feat + '特征分布图如下')
print('{} 特征有{}个不同的值'.format(train_feat, Train_data[train_feat].nunique()))
print(Train_data[train_feat].value_counts())
# 测试集
for test_feat in categorical__features:
print(test_feat + '特征分布图如下')
print('{} 特征有{}个不同的值'.format(test_feat, TestA_data[test_feat].nunique()))
print(TestA_data[test_feat].value_counts())
# 箱形图,这里只取比较密集的特征,排除name和regionCode这种稀疏的特征
categorical_features = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
for cat_feat in categorical_features:
Train_data[cat_feat] = Train_data[cat_feat].astype('category')
if Train_data[cat_feat].isnull().any():
Train_data[cat_feat] = Train_data[cat_feat].cat.add_categories(['MISSING'])
Train_data[cat_feat] = Train_data[cat_feat].fillna('MISSING')
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")
# 小提琴图
cat_list = categorical_features
target = 'price'
for feature in cat_list:
sns.violinplot(x=feature, y=target, data=Train_data)
plt.show()
# 柱形图
def bar_plot(x, y, **kwargs):
sns.barplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")
# 频数图/条形图
def count_plot(x, **kwargs):
sns.countplot(x=x)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")
7 生成数据报告
pfr = pandas_profiling.ProfileReport(Train_datara)
pfr.to_file("./report.html")
上一篇: 超简单、超实用的统计方法——因子分析
推荐阅读
-
阿里天池新人赛_二手车交易价格预测
-
天池二手车交易价格预测Task3:特征工程
-
DataWhale Task2-天池二手车交易价格预测EDA
-
数据竞赛入门系列——天池二手车交易价格预测【2】数据分析——EDA
-
天池二手车交易价格预测— 赛题理解 + 数据分析
-
Datawhale &天池二手车交易价格预测— Task1 赛题理解 +Task2 数据分析
-
2020-3-24-DataWhale Task2-天池二手车交易价格预测EDA-数据分析
-
【我的数据挖掘竞赛之旅(二)】二手车交易价格预测——2020年天池阿里云竞赛Task2数据分析
-
【我的数据挖掘竞赛之旅(二)】二手车交易价格预测——2020年天池阿里云竞赛Task5模型融合