Titanic获救预测数据集预处理
程序员文章站
2022-06-01 16:33:42
...
引言
Hexo博客:Yanbin’s blog
我的博客Titanic获救预测中对dataset的预处理感觉不是很完善,看了Kaggle上的一些Kernels,重新进行预处理(for 深度学习)…
特征处理
%matplotlib inline
import pandas as pd
import numpy as np
import re
train = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\train.csv')
test = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\test.csv')
full_data = [train, test]
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
1. Pclass
票类:经济地位的象征
序号 | 票类 |
---|---|
1 | 头等舱 |
2 | 中等舱 |
3 | 末等舱 |
# One-hot编码
# train
train['P1'] = np.array(train['Pclass'] == 1).astype(np.int32)
train['P2'] = np.array(train['Pclass'] == 2).astype(np.int32)
train['P3'] = np.array(train['Pclass'] == 3).astype(np.int32)
# test
test['P1'] = np.array(test['Pclass'] == 1).astype(np.int32)
test['P2'] = np.array(test['Pclass'] == 2).astype(np.int32)
test['P3'] = np.array(test['Pclass'] == 3).astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3
0 0 A/5 21171 7.25 NaN S 0 0 1
2. Sex
性别:男or女
Sex | label |
---|---|
male | 1 |
female | 0 |
# 把male/female转换成1/0
train['Sex'] = [1 if i == 'male' else 0 for i in train.Sex]
test['Sex'] = [1 if i == 'male' else 0 for i in test.Sex]
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3
0 0 A/5 21171 7.25 NaN S 0 0 1
3. SibSp and Parch
- SibSp
the number of siblings/spouse:兄弟姐妹/配偶人数
- Parch
the number of children/parents:子女/父母人数
# 'FamilySize':家庭成员人数
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3 FamilySize
0 0 A/5 21171 7.25 NaN S 0 0 1 2
# 'IsAlone':是否只身一人
for dataset in full_data:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3 FamilySize IsAlone
0 0 A/5 21171 7.25 NaN S 0 0 1 2 0
4. Embarked
登船港口,有缺失值,先进行缺失值处理
C = Cherbourg, Q = Queenstown, S = Southampton
# 缺失值处理
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
# One-hot编码
# train
train['E1'] = np.array(train['Embarked'] == 'S').astype(np.int32)
train['E2'] = np.array(train['Embarked'] == 'C').astype(np.int32)
train['E3'] = np.array(train['Embarked'] == 'Q').astype(np.int32)
# test
test['E1'] = np.array(test['Embarked'] == 'S').astype(np.int32)
test['E2'] = np.array(test['Embarked'] == 'C').astype(np.int32)
test['E3'] = np.array(test['Embarked'] == 'Q').astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3 FamilySize IsAlone E1 \
0 0 A/5 21171 7.25 NaN S 0 0 1 2 0 1
E2 E3
0 0 0
5. Fare
乘客票价
# train
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
train['CategoricalFare'].cat.categories = [1, 2, 3, 4]
# one-hot编码
train['F1'] = np.array(train['CategoricalFare'] == 1).astype(np.int32)
train['F2'] = np.array(train['CategoricalFare'] == 2).astype(np.int32)
train['F3'] = np.array(train['CategoricalFare'] == 3).astype(np.int32)
train['F4'] = np.array(train['CategoricalFare'] == 4).astype(np.int32)
# test
test['CategoricalFare'] = pd.qcut(test['Fare'], 4)
test['CategoricalFare'].cat.categories = [1, 2, 3, 4]
# one-hot编码
test['F1'] = np.array(test['CategoricalFare'] == 1).astype(np.int32)
test['F2'] = np.array(test['CategoricalFare'] == 2).astype(np.int32)
test['F3'] = np.array(test['CategoricalFare'] == 3).astype(np.int32)
test['F4'] = np.array(test['CategoricalFare'] == 4).astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare ... FamilySize IsAlone E1 E2 E3 CategoricalFare \
0 0 A/5 21171 7.25 ... 2 0 1 0 0 1
F1 F2 F3 F4
0 1 0 0 0
[1 rows x 25 columns]
6. Age
缺失值处理
for dataset in full_data:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22 1
Parch Ticket Fare ... FamilySize IsAlone E1 E2 E3 CategoricalFare \
0 0 A/5 21171 7.25 ... 2 0 1 0 0 1
F1 F2 F3 F4
0 1 0 0 0
[1 rows x 25 columns]
d:\program files\python36\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# train
train['CategoricalAge'] = pd.qcut(train['Age'], 5)
train['CategoricalAge'].cat.categories = [1, 2, 3, 4, 5]
train['A1'] = np.array(train['CategoricalAge'] == 1).astype(np.int32)
train['A2'] = np.array(train['CategoricalAge'] == 2).astype(np.int32)
train['A3'] = np.array(train['CategoricalAge'] == 3).astype(np.int32)
train['A4'] = np.array(train['CategoricalAge'] == 4).astype(np.int32)
train['A5'] = np.array(train['CategoricalAge'] == 5).astype(np.int32)
# test
test['CategoricalAge'] = pd.qcut(test['Age'], 5)
test['CategoricalAge'].cat.categories = [1, 2, 3, 4, 5]
test['A1'] = np.array(test['CategoricalAge'] == 1).astype(np.int32)
test['A2'] = np.array(test['CategoricalAge'] == 2).astype(np.int32)
test['A3'] = np.array(test['CategoricalAge'] == 3).astype(np.int32)
test['A4'] = np.array(test['CategoricalAge'] == 4).astype(np.int32)
test['A5'] = np.array(test['CategoricalAge'] == 5).astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22 1
Parch Ticket Fare ... F1 F2 F3 F4 CategoricalAge A1 A2 A3 A4 \
0 0 A/5 21171 7.25 ... 1 0 0 0 2 0 1 0 0
A5
0 0
[1 rows x 31 columns]
7. Name
新增一列特征’Title’:头衔
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
for dataset in full_data:
dataset['Title'] = dataset['Name'].apply(get_title)
print(pd.crosstab(train['Title'], train['Sex']))
Sex 0 1
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
for dataset in full_data:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
# train
train['T1'] = np.array(train['Title'] == 'Master').astype(np.int32)
train['T2'] = np.array(train['Title'] == 'Miss').astype(np.int32)
train['T3'] = np.array(train['Title'] == 'Mr').astype(np.int32)
train['T4'] = np.array(train['Title'] == 'Mrs').astype(np.int32)
train['T5'] = np.array(train['Title'] == 'Rare').astype(np.int32)
# test
test['T1'] = np.array(test['Title'] == 'Master').astype(np.int32)
test['T2'] = np.array(test['Title'] == 'Miss').astype(np.int32)
test['T3'] = np.array(test['Title'] == 'Mr').astype(np.int32)
test['T4'] = np.array(test['Title'] == 'Mrs').astype(np.int32)
test['T5'] = np.array(test['Title'] == 'Rare').astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22 1
Parch Ticket Fare ... A2 A3 A4 A5 Title T1 T2 T3 T4 T5
0 0 A/5 21171 7.25 ... 1 0 0 0 Mr 0 0 1 0 0
[1 rows x 37 columns]
数据清洗
获得训练神经网络的数据:train_x,train_y_
以及预测样本:test_x
train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'P1', 'P2', 'P3',
'FamilySize', 'IsAlone', 'E1', 'E2', 'E3', 'CategoricalFare', 'F1',
'F2', 'F3', 'F4', 'CategoricalAge', 'A1', 'A2', 'A3', 'A4', 'A5',
'Title', 'T1', 'T2', 'T3', 'T4', 'T5'],
dtype='object')
train_x = train[[
'P1', 'P2', 'P3', 'Sex', 'IsAlone', 'E1', 'E2', 'E3', 'F1',
'F2', 'F3', 'F4', 'A1', 'A2', 'A3', 'A4', 'A5', 'T1', 'T2',
'T3', 'T4', 'T5'
]]
print(train_x.head(1))
P1 P2 P3 Sex IsAlone E1 E2 E3 F1 F2 ... A1 A2 A3 A4 A5 T1 \
0 0 0 1 1 0 1 0 0 1 0 ... 0 1 0 0 0 0
T2 T3 T4 T5
0 0 1 0 0
[1 rows x 22 columns]
train_y_ = train[['Survived']]
print(train_y_.head(1))
Survived
0 0
test_x = test[[
'P1', 'P2', 'P3', 'Sex', 'IsAlone', 'E1', 'E2', 'E3', 'F1',
'F2', 'F3', 'F4', 'A1', 'A2', 'A3', 'A4', 'A5', 'T1', 'T2',
'T3', 'T4', 'T5'
]]
print(test_x.head(1))
P1 P2 P3 Sex IsAlone E1 E2 E3 F1 F2 ... A1 A2 A3 A4 A5 T1 \
0 0 0 1 1 1 0 0 1 1 0 ... 0 0 0 1 0 0
T2 T3 T4 T5
0 0 1 0 0
[1 rows x 22 columns]
上一篇: 夏季八个防脱发秘诀 勤洗头节制饮酒
下一篇: 与生命相伴的两个字
推荐阅读
-
ML之FE:利用FE特征工程(分析两两数值型特征之间的相关性)对AllstateClaimsSeverity(Kaggle2016竞赛)数据集实现索赔成本值的回归预测
-
Keras : 利用卷积神经网络CNN对图像进行分类,以mnist数据集为例建立模型并预测
-
KAGGLE房价预测数据集
-
深度学习:实战KAGGLE房价预测数据 (附带数据集)
-
ML之多分类预测之PLiR:使用PLiR实现对六类label数据集进行多分类
-
ML:基于自定义数据集利用Logistic、梯度下降算法GD、LoR逻辑回归、Perceptron感知器、SVM支持向量机、LDA线性判别分析算法进行二分类预测(决策边界可视化)
-
声纹识别--声纹识别的测试集和数据预处理
-
Titanic获救预测数据集预处理
-
数据集的预处理
-
数据集的预处理