欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Titanic获救预测数据集预处理

程序员文章站 2022-06-01 16:33:42
...

引言

Hexo博客:Yanbin’s blog

我的博客Titanic获救预测中对dataset的预处理感觉不是很完善,看了Kaggle上的一些Kernels,重新进行预处理(for 深度学习)…

特征处理

%matplotlib inline
import pandas as pd
import numpy as np
import re
train = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\train.csv')
test = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\test.csv')
full_data = [train, test]
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

1. Pclass

票类:经济地位的象征

序号 票类
1 头等舱
2 中等舱
3 末等舱
# One-hot编码
# train
train['P1'] = np.array(train['Pclass'] == 1).astype(np.int32)
train['P2'] = np.array(train['Pclass'] == 2).astype(np.int32)
train['P3'] = np.array(train['Pclass'] == 3).astype(np.int32)
# test
test['P1'] = np.array(test['Pclass'] == 1).astype(np.int32)
test['P2'] = np.array(test['Pclass'] == 2).astype(np.int32)
test['P3'] = np.array(test['Pclass'] == 3).astype(np.int32)
print(train.head(1))
   PassengerId  Survived  Pclass                     Name   Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris  male  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  P1  P2  P3  
0      0  A/5 21171  7.25   NaN        S   0   0   1  

2. Sex

性别:男or女

Sex label
male 1
female 0
# 把male/female转换成1/0
train['Sex'] = [1 if i == 'male' else 0 for i in train.Sex]
test['Sex'] = [1 if i == 'male' else 0 for i in test.Sex]
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  P1  P2  P3  
0      0  A/5 21171  7.25   NaN        S   0   0   1  

3. SibSp and Parch

  • SibSp

the number of siblings/spouse:兄弟姐妹/配偶人数

  • Parch

the number of children/parents:子女/父母人数

# 'FamilySize':家庭成员人数
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  P1  P2  P3  FamilySize  
0      0  A/5 21171  7.25   NaN        S   0   0   1           2  
# 'IsAlone':是否只身一人
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  P1  P2  P3  FamilySize  IsAlone  
0      0  A/5 21171  7.25   NaN        S   0   0   1           2        0  

4. Embarked

登船港口,有缺失值,先进行缺失值处理

C = Cherbourg, Q = Queenstown, S = Southampton

# 缺失值处理
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# One-hot编码
# train
train['E1'] = np.array(train['Embarked'] == 'S').astype(np.int32)
train['E2'] = np.array(train['Embarked'] == 'C').astype(np.int32)
train['E3'] = np.array(train['Embarked'] == 'Q').astype(np.int32)
# test
test['E1'] = np.array(test['Embarked'] == 'S').astype(np.int32)
test['E2'] = np.array(test['Embarked'] == 'C').astype(np.int32)
test['E3'] = np.array(test['Embarked'] == 'Q').astype(np.int32)
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  P1  P2  P3  FamilySize  IsAlone  E1  \
0      0  A/5 21171  7.25   NaN        S   0   0   1           2        0   1   

   E2  E3  
0   0   0  

5. Fare

乘客票价

# train
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
train['CategoricalFare'].cat.categories = [1, 2, 3, 4]
# one-hot编码
train['F1'] = np.array(train['CategoricalFare'] == 1).astype(np.int32)
train['F2'] = np.array(train['CategoricalFare'] == 2).astype(np.int32)
train['F3'] = np.array(train['CategoricalFare'] == 3).astype(np.int32)
train['F4'] = np.array(train['CategoricalFare'] == 4).astype(np.int32)

# test
test['CategoricalFare'] = pd.qcut(test['Fare'], 4)
test['CategoricalFare'].cat.categories = [1, 2, 3, 4]
# one-hot编码
test['F1'] = np.array(test['CategoricalFare'] == 1).astype(np.int32)
test['F2'] = np.array(test['CategoricalFare'] == 2).astype(np.int32)
test['F3'] = np.array(test['CategoricalFare'] == 3).astype(np.int32)
test['F4'] = np.array(test['CategoricalFare'] == 4).astype(np.int32)

print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1  22.0      1   

   Parch     Ticket  Fare ... FamilySize IsAlone  E1  E2  E3  CategoricalFare  \
0      0  A/5 21171  7.25 ...          2       0   1   0   0                1   

   F1  F2  F3  F4  
0   1   0   0   0  

[1 rows x 25 columns]

6. Age

缺失值处理

for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex  Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1   22      1   

   Parch     Ticket  Fare ... FamilySize IsAlone  E1  E2  E3  CategoricalFare  \
0      0  A/5 21171  7.25 ...          2       0   1   0   0                1   

   F1  F2  F3  F4  
0   1   0   0   0  

[1 rows x 25 columns]


d:\program files\python36\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# train
train['CategoricalAge'] = pd.qcut(train['Age'], 5)
train['CategoricalAge'].cat.categories = [1, 2, 3, 4, 5]
train['A1'] = np.array(train['CategoricalAge'] == 1).astype(np.int32)
train['A2'] = np.array(train['CategoricalAge'] == 2).astype(np.int32)
train['A3'] = np.array(train['CategoricalAge'] == 3).astype(np.int32)
train['A4'] = np.array(train['CategoricalAge'] == 4).astype(np.int32)
train['A5'] = np.array(train['CategoricalAge'] == 5).astype(np.int32)
# test
test['CategoricalAge'] = pd.qcut(test['Age'], 5)
test['CategoricalAge'].cat.categories = [1, 2, 3, 4, 5]
test['A1'] = np.array(test['CategoricalAge'] == 1).astype(np.int32)
test['A2'] = np.array(test['CategoricalAge'] == 2).astype(np.int32)
test['A3'] = np.array(test['CategoricalAge'] == 3).astype(np.int32)
test['A4'] = np.array(test['CategoricalAge'] == 4).astype(np.int32)
test['A5'] = np.array(test['CategoricalAge'] == 5).astype(np.int32)
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex  Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1   22      1   

   Parch     Ticket  Fare ... F1 F2  F3  F4  CategoricalAge  A1  A2  A3  A4  \
0      0  A/5 21171  7.25 ...  1  0   0   0               2   0   1   0   0   

   A5  
0   0  

[1 rows x 31 columns]

7. Name

新增一列特征’Title’:头衔

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

print(pd.crosstab(train['Title'], train['Sex']))
Sex         0    1
Title             
Capt        0    1
Col         0    2
Countess    1    0
Don         0    1
Dr          1    6
Jonkheer    0    1
Lady        1    0
Major       0    2
Master      0   40
Miss      182    0
Mlle        2    0
Mme         1    0
Mr          0  517
Mrs       125    0
Ms          1    0
Rev         0    6
Sir         0    1
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
    'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
    Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826
# train
train['T1'] = np.array(train['Title'] == 'Master').astype(np.int32)
train['T2'] = np.array(train['Title'] == 'Miss').astype(np.int32)
train['T3'] = np.array(train['Title'] == 'Mr').astype(np.int32)
train['T4'] = np.array(train['Title'] == 'Mrs').astype(np.int32)
train['T5'] = np.array(train['Title'] == 'Rare').astype(np.int32)
# test
test['T1'] = np.array(test['Title'] == 'Master').astype(np.int32)
test['T2'] = np.array(test['Title'] == 'Miss').astype(np.int32)
test['T3'] = np.array(test['Title'] == 'Mr').astype(np.int32)
test['T4'] = np.array(test['Title'] == 'Mrs').astype(np.int32)
test['T5'] = np.array(test['Title'] == 'Rare').astype(np.int32)
print(train.head(1))
   PassengerId  Survived  Pclass                     Name  Sex  Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris    1   22      1   

   Parch     Ticket  Fare ... A2 A3  A4  A5  Title  T1  T2  T3  T4  T5  
0      0  A/5 21171  7.25 ...  1  0   0   0     Mr   0   0   1   0   0  

[1 rows x 37 columns]

数据清洗

获得训练神经网络的数据:train_x,train_y_

以及预测样本:test_x

train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'P1', 'P2', 'P3',
       'FamilySize', 'IsAlone', 'E1', 'E2', 'E3', 'CategoricalFare', 'F1',
       'F2', 'F3', 'F4', 'CategoricalAge', 'A1', 'A2', 'A3', 'A4', 'A5',
       'Title', 'T1', 'T2', 'T3', 'T4', 'T5'],
      dtype='object')
train_x = train[[
    'P1', 'P2', 'P3', 'Sex', 'IsAlone', 'E1', 'E2', 'E3', 'F1',
    'F2', 'F3', 'F4', 'A1', 'A2', 'A3', 'A4', 'A5', 'T1', 'T2',
    'T3', 'T4', 'T5'
]]
print(train_x.head(1))
   P1  P2  P3  Sex  IsAlone  E1  E2  E3  F1  F2 ...  A1  A2  A3  A4  A5  T1  \
0   0   0   1    1        0   1   0   0   1   0 ...   0   1   0   0   0   0   

   T2  T3  T4  T5  
0   0   1   0   0  

[1 rows x 22 columns]
train_y_ = train[['Survived']]
print(train_y_.head(1))
   Survived
0         0
test_x = test[[
    'P1', 'P2', 'P3', 'Sex', 'IsAlone', 'E1', 'E2', 'E3', 'F1',
    'F2', 'F3', 'F4', 'A1', 'A2', 'A3', 'A4', 'A5', 'T1', 'T2',
    'T3', 'T4', 'T5'
]]
print(test_x.head(1))
   P1  P2  P3  Sex  IsAlone  E1  E2  E3  F1  F2 ...  A1  A2  A3  A4  A5  T1  \
0   0   0   1    1        1   0   0   1   1   0 ...   0   0   0   1   0   0   

   T2  T3  T4  T5  
0   0   1   0   0  

[1 rows x 22 columns]