欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Kaggle - Titanic 生存预测

程序员文章站 2024-03-22 08:10:40
...

第一次参加Kaggle,以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。

一、数据总览

从Kaggle平台我们了解到,Training set一共有891条记录,Test set一共有418条记录。提供的相关变量有:

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class

1 = 1st, 2 = 2nd, 3 = 3rd

A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

sex Sex  
Age Age in years Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp # of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch # of parents / children aboard the Titanic Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

首先查看一下训练集和测试集的基本信息,对数据的规模、各个特征的数据类型以及是否有缺失,有一个总体的了解:

import pandas as pd 
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)

#查看基本信息
print (train.info())
print (test.info())

输出为:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

可知:训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。

接下来我们再查看一下数据的具体格式:

#默认打印出前5行数据
print (train.head())

我使用的是Sublime编辑器,因为列数太多,会分多行打印,输出结果不太美观。因此直接去Kaggle上查看数据,以下为Kaggle上的数据截图。

Kaggle - Titanic 生存预测

二、数据初步分析

1. 乘客基本属性分析

对于Survived、Sex、Pclass、Embarked这些分类变量,采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量,采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量,采用直方图来显示它们的分布情况。

# 绘制分类变量的饼图
# labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
# autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
# shadow,饼是否有阴影
# startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
# pctdistance,百分比的text离圆心的距离

plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()

plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')

plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')

plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')

plt.show()

plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')

plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')

plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')

plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')

plt.show()

Kaggle - Titanic 生存预测

Kaggle - Titanic 生存预测

2. 分析不同因素与生存情况之间的关系

(1)性别:

计算不同性别的生存率:

print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())

输出为:

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109

Sex
female    0.742038
male      0.188908

可知:女性的生存率为74.20%,男性的生存率仅为18.89%,女性的生存率远大于男性,因此性别是一个重要的影响因素。

(2)年龄:

计算不同年龄的生存率:

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()

输出为:

Kaggle - Titanic 生存预测

可知:小孩子的生存率较高,老年人中有好几个年龄段的生存率都为0,生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。

print (train_age.groupby('Age_int')['Survived'].value_counts())

输出为:

Age_int  Survived
0        1            7
1        1            5
         0            2
2        0            7
         1            3
3        1            5
         0            1
4        1            7
         0            3
5        1            4
6        1            2
         0            1
7        0            2
         1            1
8        0            2
         1            2
9        0            6
         1            2
10       0            2
11       0            3
         1            1
12       1            1
13       1            2
14       0            4
         1            3
15       1            4
         0            1
16       0           11
         1            6
17       0            7
         1            6
18       0           17
         1            9
19       0           16
         1            9
20       0           13
         1            3
21       0           19
         1            5
22       0           16
         1           11
23       0           11
         1            5
24       0           16
         1           15
25       0           17
         1            6
26       0           12
         1            6
27       1           11
         0            7
28       0           20
         1            7
29       0           12
         1            8
30       0           17
         1           10
31       0            9
         1            8
32       0           10
         1           10
33       0            9
         1            6
34       0           10
         1            6
35       1           11
         0            7
36       0           12
         1           11
37       0            5
         1            1
38       0            6
         1            5
39       0            9
         1            5
40       0            9
         1            6
41       0            4
         1            2
42       0            7
         1            6
43       0            4
         1            1
44       0            6
         1            3
45       0            9
         1            5
46       0            3
47       0            8
         1            1
48       1            6
         0            3
49       1            4
         0            2
50       0            5
         1            5
51       0            5
         1            2
52       0            3
         1            3
53       1            1
54       0            5
         1            3
55       0            2
         1            1
56       0            2
         1            2
57       0            2
58       1            3
         0            2
59       0            2
60       0            2
         1            2
61       0            3
62       0            2
         1            2
63       1            2
64       0            2
65       0            3
66       0            1
70       0            3
71       0            2
74       0            1
80       1            1

接下来我们考虑将年龄分成几个年龄段,分别计算不同年龄段的生存率。0-1岁的小孩子生存率为100%,可以考虑将它单独作为一组,然后再分为1-15,15-55,>55这三个组。

train_age['Age_derived'] = pd.cut(train_age['Age'], bins=[0,0.99,14.99,54.99,100])
print (train_age.groupby('Age_derived')['Survived'].value_counts())
print (train_age.groupby('Age_derived')['Survived'].mean())

输出为:

Age_derived     Survived
(0.0, 0.99]     1             7
(0.99, 14.99]   1            38
                0            33
(14.99, 54.99]  0           362
                1           232
(54.99, 100.0]  0            29
                1            13

Age_derived
(0.0, 0.99]       1.000000
(0.99, 14.99]     0.535211
(14.99, 54.99]    0.390572
(54.99, 100.0]    0.309524

可知:小孩子的生存率较成年人和老年人高。

(3) 船舱等级:

计算不同船舱等级的生存率:

print (train.groupby('Pclass')['Survived'].value_counts())
print (train.groupby('Pclass')['Survived'].mean())

输出为:

Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119

Pclass
1    0.629630
2    0.472826
3    0.242363

可知:一等舱的生存率为62.96%,二等舱的生存率为47.28%,三等舱的生存率为24.24%。因此船舱等级也是影响生存情况的一个重要因素。

(4)登船港口:

计算不同登船港口的乘客的生存率:

print (train.groupby('Embarked')['Survived'].value_counts())
print (train.groupby('Embarked')['Survived'].mean())

输出为:

Embarked  Survived
C         1            93
          0            75
Q         0            47
          1            30
S         0           427
          1           217

Embarked
C    0.553571
Q    0.389610
S    0.336957

可知:港口C的生存率为55.36%,港口Q的生存率为38.96%,港口S的生存率为33.70%。港口C的生存率较高,因此登船港口可能为影响生存率的一个因素。

(5)有无兄弟姐妹及配偶 Sibsp、有无父母子女 Parch

计算不同Sibsp和不同Parch的生存率:

print (train.groupby('SibSp')['Survived'].value_counts())
print (train.groupby('SibSp')['Survived'].mean())

print (train.groupby('Parch')['Survived'].value_counts())
print (train.groupby('Parch')['Survived'].mean())

输出为:

SibSp  Survived
0      0           398
       1           210
1      1           112
       0            97
2      0            15
       1            13
3      0            12
       1             4
4      0            15
       1             3
5      0             5
8      0             7

SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000

Parch  Survived
0      0           445
       1           233
1      1            65
       0            53
2      0            40
       1            40
3      1             3
       0             2
4      0             4
5      0             4
       1             1
6      0             1

Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000

可知:独自一人的生存率较低,但如果亲人太多,生存率也会降低。

(6)Cabin:

Cabin的缺失率很高,无法做缺失值填补。暂时将它分为缺失和非缺失两种情况,分别计算生存率。

train.loc[train['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train.loc[train['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
print (train.groupby('Cabin_derived')['Survived'].value_counts())
print (train.groupby('Cabin_derived')['Survived'].mean())

输出为:

Cabin_derived  Survived
Missing        0           481
               1           206
Not Missing    1           136
               0            68

Cabin_derived
Missing        0.299854
Not Missing    0.666667

可知:Cabin缺失的乘客的生存率为29.99%,非缺失的乘客的生存率为66.67%,因此Cabin缺失与否可能与生存情况有关。

(7)费用Fare

先看一下生还者和未生还者的费用之间是否有区别:

print (train['Fare'][train['Survived'] == 0].describe())
print (train['Fare'][train['Survived'] == 1].describe())

输出为:

count    549.000000
mean      22.117887
std       31.388207
min        0.000000
25%        7.854200
50%       10.500000
75%       26.000000
max      263.000000

count    342.000000
mean      48.395408
std       66.596998
min        0.000000
25%       12.475000
50%       26.000000
75%       57.000000
max      512.329200

可知,生还者的费用中位数为26,未生还者的费用中位数为10,两者之间差别比较明显。

(8)Name:

第一感觉是每个人的名字都不一样,因此这个特征没什么太大价值。其实这个观点大错特错,在Titanic中,Name这个因素很重要,可以从中提取很重要的信息。首先我们来看一下Name具体是什么形式的:

print (train.Name)

输出为:

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Williams, Mr. Charles Eugene
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
19                               Masselmani, Mrs. Fatima
...

可知:Name里面包含称呼:Mr., Mrs., Miss., Master.等等。因此我们先试着提取出一个独立的特征:Title

train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
print (train['Title'].value_counts())

输出为:

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Major             2
Mlle              2
Mme               1
Ms                1
Don               1
Sir               1
Jonkheer          1
Capt              1
Lady              1
the Countess      1
Name: Title, dtype: int64

对于Master这个称呼,人数比较多,我们来看一下它代表的是哪一部分人群:

print (train[train['Title'] == 'Master'][['Survived','Title','Sex','Parch','SibSp','Fare','Age','Embarked']])

输出为:

     Survived   Title   Sex  Parch  SibSp      Fare    Age Embarked
7           0  Master  male      1      3   21.0750   2.00        S
16          0  Master  male      1      4   29.1250   2.00        Q
50          0  Master  male      1      4   39.6875   7.00        S
59          0  Master  male      2      5   46.9000  11.00        S
63          0  Master  male      2      3   27.9000   4.00        S
65          1  Master  male      1      1   15.2458    NaN        C
78          1  Master  male      2      0   29.0000   0.83        S
125         1  Master  male      0      1   11.2417  12.00        C
159         0  Master  male      2      8   69.5500    NaN        S
164         0  Master  male      1      4   39.6875   1.00        S
165         1  Master  male      2      0   20.5250   9.00        S
171         0  Master  male      1      4   29.1250   4.00        Q
176         0  Master  male      1      3   25.4667    NaN        S
...

可以看出,Master代表的是小男孩。

Title的种类比较多,我们把它们合并一下,再分析不同Title的生存率是否有差别:

train['Title'] = train['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Mlle', 'Ms'],'Miss')
train['Title'] = train['Title'].replace('Mme','Mrs')

输出为:

Title   Survived
Master  1            23
        0            17
Miss    1           130
        0            55
Mr      0           436
        1            81
Mrs     1           100
        0            26
Rare    0            15
        1             8

Title
Master    0.575000
Miss      0.702703
Mr        0.156673
Mrs       0.793651
Rare      0.347826

可知:Title是影响生存率的一个因素。

三、数据预处理

包括缺失值填补、连续型数值变量的离散化、分类变量的dummy过程。数据预处理的时候,我们将训练集和测试集合并在一起进行。

1. 缺失值填补

从上面的分析中我们知道,训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。其中Cabin缺失率(>70%)太高,我们不进行填补。

(1)填补Embarked:

Embarked变量为分类变量,是指登船港口,可取值为:C, Q, S。我们使用出现频率最高的特征值填补。

train_test_combined['Embarked'].fillna(train_test_combined['Embarked'].mode().iloc[0], inplace=True)

(2)填补Fare:

Fare是一个数值型变量,我们根据不同Pclass的Fare均值来进行缺失值填补。

train_test_combined['Fare'] = train_test_combined[['Fare']].fillna(train_test_combined.groupby('Pclass').transform('mean'))

(3)填补Age:

Age是一个数值型变量,我们根据不同Title(Mr, Mrs, Miss, Master等)的年龄均值来进行缺失值填补。

train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
train_test_combined['Age'] = train_test_combined[['Age']].fillna(train_test_combined.groupby('Title').transform('mean'))

填补完缺失值后,我们再看一下数据的基本情况:

print (train_test_combined.info())

输出为:

Data columns (total 13 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
Title          1309 non-null object
dtypes: float64(3), int64(4), object(6)

2. 连续型数值变量的离散化

(1)Age:

通过前面分析年龄和生存情况之间的关系,我们将年龄分为<1,1-<15,15-<55,>=55这四个年龄段。

train_test_combined['Age_derived'] = pd.cut(train_test_combined['Age'], bins=[0,0.99,14.99,54.99,100],labels=['baby','child','adult','older'])
age_dummy = pd.get_dummies(train_test_combined['Age_derived']).rename(columns=lambda x: 'Age_' + str(x))
train_test_combined = pd.concat([train_test_combined,age_dummy],axis=1)

(2)Fare:

通过分析Ticket可知,有些人的Ticket号相同,存在团体票,所以需要将团体票价均分到每个人。

print (train_test_combined.Ticket.value_counts())

输出为:

CA. 2343              11
CA 2144                8
1601                   8
347082                 7
3101295                7
PC 17608               7
S.O.C. 14879           7
347077                 7
19950                  6
347088                 6
113781                 6
382652                 6
...

均分团体票价:

train_test_combined['Group_ticket'] = train_test_combined['Fare'].groupby(by=train_test_combined['Ticket']).transform('count')
train_test_combined['Fare'] = train_test_combined['Fare']/train_test_combined['Group_ticket']

查看Fare的均值、中位数等统计量:

print (train_test_combined['Fare'].describe())

输出为:

count    1309.000000
mean       14.756516
std        13.550515
min         0.000000
25%         7.550000
50%         8.050000
75%        15.000000
max       128.082300
Name: Fare, dtype: float64

我们以P25, P75将Fare分为三档,Low_fare: <=7.55, Median_fare: 7.55-15.00, High_fare: >15.00

train_test_combined['Fare_derived'] = pd.cut(train_test_combined['Fare'], bins=[-1,7.55,15.00,130], labels=['Low_fare','Median_fare','High_fare'])
fare_dummy = pd.get_dummies(train_test_combined['Fare_derived']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,fare_dummy],axis=1)

3. Famliy Size

Sibsp和Parch都是反映亲人的数量,我们可以将这两个变量的值加起来,形成一个新的变量Family_size。

train_test_combined['Family_size'] = train_test_combined['Parch'] + train_test_combined['SibSp']

print (train_test_combined.groupby('Family_size')['Survived'].value_counts())
print (train_test_combined.groupby('Family_size')['Survived'].mean())

输出为:

Family_size  Survived
0            0           374
             1           163
1            1            89
             0            72
2            1            59
             0            43
3            1            21
             0             8
4            0            12
             1             3
5            0            19
             1             3
6            0             8
             1             4
7            0             6
10           0             7

Family_size
0     0.303538
1     0.552795
2     0.578431
3     0.724138
4     0.200000
5     0.136364
6     0.333333
7     0.000000
10    0.000000

可以看出独自一人或者family size过大,生存率均较低。我们根据Family_size的值分为三类:Single, Small family, Large family。

def family_size_category(Family_size):
	if Family_size == 0:
		return 'Single'
	elif Family_size <=3:
		return 'Small family'
	else:
		return 'Large family'

train_test_combined['Family_size_category'] = train_test_combined['Family_size'].map(family_size_category)
family_dummy = pd.get_dummies(train_test_combined['Family_size_category']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,family_dummy],axis=1)

4. Title

根据Name提取出title特征

train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])

train_test_combined['Title'] = train_test_combined['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don', 
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train_test_combined['Title'] = train_test_combined['Title'].replace(['Mlle', 'Ms'],'Miss')
train_test_combined['Title'] = train_test_combined['Title'].replace('Mme','Mrs')

title_dummy = pd.get_dummies(train_test_combined['Title']).rename(columns=lambda x: 'Title_' + str(x))
train_test_combined = pd.concat([train_test_combined,title_dummy],axis=1)

5. Cabin

根据Cabin是否缺失生成一个新的变量:

train_test_combined.loc[train_test_combined['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train_test_combined.loc[train_test_combined['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
cabin_dummy = pd.get_dummies(train_test_combined['Cabin_derived']).rename(columns=lambda x: 'Cabin_' + str(x))
train_test_combined = pd.concat([train_test_combined,cabin_dummy],axis=1)

6. Pclass、Sex、Embarked

这三个变量只需将其dummy,不需其他处理。

#Pclass的dummy处理
pclass_dummy = pd.get_dummies(train_test_combined['Pclass']).rename(columns=lambda x: 'Pclass_' + str(x))
train_test_combined = pd.concat([train_test_combined,pclass_dummy],axis=1)

#Sex的dummy处理
sex_dummy = pd.get_dummies(train_test_combined['Sex']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,sex_dummy],axis=1)

#Embarked的dummy处理
embarked_dummy = pd.get_dummies(train_test_combined['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))
train_test_combined = pd.concat([train_test_combined,embarked_dummy],axis=1)

最后我们将训练集和测试集分开,并保留有用的特征。

train = train_test_combined[:891]
test = train_test_combined[891:]

selected_features = ['Embarked_C','female', 'male',
       'Embarked_Q', 'Embarked_S', 'Age_baby', 'Age_child',
       'Age_adult', 'Age_older', 'Low_fare',
       'Median_fare', 'High_fare',
       'Large family', 'Single', 'Small family', 'Title_Master', 'Title_Miss',
       'Title_Mr', 'Title_Mrs', 'Title_Rare', 
       'Cabin_Missing', 'Cabin_Not Missing', 'Pclass_1',
       'Pclass_2', 'Pclass_3']

x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']

到此,数据预处理已完成,我们可以通过建模来进行预测了。

四、建模分析

1. Logistic回归

采用Grid CV法寻找最优超参数C

lr = LogisticRegression(random_state=33)
param_lr = {'C':np.logspace(-4,4,9)}
grid_lr = GridSearchCV(estimator = lr, param_grid = param_lr, cv = 5)
grid_lr.fit(x_train,y_train)
print (grid_lr.grid_scores_,'\n', 'Best param: ' ,grid_lr.best_params_, '\n', 'Best score: ', grid_lr.best_score_)

输出结果为:

[mean: 0.64646, std: 0.00833, params: {'C': 0.0001}, 
mean: 0.70595, std: 0.01292, params: {'C': 0.001}, 
mean: 0.80471, std: 0.02215, params: {'C': 0.01}, 
mean: 0.82043, std: 0.00361, params: {'C': 0.10000000000000001}, 
mean: 0.82492, std: 0.02629, params: {'C': 1.0}, 
mean: 0.82379, std: 0.02747, params: {'C': 10.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 100.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 1000.0}, 
mean: 0.82492, std: 0.02813, params: {'C': 10000.0}] 
 Best param:  {'C': 1.0} 
 Best score:  0.8249158249158249

打印出每个特征的系数:

print (pd.DataFrame({"columns":list(x_train.columns), "coef":list(grid_lr.best_estimator_.coef_.T)}))

输出为:

              columns                coef
0          Embarked_C     [0.23649956536]
1              female    [0.892754957337]
2                male   [-0.817790866598]
3          Embarked_Q   [0.0560917611675]
4          Embarked_S   [-0.217627235788]
5            Age_baby    [0.903880875824]
6           Age_child    [0.307975441906]
7           Age_adult    [-0.12853864715]
8           Age_older    [-1.00835357984]
9            Low_fare   [-0.343780990932]
10        Median_fare   [-0.102505740604]
11          High_fare    [0.521250822275]
12       Large family    [-1.40958453387]
13             Single    [0.864627435362]
14       Small family    [0.619921189252]
15       Title_Master     [1.76928521042]
16         Title_Miss  [0.00766966811902]
17           Title_Mr    [-1.21722551405]
18          Title_Mrs    [0.469708936608]
19         Title_Rare   [-0.954474210357]
20      Cabin_Missing    [-0.35111453535]
21  Cabin_Not Missing    [0.426078626089]
22           Pclass_1    [0.279724883526]
23           Pclass_2    [0.295636224026]
24           Pclass_3   [-0.500397016812]

我们再用训练好的模型对测试集进行预测,并将结果保存在本地。

lr_y_predict = lr.predict(x_test).astype('int')
lr_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':lr_y_predict})
lr_submission.to_csv('../lr_submission.csv', index=False)

最后,我们去Kaggle上make a  submission。结果为0.7799。

2. 决策树

采用Grid CV法寻找最优超参数max_depth, min_samples_split。

clf = tree.DecisionTreeClassifier(random_state=33)
param_clf = {'max_depth':[3,5,10,15,20,25],'min_samples_split':[2,4,6,8,10,15,20]}
grid_clf = GridSearchCV(estimator = clf, param_grid = param_clf, cv = 5)
grid_clf.fit(x_train,y_train)
print (grid_clf.grid_scores_,'\n', 'Best param: ' ,grid_clf.best_params_, '\n', 'Best score: ', grid_clf.best_score_)

#打印出feature importance
feature_imp_sorted_clf = pd.DataFrame({'feature': list(x_train.columns),
                                          'importance': grid_clf.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
print (feature_imp_sorted_clf)

#输出预测结果
clf_y_predict = grid_clf.predict(x_test).astype('int')
clf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':clf_y_predict})
clf_submission.to_csv('../clf_submission.csv', index=False)

输出为:

[mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 2}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 4}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 6}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 8}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 15}, 
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 20}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 2}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 4}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 6}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 8}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 10}, 
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 15}, 
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 20}, 
mean: 0.81930, std: 0.01400, params: {'max_depth': 10, 'min_samples_split': 2}, 
mean: 0.81930, std: 0.01848, params: {'max_depth': 10, 'min_samples_split': 4}, 
mean: 0.82043, std: 0.01939, params: {'max_depth': 10, 'min_samples_split': 6}, 
mean: 0.82267, std: 0.02194, params: {'max_depth': 10, 'min_samples_split': 8}, 
mean: 0.82492, std: 0.02281, params: {'max_depth': 10, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 10, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 10, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 15, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 15, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 15, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 15, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 15, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 15, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 15, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 20, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 20, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 20, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 20, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 20, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 20, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 20, 'min_samples_split': 20}, 
mean: 0.81818, std: 0.01438, params: {'max_depth': 25, 'min_samples_split': 2}, 
mean: 0.81706, std: 0.01711, params: {'max_depth': 25, 'min_samples_split': 4}, 
mean: 0.81818, std: 0.01787, params: {'max_depth': 25, 'min_samples_split': 6}, 
mean: 0.82379, std: 0.02051, params: {'max_depth': 25, 'min_samples_split': 8}, 
mean: 0.82828, std: 0.02255, params: {'max_depth': 25, 'min_samples_split': 10}, 
mean: 0.82604, std: 0.02161, params: {'max_depth': 25, 'min_samples_split': 15}, 
mean: 0.82716, std: 0.01968, params: {'max_depth': 25, 'min_samples_split': 20}] 
 Best param:  {'max_depth': 5, 'min_samples_split': 8} 
 Best score:  0.8338945005611672

              feature  importance
17           Title_Mr    0.579502
12       Large family    0.135564
19         Title_Rare    0.066667
21  Cabin_Not Missing    0.065133
24           Pclass_3    0.045870
9            Low_fare    0.041589
4          Embarked_S    0.020851
2                male    0.014137
7           Age_adult    0.008480
23           Pclass_2    0.007741
11          High_fare    0.007008
22           Pclass_1    0.002868
13             Single    0.001521
14       Small family    0.001146
0          Embarked_C    0.001003
3          Embarked_Q    0.000633
18          Title_Mrs    0.000288
20      Cabin_Missing    0.000000
5            Age_baby    0.000000
16         Title_Miss    0.000000
6           Age_child    0.000000
1              female    0.000000
10        Median_fare    0.000000
8           Age_older    0.000000
15       Title_Master    0.000000

输出可视化决策树:

print (grid_clf.best_estimator_)
clf = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=8,
            min_weight_fraction_leaf=0.0, presort=False, random_state=33,
            splitter='best')
clf.fit(x_train,y_train)

os.environ["PATH"] += os.pathsep + '/usr/local/Cellar/graphviz/2.40.1/bin/'

data_feature_name = list(x_train.columns)

dot_data = tree.export_graphviz(clf, out_file=None, feature_names = data_feature_name,filled=True, rounded=True,special_characters=True) 

graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("TitanicTree.pdf")
print('Visible tree plot saved as pdf.')

输出为:

Kaggle - Titanic 生存预测

最后去kaggle上make a submission,准确率为0.78947。

3. Random Forest

使用Grid CV来调参,先确定n_estimators的值,再确定max_features的值,最后确定max_depth、min_samples_leaf、min_samples_split的值。

rf = RandomForestClassifier(random_state=33)
param_rf = {'n_estimators':[i for i in range(10,50,5)]}
#param_rf = {'n_estimators':[10,50,100,200,500,1000]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)

rf = RandomForestClassifier(random_state=33, n_estimators=20)
param_rf = {'max_features':[i for i in range(2,23,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)

rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18)
param_rf = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)

rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10)
param_rf = {'min_samples_split':[i for i in range(12,25,2)],'min_samples_leaf':[i for i in range(2,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)

rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22, oob_score = True)
rf.fit(x_train,y_train)
#print (rf.oob_score_)
rf_y_predict = rf.predict(x_test).astype('int')
rf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rf_y_predict})
rf_submission.to_csv('../rf_submission.csv', index=False)

最终的选择的最优参数组合为:n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22,best score为0.8439955106621774。

最后去kaggle上提交,结果为0.79425。

4. Adaboost

Adaboost需要调节的参数较少,采用Grid CV法,寻找最优n_estimators和learning rate。这两个参数需要一起调。

ada = AdaBoostClassifier(random_state=33)
param_ada = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.01,0.1]}
grid_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 5)
grid_ada.fit(x_train,y_train)
print (grid_ada.grid_scores_,'\n', 'Best param: ' ,grid_ada.best_params_, '\n', 'Best score: ', grid_ada.best_score_)

ada_y_predict = grid_ada.predict(x_test).astype('int')
ada_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':ada_y_predict})
ada_submission.to_csv('../ada_submission.csv', index=False)

输出为:

[mean: 0.77890, std: 0.01317, params: {'learning_rate': 0.001, 'n_estimators': 500}, 
mean: 0.78676, std: 0.01813, params: {'learning_rate': 0.001, 'n_estimators': 1000}, 
mean: 0.79125, std: 0.01352, params: {'learning_rate': 0.001, 'n_estimators': 2000}, 
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.001, 'n_estimators': 5000}, 
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.01, 'n_estimators': 500}, 
mean: 0.82941, std: 0.01887, params: {'learning_rate': 0.01, 'n_estimators': 1000}, 
mean: 0.82828, std: 0.02010, params: {'learning_rate': 0.01, 'n_estimators': 2000}, 
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.01, 'n_estimators': 5000}, 
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.1, 'n_estimators': 500}, 
mean: 0.82155, std: 0.02737, params: {'learning_rate': 0.1, 'n_estimators': 1000}, 
mean: 0.82267, std: 0.02647, params: {'learning_rate': 0.1, 'n_estimators': 2000}, 
mean: 0.82379, std: 0.02674, params: {'learning_rate': 0.1, 'n_estimators': 5000}] 

 Best param:  {'learning_rate': 0.01, 'n_estimators': 1000} 
 Best score:  0.8294051627384961

最后去kaggle上提交,结果为0.78947。

5. Gradient tree boosting

使用Grid CV来调参,先确定n_estimators和learning rate的值,再确定max_depth、min_samples_leaf、min_samples_split的值。

gtb = GradientBoostingClassifier(random_state=33, subsample=0.8)
param_gtb = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.005,0.01,0.02]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)

gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001)
param_gtb = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)

gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10)
param_gtb = {'min_samples_split':[i for i in range(10,18,2)],'min_samples_leaf':[i for i in range(14,19,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)

gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16)
gtb.fit(x_train,y_train)
gtb_y_predict = gtb.predict(x_test).astype('int')
print (cross_val_score(gtb,x_train,y_train,cv=5).mean())
gtb_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':gtb_y_predict})
gtb_submission.to_csv('../gtb_submission.csv', index=False)

最终的选择的最优参数组合为:n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16,best score为0.8417508417508418。

最后去kaggle上提交,结果为0.80382。

五、另一种预测方法

我们知道大部分女性都幸存下来了,大部分男性都没能存活下来。那怎么去判断是哪部分女性没有幸存下来,而哪部分男性幸存下来了呢。有一个合理的假设为:如果母亲活下来了,那么她的孩子也会存活下来。如果孩子死了,那么母亲也会死亡。因为训练集和测试集中的家族是有一部分重叠的,所以我们可以根据训练集家族孩子和女性的存活情况,来判断测试集中同一家族中孩子和女性的存活情况。其规则为:对于测试集中的小男孩,如果他家族中的女性和小男孩都活下来了,那么就预测该小男孩也会活下来。对于测试集中的女性,如果她家族中的女性和小男孩都死亡了,那么就预测该女性也会死亡。对剩下的数据,则再根据性别来判断,女性存活,男性死亡。

#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')

#Surname
train['Surname'] = [train.iloc[i]['Name'].split(',')[0] + str(train.iloc[i]['Pclass']) for i in range(len(train))] #同一个家族的人船舱等级应该一致
test['Surname'] = [test.iloc[i]['Name'].split(',')[0] + str(test.iloc[i]['Pclass']) for i in range(len(test))] #同一个家族的人船舱等级应该一致

train['Family_size'] = train['Parch'] + train['SibSp']
test['Family_size'] = test['Parch'] + test['SibSp']

train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
test['Title'] = test['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])


boy = (train.Name.str.contains('Master')) | ((train.Sex=='male') & (train.Age<13))
female = train.Sex=='female'
boy_or_female = boy | female

boy_femSurvival = train[boy_or_female].groupby('Surname')['Survived'].mean().to_frame()

boy_femSurvived = list(boy_femSurvival[boy_femSurvival['Survived']==1].index)
boy_femDied = list(boy_femSurvival[boy_femSurvival['Survived']==0].index)

def boy_female_survival(input_dataset):
	for i in range(len(input_dataset)):
		if input_dataset.iloc[i]['Surname'] in boy_femSurvived and input_dataset.iloc[i]['Family_size']>0 and (input_dataset.iloc[i]['Sex']=='female' or 
			(input_dataset.iloc[i]['Title']=='Master' or (input_dataset.iloc[i]['Sex']=='male' and input_dataset.iloc[i]['Age']<13))):
			input_dataset.loc[i,'Survived'] = 1
		elif input_dataset.iloc[i]['Surname'] in boy_femDied and input_dataset.iloc[i]['Family_size']>0:
			input_dataset.loc[i,'Survived'] = 0


boy_female_survival(test)	
#print (test[test['Survived'] == 1][['Name', 'Age', 'Sex', 'Pclass','Family_size']])

test_out1 = test[test['Survived'].notnull()]
test1 = test[test['Survived'].isnull()]
test1.index = range(0,len(test1))

#对剩下的数据根据性别来判断幸存与否
def gender_survival(sex):
	if sex == 'female':
		return 1
	else:
		return 0

test1['Survived'] = test1['Sex'].map(gender_survival)

#合并两次预测的数据
test_out = pd.concat([test_out1, test1], axis=0).sort_values(by = 'PassengerId')
test_submission = test_out[['PassengerId','Survived']]
test_submission['Survived'] = test_submission['Survived'].astype('int')

test_submission.to_csv('../test_submission.csv', index=False)

最后去kaggle上提交,结果为0.81339,比建模的效果要好。

 

参考文献:

1. How to score over 82% Titanic 

2. Kaggle_Titanic生存预测 -- 详细流程吐血梳理