Kaggle - Titanic 生存预测
第一次参加Kaggle,以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。
一、数据总览
从Kaggle平台我们了解到,Training set一共有891条记录,Test set一共有418条记录。提供的相关变量有:
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class |
1 = 1st, 2 = 2nd, 3 = 3rd A proxy for socio-economic status (SES) |
sex | Sex | |
Age | Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |
sibsp | # of siblings / spouses aboard the Titanic | Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) |
parch | # of parents / children aboard the Titanic | Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them. |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
首先查看一下训练集和测试集的基本信息,对数据的规模、各个特征的数据类型以及是否有缺失,有一个总体的了解:
import pandas as pd
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)
#查看基本信息
print (train.info())
print (test.info())
输出为:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
可知:训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。
接下来我们再查看一下数据的具体格式:
#默认打印出前5行数据
print (train.head())
我使用的是Sublime编辑器,因为列数太多,会分多行打印,输出结果不太美观。因此直接去Kaggle上查看数据,以下为Kaggle上的数据截图。
二、数据初步分析
1. 乘客基本属性分析
对于Survived、Sex、Pclass、Embarked这些分类变量,采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量,采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量,采用直方图来显示它们的分布情况。
# 绘制分类变量的饼图
# labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
# autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
# shadow,饼是否有阴影
# startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
# pctdistance,百分比的text离圆心的距离
plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()
plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')
plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')
plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')
plt.show()
plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')
plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')
plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')
plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')
plt.show()
2. 分析不同因素与生存情况之间的关系
(1)性别:
计算不同性别的生存率:
print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())
输出为:
Sex Survived
female 1 233
0 81
male 0 468
1 109
Sex
female 0.742038
male 0.188908
可知:女性的生存率为74.20%,男性的生存率仅为18.89%,女性的生存率远大于男性,因此性别是一个重要的影响因素。
(2)年龄:
计算不同年龄的生存率:
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()
输出为:
可知:小孩子的生存率较高,老年人中有好几个年龄段的生存率都为0,生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。
print (train_age.groupby('Age_int')['Survived'].value_counts())
输出为:
Age_int Survived
0 1 7
1 1 5
0 2
2 0 7
1 3
3 1 5
0 1
4 1 7
0 3
5 1 4
6 1 2
0 1
7 0 2
1 1
8 0 2
1 2
9 0 6
1 2
10 0 2
11 0 3
1 1
12 1 1
13 1 2
14 0 4
1 3
15 1 4
0 1
16 0 11
1 6
17 0 7
1 6
18 0 17
1 9
19 0 16
1 9
20 0 13
1 3
21 0 19
1 5
22 0 16
1 11
23 0 11
1 5
24 0 16
1 15
25 0 17
1 6
26 0 12
1 6
27 1 11
0 7
28 0 20
1 7
29 0 12
1 8
30 0 17
1 10
31 0 9
1 8
32 0 10
1 10
33 0 9
1 6
34 0 10
1 6
35 1 11
0 7
36 0 12
1 11
37 0 5
1 1
38 0 6
1 5
39 0 9
1 5
40 0 9
1 6
41 0 4
1 2
42 0 7
1 6
43 0 4
1 1
44 0 6
1 3
45 0 9
1 5
46 0 3
47 0 8
1 1
48 1 6
0 3
49 1 4
0 2
50 0 5
1 5
51 0 5
1 2
52 0 3
1 3
53 1 1
54 0 5
1 3
55 0 2
1 1
56 0 2
1 2
57 0 2
58 1 3
0 2
59 0 2
60 0 2
1 2
61 0 3
62 0 2
1 2
63 1 2
64 0 2
65 0 3
66 0 1
70 0 3
71 0 2
74 0 1
80 1 1
接下来我们考虑将年龄分成几个年龄段,分别计算不同年龄段的生存率。0-1岁的小孩子生存率为100%,可以考虑将它单独作为一组,然后再分为1-15,15-55,>55这三个组。
train_age['Age_derived'] = pd.cut(train_age['Age'], bins=[0,0.99,14.99,54.99,100])
print (train_age.groupby('Age_derived')['Survived'].value_counts())
print (train_age.groupby('Age_derived')['Survived'].mean())
输出为:
Age_derived Survived
(0.0, 0.99] 1 7
(0.99, 14.99] 1 38
0 33
(14.99, 54.99] 0 362
1 232
(54.99, 100.0] 0 29
1 13
Age_derived
(0.0, 0.99] 1.000000
(0.99, 14.99] 0.535211
(14.99, 54.99] 0.390572
(54.99, 100.0] 0.309524
可知:小孩子的生存率较成年人和老年人高。
(3) 船舱等级:
计算不同船舱等级的生存率:
print (train.groupby('Pclass')['Survived'].value_counts())
print (train.groupby('Pclass')['Survived'].mean())
输出为:
Pclass Survived
1 1 136
0 80
2 0 97
1 87
3 0 372
1 119
Pclass
1 0.629630
2 0.472826
3 0.242363
可知:一等舱的生存率为62.96%,二等舱的生存率为47.28%,三等舱的生存率为24.24%。因此船舱等级也是影响生存情况的一个重要因素。
(4)登船港口:
计算不同登船港口的乘客的生存率:
print (train.groupby('Embarked')['Survived'].value_counts())
print (train.groupby('Embarked')['Survived'].mean())
输出为:
Embarked Survived
C 1 93
0 75
Q 0 47
1 30
S 0 427
1 217
Embarked
C 0.553571
Q 0.389610
S 0.336957
可知:港口C的生存率为55.36%,港口Q的生存率为38.96%,港口S的生存率为33.70%。港口C的生存率较高,因此登船港口可能为影响生存率的一个因素。
(5)有无兄弟姐妹及配偶 Sibsp、有无父母子女 Parch
计算不同Sibsp和不同Parch的生存率:
print (train.groupby('SibSp')['Survived'].value_counts())
print (train.groupby('SibSp')['Survived'].mean())
print (train.groupby('Parch')['Survived'].value_counts())
print (train.groupby('Parch')['Survived'].mean())
输出为:
SibSp Survived
0 0 398
1 210
1 1 112
0 97
2 0 15
1 13
3 0 12
1 4
4 0 15
1 3
5 0 5
8 0 7
SibSp
0 0.345395
1 0.535885
2 0.464286
3 0.250000
4 0.166667
5 0.000000
8 0.000000
Parch Survived
0 0 445
1 233
1 1 65
0 53
2 0 40
1 40
3 1 3
0 2
4 0 4
5 0 4
1 1
6 0 1
Parch
0 0.343658
1 0.550847
2 0.500000
3 0.600000
4 0.000000
5 0.200000
6 0.000000
可知:独自一人的生存率较低,但如果亲人太多,生存率也会降低。
(6)Cabin:
Cabin的缺失率很高,无法做缺失值填补。暂时将它分为缺失和非缺失两种情况,分别计算生存率。
train.loc[train['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train.loc[train['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
print (train.groupby('Cabin_derived')['Survived'].value_counts())
print (train.groupby('Cabin_derived')['Survived'].mean())
输出为:
Cabin_derived Survived
Missing 0 481
1 206
Not Missing 1 136
0 68
Cabin_derived
Missing 0.299854
Not Missing 0.666667
可知:Cabin缺失的乘客的生存率为29.99%,非缺失的乘客的生存率为66.67%,因此Cabin缺失与否可能与生存情况有关。
(7)费用Fare
先看一下生还者和未生还者的费用之间是否有区别:
print (train['Fare'][train['Survived'] == 0].describe())
print (train['Fare'][train['Survived'] == 1].describe())
输出为:
count 549.000000
mean 22.117887
std 31.388207
min 0.000000
25% 7.854200
50% 10.500000
75% 26.000000
max 263.000000
count 342.000000
mean 48.395408
std 66.596998
min 0.000000
25% 12.475000
50% 26.000000
75% 57.000000
max 512.329200
可知,生还者的费用中位数为26,未生还者的费用中位数为10,两者之间差别比较明显。
(8)Name:
第一感觉是每个人的名字都不一样,因此这个特征没什么太大价值。其实这个观点大错特错,在Titanic中,Name这个因素很重要,可以从中提取很重要的信息。首先我们来看一下Name具体是什么形式的:
print (train.Name)
输出为:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
7 Palsson, Master. Gosta Leonard
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 Nasser, Mrs. Nicholas (Adele Achem)
10 Sandstrom, Miss. Marguerite Rut
11 Bonnell, Miss. Elizabeth
12 Saundercock, Mr. William Henry
13 Andersson, Mr. Anders Johan
14 Vestrom, Miss. Hulda Amanda Adolfina
15 Hewlett, Mrs. (Mary D Kingcome)
16 Rice, Master. Eugene
17 Williams, Mr. Charles Eugene
18 Vander Planke, Mrs. Julius (Emelia Maria Vande...
19 Masselmani, Mrs. Fatima
...
可知:Name里面包含称呼:Mr., Mrs., Miss., Master.等等。因此我们先试着提取出一个独立的特征:Title
train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
print (train['Title'].value_counts())
输出为:
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
Mme 1
Ms 1
Don 1
Sir 1
Jonkheer 1
Capt 1
Lady 1
the Countess 1
Name: Title, dtype: int64
对于Master这个称呼,人数比较多,我们来看一下它代表的是哪一部分人群:
print (train[train['Title'] == 'Master'][['Survived','Title','Sex','Parch','SibSp','Fare','Age','Embarked']])
输出为:
Survived Title Sex Parch SibSp Fare Age Embarked
7 0 Master male 1 3 21.0750 2.00 S
16 0 Master male 1 4 29.1250 2.00 Q
50 0 Master male 1 4 39.6875 7.00 S
59 0 Master male 2 5 46.9000 11.00 S
63 0 Master male 2 3 27.9000 4.00 S
65 1 Master male 1 1 15.2458 NaN C
78 1 Master male 2 0 29.0000 0.83 S
125 1 Master male 0 1 11.2417 12.00 C
159 0 Master male 2 8 69.5500 NaN S
164 0 Master male 1 4 39.6875 1.00 S
165 1 Master male 2 0 20.5250 9.00 S
171 0 Master male 1 4 29.1250 4.00 Q
176 0 Master male 1 3 25.4667 NaN S
...
可以看出,Master代表的是小男孩。
Title的种类比较多,我们把它们合并一下,再分析不同Title的生存率是否有差别:
train['Title'] = train['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Mlle', 'Ms'],'Miss')
train['Title'] = train['Title'].replace('Mme','Mrs')
输出为:
Title Survived
Master 1 23
0 17
Miss 1 130
0 55
Mr 0 436
1 81
Mrs 1 100
0 26
Rare 0 15
1 8
Title
Master 0.575000
Miss 0.702703
Mr 0.156673
Mrs 0.793651
Rare 0.347826
可知:Title是影响生存率的一个因素。
三、数据预处理
包括缺失值填补、连续型数值变量的离散化、分类变量的dummy过程。数据预处理的时候,我们将训练集和测试集合并在一起进行。
1. 缺失值填补
从上面的分析中我们知道,训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。其中Cabin缺失率(>70%)太高,我们不进行填补。
(1)填补Embarked:
Embarked变量为分类变量,是指登船港口,可取值为:C, Q, S。我们使用出现频率最高的特征值填补。
train_test_combined['Embarked'].fillna(train_test_combined['Embarked'].mode().iloc[0], inplace=True)
(2)填补Fare:
Fare是一个数值型变量,我们根据不同Pclass的Fare均值来进行缺失值填补。
train_test_combined['Fare'] = train_test_combined[['Fare']].fillna(train_test_combined.groupby('Pclass').transform('mean'))
(3)填补Age:
Age是一个数值型变量,我们根据不同Title(Mr, Mrs, Miss, Master等)的年龄均值来进行缺失值填补。
train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
train_test_combined['Age'] = train_test_combined[['Age']].fillna(train_test_combined.groupby('Title').transform('mean'))
填补完缺失值后,我们再看一下数据的基本情况:
print (train_test_combined.info())
输出为:
Data columns (total 13 columns):
Age 1309 non-null float64
Cabin 295 non-null object
Embarked 1309 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
Title 1309 non-null object
dtypes: float64(3), int64(4), object(6)
2. 连续型数值变量的离散化
(1)Age:
通过前面分析年龄和生存情况之间的关系,我们将年龄分为<1,1-<15,15-<55,>=55这四个年龄段。
train_test_combined['Age_derived'] = pd.cut(train_test_combined['Age'], bins=[0,0.99,14.99,54.99,100],labels=['baby','child','adult','older'])
age_dummy = pd.get_dummies(train_test_combined['Age_derived']).rename(columns=lambda x: 'Age_' + str(x))
train_test_combined = pd.concat([train_test_combined,age_dummy],axis=1)
(2)Fare:
通过分析Ticket可知,有些人的Ticket号相同,存在团体票,所以需要将团体票价均分到每个人。
print (train_test_combined.Ticket.value_counts())
输出为:
CA. 2343 11
CA 2144 8
1601 8
347082 7
3101295 7
PC 17608 7
S.O.C. 14879 7
347077 7
19950 6
347088 6
113781 6
382652 6
...
均分团体票价:
train_test_combined['Group_ticket'] = train_test_combined['Fare'].groupby(by=train_test_combined['Ticket']).transform('count')
train_test_combined['Fare'] = train_test_combined['Fare']/train_test_combined['Group_ticket']
查看Fare的均值、中位数等统计量:
print (train_test_combined['Fare'].describe())
输出为:
count 1309.000000
mean 14.756516
std 13.550515
min 0.000000
25% 7.550000
50% 8.050000
75% 15.000000
max 128.082300
Name: Fare, dtype: float64
我们以P25, P75将Fare分为三档,Low_fare: <=7.55, Median_fare: 7.55-15.00, High_fare: >15.00
train_test_combined['Fare_derived'] = pd.cut(train_test_combined['Fare'], bins=[-1,7.55,15.00,130], labels=['Low_fare','Median_fare','High_fare'])
fare_dummy = pd.get_dummies(train_test_combined['Fare_derived']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,fare_dummy],axis=1)
3. Famliy Size
Sibsp和Parch都是反映亲人的数量,我们可以将这两个变量的值加起来,形成一个新的变量Family_size。
train_test_combined['Family_size'] = train_test_combined['Parch'] + train_test_combined['SibSp']
print (train_test_combined.groupby('Family_size')['Survived'].value_counts())
print (train_test_combined.groupby('Family_size')['Survived'].mean())
输出为:
Family_size Survived
0 0 374
1 163
1 1 89
0 72
2 1 59
0 43
3 1 21
0 8
4 0 12
1 3
5 0 19
1 3
6 0 8
1 4
7 0 6
10 0 7
Family_size
0 0.303538
1 0.552795
2 0.578431
3 0.724138
4 0.200000
5 0.136364
6 0.333333
7 0.000000
10 0.000000
可以看出独自一人或者family size过大,生存率均较低。我们根据Family_size的值分为三类:Single, Small family, Large family。
def family_size_category(Family_size):
if Family_size == 0:
return 'Single'
elif Family_size <=3:
return 'Small family'
else:
return 'Large family'
train_test_combined['Family_size_category'] = train_test_combined['Family_size'].map(family_size_category)
family_dummy = pd.get_dummies(train_test_combined['Family_size_category']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,family_dummy],axis=1)
4. Title
根据Name提取出title特征
train_test_combined['Title'] = train_test_combined['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
train_test_combined['Title'] = train_test_combined['Title'].replace(['Lady', 'the Countess','Capt', 'Col','Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train_test_combined['Title'] = train_test_combined['Title'].replace(['Mlle', 'Ms'],'Miss')
train_test_combined['Title'] = train_test_combined['Title'].replace('Mme','Mrs')
title_dummy = pd.get_dummies(train_test_combined['Title']).rename(columns=lambda x: 'Title_' + str(x))
train_test_combined = pd.concat([train_test_combined,title_dummy],axis=1)
5. Cabin
根据Cabin是否缺失生成一个新的变量:
train_test_combined.loc[train_test_combined['Cabin'].isnull(),'Cabin_derived'] = 'Missing'
train_test_combined.loc[train_test_combined['Cabin'].notnull(),'Cabin_derived'] = 'Not Missing'
cabin_dummy = pd.get_dummies(train_test_combined['Cabin_derived']).rename(columns=lambda x: 'Cabin_' + str(x))
train_test_combined = pd.concat([train_test_combined,cabin_dummy],axis=1)
6. Pclass、Sex、Embarked
这三个变量只需将其dummy,不需其他处理。
#Pclass的dummy处理
pclass_dummy = pd.get_dummies(train_test_combined['Pclass']).rename(columns=lambda x: 'Pclass_' + str(x))
train_test_combined = pd.concat([train_test_combined,pclass_dummy],axis=1)
#Sex的dummy处理
sex_dummy = pd.get_dummies(train_test_combined['Sex']).rename(columns=lambda x: str(x))
train_test_combined = pd.concat([train_test_combined,sex_dummy],axis=1)
#Embarked的dummy处理
embarked_dummy = pd.get_dummies(train_test_combined['Embarked']).rename(columns=lambda x: 'Embarked_' + str(x))
train_test_combined = pd.concat([train_test_combined,embarked_dummy],axis=1)
最后我们将训练集和测试集分开,并保留有用的特征。
train = train_test_combined[:891]
test = train_test_combined[891:]
selected_features = ['Embarked_C','female', 'male',
'Embarked_Q', 'Embarked_S', 'Age_baby', 'Age_child',
'Age_adult', 'Age_older', 'Low_fare',
'Median_fare', 'High_fare',
'Large family', 'Single', 'Small family', 'Title_Master', 'Title_Miss',
'Title_Mr', 'Title_Mrs', 'Title_Rare',
'Cabin_Missing', 'Cabin_Not Missing', 'Pclass_1',
'Pclass_2', 'Pclass_3']
x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']
到此,数据预处理已完成,我们可以通过建模来进行预测了。
四、建模分析
1. Logistic回归
采用Grid CV法寻找最优超参数C
lr = LogisticRegression(random_state=33)
param_lr = {'C':np.logspace(-4,4,9)}
grid_lr = GridSearchCV(estimator = lr, param_grid = param_lr, cv = 5)
grid_lr.fit(x_train,y_train)
print (grid_lr.grid_scores_,'\n', 'Best param: ' ,grid_lr.best_params_, '\n', 'Best score: ', grid_lr.best_score_)
输出结果为:
[mean: 0.64646, std: 0.00833, params: {'C': 0.0001},
mean: 0.70595, std: 0.01292, params: {'C': 0.001},
mean: 0.80471, std: 0.02215, params: {'C': 0.01},
mean: 0.82043, std: 0.00361, params: {'C': 0.10000000000000001},
mean: 0.82492, std: 0.02629, params: {'C': 1.0},
mean: 0.82379, std: 0.02747, params: {'C': 10.0},
mean: 0.82492, std: 0.02813, params: {'C': 100.0},
mean: 0.82492, std: 0.02813, params: {'C': 1000.0},
mean: 0.82492, std: 0.02813, params: {'C': 10000.0}]
Best param: {'C': 1.0}
Best score: 0.8249158249158249
打印出每个特征的系数:
print (pd.DataFrame({"columns":list(x_train.columns), "coef":list(grid_lr.best_estimator_.coef_.T)}))
输出为:
columns coef
0 Embarked_C [0.23649956536]
1 female [0.892754957337]
2 male [-0.817790866598]
3 Embarked_Q [0.0560917611675]
4 Embarked_S [-0.217627235788]
5 Age_baby [0.903880875824]
6 Age_child [0.307975441906]
7 Age_adult [-0.12853864715]
8 Age_older [-1.00835357984]
9 Low_fare [-0.343780990932]
10 Median_fare [-0.102505740604]
11 High_fare [0.521250822275]
12 Large family [-1.40958453387]
13 Single [0.864627435362]
14 Small family [0.619921189252]
15 Title_Master [1.76928521042]
16 Title_Miss [0.00766966811902]
17 Title_Mr [-1.21722551405]
18 Title_Mrs [0.469708936608]
19 Title_Rare [-0.954474210357]
20 Cabin_Missing [-0.35111453535]
21 Cabin_Not Missing [0.426078626089]
22 Pclass_1 [0.279724883526]
23 Pclass_2 [0.295636224026]
24 Pclass_3 [-0.500397016812]
我们再用训练好的模型对测试集进行预测,并将结果保存在本地。
lr_y_predict = lr.predict(x_test).astype('int')
lr_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':lr_y_predict})
lr_submission.to_csv('../lr_submission.csv', index=False)
最后,我们去Kaggle上make a submission。结果为0.7799。
2. 决策树
采用Grid CV法寻找最优超参数max_depth, min_samples_split。
clf = tree.DecisionTreeClassifier(random_state=33)
param_clf = {'max_depth':[3,5,10,15,20,25],'min_samples_split':[2,4,6,8,10,15,20]}
grid_clf = GridSearchCV(estimator = clf, param_grid = param_clf, cv = 5)
grid_clf.fit(x_train,y_train)
print (grid_clf.grid_scores_,'\n', 'Best param: ' ,grid_clf.best_params_, '\n', 'Best score: ', grid_clf.best_score_)
#打印出feature importance
feature_imp_sorted_clf = pd.DataFrame({'feature': list(x_train.columns),
'importance': grid_clf.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
print (feature_imp_sorted_clf)
#输出预测结果
clf_y_predict = grid_clf.predict(x_test).astype('int')
clf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':clf_y_predict})
clf_submission.to_csv('../clf_submission.csv', index=False)
输出为:
[mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 2},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 4},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 6},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 8},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 10},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 15},
mean: 0.82604, std: 0.02448, params: {'max_depth': 3, 'min_samples_split': 20},
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 2},
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 4},
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 6},
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 8},
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 10},
mean: 0.83389, std: 0.02877, params: {'max_depth': 5, 'min_samples_split': 15},
mean: 0.83277, std: 0.02694, params: {'max_depth': 5, 'min_samples_split': 20},
mean: 0.81930, std: 0.01400, params: {'max_depth': 10, 'min_samples_split': 2},
mean: 0.81930, std: 0.01848, params: {'max_depth': 10, 'min_samples_split': 4},
mean: 0.82043, std: 0.01939, params: {'max_depth': 10, 'min_samples_split': 6},
mean: 0.82267, std: 0.02194, params: {'max_depth': 10, 'min_samples_split': 8},
mean: 0.82492, std: 0.02281, params: {'max_depth': 10, 'min_samples_split': 10},
mean: 0.82604, std: 0.02161, params: {'max_depth': 10, 'min_samples_split': 15},
mean: 0.82716, std: 0.01968, params: {'max_depth': 10, 'min_samples_split': 20},
mean: 0.81818, std: 0.01438, params: {'max_depth': 15, 'min_samples_split': 2},
mean: 0.81706, std: 0.01711, params: {'max_depth': 15, 'min_samples_split': 4},
mean: 0.81818, std: 0.01787, params: {'max_depth': 15, 'min_samples_split': 6},
mean: 0.82379, std: 0.02051, params: {'max_depth': 15, 'min_samples_split': 8},
mean: 0.82828, std: 0.02255, params: {'max_depth': 15, 'min_samples_split': 10},
mean: 0.82604, std: 0.02161, params: {'max_depth': 15, 'min_samples_split': 15},
mean: 0.82716, std: 0.01968, params: {'max_depth': 15, 'min_samples_split': 20},
mean: 0.81818, std: 0.01438, params: {'max_depth': 20, 'min_samples_split': 2},
mean: 0.81706, std: 0.01711, params: {'max_depth': 20, 'min_samples_split': 4},
mean: 0.81818, std: 0.01787, params: {'max_depth': 20, 'min_samples_split': 6},
mean: 0.82379, std: 0.02051, params: {'max_depth': 20, 'min_samples_split': 8},
mean: 0.82828, std: 0.02255, params: {'max_depth': 20, 'min_samples_split': 10},
mean: 0.82604, std: 0.02161, params: {'max_depth': 20, 'min_samples_split': 15},
mean: 0.82716, std: 0.01968, params: {'max_depth': 20, 'min_samples_split': 20},
mean: 0.81818, std: 0.01438, params: {'max_depth': 25, 'min_samples_split': 2},
mean: 0.81706, std: 0.01711, params: {'max_depth': 25, 'min_samples_split': 4},
mean: 0.81818, std: 0.01787, params: {'max_depth': 25, 'min_samples_split': 6},
mean: 0.82379, std: 0.02051, params: {'max_depth': 25, 'min_samples_split': 8},
mean: 0.82828, std: 0.02255, params: {'max_depth': 25, 'min_samples_split': 10},
mean: 0.82604, std: 0.02161, params: {'max_depth': 25, 'min_samples_split': 15},
mean: 0.82716, std: 0.01968, params: {'max_depth': 25, 'min_samples_split': 20}]
Best param: {'max_depth': 5, 'min_samples_split': 8}
Best score: 0.8338945005611672
feature importance
17 Title_Mr 0.579502
12 Large family 0.135564
19 Title_Rare 0.066667
21 Cabin_Not Missing 0.065133
24 Pclass_3 0.045870
9 Low_fare 0.041589
4 Embarked_S 0.020851
2 male 0.014137
7 Age_adult 0.008480
23 Pclass_2 0.007741
11 High_fare 0.007008
22 Pclass_1 0.002868
13 Single 0.001521
14 Small family 0.001146
0 Embarked_C 0.001003
3 Embarked_Q 0.000633
18 Title_Mrs 0.000288
20 Cabin_Missing 0.000000
5 Age_baby 0.000000
16 Title_Miss 0.000000
6 Age_child 0.000000
1 female 0.000000
10 Median_fare 0.000000
8 Age_older 0.000000
15 Title_Master 0.000000
输出可视化决策树:
print (grid_clf.best_estimator_)
clf = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=8,
min_weight_fraction_leaf=0.0, presort=False, random_state=33,
splitter='best')
clf.fit(x_train,y_train)
os.environ["PATH"] += os.pathsep + '/usr/local/Cellar/graphviz/2.40.1/bin/'
data_feature_name = list(x_train.columns)
dot_data = tree.export_graphviz(clf, out_file=None, feature_names = data_feature_name,filled=True, rounded=True,special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("TitanicTree.pdf")
print('Visible tree plot saved as pdf.')
输出为:
最后去kaggle上make a submission,准确率为0.78947。
3. Random Forest
使用Grid CV来调参,先确定n_estimators的值,再确定max_features的值,最后确定max_depth、min_samples_leaf、min_samples_split的值。
rf = RandomForestClassifier(random_state=33)
param_rf = {'n_estimators':[i for i in range(10,50,5)]}
#param_rf = {'n_estimators':[10,50,100,200,500,1000]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)
rf = RandomForestClassifier(random_state=33, n_estimators=20)
param_rf = {'max_features':[i for i in range(2,23,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)
rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18)
param_rf = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)
rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10)
param_rf = {'min_samples_split':[i for i in range(12,25,2)],'min_samples_leaf':[i for i in range(2,21,2)]}
grid_rf = GridSearchCV(estimator = rf, param_grid = param_rf, cv = 5)
grid_rf.fit(x_train,y_train)
print (grid_rf.grid_scores_,'\n', 'Best param: ' ,grid_rf.best_params_, '\n', 'Best score: ', grid_rf.best_score_)
rf = RandomForestClassifier(random_state=33, n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22, oob_score = True)
rf.fit(x_train,y_train)
#print (rf.oob_score_)
rf_y_predict = rf.predict(x_test).astype('int')
rf_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rf_y_predict})
rf_submission.to_csv('../rf_submission.csv', index=False)
最终的选择的最优参数组合为:n_estimators=20, max_features = 18, max_depth=10, min_samples_leaf = 2, min_samples_split = 22,best score为0.8439955106621774。
最后去kaggle上提交,结果为0.79425。
4. Adaboost
Adaboost需要调节的参数较少,采用Grid CV法,寻找最优n_estimators和learning rate。这两个参数需要一起调。
ada = AdaBoostClassifier(random_state=33)
param_ada = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.01,0.1]}
grid_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 5)
grid_ada.fit(x_train,y_train)
print (grid_ada.grid_scores_,'\n', 'Best param: ' ,grid_ada.best_params_, '\n', 'Best score: ', grid_ada.best_score_)
ada_y_predict = grid_ada.predict(x_test).astype('int')
ada_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':ada_y_predict})
ada_submission.to_csv('../ada_submission.csv', index=False)
输出为:
[mean: 0.77890, std: 0.01317, params: {'learning_rate': 0.001, 'n_estimators': 500},
mean: 0.78676, std: 0.01813, params: {'learning_rate': 0.001, 'n_estimators': 1000},
mean: 0.79125, std: 0.01352, params: {'learning_rate': 0.001, 'n_estimators': 2000},
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.001, 'n_estimators': 5000},
mean: 0.81818, std: 0.01382, params: {'learning_rate': 0.01, 'n_estimators': 500},
mean: 0.82941, std: 0.01887, params: {'learning_rate': 0.01, 'n_estimators': 1000},
mean: 0.82828, std: 0.02010, params: {'learning_rate': 0.01, 'n_estimators': 2000},
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.01, 'n_estimators': 5000},
mean: 0.82492, std: 0.02700, params: {'learning_rate': 0.1, 'n_estimators': 500},
mean: 0.82155, std: 0.02737, params: {'learning_rate': 0.1, 'n_estimators': 1000},
mean: 0.82267, std: 0.02647, params: {'learning_rate': 0.1, 'n_estimators': 2000},
mean: 0.82379, std: 0.02674, params: {'learning_rate': 0.1, 'n_estimators': 5000}]
Best param: {'learning_rate': 0.01, 'n_estimators': 1000}
Best score: 0.8294051627384961
最后去kaggle上提交,结果为0.78947。
5. Gradient tree boosting
使用Grid CV来调参,先确定n_estimators和learning rate的值,再确定max_depth、min_samples_leaf、min_samples_split的值。
gtb = GradientBoostingClassifier(random_state=33, subsample=0.8)
param_gtb = {'n_estimators':[500,1000,2000,5000],'learning_rate':[0.001,0.005,0.01,0.02]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)
gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001)
param_gtb = {'max_depth':[i for i in range(10,25,5)],'min_samples_split':[i for i in range(12,21,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)
gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10)
param_gtb = {'min_samples_split':[i for i in range(10,18,2)],'min_samples_leaf':[i for i in range(14,19,2)]}
grid_gtb = GridSearchCV(estimator = gtb, param_grid = param_gtb, cv = 5)
grid_gtb.fit(x_train,y_train)
print (grid_gtb.grid_scores_,'\n', 'Best param: ' ,grid_gtb.best_params_, '\n', 'Best score: ', grid_gtb.best_score_)
gtb = GradientBoostingClassifier(random_state=33, subsample=0.8, n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16)
gtb.fit(x_train,y_train)
gtb_y_predict = gtb.predict(x_test).astype('int')
print (cross_val_score(gtb,x_train,y_train,cv=5).mean())
gtb_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':gtb_y_predict})
gtb_submission.to_csv('../gtb_submission.csv', index=False)
最终的选择的最优参数组合为:n_estimators=1000, learning_rate=0.001, max_depth=10, min_samples_split=10 , min_samples_leaf=16,best score为0.8417508417508418。
最后去kaggle上提交,结果为0.80382。
五、另一种预测方法
我们知道大部分女性都幸存下来了,大部分男性都没能存活下来。那怎么去判断是哪部分女性没有幸存下来,而哪部分男性幸存下来了呢。有一个合理的假设为:如果母亲活下来了,那么她的孩子也会存活下来。如果孩子死了,那么母亲也会死亡。因为训练集和测试集中的家族是有一部分重叠的,所以我们可以根据训练集家族孩子和女性的存活情况,来判断测试集中同一家族中孩子和女性的存活情况。其规则为:对于测试集中的小男孩,如果他家族中的女性和小男孩都活下来了,那么就预测该小男孩也会活下来。对于测试集中的女性,如果她家族中的女性和小男孩都死亡了,那么就预测该女性也会死亡。对剩下的数据,则再根据性别来判断,女性存活,男性死亡。
#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
#Surname
train['Surname'] = [train.iloc[i]['Name'].split(',')[0] + str(train.iloc[i]['Pclass']) for i in range(len(train))] #同一个家族的人船舱等级应该一致
test['Surname'] = [test.iloc[i]['Name'].split(',')[0] + str(test.iloc[i]['Pclass']) for i in range(len(test))] #同一个家族的人船舱等级应该一致
train['Family_size'] = train['Parch'] + train['SibSp']
test['Family_size'] = test['Parch'] + test['SibSp']
train['Title'] = train['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
test['Title'] = test['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
boy = (train.Name.str.contains('Master')) | ((train.Sex=='male') & (train.Age<13))
female = train.Sex=='female'
boy_or_female = boy | female
boy_femSurvival = train[boy_or_female].groupby('Surname')['Survived'].mean().to_frame()
boy_femSurvived = list(boy_femSurvival[boy_femSurvival['Survived']==1].index)
boy_femDied = list(boy_femSurvival[boy_femSurvival['Survived']==0].index)
def boy_female_survival(input_dataset):
for i in range(len(input_dataset)):
if input_dataset.iloc[i]['Surname'] in boy_femSurvived and input_dataset.iloc[i]['Family_size']>0 and (input_dataset.iloc[i]['Sex']=='female' or
(input_dataset.iloc[i]['Title']=='Master' or (input_dataset.iloc[i]['Sex']=='male' and input_dataset.iloc[i]['Age']<13))):
input_dataset.loc[i,'Survived'] = 1
elif input_dataset.iloc[i]['Surname'] in boy_femDied and input_dataset.iloc[i]['Family_size']>0:
input_dataset.loc[i,'Survived'] = 0
boy_female_survival(test)
#print (test[test['Survived'] == 1][['Name', 'Age', 'Sex', 'Pclass','Family_size']])
test_out1 = test[test['Survived'].notnull()]
test1 = test[test['Survived'].isnull()]
test1.index = range(0,len(test1))
#对剩下的数据根据性别来判断幸存与否
def gender_survival(sex):
if sex == 'female':
return 1
else:
return 0
test1['Survived'] = test1['Sex'].map(gender_survival)
#合并两次预测的数据
test_out = pd.concat([test_out1, test1], axis=0).sort_values(by = 'PassengerId')
test_submission = test_out[['PassengerId','Survived']]
test_submission['Survived'] = test_submission['Survived'].astype('int')
test_submission.to_csv('../test_submission.csv', index=False)
最后去kaggle上提交,结果为0.81339,比建模的效果要好。
参考文献:
1. How to score over 82% Titanic
2. Kaggle_Titanic生存预测 -- 详细流程吐血梳理