欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

DA-002【总第4期】Python数据分析:利用get_dummies对类目型的特征因子化

程序员文章站 2022-05-25 18:01:33
...

当我们逻辑回归建模时,输入的特征都是数据型特征,因此,我们会先对类目特征进行因子化

比如,统计人口时有“Man”和“Woman”两个属性,做数据分析时,我们需要将这种属性转化为数值形式

import pandas as pd
num_sex=['man','woman','man','man','woman','man','woman']
pd.get_dummies(num_sex,prefix='kind')

输出如下:

  kind_man kind_woman
0 1 0
1 0 1
2 1 0
3 1 0
4 0 1
5 1 0
6 0 1

其中,get_dummies的用法为:

pd.get_dummies(data,prefix="kind", prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)
''''
data : array-like, Series, or DataFrame 
输入的数据
prefix : get_dummies转换后,列名的前缀 
columns : 指定需要实现类别转换的列名
dummy_na : 增加一列表示空缺值,如果False就忽略空缺值
drop_first : 获得k中的k-1个类别值,去除第一个
'''

同样,对于文件中的数据也可以如此操作

以泰坦尼克号幸存者预测数据集为例

f=open("train.csv")
data=pd.read_csv(f)
dummies_Cabin = pd.get_dummies(data['Cabin'], prefix= 'Cabin')#使用pandas的”get_dummies”,并拼接在原来的”data_train”之上
#data : array-like, Series, or DataFrame输入的数据,prefix:新建的列名
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')
data = pd.concat([data,dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)#表格链接,axis=1表示列
data.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)#丢弃原先的那些列

用data.head()查看数据

处理前:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.000000 1 0 A/5 21171 7.2500 No S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.000000 1 0 PC 17599 71.2833 Yes C
2 3 1 3 Heikkinen, Miss. Laina female 26.000000 0 0 STON/O2. 3101282 7.9250 No S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.000000 1 0 113803 53.1000 Yes S
4 5 0 3 Allen, Mr. William Henry male 35.000000 0 0 373450 8.0500 No S
5 6 0 3 Moran, Mr. James male 23.828953 0 0 330877 8.4583 No Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.000000 0 0 17463 51.8625 Yes S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.000000 3 1 349909 21.0750 No S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.000000 0 2 347742 11.1333 No S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.000000 1 0 237736 30.0708 No C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.000000 1 1 PP 9549 16.7000 Yes S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.000000 0 0 113783 26.5500 Yes S
12 13 0 3 Saundercock, Mr. William Henry male 20.000000 0 0 A/5. 2151 8.0500 No S
13 14 0 3 Andersson, Mr. Anders Johan male 39.000000 1 5 347082 31.2750 No S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.000000 0 0 350406 7.8542 No S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.000000 0 0 248706 16.0000 No S
16 17 0 3 Rice, Master. Eugene male 2.000000 4 1 382652 29.1250 No Q
17 18 1 2 Williams, Mr. Charles Eugene male 32.066493 0 0 244373 13.0000 No S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.000000 1 0 345763 18.0000 No S
19 20 1 3 Masselmani, Mrs. Fatima female 29.518205 0 0 2649 7.2250 No C

处理后

PassengerId Survived Age SibSp Parch Fare Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3
0 1 0 22.000000 1 0 7.2500 1 0 0 0 1 0 1 0 0 1
1 2 1 38.000000 1 0 71.2833 0 1 1 0 0 1 0 1 0 0
2 3 1 26.000000 0 0 7.9250 1 0 0 0 1 1 0 0 0 1
3 4 1 35.000000 1 0 53.1000 0 1 0 0 1 1 0 1 0 0
4 5 0 35.000000 0 0 8.0500 1 0 0 0 1 0 1 0 0 1
5 6 0 23.828953 0 0 8.4583 1 0 0 1 0 0 1 0 0 1
6 7 0 54.000000 0 0 51.8625 0 1 0 0 1 0 1 1 0 0
7 8 0 2.000000 3 1 21.0750 1 0 0 0 1 0 1 0 0 1
8 9 1 27.000000 0 2 11.1333 1 0 0 0 1 1 0 0 0 1
9 10 1 14.000000 1 0 30.0708 1 0 1 0 0 1 0 0 1 0
10 11 1 4.000000 1 1 16.7000 0 1 0 0 1 1 0 0 0 1
11 12 1 58.000000 0 0 26.5500 0 1 0 0 1 1 0 1 0 0
12 13 0 20.000000 0 0 8.0500 1 0 0 0 1 0 1 0 0 1
13 14 0 39.000000 1 5 31.2750 1 0 0 0 1 0 1 0 0 1
14 15 0 14.000000 0 0 7.8542 1 0 0 0 1 1 0 0 0 1
15 16 1 55.000000 0 0 16.0000 1 0 0 0 1 1 0 0 1 0
16 17 0 2.000000 4 1 29.1250 1 0 0 1 0 0 1 0 0 1
17 18 1 32.066493 0 0 13.0000 1 0 0 0 1 0 1 0 1 0
18 19 0 31.000000 1 0 18.0000 1 0 0 0 1 1 0 0 0 1
19 20 1 29.518205 0 0 7.2250 1 0 1 0 0 1 0 0


相关标签: DA