DA-002【总第4期】Python数据分析:利用get_dummies对类目型的特征因子化
程序员文章站
2022-05-25 18:01:33
...
当我们逻辑回归建模时,输入的特征都是数据型特征,因此,我们会先对类目特征进行因子化
比如,统计人口时有“Man”和“Woman”两个属性,做数据分析时,我们需要将这种属性转化为数值形式
import pandas as pd
num_sex=['man','woman','man','man','woman','man','woman']
pd.get_dummies(num_sex,prefix='kind')
输出如下:
kind_man | kind_woman | |
---|---|---|
0 | 1 | 0 |
1 | 0 | 1 |
2 | 1 | 0 |
3 | 1 | 0 |
4 | 0 | 1 |
5 | 1 | 0 |
6 | 0 | 1 |
其中,get_dummies的用法为:
pd.get_dummies(data,prefix="kind", prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False) '''' data : array-like, Series, or DataFrame 输入的数据 prefix : get_dummies转换后,列名的前缀 columns : 指定需要实现类别转换的列名 dummy_na : 增加一列表示空缺值,如果False就忽略空缺值 drop_first : 获得k中的k-1个类别值,去除第一个 '''
同样,对于文件中的数据也可以如此操作
以泰坦尼克号幸存者预测数据集为例
f=open("train.csv")
data=pd.read_csv(f)
dummies_Cabin = pd.get_dummies(data['Cabin'], prefix= 'Cabin')#使用pandas的”get_dummies”,并拼接在原来的”data_train”之上
#data : array-like, Series, or DataFrame输入的数据,prefix:新建的列名
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')
data = pd.concat([data,dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)#表格链接,axis=1表示列
data.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)#丢弃原先的那些列
用data.head()查看数据
处理前:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | No | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | Yes | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | No | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.000000 | 1 | 0 | 113803 | 53.1000 | Yes | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.000000 | 0 | 0 | 373450 | 8.0500 | No | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | 23.828953 | 0 | 0 | 330877 | 8.4583 | No | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.000000 | 0 | 0 | 17463 | 51.8625 | Yes | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.000000 | 3 | 1 | 349909 | 21.0750 | No | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.000000 | 0 | 2 | 347742 | 11.1333 | No | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.000000 | 1 | 0 | 237736 | 30.0708 | No | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.000000 | 1 | 1 | PP 9549 | 16.7000 | Yes | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.000000 | 0 | 0 | 113783 | 26.5500 | Yes | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.000000 | 0 | 0 | A/5. 2151 | 8.0500 | No | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.000000 | 1 | 5 | 347082 | 31.2750 | No | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.000000 | 0 | 0 | 350406 | 7.8542 | No | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.000000 | 0 | 0 | 248706 | 16.0000 | No | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.000000 | 4 | 1 | 382652 | 29.1250 | No | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | 32.066493 | 0 | 0 | 244373 | 13.0000 | No | S |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.000000 | 1 | 0 | 345763 | 18.0000 | No | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | 29.518205 | 0 | 0 | 2649 | 7.2250 | No | C |
处理后
PassengerId | Survived | Age | SibSp | Parch | Fare | Cabin_No | Cabin_Yes | Embarked_C | Embarked_Q | Embarked_S | Sex_female | Sex_male | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 22.000000 | 1 | 0 | 7.2500 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1 | 38.000000 | 1 | 0 | 71.2833 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 26.000000 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 35.000000 | 1 | 0 | 53.1000 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
4 | 5 | 0 | 35.000000 | 0 | 0 | 8.0500 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
5 | 6 | 0 | 23.828953 | 0 | 0 | 8.4583 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
6 | 7 | 0 | 54.000000 | 0 | 0 | 51.8625 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
7 | 8 | 0 | 2.000000 | 3 | 1 | 21.0750 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
8 | 9 | 1 | 27.000000 | 0 | 2 | 11.1333 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
9 | 10 | 1 | 14.000000 | 1 | 0 | 30.0708 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
10 | 11 | 1 | 4.000000 | 1 | 1 | 16.7000 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
11 | 12 | 1 | 58.000000 | 0 | 0 | 26.5500 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
12 | 13 | 0 | 20.000000 | 0 | 0 | 8.0500 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
13 | 14 | 0 | 39.000000 | 1 | 5 | 31.2750 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
14 | 15 | 0 | 14.000000 | 0 | 0 | 7.8542 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
15 | 16 | 1 | 55.000000 | 0 | 0 | 16.0000 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
16 | 17 | 0 | 2.000000 | 4 | 1 | 29.1250 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
17 | 18 | 1 | 32.066493 | 0 | 0 | 13.0000 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
18 | 19 | 0 | 31.000000 | 1 | 0 | 18.0000 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
19 | 20 | 1 | 29.518205 | 0 | 0 | 7.2250 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
上一篇: PHP 类相关函数的使用详解_PHP教程
下一篇: HHVM 是如何提升 PHP 性能的?