在python环境中实现 R for Data Science https://r4ds.had.co.nz/ 中的5,12,13三章中的功能
1.Data transformation数据转换
参考python dfply文档https://github.com/kieferk/dfply/blob/master/README.md.
!pip install dfply
from dfply import *
import numpy as np
import pandas as pd
加载内置的数据集diamonds数据集,数据集共53940行,有carat、cut、color、clarity、depth、table、price、x、y、z共10列,对应每个钻石的一些参数值。
diamonds.head()
|
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
0 |
0.23 |
Ideal |
E |
SI2 |
61.5 |
55.0 |
326 |
3.95 |
3.98 |
2.43 |
1 |
0.21 |
Premium |
E |
SI1 |
59.8 |
61.0 |
326 |
3.89 |
3.84 |
2.31 |
2 |
0.23 |
Good |
E |
VS1 |
56.9 |
65.0 |
327 |
4.05 |
4.07 |
2.31 |
3 |
0.29 |
Premium |
I |
VS2 |
62.4 |
58.0 |
334 |
4.20 |
4.23 |
2.63 |
4 |
0.31 |
Good |
J |
SI2 |
63.3 |
58.0 |
335 |
4.34 |
4.35 |
2.75 |
1.1 Filter rows过滤筛选
diamonds >> filter_by(X.cut == 'Ideal') >> head()
|
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
0 |
0.23 |
Ideal |
E |
SI2 |
61.5 |
55.0 |
326 |
3.95 |
3.98 |
2.43 |
11 |
0.23 |
Ideal |
J |
VS1 |
62.8 |
56.0 |
340 |
3.93 |
3.90 |
2.46 |
13 |
0.31 |
Ideal |
J |
SI2 |
62.2 |
54.0 |
344 |
4.35 |
4.37 |
2.71 |
16 |
0.30 |
Ideal |
I |
SI2 |
62.0 |
54.0 |
348 |
4.31 |
4.34 |
2.68 |
39 |
0.33 |
Ideal |
I |
SI2 |
61.8 |
55.0 |
403 |
4.49 |
4.51 |
2.78 |
管道函数是将上一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。
diamonds >> filter_by(X.cut == 'Ideal', X.price < 337)
|
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
0 |
0.23 |
Ideal |
E |
SI2 |
61.5 |
55.0 |
326 |
3.95 |
3.98 |
2.43 |
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
|
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
26683 |
0.33 |
Ideal |
E |
SI2 |
62.2 |
54.0 |
427 |
4.44 |
4.46 |
2.77 |
32297 |
0.34 |
Ideal |
E |
SI2 |
62.4 |
54.0 |
454 |
4.49 |
4.52 |
2.81 |
40928 |
0.30 |
Ideal |
E |
SI1 |
61.6 |
54.0 |
499 |
4.32 |
4.35 |
2.67 |
50623 |
0.30 |
Ideal |
E |
SI2 |
62.1 |
54.0 |
401 |
4.32 |
4.35 |
2.69 |
50625 |
0.30 |
Ideal |
E |
SI2 |
62.0 |
54.0 |
401 |
4.33 |
4.35 |
2.69 |
1.2 Arrange rows排列
diamonds >> arrange(X.table, ascending=False) >> head(5)
|
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
24932 |
2.01 |
Fair |
F |
SI1 |
58.6 |
95.0 |
13387 |
8.32 |
8.31 |
4.87 |
50773 |
0.81 |
Fair |
F |
SI2 |
68.8 |
79.0 |
2301 |
5.26 |
5.20 |
3.58 |
51342 |
0.79 |
Fair |
G |
SI1 |
65.3 |
76.0 |
2362 |
5.52 |
5.13 |
3.35 |
52860 |
0.50 |
Fair |
E |
VS2 |
79.0 |
73.0 |
2579 |
5.21 |
5.18 |
4.09 |
49375 |
0.70 |
Fair |
H |
VS1 |
62.0 |
73.0 |
2100 |
5.65 |
5.54 |
3.47 |
1.3 Select columns选择
diamonds >> select(X.cut, 'price', X.x) >> head(3)
|
cut |
price |
x |
0 |
Ideal |
326 |
3.95 |
1 |
Premium |
326 |
3.89 |
2 |
Good |
327 |
4.05 |
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)
|
cut |
price |
x |
y |
0 |
Ideal |
326 |
3.95 |
3.98 |
1 |
Premium |
326 |
3.89 |
3.84 |
starts_with(‘c’) :查找以字符c前缀开头的列;
ends_with(‘c’):查找以字符c后缀结束的列;
contains(‘c’):筛选出包包字符c的列;
everything():所有列。
diamonds >> select(starts_with('c')) >> head(3)
|
carat |
cut |
color |
clarity |
0 |
0.23 |
Ideal |
E |
SI2 |
1 |
0.21 |
Premium |
E |
SI1 |
2 |
0.23 |
Good |
E |
VS1 |
1.4 Add new variables添加新变量
可以使用mutate()函数创建新变量,在一次调用中可创建多个变量
diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)
|
x |
y |
z |
x_plus_y |
0 |
3.95 |
3.98 |
2.43 |
7.93 |
1 |
3.89 |
3.84 |
2.31 |
7.73 |
2 |
4.05 |
4.07 |
2.31 |
8.12 |
diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)
|
x_plus_y |
y_div_z |
0 |
7.93 |
1.637860 |
1 |
7.73 |
1.662338 |
2 |
8.12 |
1.761905 |
1.5 Grouped summaries分组汇总
diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
|
price_mean |
price_std |
0 |
3932.799722 |
3989.439738 |
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
|
cut |
price_mean |
price_std |
0 |
Fair |
4358.757764 |
3560.386612 |
1 |
Good |
3928.864452 |
3681.589584 |
2 |
Ideal |
3457.541970 |
3808.401172 |
3 |
Premium |
4584.257704 |
4349.204961 |
4 |
Very Good |
3981.759891 |
3935.862161 |
diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')
|
price_mean |
price_var |
depth_mean |
depth_var |
0 |
3932.799722 |
1.591533e+07 |
61.749405 |
2.052366 |
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)
|
cut |
price_mean |
price_var |
depth_mean |
depth_var |
0 |
Fair |
4358.757764 |
1.266848e+07 |
64.041677 |
13.266319 |
1 |
Good |
3928.864452 |
1.355134e+07 |
62.365879 |
4.705224 |
2 |
Ideal |
3457.541970 |
1.450325e+07 |
61.709401 |
0.516274 |
3 |
Premium |
4584.257704 |
1.891421e+07 |
61.264673 |
1.342755 |
4 |
Very Good |
3981.759891 |
1.548973e+07 |
61.818275 |
1.900466 |
2.Tidy data数据整理
2.1Pivoting旋转
import pandas as pd
import numpy as np
mydata=pd.DataFrame({
"Name":["苹果","谷歌","脸书","亚马逊","腾讯"],
"Conpany":["Apple","Google","Facebook","Amozon","Tencent"],
"Sale2013":[5000,3500,2300,2100,3100],
"Sale2014":[5050,3800,2900,2500,3300],
"Sale2015":[5050,3800,2900,2500,3300],
"Sale2016":[5050,3800,2900,2500,3300]
})
mydata
|
Name |
Conpany |
Sale2013 |
Sale2014 |
Sale2015 |
Sale2016 |
0 |
苹果 |
Apple |
5000 |
5050 |
5050 |
5050 |
1 |
谷歌 |
Google |
3500 |
3800 |
3800 |
3800 |
2 |
脸书 |
Facebook |
2300 |
2900 |
2900 |
2900 |
3 |
亚马逊 |
Amozon |
2100 |
2500 |
2500 |
2500 |
4 |
腾讯 |
Tencent |
3100 |
3300 |
3300 |
3300 |
python中melt函数(数据宽转长)与R中pivot_longer函数类似,通过减少列并将它们转换为值来延长数据框
mydata1=mydata.melt(id_vars=["Name","Conpany"],
var_name="Year",
value_name="Sale" )
mydata1
|
Name |
Conpany |
Year |
Sale |
0 |
苹果 |
Apple |
Sale2013 |
5000 |
1 |
谷歌 |
Google |
Sale2013 |
3500 |
2 |
脸书 |
Facebook |
Sale2013 |
2300 |
3 |
亚马逊 |
Amozon |
Sale2013 |
2100 |
4 |
腾讯 |
Tencent |
Sale2013 |
3100 |
5 |
苹果 |
Apple |
Sale2014 |
5050 |
R中pivot_wider函数实现的功能可用python中pivot_table函数(数据长转宽)代替,跨多列传播键值对
mydata1.pivot_table(index=["Name","Conpany"],
columns=["Year"],
values=["Sale"])
|
|
Sale |
|
Year |
Sale2013 |
Sale2014 |
Sale2015 |
Sale2016 |
Name |
Conpany |
|
|
|
|
亚马逊 |
Amozon |
2100 |
2500 |
2500 |
2500 |
脸书 |
Facebook |
2300 |
2900 |
2900 |
2900 |
腾讯 |
Tencent |
3100 |
3300 |
3300 |
3300 |
苹果 |
Apple |
5000 |
5050 |
5050 |
5050 |
谷歌 |
Google |
3500 |
3800 |
3800 |
3800 |
也可使用plydata库进行数据整理
python中的plydata库基于 R 中的 dplyr、tidyr 和 forcats 包,许多函数名称都是直接借用过来的
https://plydata.readthedocs.io/en/stable/api.html.
2.2 Separating and uniting分离与融合
df = pd.DataFrame({'alpha': 1,'x': ['a,1', 'b,2', 'c,3'],'zeta': 6})
print(df)
df >> separate('x', into=['A', 'B'], remove=False)
|
alpha |
x |
zeta |
A |
B |
0 |
1 |
a,1 |
6 |
a |
1 |
1 |
1 |
b,2 |
6 |
b |
2 |
2 |
1 |
c,3 |
6 |
c |
3 |
df = pd.DataFrame({'c1': [1, 2, 3, 4, None],'c2': list('abcde'),'c3': list('vwxyz')})
df >> unite('c1c2', 'c1', 'c2', na_rm=True)
[‘c1’, ‘c2’] _ True maintain
|
c3 |
c1c2 |
0 |
v |
1.0_a |
1 |
w |
2.0_b |
2 |
x |
3.0_c |
3 |
y |
4.0_d |
4 |
z |
NaN |
2.3 Missing values缺失值
df
|
c1 |
c2 |
c3 |
0 |
1.0 |
a |
v |
1 |
2.0 |
b |
w |
2 |
3.0 |
c |
x |
3 |
4.0 |
d |
y |
4 |
NaN |
e |
z |
print(df.isnull())
print(df.isnull().sum())
删除缺失值可用df.dropna(),也可以给这个方法传入how="all"参数,只有在整行为空的前提下才删除
df.fillna('*')
|
c1 |
c2 |
c3 |
0 |
1.0 |
a |
v |
1 |
2.0 |
b |
w |
2 |
3.0 |
c |
x |
3 |
4.0 |
d |
y |
4 |
* |
e |
z |
3.Relational data关系数据
3.1 pd.merge(),pd.concat()
pd.merge()官方文档
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge.
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
print(df1)
print(df2)
df1.merge(df2, how='inner', on='a')
df1.merge(df2, how='cross')
|
a_x |
b |
a_y |
c |
0 |
foo |
1 |
foo |
3 |
1 |
foo |
1 |
baz |
4 |
2 |
bar |
2 |
foo |
3 |
3 |
bar |
2 |
baz |
4 |
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
columns=['letter', 'number', 'animal'])
pd.concat([df1, df3], join="inner")
|
letter |
number |
0 |
a |
1 |
1 |
b |
2 |
0 |
c |
3 |
1 |
d |
4 |
3.2运用dfply库中的join系列函数
a = pd.DataFrame({'x1':['A','B','C'],'x2':[1,2,3]})
b = pd.DataFrame({'x1':['A','B','D'],'x3':[True,False,True]})
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HDlnjDWV-1639040492974)(attachment:image.png)]
a >> inner_join(b, by='x1')
a >> outer_join(b, by='x1')
a >> left_join(b, by='x1')
a >> right_join(b, by='x1')
a >> semi_join(b, by='x1')
a >> anti_join(b, by='x1')