《利用Python进行数据分析》第5章 pandas的数据汇总与处理缺失数据
汇总和计算描述统计
pandas对象拥有一组常用的数学和统计方法。它们大部分都属于约简和汇总统计,用于从Series中提取单个值(如sum或mean)或从DataFrame的行或列中提取一个Series。
In [1]: from pandas import Series,DataFrame
In [2]: import pandas as pd
In [3]: import numpy as np
用DataFrame的sum方法将会返回一个含有列小计的Series:
In [5]: df=DataFrame([[2.6,np.nan],[6.2,-3.5],
...: [np.nan,np.nan],[0.75,-1.3]],
...: index=['a','b','c','d'],
...: columns=['one','two'])
In [6]: df
Out[6]:
one two
a 2.60 NaN
b 6.20 -3.5
c NaN NaN
d 0.75 -1.3
In [7]: df.sum()
Out[7]:
one 9.55
two -4.80
dtype: float64
传入axis=1将会按行进行求和运算
In [8]: df.sum(axis=1)
Out[8]:
a 2.60
b 2.70
c 0.00
d -0.55
dtype: float64
NA值会自动被排除,除非整个切片(这里指的是行或列)都是NA。通过skipna选项可以禁用该功能
In [9]: df.mean(axis=1)
Out[9]:
a 2.600
b 1.350
c NaN
d -0.275
dtype: float64
In [10]: df.mean(axis=1,skipna=False)
Out[10]:
a NaN
b 1.350
c NaN
d -0.275
dtype: float64
idxmin和idxmax返回的是间接统计(比如达到最小值或最大值的索引
In [11]: df.idxmax()
Out[11]:
one b
two d
dtype: object
In [12]: df.idxmin()
Out[12]:
one d
two b
dtype: object
累计型的
In [13]: df.cumsum()
Out[13]:
one two
a 2.60 NaN
b 8.80 -3.5
c NaN NaN
d 9.55 -4.8
describe用于一次性产生多个汇总统计
In [14]: df.describe()
Out[14]:
one two
count 3.000000 2.000000
mean 3.183333 -2.400000
std 2.771432 1.555635
min 0.750000 -3.500000
25% 1.675000 -2.950000
50% 2.600000 -2.400000
75% 4.400000 -1.850000
max 6.200000 -1.300000
对于非数值型数据,describe会产生另外一种汇总统计
In [15]: obj=Series(['a','a','b','c']*4)
In [16]: obj
Out[16]:
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
In [17]: obj.describe()
Out[17]:
count 16
unique 3
top a
freq 8
dtype: object
唯一值、值计数以及成员资格
第一个函数是unique,它可以得到Series中的唯一值数组
In [4]: obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
In [5]: uniques=obj.unique()
In [6]: uniques
Out[6]: array(['c', 'a', 'd', 'b'], dtype=object)
返回的唯一值是未排序的,如果需要的话,可以对结果再次进行排序(uniques.sort())。value_counts用于计算一个Series中各值出现的频率
In [7]: obj.value_counts()
Out[7]:
c 3
a 3
b 2
d 1
dtype: int64
结果Series是按值频率降序排列的。value_counts还是一个*pandas方法,可用于任何数组或序列
In [12]: pd.value_counts(obj.values,sort=False)
Out[12]:
d 1
a 3
b 2
c 3
dtype: int64
isin,它用于判断矢量化集合的成员资格,可用于选取Series中或DataFrame列中数据的子集
In [13]: mask=obj.isin(['b','c'])
In [14]: mask
Out[14]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
In [15]: obj[mask]
Out[15]:
0 c
5 b
6 b
7 c
8 c
dtype: object
有时可能希望得到DataFrame中多个相关列的一张柱状图
In [16]: data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
...: 'Qu2': [2, 3, 1, 2, 3],
...: 'Qu3': [1, 5, 2, 4, 4]})
In [17]: data
Out[17]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [18]: result = data.apply(pd.value_counts).fillna(0)
In [19]: result
Out[19]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
处理缺失数据
pandas使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据
In [20]: string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [21]: string_data
Out[21]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [22]: string_data.isnull()
Out[22]:
0 False
1 False
2 True
3 False
dtype: bool
Python内置的None值也会被当做NA处理
In [23]: string_data[0] = None
In [24]: string_data
Out[24]:
0 None
1 artichoke
2 NaN
3 avocado
dtype: object
In [25]: string_data.isnull()
Out[25]:
0 True
1 False
2 True
3 False
dtype: bool
滤除缺失数据
纯手工操作永远都是一个办法,但dropna可能会更实用一些。对于一个Series,dropna返回一个仅含非空数据和索引值的Series
In [26]: from numpy import nan as NA
In [27]: data = Series([3, NA, 3.5, NA, 9,5,NA])
In [28]: data
Out[28]:
0 3.0
1 NaN
2 3.5
3 NaN
4 9.0
5 5.0
6 NaN
dtype: float64
In [29]: data.dropna()
Out[29]:
0 3.0
2 3.5
4 9.0
5 5.0
dtype: float64
也可以通过布尔型索引达到这个目的
In [30]: data[data.notnull()]
Out[30]:
0 3.0
2 3.5
4 9.0
5 5.0
dtype: float64
对于DataFrame对象,可能希望丢弃全NA或含有NA的行或列。dropna默认丢弃任何含有缺失值的行:
In [31]: data = DataFrame([[3, 6.5, 9], [6., NA, NA],
...: [NA, NA, NA], [NA, 6.5, 8]])
In [32]: data
Out[32]:
0 1 2
0 3.0 6.5 9.0
1 6.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 8.0
In [33]: cleaned=data.dropna()
In [34]: cleaned
Out[34]:
0 1 2
0 3.0 6.5 9.0
传入how=’all’将只丢弃全为NA的那些行
In [35]: data.dropna(how='all')
Out[35]:
0 1 2
0 3.0 6.5 9.0
1 6.0 NaN NaN
3 NaN 6.5 8.0
In [36]: data[4]=NA
In [37]: data
Out[37]:
0 1 2 4
0 3.0 6.5 9.0 NaN
1 6.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 8.0 NaN
要用这种方式丢弃列,只需传入axis=1
In [38]: data.dropna(axis=1,how='all')
Out[38]:
0 1 2
0 3.0 6.5 9.0
1 6.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 8.0
另一个滤除DataFrame行的问题涉及时间序列数据。假设你只想留下一部分观测数据,可以用thresh参数实现此目的
In [39]: df = DataFrame(np.random.randn(7, 3))
In [40]: df
Out[40]:
0 1 2
0 -0.668307 0.695528 -1.519538
1 -0.623535 -0.157634 -0.197022
2 0.064474 1.691703 -0.505063
3 -1.146467 0.851013 0.203486
4 1.910628 -0.309901 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
In [41]: df.ix[:4,1]=NA
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
"""Entry point for launching an IPython kernel.
In [42]: df.loc[:4,1]=NA
In [43]: df
Out[43]:
0 1 2
0 -0.668307 NaN -1.519538
1 -0.623535 NaN -0.197022
2 0.064474 NaN -0.505063
3 -1.146467 NaN 0.203486
4 1.910628 NaN 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
In [44]: df.ix[:2,2]=NA
In [45]: df
Out[45]:
0 1 2
0 -0.668307 NaN NaN
1 -0.623535 NaN NaN
2 0.064474 NaN NaN
3 -1.146467 NaN 0.203486
4 1.910628 NaN 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
填充缺失数据
你可能不想滤除缺失数据(有可能会丢弃跟它有关的其他数据),而是希望通过其他方式填补那些“空洞”。对于大多数情况而言,fillna方法是最主要的函数。通过一个常数调用fillna就会将缺失值替换为那个常数值。
In [46]: df.fillna(0)
Out[46]:
0 1 2
0 -0.668307 0.000000 0.000000
1 -0.623535 0.000000 0.000000
2 0.064474 0.000000 0.000000
3 -1.146467 0.000000 0.203486
4 1.910628 0.000000 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值
In [47]: df.fillna({1:0.6,2:-3})
Out[47]:
0 1 2
0 -0.668307 0.600000 -3.000000
1 -0.623535 0.600000 -3.000000
2 0.064474 0.600000 -3.000000
3 -1.146467 0.600000 0.203486
4 1.910628 0.600000 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
In [48]: df
Out[48]:
0 1 2
0 -0.668307 NaN NaN
1 -0.623535 NaN NaN
2 0.064474 NaN NaN
3 -1.146467 NaN 0.203486
4 1.910628 NaN 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
fillna默认会返回新对象,但也可以对现有对象进行就地修改,总是返回被填充对象的引用
In [49]: _=df.fillna(0,inplace=True)
In [50]: df
Out[50]:
0 1 2
0 -0.668307 0.000000 0.000000
1 -0.623535 0.000000 0.000000
2 0.064474 0.000000 0.000000
3 -1.146467 0.000000 0.203486
4 1.910628 0.000000 0.232673
5 1.113025 0.883580 0.434787
6 0.382675 1.488205 0.195079
对reindex有效的那些插值方法也可用于fillna
In [56]: df = DataFrame(np.random.randn(6, 4))
In [57]: df.loc[2:,1]=NA
In [58]: df
Out[58]:
0 1 2 3
0 -0.104469 -0.311641 -0.190943 1.291739
1 0.207564 1.413073 0.025226 1.215066
2 -0.190017 NaN -1.240705 -0.177461
3 0.270217 NaN -0.356850 -0.796403
4 -1.628883 NaN -0.208336 -0.075219
5 -1.576697 NaN 0.781700 -0.028192
In [59]: df.loc[4:,2]=NA
In [60]: df
Out[60]:
0 1 2 3
0 -0.104469 -0.311641 -0.190943 1.291739
1 0.207564 1.413073 0.025226 1.215066
2 -0.190017 NaN -1.240705 -0.177461
3 0.270217 NaN -0.356850 -0.796403
4 -1.628883 NaN NaN -0.075219
5 -1.576697 NaN NaN -0.028192
In [62]: df.fillna(method='ffill')
Out[62]:
0 1 2 3
0 -0.104469 -0.311641 -0.190943 1.291739
1 0.207564 1.413073 0.025226 1.215066
2 -0.190017 1.413073 -1.240705 -0.177461
3 0.270217 1.413073 -0.356850 -0.796403
4 -1.628883 1.413073 -0.356850 -0.075219
5 -1.576697 1.413073 -0.356850 -0.028192
可以连续填充的最大数量
In [64]: df.fillna(method='ffill',limit=2)
Out[64]:
0 1 2 3
0 -0.104469 -0.311641 -0.190943 1.291739
1 0.207564 1.413073 0.025226 1.215066
2 -0.190017 1.413073 -1.240705 -0.177461
3 0.270217 1.413073 -0.356850 -0.796403
4 -1.628883 NaN -0.356850 -0.075219
5 -1.576697 NaN -0.356850 -0.028192
可以传入Series的平均值或中位数
In [65]: data = Series([1., NA, 3.5, NA, 7])
In [66]: data
Out[66]:
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
In [67]: data.fillna(data.mean())
Out[67]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
层次化索引
层次化索引(hierarchical indexing)是pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引级别。抽象点说,它使你能以低维度形式处理高维度数据。
In [1]: from pandas import Series,DataFrame
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: data = Series(np.random.randn(10),
...: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
...: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
In [5]: data
Out[5]:
a 1 -1.155262
2 1.141384
3 -1.046202
b 1 -0.621072
2 -0.200051
3 -0.812091
c 1 1.762207
2 0.293709
d 2 -0.646778
3 1.622014
dtype: float64
这就是带有MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签
In [6]: data.index
Out[6]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
对于一个层次化索引的对象,选取数据子集的操作很简单
In [8]: data['b']
Out[8]:
1 -0.621072
2 -0.200051
3 -0.812091
dtype: float64
In [9]: data['b':'c']
Out[9]:
b 1 -0.621072
2 -0.200051
3 -0.812091
c 1 1.762207
2 0.293709
dtype: float64
In [10]: data.loc[[]]
Out[10]: Series([], dtype: float64)
In [10]:
In [11]: data.loc[['b','d']]
Out[11]:
b 1 -0.621072
2 -0.200051
3 -0.812091
d 2 -0.646778
3 1.622014
dtype: float64
有时甚至还可以在“内层”中进行选取
In [12]: data[:2]
Out[12]:
a 1 -1.155262
2 1.141384
dtype: float64
In [13]: data[:,2]
Out[13]:
a 1.141384
b -0.200051
c 0.293709
d -0.646778
dtype: float64
层次化索引在数据重塑和基于分组的操作(如透视表生成)中扮演着重要的角色。比如说,这段数据可以通过其unstack方法被重新安排到一个DataFrame中
In [14]: data.unstack()
Out[14]:
1 2 3
a -1.155262 1.141384 -1.046202
b -0.621072 -0.200051 -0.812091
c 1.762207 0.293709 NaN
d NaN -0.646778 1.622014
unstack的逆运算是stack
In [15]: data.unstack().stack()
Out[15]:
a 1 -1.155262
2 1.141384
3 -1.046202
b 1 -0.621072
2 -0.200051
3 -0.812091
c 1 1.762207
2 0.293709
d 2 -0.646778
3 1.622014
dtype: float64
对于一个DataFrame,每条轴都可以有分层索引
In [16]: frame = DataFrame(np.arange(12).reshape((4, 3)),
...: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
...: columns=[['Ohio', 'Ohio', 'Colorado'],
...: ['Green', 'Red', 'Green']])
In [17]: frame
Out[17]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [18]: frame.index.names=['key1','key2']
In [19]: frame.columns.names=['state','color']
In [20]: frame
Out[20]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [21]: frame['Ohio']
Out[21]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
重排分级顺序
有时,你需要重新调整某条轴上各级别的顺序,或根据指定级别上的值对数据进行排序。swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象(但数据不会发生变化)
In [23]: frame.swaplevel('key1', 'key2')
Out[23]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sortlevel则根据单个级别中的值对数据进行排序(稳定的)。交换级别时,常常也会用到sortlevel,这样最终结果就是有序的
In [24]: frame.sortlevel(1)
Out[24]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [25]: frame.swaplevel(0, 1)
Out[25]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [26]: frame.swaplevel(0, 1).sortlevel(0)
Out[26]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
注意:在层次化索引的对象上,如果索引是按字典方式从外到内排序(即调用sortlevel(0)或sort_index()的结果),数据选取操作的性能要好很多。
根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计都有一个level选项,它用于指定在某条轴上求和的级别。再以上面那个DataFrame为例,我们可以根据行或列上的级别来进行求和。
In [27]: frame
Out[27]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [28]: frame.sum(level='key2')
Out[28]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
In [29]: frame.sum(level='color', axis=1)
Out[29]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
使用DataFrame的列
人们经常想要将DataFrame的一个或多个列当做行索引来用,或者可能希望将行索引变成DataFrame的列。
In [30]: frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
...: 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
...: 'd': [0, 1, 2, 0, 1, 2, 3]})
In [31]: frame
Out[31]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
DataFrame的set_index函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame
In [32]: frame2=frame.set_index(['c','d'])
In [33]: frame2
Out[33]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
默认情况下,那些列会从DataFrame中移除,但也可以将其保留下来
In [34]: frame.set_index(['c','d'],drop=False)
Out[34]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
reset_index的功能跟set_index刚好相反,层次化索引的级别会被转移到列里面
In [35]: frame2.reset_index()
Out[35]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
接下来练习下一章节