pandas知识点(基本功能)
程序员文章站
2022-07-05 14:41:17
1.重新索引 如果reindex会根据新索引重新排序,不存在的则引入缺省: In [3]: obj = Series([4.5,7.2,-5.3,3.6], index=["d","b","a","c"]) In [4]: obj Out[4]: d 4.5 b 7.2 a -5.3 c 3.6 d ......
1.重新索引
如果reindex会根据新索引重新排序,不存在的则引入缺省:
in [3]: obj = series([4.5,7.2,-5.3,3.6], index=["d","b","a","c"]) in [4]: obj out[4]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 in [6]: obj2 = obj.reindex(["a","b","c","d","e"]) in [7]: obj2 out[7]: a -5.3 b 7.2 c 3.6 d 4.5 e nan dtype: float64
ffill可以实现前向值填充:
in [8]: obj3 = series(["blue","purple","yellow"], index=[0,2,4]) in [9]: obj3.reindex(range(6), method="ffill") out[9]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
2.丢弃指定轴上的项
drop方法返回在指定轴上删除了指定值的新对象:
in [12]: obj = series(np.arange(5.), index=["a","b","c","d","e"]) in [13]: new_obj = obj.drop("c") in [14]: new_obj out[14]: a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64
dataframe可以删除任意轴上的索引值
3.索引,选取和过滤
series的索引可以不止是整数:
in [4]: obj = series(np.arange(4.), index=["a","b","c","d"])out[6]: a 0.0 b 1.0 dtype: float64 in [7]: obj[obj<2] out[7]: a 0.0 b 1.0 dtype: float64
series切片与普通的python切片不一样,末端也是包含的:
in [8]: obj["b":"c"] out[8]: b 1.0 c 2.0 dtype: float64
dataframe进行索引:
in [10]: data out[10]: one two three four ohio 0 1 2 3 colorado 4 5 6 7 utah 8 9 10 11 new york 12 13 14 15 in [11]: data['two'] out[11]: ohio 1 colorado 5 utah 9 new york 13 name: two, dtype: int32 in [12]: data[:2] out[12]: one two three four ohio 0 1 2 3 colorado 4 5 6 7
布尔型dataframe进行索引:
in [13]: data > 5 out[13]: one two three four ohio false false false false colorado false false true true utah true true true true new york true true true true
利用ix可以选取行和列的子集:
in [18]: data.ix['colorado',['two','three']] out[18]: two 5 three 6 name: colorado, dtype: int32 in [19]: data.ix[['colorado','utah'],[3,0,1]] out[19]: four one two colorado 7 4 5 utah 11 8 9
4.算数运算和数据对齐
对不同索引的对象进行算数运算,如果存在不同的索引,则结果的索引取其并集:
in [20]: s1 = series([7.3,-2.5,3.4,1.5],index=['a','c','d','e']) in [21]: s2 = series([-2.1, 3.6, -1.5, 4, 3.1],index=['a','c','e','f','g']) in [22]: s1+s2 out[22]: a 5.2 c 1.1 d nan e 0.0 f nan g nan dtype: float64
对于dataframe,对齐操作会同时发生在行和列上:
in [26]: df1 out[26]: b d e utah 0.0 1.0 2.0 ohio 3.0 4.0 5.0 texas 6.0 7.0 8.0 oregon 9.0 10.0 11.0 in [27]: df2 out[27]: b c d ohio 0.0 1.0 2.0 texas 3.0 4.0 5.0 colorado 6.0 7.0 8.0 in [28]: df1+df2 out[28]: b c d e colorado nan nan nan nan ohio 3.0 nan 6.0 nan oregon nan nan nan nan texas 9.0 nan 12.0 nan utah nan nan nan nan
使用add方法相加:
in [30]: df2.add(df1,fill_value=0) out[30]: b c d e colorado 6.0 7.0 8.0 nan ohio 3.0 1.0 6.0 5.0 oregon 9.0 nan 10.0 11.0 texas 9.0 4.0 12.0 8.0 utah 0.0 nan 1.0 2.0
5.dataframe和series之间的运算:
计算二维数组和某一行的差:
in [31]: arr = np.arange(12.).reshape((3,4)) in [32]: arr out[32]: array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) in [33]: arr - arr[1] out[33]: array([[-4., -4., -4., -4.], [ 0., 0., 0., 0.], [ 4., 4., 4., 4.]])
dataframe和series之间的运算:
in [35]: frame = dataframe(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['utah','ohio','texas','oregon']) in [39]: series = frame.iloc[0] in [40]: frame out[40]: b d e utah 0.0 1.0 2.0 ohio 3.0 4.0 5.0 texas 6.0 7.0 8.0 oregon 9.0 10.0 11.0 in [41]: series out[41]: b 0.0 d 1.0 e 2.0 name: utah, dtype: float64 in [43]: frame - series out[43]: b d e utah 0.0 0.0 0.0 ohio 3.0 3.0 3.0 texas 6.0 6.0 6.0 oregon 9.0 9.0 9.0
如果某个索引值找不到,则与运算的两个对象会被重新索引以形成并集:
in [45]: frame + series2 out[45]: b d e f utah 0.0 nan 3.0 nan ohio 3.0 nan 6.0 nan texas 6.0 nan 9.0 nan oregon 9.0 nan 12.0 nan
匹配列并在列上广播:
in [46]: series3 = frame['d'] in [47]: frame.sub(series3, axis=0) out[47]: b d e utah -1.0 0.0 1.0 ohio -1.0 0.0 1.0 texas -1.0 0.0 1.0 oregon -1.0 0.0 1.0
6.函数应用和映射
numpy的ufuncs也可用于操作pandas对象:
in [49]: frame = dataframe(np.random.randn(4,3), columns=list('bde'),index=['utah','ohio','texas','oregon']) in [50]: frame out[50]: b d e utah 0.913051 -1.289725 -0.590573 ohio 1.417612 -1.835357 -0.010755 texas 0.328839 -0.121878 -1.209583 oregon 1.315330 -1.026557 -1.777427 in [51]: np.abs(frame) out[51]: b d e utah 0.913051 1.289725 0.590573 ohio 1.417612 1.835357 0.010755 texas 0.328839 0.121878 1.209583 oregon 1.315330 1.026557 1.777427 dataframe的apply方法可以实现将函数应用到由各行或列形成的一维数组上: in [52]: f = lambda x:x.max() - x.min() in [53]: frame.apply(f) out[53]: b 1.088773 d 1.713479 e 1.766671 dtype: float64 in [54]: frame.apply(f, axis=1) out[54]: utah 2.202776 ohio 3.252969 texas 1.538421 oregon 3.092757 dtype: float64
7.排序和排名
sort_index方法可以返回一个已排序的对象
in [57]: obj = series(range(4), index=['d','a','b','c']) in [58]: obj out[58]: d 0 a 1 b 2 c 3 dtype: int64 in [59]: obj.sort_index out[59]: <bound method series.sort_index of d 0 a 1 b 2 c 3 dtype: int64> in [62]: frame.sort_index() out[62]: b d e ohio 1.417612 -1.835357 -0.010755 oregon 1.315330 -1.026557 -1.777427 texas 0.328839 -0.121878 -1.209583 utah 0.913051 -1.289725 -0.590573 in [63]: frame.sort_index(axis=1) out[63]: b d e utah 0.913051 -1.289725 -0.590573 ohio 1.417612 -1.835357 -0.010755 texas 0.328839 -0.121878 -1.209583 oregon 1.315330 -1.026557 -1.777427
倒序查看:
in [65]: frame.sort_index(axis=1,ascending=false) out[65]: e d b utah -0.590573 -1.289725 0.913051 ohio -0.010755 -1.835357 1.417612 texas -1.209583 -0.121878 0.328839 oregon -1.777427 -1.026557 1.315330
按某一列的值进行排序:
in [67]: frame.sort_values(by='b') out[67]: b d e texas 0.328839 -0.121878 -1.209583 utah 0.913051 -1.289725 -0.590573 oregon 1.315330 -1.026557 -1.777427 ohio 1.417612 -1.835357 -0.010755
排名(rank)与排序类似,它会设置一个排名值,并且可以根据某种规则破坏平级关系
in [70]: obj out[70]: 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 in [71]: obj.rank() out[71]: 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
根据值在原数据中出现的顺序给出排名
in [72]: obj.rank(method='first') out[72]: 0 6.0 1 1.0 2 7.0 3 4.0 4 3.0 5 2.0 6 5.0 dtype: float64
8.带有重复值的轴索引
使用is_unique查看值是否唯一
in [73]: obj = series(range(5),index=['a','a','b','b','c']) in [74]: obj out[74]: a 0 a 1 b 2 b 3 c 4 dtype: int64 in [75]: obj.index.is_unique out[75]: false
对重复索引选取数据:
in [76]: obj['a'] out[76]: a 0 a 1 dtype: int64
dataframe也是同样的道理