欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

pandas Series DataFrame 综合学习

程序员文章站 2022-06-05 20:16:24
...

综合学习分析

索引对象

pandas 中的索引对象负责管理轴标签和其他元数据(比如轴名称)

from pandas import Series

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index) # Index(['a', 'b', 'c'], dtype='object')
print(index[1:]) # Index(['b', 'c'], dtype='object')

Index 是不能被修改的用户不能对其修改

index[1] = 'd'
# Traceback (most recent call last):
#   File "E:/pandas_study/comone/a.py", line 8, in <module>
#     index[1] = 'd'
#   File "C:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 1724, in __setitem__
#     raise TypeError("Index does not support mutable operations")
# TypeError: Index does not support mutable operations

不可修改行很重要, 这样才能是Index对象在多个数据结构中安全共享数据

from pandas import Series
import pandas as pd
import numpy as np

index = pd.Index(np.arange(3))
obj = Series([1.5, -2.5, 0], index=index)

print(index is obj.index)
print(obj.index is index)

基本功能

现在我们要操作Series和DataFrame 中的基础数据的基本手段

1 重新索引
reindex 作用: 创建一个适应新索引的新对象。
下面来比较 这几种没有index 指定index 重新指定排序

from pandas import Series
import pandas as pd
import numpy as np

data = {"a": -5.3, "c": 3.6, "b": 7.2, 'd': 4.5}
# obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj = Series(data)
print(obj)
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# dtype: float64
print("=================")
obj2 = Series(data, index=['d', 'b', 'a', 'c'])
print(obj2)
# d    4.5
# b    7.2
# a   -5.3
# c    3.6
# dtype: float64
print("=================")
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# e    NaN
# dtype: float64

如果某个索引值当前不存在, 就引入缺失值

空的时候缺失值 使用fill_value 填充

obj3 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=10)
print(obj3)
# a    -5.3
# b     7.2
# c     3.6
# d     4.5
# e    10.0
# dtype: float64

重新索引有时候需要插值处理。method选项可以达到。 ffill可以实现向前值传值

obj = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj)
# 0      blue
# 2    purple
# 4    yellow
# dtype: object
print("====")
obj2 = obj.reindex(range(6), method='ffill')
print(obj2)
# 0      blue
# 1      blue
# 2    purple
# 3    purple
# 4    yellow
# 5    yellow
# dtype: object

pandas Series DataFrame 综合学习

ffill 向前 填充
bfill 向后填充

修改index 索引

对于DataFrame, reindex可以修改索引, 或者连个都修改。如果只传入一个序列, 则会重新索引行

from pandas import Series, DataFrame
import numpy as np

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['yang', 'xiao', 'dong']
                  )
print(frame)
#    yang  xiao  dong
# a     0     1     2
# c     3     4     5
# d     6     7     8
print("=========")
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
#    yang  xiao  dong
# a   0.0   1.0   2.0
# b   NaN   NaN   NaN
# c   3.0   4.0   5.0
# d   6.0   7.0   8.0
print("==============")
state = ['yang', 'yan', 'dong']
frame3 = frame.reindex(columns=state)
print(frame3)
#    yang  yan  dong
# a     0  NaN     2
# c     3  NaN     5
# d     6  NaN     8
print("=============")

可以对行和列进行重新索引, 而插值只能按行应用(轴为0)

# 对行和列同时进行索引
frame.reindex(index=['a','b','c','d'], method='ffill,
        columns=state
# 比较简洁的一种方式, 下面这种方式是上面方式的简写
frame.ix(['a','b','c','d'], state)

利用ix的标签索引功能, 重新索引任务可以变得更加简洁

reindex 函数中的参数
pandas Series DataFrame 综合学习

丢弃指定轴上的项

由于需要执行一些数据整理和集合逻辑, 所以drop方法返回的是一个再指定轴上删除了指定值的新对象
注意返回的是新的对象。

Series 上面的丢弃

from pandas import Series
import pandas as pd
import numpy as np

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# e    4.0
# dtype: float64
print("=============")
new_obj = obj.drop("c")
print(new_obj)
# a    0.0
# b    1.0
# d    3.0
# e    4.0
# dtype: float64
print("===============")
new_obj2 = obj.drop(['a', 'b'])
print(new_obj2)
# c    2.0
# d    3.0
# e    4.0
# dtype: float64

DataFrame 上面的丢弃

axis =0 =1 的理解
pandas Series DataFrame 综合学习

0 跨行 沿着行垂直往下
1 跨列 沿着列方向水平延伸

操作列就是 axis 为1 操作行就是axis =0

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4,4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame)
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# d   12   13     14    15
print("==============")
frame2 = frame.drop(['a', 'b'])
print(frame2)
#    one  two  three  four
# c    8    9     10    11
# d   12   13     14    15
print("======")
frame3 = frame.drop('two', axis=1)
print(frame3)
#    one  three  four
# a    0      2     3
# b    4      6     7
# c    8     10    11
# d   12     14    15
print("============")
frame4 = frame.drop(['two', 'four'], axis=1)
print(frame4)
#    one  three
# a    0      2
# b    4      6
# c    8     10
# d   12     14

默认的是axis = 0

索引选取 过滤

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# dtype: float64
print("==")
print(obj['b'])
print(obj.b)
print(obj[1])
print(obj[3])
# 1.0
# 1.0
# 1.0
# 3.0
print("============")
print(obj[2:4])
print(obj[['b', 'c', 'd']])
print(obj[[1, 3]])
print(obj[obj < 2])

# c    2.0
# d    3.0
# dtype: float64
# b    1.0
# c    2.0
# d    3.0
# dtype: float64
# b    1.0
# d    3.0
# dtype: float64
# a    0.0
# b    1.0
# dtype: float64

切片利用标签的切片运算和 普通的不一样, 其末端是包含的。

print(obj['b':'c'])
#b    1.0
#c    2.0
#dtype: float64

给切片的位置设置值

obj['b':'c'] = 5
print(obj)
# a    0.0
# b    5.0
# c    5.0
# d    3.0
# dtype: float64

对DataFrame 进行索引就是获取一个或者多个列

索引中的特殊情况

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame[:2])
#   one  two  three  four
#a    0    1      2     3
#b    4    5      6     7
print("========")
print(frame[frame['three'] > 5])
#   one  two  three  four
#b    4    5      6     7
#c    8    9     10    11
#d   12   13     14    15

索引字段ix

为了DataFrame 在行上进行标签索引。 她是你可以通过Numpy 式的标记法以及轴标签从DataFrame中选取行和列的子集

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )
print(frame)
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# d   12   13     14    15
print(frame.ix['a', ['two', 'three']])
# two      1
# three    2
# Name: a, dtype: int32
print("=======")
print(frame.ix[['b', 'c'], [3, 0, 1]])
#    four  one  two
# b     7    4    5
# c    11    8    9
print(frame.ix[['b', 'c'], ["four", "one", "two"]])
#    four  one  two
# b     7    4    5
# c    11    8    9
print("=======")
print(frame.ix[2])
# one       8
# two       9
# three    10
# four     11
# Name: c, dtype: int32
print(frame.ix[:'c', 'two'])
# a    1
# b    5
# c    9
# Name: two, dtype: int32
print("=========")
print(frame.ix[frame.three > 5, :3])
#    one  two  three
# b    4    5      6
# c    8    9     10
# d   12   13     14

pandas 对象中的数据的选取和重排的方式很多
下面是一些总结
pandas Series DataFrame 综合学习

pandas Series DataFrame 综合学习

算术运算和数据对其

pandas 的一个重要功能是对不同索引的对象进行算术运算。 在将对象相加的时候, 如果存在不同的索引对, 则结果的索引就是索引对的并集。

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s3 = s1 + s2
print(s1)
#a    7.3
#c   -2.5
#d    3.4
#e    1.5
#dtype: float64
print(s2)
#a   -2.1
#c    3.6
#e   -1.5
#f    4.0
#g    3.1
#dtype: float64
print(s3)
#a    5.2
#c    1.1
#d    NaN
#e    0.0
#f    NaN
#g    NaN
#dtype: float64

自动的数据对齐操作在不重叠的索引处引入了NA 值。 缺失值会在算术运算过程中传播。

对于DataFrame, 对其操作会同时发生在行和列上面

df = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
               index=['one', 'two', 'three']
               )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )

print(df)
#          b    c    d
# one    0.0  1.0  2.0
# two    3.0  4.0  5.0
# three  6.0  7.0  8.0
print(df2)
#         b     d     e
# five  0.0   1.0   2.0
# one   3.0   4.0   5.0
# two   6.0   7.0   8.0
# six   9.0  10.0  11.0
print(df + df2)
#          b   c     d   e
# five   NaN NaN   NaN NaN
# one    3.0 NaN   6.0 NaN
# six    NaN NaN   NaN NaN
# three  NaN NaN   NaN NaN
# two    9.0 NaN  12.0 NaN

上面可以看到有很多的NaN的值,现在需要填充起来
使用add fill_value 来进行填充。 规则是两者中有一个没有的就填写没有的那一方 指的是行列。如果两则都没有 有一个行列 在另外一个对象中没有的还是NAN

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['one', 'two', 'three']
                )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )
print(df1 + df2)
#         b   c     d   e
# five   NaN NaN   NaN NaN
# one    3.0 NaN   6.0 NaN
# six    NaN NaN   NaN NaN
# three  NaN NaN   NaN NaN
# two    9.0 NaN  12.0 NaN
print(df1)
#          b    c    d
# one    0.0  1.0  2.0
# two    3.0  4.0  5.0
# three  6.0  7.0  8.0
print(df2)
#         b     d     e
# five  0.0   1.0   2.0
# one   3.0   4.0   5.0
# two   6.0   7.0   8.0
# six   9.0  10.0  11.0
df3 = df1.add(df2, fill_value=0)
print(df3)
#          b    c     d     e
# five   0.0  NaN   1.0   2.0
# one    3.0  1.0   6.0   5.0
# six    9.0  NaN  10.0  11.0
# three  6.0  7.0   8.0   NaN
# two    9.0  4.0  12.0   8.0

pandas Series DataFrame 综合学习

DataFrame 和Series之间的运算

他们之间的运算都是广播。 首先来看个numpy 之间的运算然后再切换到DataFrame 和Series 之间的运算

import numpy as np

arr = np.arange(12.).reshape((3, 4))
print(arr)
#[[ 0.  1.  2.  3.]
# [ 4.  5.  6.  7.]
# [ 8.  9. 10. 11.]]
print(arr[0]) # [0. 1. 2. 3.]
print("=====")
arr2 = arr - arr[0]
print(arr2)
#[[0. 0. 0. 0.]
# [4. 4. 4. 4.]
# [8. 8. 8. 8.]]

现在看看DataFrame和Series 之间的运算

frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
                  index=['one', 'two', 'three', 'four']
                  )
series = frame.ix[0]
print(series)
series2 = frame.ix["one"]
print(series2)

aa = frame - series
print(aa)

 #        b    d    e
#one    0.0  0.0  0.0
#two    3.0  3.0  3.0
#three  6.0  6.0  6.0
#four   9.0  9.0  9.0

默认情况下 DataFrame 和Series的算术运算会将 Series的索引匹配到DataFrame的列, 然后沿着行一直向下广播。

如果, 某个索引值在DataFrame的列或者Series的索引中找不到, 则参与运算的两个对象就会被重新索引以形成并集

series = Series(range(3), index=['b', 'e', 'f'])
print(frame - series)
#          b   d     e   f
# one    0.0 NaN   1.0 NaN
# two    3.0 NaN   4.0 NaN
# three  6.0 NaN   7.0 NaN
# four   9.0 NaN  10.0 NaN

注意上面是在行上面广播, 在列上面广播要注意呀,,敲黑板啦。要使用算术方法

series = frame['d']
print(series)
# one       1.0
# two       4.0
# three     7.0
# four     10.0
# Name: d, dtype: float64
print(frame.sub(series, axis=0))
#          b    d    e
# one   -1.0  0.0  1.0
# two   -1.0  0.0  1.0
# three -1.0  0.0  1.0
# four  -1.0  0.0  1.0

传入的轴号就是希望匹配的轴。在本例中我们得目的是匹配DataFrame的行索引并进行广播

函数应用和映射

相关标签: pandas