【数据预处理】Pandas缺失的数据处理

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
   ...:                   columns=['one', 'two', 'three'])
   ...: 

In [2]: df['four'] = 'bar'

In [3]: df['five'] = df['one'] > 0

In [4]: df
Out[4]: 
        one       two     three four   five
a -0.166778  0.501113 -0.355322  bar  False
c -0.337890  0.580967  0.983801  bar  False
e  0.057802  0.761948 -0.712964  bar   True
f -0.443160 -0.974602  1.047704  bar  False
h -0.717852 -1.053898 -0.019369  bar  False

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [6]: df2
Out[6]: 
        one       two     three four   five
a -0.166778  0.501113 -0.355322  bar  False
b       NaN       NaN       NaN  NaN    NaN
c -0.337890  0.580967  0.983801  bar  False
d       NaN       NaN       NaN  NaN    NaN
e  0.057802  0.761948 -0.712964  bar   True
f -0.443160 -0.974602  1.047704  bar  False
g       NaN       NaN       NaN  NaN    NaN
h -0.717852 -1.053898 -0.019369  bar  False

被视为“缺失”的值

由于数据有许多形状和形式，因此大熊猫的目标是在处理缺失数据时保持灵活性。虽然NaN是计算速度和便利性的默认缺失值标记，但我们需要能够使用不同类型的数据轻松检测此值：浮点，整数，布尔值和一般对象。然而，在许多情况下，Python None会出现，我们也希望考虑“缺失”或“不可用”或“不适用”。

注意：如果你想在计算中考虑inf并且-inf是“NA”，你可以设置。pandas.options.mode.use_inf_as_na = True

为了使检测的缺失值更容易（和不同阵列dtypes），熊猫提供isna()和 notna()功能，这也是对系列的方法和数据帧的对象：

In [7]: df2['one']
Out[7]: 
a   -0.166778
b         NaN
c   -0.337890
d         NaN
e    0.057802
f   -0.443160
g         NaN
h   -0.717852
Name: one, dtype: float64

In [8]: pd.isna(df2['one'])
Out[8]: 
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [9]: df2['four'].notna()
Out[9]: 
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

In [10]: df2.isna()
Out[10]: 
     one    two  three   four   five
a  False  False  False  False  False
b   True   True   True   True   True
c  False  False  False  False  False
d   True   True   True   True   True
e  False  False  False  False  False
f  False  False  False  False  False
g   True   True   True   True   True
h  False  False  False  False  False

警告

必须要注意的是，在Python（和NumPy）中，nan's不要比较相等，但要None's 做到。请注意，pandas / NumPy使用的事实是，并且像对待一样。np.nan != np.nanNonenp.nan
In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False
因此，与上面相比，标量相等比较与a None/np.nan不提供有用的信息。
In [13]: df2['one'] == np.nan
Out[13]: 
a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool

日期时间

对于datetime64 [ns]类型，NaT表示缺少的值。这是一个伪本机标记值，可以由单个dtype（datetime64 [ns]）中的NumPy表示。pandas对象提供NaT和之间的互操作性NaN。

In [14]: df2 = df.copy()

In [15]: df2['timestamp'] = pd.Timestamp('20120101')

In [16]: df2
Out[16]: 
        one       two     three four   five  timestamp
a -0.166778  0.501113 -0.355322  bar  False 2012-01-01
c -0.337890  0.580967  0.983801  bar  False 2012-01-01
e  0.057802  0.761948 -0.712964  bar   True 2012-01-01
f -0.443160 -0.974602  1.047704  bar  False 2012-01-01
h -0.717852 -1.053898 -0.019369  bar  False 2012-01-01

In [17]: df2.loc[['a','c','h'],['one','timestamp']] = np.nan

In [18]: df2
Out[18]: 
        one       two     three four   five  timestamp
a       NaN  0.501113 -0.355322  bar  False        NaT
c       NaN  0.580967  0.983801  bar  False        NaT
e  0.057802  0.761948 -0.712964  bar   True 2012-01-01
f -0.443160 -0.974602  1.047704  bar  False 2012-01-01
h       NaN -1.053898 -0.019369  bar  False        NaT

In [19]: df2.get_dtype_counts()
Out[19]: 
float64           3
object            1
bool              1
datetime64[ns]    1
dtype: int64

插入缺失数据

您只需分配容器即可插入缺失值。使用的实际缺失值将根据dtype选择。

例如，NaN无论选择的缺失值类型如何，数字容器将始终使用：

In [20]: s = pd.Series([1, 2, 3])

In [21]: s.loc[0] = None

In [22]: s
Out[22]: 
0    NaN
1    2.0
2    3.0
dtype: float64

同样，datetime容器将始终使用NaT。

对于对象容器，pandas将使用给定的值：

In [23]: s = pd.Series(["a", "b", "c"])

In [24]: s.loc[0] = None

In [25]: s.loc[1] = np.nan

In [26]: s
Out[26]: 
0    None
1     NaN
2       c
dtype: object

缺少数据的计算

缺失值通过pandas对象之间的算术运算自然传播。

In [27]: a
Out[27]: 
        one       two
a       NaN  0.501113
c       NaN  0.580967
e  0.057802  0.761948
f -0.443160 -0.974602
h -0.443160 -1.053898

In [28]: b
Out[28]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369

In [29]: a + b
Out[29]: 
        one  three       two
a       NaN    NaN  1.002226
c       NaN    NaN  1.161935
e  0.115604    NaN  1.523896
f -0.886321    NaN -1.949205
h       NaN    NaN -2.107796

数据结构概述（以及此处和此处列出）中讨论的描述性统计和计算方法都是为了解决丢失的数据而编写的。例如：

求和数据时，NA（缺失）值将被视为零。
如果数据都是NA，则结果为0。
默认情况下，累积方法cumsum()和cumprod()忽略NA值，但在结果数组中保留它们。要覆盖此行为并包含NA值，请使用skipna=False。

In [30]: df
Out[30]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369

In [31]: df['one'].sum()
Out[31]: -0.38535826528461409

In [32]: df.mean(1)
Out[32]: 
a    0.072895
c    0.782384
e    0.035595
f   -0.123353
h   -0.536633
dtype: float64

In [33]: df.cumsum()
Out[33]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  1.082080  0.628479
e  0.057802  1.844028 -0.084485
f -0.385358  0.869426  0.963219
h       NaN -0.184472  0.943850

In [34]: df.cumsum(skipna=False)
Out[34]: 
   one       two     three
a  NaN  0.501113 -0.355322
c  NaN  1.082080  0.628479
e  NaN  1.844028 -0.084485
f  NaN  0.869426  0.963219
h  NaN -0.184472  0.943850

Sum/Prod of Empties/Nans

警告

此行为现在是v0.22.0的标准，并且与默认值一致numpy; 之前所有NA或空系列/数据框的sum / prod将返回NaN。有关更多信息，请参阅v0.22.0 whatsnew。

DataFrame的空或全NA系列或列的总和为0。

In [35]: pd.Series([np.nan]).sum()
Out[35]: 0.0

In [36]: pd.Series([]).sum()
Out[36]: 0.0

空数据或全NA系列或DataFrame列的乘积为1。

In [37]: pd.Series([np.nan]).prod()
Out[37]: 1.0

In [38]: pd.Series([]).prod()
Out[38]: 1.0

GroupBy中的NA值

GroupBy中的NA组被自动排除。此行为与R一致，例如：

In [39]: df
Out[39]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369

In [40]: df.groupby('one').mean()
Out[40]: 
                two     three
one                          
-0.443160 -0.974602  1.047704
 0.057802  0.761948 -0.712964

有关详细信息，请参阅此处的groupby部分。

清理/填写缺失数据

pandas对象配备有各种数据处理方法来处理丢失的数据。

填充缺失值：fillna

fillna() 可以通过几种方式用非NA数据“填写”NA值，我们将说明：

用标量值替换NA

In [41]: df2
Out[41]: 
        one       two     three four   five  timestamp
a       NaN  0.501113 -0.355322  bar  False        NaT
c       NaN  0.580967  0.983801  bar  False        NaT
e  0.057802  0.761948 -0.712964  bar   True 2012-01-01
f -0.443160 -0.974602  1.047704  bar  False 2012-01-01
h       NaN -1.053898 -0.019369  bar  False        NaT

In [42]: df2.fillna(0)
Out[42]: 
        one       two     three four   five            timestamp
a  0.000000  0.501113 -0.355322  bar  False                    0
c  0.000000  0.580967  0.983801  bar  False                    0
e  0.057802  0.761948 -0.712964  bar   True  2012-01-01 00:00:00
f -0.443160 -0.974602  1.047704  bar  False  2012-01-01 00:00:00
h  0.000000 -1.053898 -0.019369  bar  False                    0

In [43]: df2['one'].fillna('missing')
Out[43]: 
a     missing
c     missing
e    0.057802
f    -0.44316
h     missing
Name: one, dtype: object

向前或向后填补空隙

使用与重建索引相同的填充参数，我们可以向前或向后传播非NA值：

In [44]: df
Out[44]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369

In [45]: df.fillna(method='pad')
Out[45]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h -0.443160 -1.053898 -0.019369

限制填充量

如果我们只想要填充一定数量的数据点的连续间隙，我们可以使用limit关键字：

In [46]: df
Out[46]: 
   one       two     three
a  NaN  0.501113 -0.355322
c  NaN  0.580967  0.983801
e  NaN       NaN       NaN
f  NaN       NaN       NaN
h  NaN -1.053898 -0.019369

In [47]: df.fillna(method='pad', limit=1)
Out[47]: 
   one       two     three
a  NaN  0.501113 -0.355322
c  NaN  0.580967  0.983801
e  NaN  0.580967  0.983801
f  NaN       NaN       NaN
h  NaN -1.053898 -0.019369

提醒您，这些是可用的填充方法：

方法	行动
垫/ ffill	向前填充值
bfill /回填	向后填充值

对于时间序列数据，使用pad / ffill非常常见，因此在每个时间点都可以使用“最后已知值”。

ffill()相当于fillna(method='ffill') 和bfill()等同于fillna(method='bfill')

用PandasObject填充

您还可以使用可对齐的字典或系列填充。系列的字典或索引的标签必须与您要填充的框架的列相匹配。其用例是使用该列的平均值填充DataFrame。

In [48]: dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))

In [49]: dff.iloc[3:5,0] = np.nan

In [50]: dff.iloc[4:6,1] = np.nan

In [51]: dff.iloc[5:8,2] = np.nan

In [52]: dff
Out[52]: 
          A         B         C
0  0.758887  2.340598  0.219039
1 -1.235583  0.031785  0.701683
2 -1.557016 -0.636986 -1.238610
3       NaN -1.002278  0.654052
4       NaN       NaN  1.053999
5  0.651981       NaN       NaN
6  0.109001 -0.533294       NaN
7 -1.037831 -1.150016       NaN
8 -0.687693  1.921056 -0.121113
9 -0.258742 -0.706329  0.402547

In [53]: dff.fillna(dff.mean())
Out[53]: 
          A         B         C
0  0.758887  2.340598  0.219039
1 -1.235583  0.031785  0.701683
2 -1.557016 -0.636986 -1.238610
3 -0.407125 -1.002278  0.654052
4 -0.407125  0.033067  1.053999
5  0.651981  0.033067  0.238800
6  0.109001 -0.533294  0.238800
7 -1.037831 -1.150016  0.238800
8 -0.687693  1.921056 -0.121113
9 -0.258742 -0.706329  0.402547

In [54]: dff.fillna(dff.mean()['B':'C'])
Out[54]: 
          A         B         C
0  0.758887  2.340598  0.219039
1 -1.235583  0.031785  0.701683
2 -1.557016 -0.636986 -1.238610
3       NaN -1.002278  0.654052
4       NaN  0.033067  1.053999
5  0.651981  0.033067  0.238800
6  0.109001 -0.533294  0.238800
7 -1.037831 -1.150016  0.238800
8 -0.687693  1.921056 -0.121113
9 -0.258742 -0.706329  0.402547

与上面的结果相同，但是在这种情况下对齐'fill'值是一个系列。

In [55]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
Out[55]: 
          A         B         C
0  0.758887  2.340598  0.219039
1 -1.235583  0.031785  0.701683
2 -1.557016 -0.636986 -1.238610
3 -0.407125 -1.002278  0.654052
4 -0.407125  0.033067  1.053999
5  0.651981  0.033067  0.238800
6  0.109001 -0.533294  0.238800
7 -1.037831 -1.150016  0.238800
8 -0.687693  1.921056 -0.121113
9 -0.258742 -0.706329  0.402547

删除轴标签缺少数据：dropna

您可能希望简单地从引用缺失数据的数据集中排除标签。为此，请使用dropna()：

In [56]: df
Out[56]: 
   one       two     three
a  NaN  0.501113 -0.355322
c  NaN  0.580967  0.983801
e  NaN  0.000000  0.000000
f  NaN  0.000000  0.000000
h  NaN -1.053898 -0.019369

In [57]: df.dropna(axis=0)
Out[57]: 
Empty DataFrame
Columns: [one, two, three]
Index: []

In [58]: df.dropna(axis=1)
Out[58]: 
        two     three
a  0.501113 -0.355322
c  0.580967  0.983801
e  0.000000  0.000000
f  0.000000  0.000000
h -1.053898 -0.019369

In [59]: df['one'].dropna()
Out[59]: Series([], Name: one, dtype: float64)

dropna()系列可以使用等效产品。DataFrame.dropna具有比Series.dropna更多的选项，可以在API中进行检查。

插值

在0.21.0版本中的新的limit_area加入关键字参数。

interpolate() 默认情况下，Series和DataFrame对象都会在缺少的数据点处执行线性插值。

In [60]: ts
Out[60]: 
2000-01-31    0.469112
2000-02-29         NaN
2000-03-31         NaN
2000-04-28         NaN
2000-05-31         NaN
2000-06-30         NaN
2000-07-31         NaN
                ...   
2007-10-31   -3.305259
2007-11-30   -5.485119
2007-12-31   -6.854968
2008-01-31   -7.809176
2008-02-29   -6.346480
2008-03-31   -8.089641
2008-04-30   -8.916232
Freq: BM, Length: 100, dtype: float64

In [61]: ts.count()
Out[61]: 61

In [62]: ts.interpolate().count()
Out[62]: 100

In [63]: ts.interpolate().plot()
Out[63]: <matplotlib.axes._subplots.AxesSubplot at 0x7f20cf59ca58>

【数据预处理】Pandas缺失的数据处理

通过method关键字可以获得索引感知插值：

In [64]: ts2
Out[64]: 
2000-01-31    0.469112
2000-02-29         NaN
2002-07-31   -5.689738
2005-01-31         NaN
2008-04-30   -8.916232
dtype: float64

In [65]: ts2.interpolate()
Out[65]: 
2000-01-31    0.469112
2000-02-29   -2.610313
2002-07-31   -5.689738
2005-01-31   -7.302985
2008-04-30   -8.916232
dtype: float64

In [66]: ts2.interpolate(method='time')
Out[66]: 
2000-01-31    0.469112
2000-02-29    0.273272
2002-07-31   -5.689738
2005-01-31   -7.095568
2008-04-30   -8.916232
dtype: float64

对于浮点索引，请使用method='values'：

In [67]: ser
Out[67]: 
0.0      0.0
1.0      NaN
10.0    10.0
dtype: float64

In [68]: ser.interpolate()
Out[68]: 
0.0      0.0
1.0      5.0
10.0    10.0
dtype: float64

In [69]: ser.interpolate(method='values')
Out[69]: 
0.0      0.0
1.0      1.0
10.0    10.0
dtype: float64

您还可以使用DataFrame进行插值：

In [70]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
   ....:                    'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
   ....: 

In [71]: df
Out[71]: 
     A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

In [72]: df.interpolate()
Out[72]: 
     A      B
0  1.0   0.25
1  2.1   1.50
2  3.4   2.75
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40

该method参数提供了对更高级插值方法的访问。如果安装了scipy，则可以将1-d插值例程的名称传递给method。您需要查阅完整的scipy插值文档和参考指南以获取详细信息。适当的插值方法取决于您使用的数据类型。

如果您正在处理以不断增长的速度增长的时间序列，则 method='quadratic'可能是合适的。
如果您的值接近累积分布函数，那么method='pchip'应该可以正常工作。
要以平滑绘图的目标填充缺失值，请考虑method='akima'。

警告：这些方法需要scipy。

In [73]: df.interpolate(method='barycentric')
Out[73]: 
      A       B
0  1.00   0.250
1  2.10  -7.660
2  3.53  -4.515
3  4.70   4.000
4  5.60  12.200
5  6.80  14.400

In [74]: df.interpolate(method='pchip')
Out[74]: 
         A          B
0  1.00000   0.250000
1  2.10000   0.672808
2  3.43454   1.928950
3  4.70000   4.000000
4  5.60000  12.200000
5  6.80000  14.400000

In [75]: df.interpolate(method='akima')
Out[75]: 
          A          B
0  1.000000   0.250000
1  2.100000  -0.873316
2  3.406667   0.320034
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

通过多项式或样条逼近进行插值时，还必须指定近似的度数或阶数：

In [76]: df.interpolate(method='spline', order=2)
Out[76]: 
          A          B
0  1.000000   0.250000
1  2.100000  -0.428598
2  3.404545   1.206900
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

In [77]: df.interpolate(method='polynomial', order=2)
Out[77]: 
          A          B
0  1.000000   0.250000
1  2.100000  -2.703846
2  3.451351  -1.453846
3  4.700000   4.000000
4  5.600000  12.200000
5  6.800000  14.400000

比较几种方法：

In [78]: np.random.seed(2)

In [79]: ser = pd.Series(np.arange(1, 10.1, .25)**2 + np.random.randn(37))

In [80]: bad = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])

In [81]: ser[bad] = np.nan

In [82]: methods = ['linear', 'quadratic', 'cubic']

In [83]: df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})

In [84]: df.plot()
Out[84]: <matplotlib.axes._subplots.AxesSubplot at 0x7f20cf573fd0>

【数据预处理】Pandas缺失的数据处理

另一个用例是以新值插值。假设您有一些分布的100个观察值。让我们假设你对中间发生的事情特别感兴趣。您可以混合使用pandas reindex和interpolate方法来插入新值。

In [85]: ser = pd.Series(np.sort(np.random.uniform(size=100)))

# interpolate at new_index
In [86]: new_index = ser.index | pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])

In [87]: interp_s = ser.reindex(new_index).interpolate(method='pchip')

In [88]: interp_s[49:51]
Out[88]: 
49.00    0.471410
49.25    0.476841
49.50    0.481780
49.75    0.485998
50.00    0.489266
50.25    0.491814
50.50    0.493995
50.75    0.495763
51.00    0.497074
dtype: float64

插值限制

像其他pandas fill方法一样，interpolate()接受limit关键字参数。使用此参数可限制NaN自上次有效观察以来填充的连续值的数量：

In [89]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])

# fill all consecutive values in a forward direction
In [90]: ser.interpolate()
Out[90]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64

# fill one consecutive value in a forward direction
In [91]: ser.interpolate(limit=1)
Out[91]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5     NaN
6    13.0
7    13.0
8     NaN
dtype: float64

默认情况下，NaN值按forward方向填充。使用 limit_direction参数填充backward或从both方向填充。

# fill one consecutive value backwards
In [92]: ser.interpolate(limit=1, limit_direction='backward')
Out[92]: 
0     NaN
1     5.0
2     5.0
3     NaN
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64

# fill one consecutive value in both directions
In [93]: ser.interpolate(limit=1, limit_direction='both')
Out[93]: 
0     NaN
1     5.0
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7    13.0
8     NaN
dtype: float64

# fill all consecutive values in both directions
In [94]: ser.interpolate(limit_direction='both')
Out[94]: 
0     5.0
1     5.0
2     5.0
3     7.0
4     9.0
5    11.0
6    13.0
7    13.0
8    13.0
dtype: float64

默认情况下，NaN无论值是在现有有效值内部（由其包围）还是在现有有效值之外，都会填充值。在v0.23中引入的limit_area参数将填充限制为内部或外部值。

# fill one consecutive inside value in both directions
In [95]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
Out[95]: 
0     NaN
1     NaN
2     5.0
3     7.0
4     NaN
5    11.0
6    13.0
7     NaN
8     NaN
dtype: float64

# fill all consecutive outside values backward
In [96]: ser.interpolate(limit_direction='backward', limit_area='outside')
Out[96]: 
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7     NaN
8     NaN
dtype: float64

# fill all consecutive outside values in both directions
In [97]: ser.interpolate(limit_direction='both', limit_area='outside')
Out[97]: 
0     5.0
1     5.0
2     5.0
3     NaN
4     NaN
5     NaN
6    13.0
7    13.0
8    13.0
dtype: float64

替换通用值

通常我们想用其他值替换任意值。

replace()在Series和replace()DataFrame中提供了一种有效而灵活的方式来执行此类替换。

对于Series，您可以用其他值替换单个值或值列表：

In [98]: ser = pd.Series([0., 1., 2., 3., 4.])

In [99]: ser.replace(0, 5)
Out[99]: 
0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

您可以通过其他值列表替换值列表：

In [100]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
Out[100]: 
0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64

您还可以指定映射字典：

In [101]: ser.replace({0: 10, 1: 100})
Out[101]: 
0     10.0
1    100.0
2      2.0
3      3.0
4      4.0
dtype: float64

对于DataFrame，您可以按列指定单个值：

In [102]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})

In [103]: df.replace({'a': 0, 'b': 5}, 100)
Out[103]: 
     a    b
0  100  100
1    1    6
2    2    7
3    3    8
4    4    9

您可以将所有给定值视为缺失值并对其进行插值，而不是使用指定值替换：

In [104]: ser.replace([1, 2, 3], method='pad')
Out[104]: 
0    0.0
1    0.0
2    0.0
3    0.0
4    4.0
dtype: float64

字符串/正则表达式替换

注意：带有前缀r字符的Python字符串，例如所谓的“原始”字符串。它们具有与反斜杠不同的语义，而不是没有此前缀的字符串。原始字符串中的反斜杠将被解释为转义反斜杠，例如。如果不清楚，你应该阅读它们。r'helloworld'r'\' == '\\'

更换 '。' 用NaN（str - > str）：

In [105]: d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}

In [106]: df = pd.DataFrame(d)

In [107]: df.replace('.', np.nan)
Out[107]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

现在使用正则表达式删除周围的空格（正则表达式 - >正则表达式）：

In [108]: df.replace(r'\s*\.\s*', np.nan, regex=True)
Out[108]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

替换几个不同的值（列表 - >列表）：

In [109]: df.replace(['a', '.'], ['b', np.nan])
Out[109]: 
   a    b    c
0  0    b    b
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

正则表达式列表 - >正则表达式列表：

In [110]: df.replace([r'\.', r'(a)'], ['dot', '\1stuff'], regex=True)
Out[110]: 
   a       b       c
0  0  stuff  stuff
1  1       b       b
2  2     dot     NaN
3  3     dot       d

只搜索列'b'（dict - > dict）：

In [111]: df.replace({'b': '.'}, {'b': np.nan})
Out[111]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

与前一个示例相同，但使用正则表达式进行搜索（dge of regex - > dict）：

In [112]: df.replace({'b': r'\s*\.\s*'}, {'b': np.nan}, regex=True)
Out[112]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

您可以传递使用regex=True以下内容的正则表达式的嵌套字典：

In [113]: df.replace({'b': {'b': r''}}, regex=True)
Out[113]: 
   a  b    c
0  0  a    a
1  1       b
2  2  .  NaN
3  3  .    d

或者，您可以像这样传递嵌套字典：

In [114]: df.replace(regex={'b': {r'\s*\.\s*': np.nan}})
Out[114]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  NaN  NaN
3  3  NaN    d

您还可以在替换时使用正则表达式匹配组（正则表达式 - >正则表达式的dict），这也适用于列表。

In [115]: df.replace({'b': r'\s*(\.)\s*'}, {'b': r'\1ty'}, regex=True)
Out[115]: 
   a    b    c
0  0    a    a
1  1    b    b
2  2  .ty  NaN
3  3  .ty    d

您可以传递正则表达式列表，其中匹配的正则表达式将替换为标量（正则表达式列表 - >正则表达式）。

In [116]: df.replace([r'\s*\.\s*', r'a|b'], np.nan, regex=True)
Out[116]: 
   a   b    c
0  0 NaN  NaN
1  1 NaN  NaN
2  2 NaN  NaN
3  3 NaN    d

所有正则表达式示例也可以以to_replace参数作为regex参数传递。在这种情况下，value 参数必须通过名称显式传递，或者regex必须是嵌套字典。在这种情况下，前一个示例将是：

In [117]: df.replace(regex=[r'\s*\.\s*', r'a|b'], value=np.nan)
Out[117]: 
   a   b    c
0  0 NaN  NaN
1  1 NaN  NaN
2  2 NaN  NaN
3  3 NaN    d

如果您不希望regex=True每次要使用正则表达式时都这样做，这将非常方便。

注意：在上面的replace示例中，您看到正则表达式的任何位置，编译的正则表达式也是有效的。

数字替换

replace()类似于fillna()。

In [118]: df = pd.DataFrame(np.random.randn(10, 2))

In [119]: df[np.random.rand(df.shape[0]) > 0.5] = 1.5

In [120]: df.replace(1.5, np.nan)
Out[120]: 
          0         1
0 -0.844214 -1.021415
1  0.432396 -0.323580
2  0.423825  0.799180
3  1.262614  0.751965
4       NaN       NaN
5       NaN       NaN
6 -0.498174 -1.060799
7  0.591667 -0.183257
8  1.019855 -1.482465
9       NaN       NaN

通过传递列表可以替换多个值。

In [121]: df00 = df.values[0, 0]

In [122]: df.replace([1.5, df00], [np.nan, 'a'])
Out[122]: 
          0         1
0         a  -1.02141
1  0.432396  -0.32358
2  0.423825   0.79918
3   1.26261  0.751965
4       NaN       NaN
5       NaN       NaN
6 -0.498174   -1.0608
7  0.591667 -0.183257
8   1.01985  -1.48247
9       NaN       NaN

In [123]: df[1].dtype
Out[123]: dtype('float64')

您还可以对DataFrame进行操作：

In [124]: df.replace(1.5, np.nan, inplace=True)

警告：

替换多个bool或datetime64对象时，replace（to_replace）的第一个参数必须与要替换的值的类型匹配。例如，
s = pd.Series([True, False, True])
s.replace({'a string': 'new value', True: False})  # raises

TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
会引发一个，TypeError因为其中一个dict键的更换类型不正确。

但是，当替换单个对象时，
In [125]: s = pd.Series([True, False, True])

In [126]: s.replace('a string', 'another string')
Out[126]: 
0     True
1    False
2     True
dtype: bool
原始NDFrame对象将不会被返回。我们正在努力统一此API，但出于向后兼容性原因，我们无法打破后一种行为。有关详细信息，请参见GH6354。

缺少数据转换规则和索引

虽然pandas支持存储整数和布尔类型的数组，但这些类型不能存储丢失的数据。在我们可以切换到在NumPy中使用本机NA类型之前，我们已经建立了一些“投射规则”。当重建索引操作引入缺失数据时，系列将根据下表中引入的规则进行转换。

数据类型	演员
整数	浮动
布尔	目的
浮动	没有演员
目的	没有演员

例如：

In [127]: s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])

In [128]: s > 0
Out[128]: 
0    True
2    True
4    True
6    True
7    True
dtype: bool

In [129]: (s > 0).dtype
Out[129]: dtype('bool')

In [130]: crit = (s > 0).reindex(list(range(8)))

In [131]: crit
Out[131]: 
0    True
1     NaN
2    True
3     NaN
4    True
5     NaN
6    True
7    True
dtype: object

In [132]: crit.dtype
Out[132]: dtype('O')

通常，如果您尝试使用对象数组（即使它包含布尔值）而不是布尔数组来从ndarray获取或设置值（例如，根据某些条件选择值），NumPy会抱怨。如果布尔向量包含NA，则会生成异常：

In [133]: reindexed = s.reindex(list(range(8))).fillna(0)

In [134]: reindexed[crit]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-134-0dac417a4890> in <module>()
----> 1 reindexed[crit]

/pandas/pandas/core/series.py in __getitem__(self, key)
    805             key = list(key)
    806 
--> 807         if com.is_bool_indexer(key):
    808             key = check_bool_indexer(self.index, key)
    809 

/pandas/pandas/core/common.py in is_bool_indexer(key)
    105             if not lib.is_bool_array(key):
    106                 if isna(key).any():
--> 107                     raise ValueError('cannot index with vector containing '
    108                                      'NA / NaN values')
    109                 return False

ValueError: cannot index with vector containing NA / NaN values

但是，这些可以填写使用fillna()，它将工作正常：

In [135]: reindexed[crit.fillna(False)]
Out[135]: 
0    0.126504
2    0.696198
4    0.697416
6    0.601516
7    0.003659
dtype: float64

In [136]: reindexed[crit.fillna(True)]
Out[136]: 
0    0.126504
1    0.000000
2    0.696198
3    0.000000
4    0.697416
5    0.000000
6    0.601516
7    0.003659
dtype: float64

相关标签： pandas 数据预处理缺失值处理

上一篇： Android系统进程间通信Binder机制在应用程序框架层的Java接口源代码分析

下一篇： C语言_指针_函数指针

【数据预处理】Pandas缺失的数据处理

缺少数据基础

何时/为何数据丢失？

被视为“缺失”的值

日期时间

插入缺失数据

缺少数据的计算

Sum/Prod of Empties/Nans

GroupBy中的NA值

清理/填写缺失数据

填充缺失值：fillna

用PandasObject填充

删除轴标签缺少数据：dropna

插值

插值限制

替换通用值

字符串/正则表达式替换

数字替换

缺少数据转换规则和索引