【Pandas-3】十分钟入门Pandas (下)
程序员文章站
2022-07-01 19:27:10
...
9 时间序列
Pandas 在对频率转换进行重新采样时拥有简单、强大且高效的功能(如将按秒采样的数据转换为按5分钟为单位进行采样的数据)。这种操作在金融领域非常常见。
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
1、 时区表示:
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [113]: ts
Out[113]:
2012-03-06 0.464000
2012-03-07 0.227371
2012-03-08 -0.496922
2012-03-09 0.306389
2012-03-10 -2.290613
Freq: D, dtype: float64
In [114]: ts_utc = ts.tz_localize('UTC')
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 0.464000
2012-03-07 00:00:00+00:00 0.227371
2012-03-08 00:00:00+00:00 -0.496922
2012-03-09 00:00:00+00:00 0.306389
2012-03-10 00:00:00+00:00 -2.290613
Freq: D, dtype: float64
2、 时区转换:
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 0.464000
2012-03-06 19:00:00-05:00 0.227371
2012-03-07 19:00:00-05:00 -0.496922
2012-03-08 19:00:00-05:00 0.306389
2012-03-09 19:00:00-05:00 -2.290613
Freq: D, dtype: float64
3、 时间跨度转换:
In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [119]: ts
Out[119]:
2012-01-31 -1.134623
2012-02-29 -1.561819
2012-03-31 -0.260838
2012-04-30 0.281957
2012-05-31 1.523962
Freq: M, dtype: float64
In [120]: ps = ts.to_period()
In [121]: ps
Out[121]:
2012-01 -1.134623
2012-02 -1.561819
2012-03 -0.260838
2012-04 0.281957
2012-05 1.523962
Freq: M, dtype: float64
In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 -1.134623
2012-02-01 -1.561819
2012-03-01 -0.260838
2012-04-01 0.281957
2012-05-01 1.523962
Freq: MS, dtype: float64
4、 时期和时间戳之间的转换使得可以使用一些方便的算术函数。
In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -0.902937
1990-06-01 09:00 0.068159
1990-09-01 09:00 -0.057873
1990-12-01 09:00 -0.368204
1991-03-01 09:00 -1.144073
Freq: H, dtype: float64
10 Categorical
从 0.15 版本开始,pandas 可以在DataFrame中支持 Categorical 类型的数据
In [127]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
1、 将原始的grade
转换为 Categorical 数据类型:
In [128]: df["grade"] = df["raw_grade"].astype("category")
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
2、 将 Categorical 类型数据重命名为更有意义的名称:
In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]
3、 对类别进行重新排序,增加缺失的类别:
In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
4、 排序是按照 Categorical 的顺序进行的而不是按照字典顺序进行:
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
5、 对 Categorical 列进行排序时存在空的类别:
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
11 画图
In [135]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
In [136]: ts = ts.cumsum()
In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff2ab2af550>
对于DataFrame
来说,plot
是一种将所有列及其标签进行绘制的简便方法:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure(); df.plot(); plt.legend(loc='best')
Out[140]: <matplotlib.legend.Legend at 0x7ff29c8163d0>
12 导入和保存数据
12.1 CSV
1、 写入 csv 文件:
In [141]: df.to_csv('foo.csv')
2、 从 csv 文件中读取:
In [142]: pd.read_csv('foo.csv')
Out[142]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
12.2 HDF5
1、 写入 HDF5 存储:
In [143]: df.to_hdf('foo.h5','df')
2、 从 HDF5 存储中读取:
In [144]: pd.read_hdf('foo.h5','df')
Out[144]:
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
12.3 Excel
1、 写入excel文件:
In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
2、 从excel文件中读取:
In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[146]:
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
上一篇: 一、mvc模式和mvvm模式
下一篇: “笨“方法学习Vue中的MVVM模式