pandas 基本使用

程序员文章站 2024-03-24 22:04:58

...

介绍

是什么

pandas是一个强大的Python数据分析的工具包，它是基于NumPy构建的。

关于NumPy，参考这里。

主要功能

具备对齐功能的数据结构DataFrame、Series
集成时间序列功能
提供丰富的数学运算和操作
灵活处理缺失数据

安装

pip install pandas

引用

import pandas as pd

Series

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。

Series比较像列表（数组）和字典的结合体，有序，同时支持索引和键。

创建

# 默认创建
sr = pd.Series([1,5,-6,9])
""" 默认生成整数索引
0    1
1    5
2   -6
3    9
dtype: int64
"""

# 指定标签创建
pd.Series([1,5,-6,9], index=['a', 'b', 'c', 'd'])
"""
a    1
b    5
c   -6
d    9
dtype: int64
"""

# 以字典方式创建
pd.Series({'a':1, 'b':2})
"""
a    1
b    2
dtype: int64
"""

# 取值数组和索引数组：values属性和index属性
sr = pd.Series([1,5,-6,9])
sr.index
sr.values
"""
RangeIndex(start=0, stop=4, step=1)
array([ 1,  5, -6,  9], dtype=int64)
"""

sr = pd.Series([1,5,-6,9], index=['a', 'b', 'c', 'd'])
sr.index
sr.values
""" 说明，字符串是object
Index(['a', 'b', 'c', 'd'], dtype='object')
array([ 1,  5, -6,  9], dtype=int64)
"""

特性

Series支持NumPy模块的特性（下标）

从ndarray创建Series：Series(arr)

a = np.array([1,2,3,4])
sr = pd.Series(a, index=['a','b','c','d'])
"""
a    1
b    2
c    3
d    4
dtype: int32
"""

与标量运算：sr*2

sr = pd.Series([1,2,3,4], index=['a','b','c','d'])
sr * 2
"""
a    2
b    4
c    6
d    8
dtype: int32
"""

两个Series运算：sr1+sr2

sr1 = pd.Series([1,2,3,4])
sr2 = pd.Series([3,1,3,4])
sr1 + sr2
"""
0    4
1    3
2    6
3    8
dtype: int64
"""

索引：sr[0], sr[[1,2,4]]

sr = pd.Series([1,5,-6,9,8], index=['a', 'b', 'c', 'd', 'e'])

sr[0]  # 1 简单索引
sr[[1,2,4]]  # 花式索引
"""
b    5
c   -6
e    8
dtype: int64
"""

切片：sr[0:2]（切片依然是视图形式），顾头不顾尾

sr = pd.Series([1,5,-6,9,8], index=['a', 'b', 'c', 'd', 'e'])
sr[0:2]
"""
a    1
b    5
dtype: int64
"""

通用函数：np.abs(sr) ，参考num.py
布尔值过滤：sr[sr>0]

统计函数：mean() sum() cumsum()


# cumsum() 返回前缀和

sr = pd.Series([1,2,3,4,5])
sr.cumsum()
"""
0     1  
1     3  
2     6
3    10
4    15
dtype: int64
"""

Series支持字典的特性（标签）

从字典创建Series：Series(dic),

in运算：’a’ in sr、for x in sr

sr = pd.Series([1,2,3,4], index=['a','b','c','d'])
'a' in sr  # True
sr.get('a') # 1  


# 注意，不同与python的字典类型，循环Series对象，结果就是值，而不是键

for i in sr:
    print(i)
"""
1
2
3
4
"""

键索引：sr[‘a’], sr[[‘a’, ‘b’, ‘d’]]

sr = pd.Series([1,2,3,4], index=['a','b','c','d'])
sr['a'] # 1


# 花式索引

sr[['a','b','d']] 
"""
a    1
b    2
d    4
dtype: int64
"""

键切片：sr[‘a’:’c’]，顾头顾尾

sr = pd.Series([1,2,3,4], index=['a','b','c','d'])
sr['a':'c']
"""
等同于sr[0:3]，只不过通过标签索切片，顾头顾尾:
a    1
b    2
c    3
dtype: int64
"""

其他函数：get(‘a’, default=0)等

sr = pd.Series([1,2,3,4], index=['a','b','c','d'])
sr.get('f', default=0)  # 0

整数索引

Series对象的索引既可以是标签，又可以是整数。

特别注意：如果索引是整数类型，则根据整数进行数据操作时总是面向标签的！(即如果索引即可以解释成下标，有可以是标签时，以标签解释)

解决方案：sr.loc() 以标签解释； sr.iloc() 以下标解释

import pandas as pd
import numpy as np

sr = pd.Series(np.random.uniform(1,10,5))
"""
0    2.699248
1    7.649924
2    5.232440
3    6.690911
4    5.734944
dtype: float64
"""
sr[-1]  # 报错，KeyError: -1  因为默认以标签解释，而-1这个标签不存在

sr.iloc[-1] # 5.734944 （因为是随机数，这里值是随机的）
sr.loc[-1]  # KeyError: 'the label [-1] is not in the [index]'

Series数据对齐

Series对象运算时，会按索引进行对齐然后计算。如果存在不同的索引，则结果的索引是两个操作数索引的并集

# 索引都存在时,按标签对齐
sr1 = pd.Series([12,23,34], index=['c','a','d'])
sr2 = pd.Series([11,20,10], index=['d','c','a',])
sr1+sr2
"""
a    33
c    32
d    45
dtype: int64
"""

#存在不同的索引时，缺失值返回NaN
sr3 = pd.Series([11,20,10,14], index=['d','c','a','b'])
sr1+sr3
"""
a    33.0
b     NaN   # 缺失值
c    32.0
d    45.0
dtype: float64
"""

# 如何在两个Series对象进行算术运算时将缺失值设为0？ 
# 使用函数（add, sub, div, mul），而不是运算符，指定fill_value参数
sr1.add(sr3, fill_value=0)
"""
a    33.0
b    14.0   # 缺失值赋值0，使不影响运算结果
c    32.0
d    45.0
dtype: float64
"""

Series缺失数据

缺失数据：

使用NaN（Not a Number）来表示缺失数据。其值等于np.nan。内置的None值也会被当做NaN处理。

处理缺失数据的相关方法：

isnull() 返回布尔数组，缺失值对应为True；等同于NumPy中的isnan()
notnull() 返回布尔数组，缺失值对应为False
dropna() 过滤掉值为NaN的行
fillna() 填充缺失数据

import pandas as pd
import numpy as np

sr = pd.Series([33, np.nan, 32, np.nan], index=['a','b','c','d'])
sr.notnull()
"""
a     True
b    False
c     True
d    False
dtype: bool
"""

# 布尔过滤
sr[sr.notnull()]
"""
a    33.0
c    32.0
dtype: float64
"""

# 效果等价于布尔过滤
sr.dropna()
"""
a    33.0
c    32.0
dtype: float64
"""

# 填充(这里用平均值填充，注意，mean()函数在计算平均值是会自动过滤掉NaN)
sr.fillna(sr.mean())
"""
a    33.0
b    32.5
c    32.0
d    32.5
dtype: float64
"""

DataFrame

DataFrame是一个表格型的数据结构（二维数据结构），含有一组有序的列。DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

创建

# 通过字典的方式创建，key相当于列名（也可以理解为数据库的字段），自动创建整数索引。
pd.DataFrame({'one': [1,2,3,4,], 'two': [4,3,2,1]})
"""
    one two
0   1   4
1   2   3
2   3   2
3   4   1
"""

# 通过字典和Series对象创建: 取Series对象中标签的并集作为索引，缺失值为NaN
pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
              'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
"""
    one two
a   1.0 2
b   2.0 1
c   3.0 3
d   NaN 4
"""

DataFrame的类似与数据库中的表，我们一般很少手动创建DataFrame，而是从文件中读取，这个后面会写到。

csv文件的读取和写入：

df = pd.read_csv(‘601318.csv’)
df.to_csv()

常用属性及方法

T 转置
index 获取索引
columns 获取列索引
values 获取值数组

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
              'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})
"""
    one two
a   1.0 2
b   2.0 1
c   3.0 3
d   NaN 4
"""

# 转置
df.T
"""
    a   b   c   d
one 1.0 2.0 3.0 NaN
two 2.0 1.0 3.0 4.0
"""

# 获取索引
df.index
"""
Index(['a', 'b', 'c', 'd'], dtype='object')
"""

# 获取列索引
df.columns
"""
Index(['one', 'two'], dtype='object')
"""

# 获取值数组
df.values
"""
array([[  1.,   2.],
       [  2.,   1.],
       [  3.,   3.],
       [ nan,   4.]])
"""

索引和切片

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
              'two':pd.Series([5,4,7,2],index=['b','a','c','d']),
                'three':pd.Series([6,8,3,9], index=['a','b','c','d'])})
"""

    one     three   two
a   1.0     6       4
b   2.0     8       5
c   3.0     3       7
d   NaN     9       2
"""

########  位置索引 iloc ################################

# 取一维数据（Series对象）,按行取
df.iloc[3]  # iloc下标索引
"""
one      NaN
three    9.0
two      2.0
Name: d, dtype: float64
"""

# 取某个值
df.iloc[3,1]  # 9  3行1列

# 切某个区域；逗号左边是行，右边是列
df.iloc[0:3,1:]
"""
  three  two
a   6   4
b   8   5
c   3   7
"""

# 花式索引
df.iloc[[0,3],[0,2]]
"""
    one two
a   1.0 4
d   NaN 2
"""

####### 标签索引 先列后行 ############################

# 取一维数据（Series对象）,按列取
df['one']
"""
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64
"""


# 取某个值, 按标签，先列后行
df['two']['a'] # 4 two列a行

# 注意：先行后列时，千万不要连用两个中括号, 除非指定loc
df['b']['two']  # 报错 KeyError: 'b'
df.loc['b']['two']  # 5

###### 布尔索引 ###################################

df[df['one']>1]
"""
    one     three   two
b   2.0     8       5
c   3.0     3       7
"""

# isin([]) 判断在不在
df[df['two'].isin([2,4])]
"""
    one three   two
a   1.0 6   4
d   NaN 9   2
"""

# 不符合条件全部设置缺失值
df[df>6] 
"""

    one     three   two
a   NaN     NaN     NaN
b   NaN     8.0     NaN
c   NaN     NaN     7.0
d   NaN     9.0     NaN
"""


"""
总结：
通过位置取值时，加iloc指定
通过标签取值时，先列后行；如果要先行后列，加loc指定
（因为DateFrame类似于数据库中的表，因此通过列（字段）取值显然更有意义）
"""

数据对齐与缺失数据

DataFrame对象在运算时，同样会进行数据对齐，结果的行索引与列索引分别为两个对象的行索引与列索引的并集。

DataFrame处理缺失数据的方法：

isnull()
notnull()
dropna(axis=0, how=’any’,…)
- axis=0 行轴（默认值）， axis=1 列轴
- how=’any’ 指定轴中只要有一个nan，就删掉；how=’all’ 指定轴全是nan,才删掉
fillna()

import pandas as pd
import numpy as np

df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),
              'two':pd.Series([3,2,np.nan],index=['a','b','c'])})
"""

    one two
a   1   3.0
b   2   2.0
c   3   NaN
"""

df.dropna(how='any')
"""
    one two
a   1   3.0
b   2   2.0
"""

df.dropna(how='all')
"""
    one two
a   1   3.0
b   2   2.0
c   3   NaN
"""

df.dropna(axis=1, how='any')
"""
    one
a   1
b   2
c   3
"""

df.fillna(0)
"""
    one two
a   1   3.0
b   2   2.0
c   3   0.0
"""

其他常用方法（Series和DataFrame）

mean(axis=0, skipna=True) 平均值
- axis=0 取列平均值（默认），axis=1 取行平均值
- skpna默认True, False没意义，只要有NaN, 平均值值为NaN
sum(axis=0)
sort_index(axis, …, ascending=True) 按行或列索引排序
sort_values(by=column, axis=0, ascending=True) 按值排序
- by指定列（必须参数，且必须是列），ascending=False 倒序
apply(func, axis=0) 将自定义函数应用在各行或者各列上,func可返回标量或者Series
applymap(func) 将函数应用在DataFrame各个元素上
map(func) 将函数应用在Series各个元素上
NumPy的通用函数在Pandas中同样适用

层次化索引

层次化索引是Pandas的一项重要功能，它使我们能够在一个轴上拥有多个索引级别。

pd.Series(np.random.rand(9), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                                    [1,2,3,1,2,3,1,2,3]])
"""
a  1    0.869598
   2    0.411018
   3    0.489448
b  1    0.805159
   2    0.202590
   3    0.926205
c  1    0.779833
   2    0.860388
   3    0.906701
dtype: float64
"""

从文件读取与写入

读文件

读取文件：从文件名、URL、文件对象中加载数据

read_csv 默认分隔符为逗号
read_table默认分隔符为\t
read_excel读取excel文件
更多格式，通过在ipython中输入pd.read_*? 查看
读取文件函数主要参数：
sep 指定分隔符，可用正则表达式如’\s+’
header=None 指定文件无列名，默认列名失效
names 指定列名原文件无列名时，可通过这种方式指定，比如： names=[‘username’, ‘phone’, ‘email’]
index_col 指定某列作为索引比如 index_col = 0 指定0列为索引
skip_row 指定跳过某些行
na_values 指定某些字符串表示缺失值，比如 na_values=[‘None’, ‘null’] 将这些字符串作为NaN
parse_dates 指定某些列是否被解析为日期，布尔值或列表

实例参看本文最后。

写文件

写入到文件：

to_csv

主要参数：

sep 指定分隔符，
na_rep 指定缺失值转换的字符串，默认为空字符串
header=False 不输出列名一行
index=False 不输出行索引一列

其它文件类型：

json, XML, HTML, 数据库

pandas转换为二进制文件格式（pickle）:

save
load

时间对象

时间序列类型：

时间戳：特定时刻
固定时期：如2017年7月
时间间隔：起始时间-结束时间

import datetime

a = datetime.datetime(2001, 1, 1, 0, 0)
b = datetime.datetime(2021, 3, 1, 0, 0)
c = b - a # datetime.timedelta(7364) 时间间隔对象
c.days  # 7364

python标准库：datetime ：

date time datetime timedelta
strftime() format, 对象变为字符串
strptime() parse 把字符串变为对象

第三方包：dateutil：

pandas自带这个包，更易用，不用指定格式，常见英文日期格式都支持：dateutil.parser.parse()

import dateutil

dateutil.parser.parse('2001-01-01')  
# datetime.datetime(2001, 1, 1, 0, 0) 解析为日期对象

dateutil.parser.parse('2001-01-01 09:30:00')
# datetime.datetime(2001, 1, 1, 9, 30)

dateutil.parser.parse('200/01/01')
# datetime.datetime(200, 1, 1, 0, 0)

时间对象处理

产生时间对象数组：date_range

start 开始时间
end 结束时间
periods 时间长度
freq 时间频率
- 默认为’D(ay)’，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

# 产生时间范围对象；B表示工作日，自动略过周末
pd.date_range('2017-07-015', '2017-07-30', freq='B')
"""
DatetimeIndex(['2017-07-17', '2017-07-18', '2017-07-19', '2017-07-20',
               '2017-07-21', '2017-07-24', '2017-07-25', '2017-07-26',
               '2017-07-27', '2017-07-28'],
              dtype='datetime64[ns]', freq='B')
"""

# 产生时间范围对象；SM表示半月间隔
pd.date_range('2017-05-01', '2017-07-01', freq='SM')
"""
DatetimeIndex(['2017-05-15', '2017-05-31', '2017-06-15', '2017-06-30'], dtype='datetime64[ns]', freq='SM-15')
"""

时间序列

概念

时间序列就是以时间对象为索引的Series或DataFrame。

datetime对象作为索引时是存储在DatetimeIndex对象中的。

功能

传入“年”或“年月”作为切片方式
传入日期范围作为切片方式

比如：拿到一份股票的历史数据601318.csv，读入进行处理

pandas 基本使用

# index_col='date' 指定date列为索引列（否则无法通过日期范围切片）
# parse_dates=['date'], 将date列解析为时间对象
df = pd.read_csv('601318.csv',index_col='date', parse_dates=['date'],  na_values=['None'])

# 根据时间范围切片
df['2013-11':'2014']

练习

选出601318.csv中所有阳线的日期

df = pd.read_csv('601318.csv',
                 index_col='date', # 日期列作为索引
                 parse_dates=['date'],
                 na_values=['None'])
df[df['close'] > df['open']].index  # 筛选出收盘价大于开盘价的行，然后取日期索引
"""
DatetimeIndex(['2007-03-06', '2007-03-07', '2007-03-13', '2007-03-14',
               '2007-03-16', '2007-03-26', '2007-03-27', '2007-03-28',
               ...
               '2017-07-19', '2017-07-20', '2017-07-24', '2017-07-27',
               '2017-07-31', '2017-08-01'],
              dtype='datetime64[ns]', name='date', length=1228, freq=None)
"""

# 结果还以通过.values 转为数组，或者.tolist() 转为列表

新增两列，存5日均线和30日均线

df['ma5'] = np.nan  # 初始化为NaN
df['ma10'] = np.nan

for i in range(4, len(df)):
    df.loc[df.index[i], 'ma5'] = df['close'][i-4:i+1].mean()  
    # 取收盘价那一列前4天到当天的均值，赋给当天ma5

for i in range(10, len(df)):
    df.loc[df.index[i], 'ma10'] = df['close'][i-9:i+1].mean()

pandas 基本使用

选出所有金叉日期和死叉日期

pandas 基本使用

golden_cross = []
death_cross = []

sr = df['ma10'] >= df['ma5']

""" 
...
2017-07-03    False
2017-07-04     True
2017-07-05     True
2017-07-06    False
2017-07-07    False
2017-07-10    False
...
T --> F : 金叉
F --> T : 死叉
"""

for i in range(1, len(sr)):
    if sr.iloc[i] == True and sr.iloc[i-1] == False:
        death_cross.append(sr.index[i])

    if sr.iloc[i] == False and sr.iloc[i-1] == True:
        golden_cross.append(sr.index[i])

pandas 基本使用

介绍

Series

创建

特性

Series支持NumPy模块的特性（下标）

Series支持字典的特性（标签）

整数索引

Series数据对齐

Series缺失数据

DataFrame

创建

常用属性及方法

索引和切片

数据对齐与缺失数据

其他常用方法（Series和DataFrame）

层次化索引

从文件读取与写入

读文件

写文件

时间对象

时间对象处理

时间序列

练习

SQL语句 05进阶篇索引、子查询、ALTER TABLE 命令、处理重复数据、使用视图