欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Pandas基础2.1|Python学习笔记

程序员文章站 2022-05-26 21:40:32
...
import numpy as np
import pandas as pd
df = pd.read_csv('./data/table.csv',index_col='ID')
df
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1101 0 S_1 C_1 M street_1 173 63 34.0 A+
1102 1 S_1 C_1 F street_2 192 73 32.5 B+
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1104 3 S_1 C_1 F street_2 167 81 80.4 B-
1105 4 S_1 C_1 F street_4 159 64 84.8 B+
1201 5 S_1 C_2 M street_5 188 68 97.0 A-
1202 6 S_1 C_2 F street_4 176 94 63.5 B-
1203 7 S_1 C_2 M street_6 160 53 58.8 A+
1204 8 S_1 C_2 F street_5 162 63 33.8 B
1205 9 S_1 C_2 F street_6 167 63 68.4 B-
1301 10 S_1 C_3 M street_4 161 68 31.5 B+

一、单级索引

  • 最常用的三类:iloc - 位置索引;loc - 标签索引;[]

loc(RMK:loc中使用的切片全部包含右端点)

单行索引:

df.loc[1103]
Unnamed: 0           2
School             S_1
Class              C_1
Gender               M
Address       street_2
Height             186
Weight              82
Math              87.2
Physics             B+
Name: 1103, dtype: object

多行索引:

df.loc[[1103,1104]]
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1104 3 S_1 C_1 F street_2 167 81 80.4 B-
df.loc[2402:].head(5)#1304往后的所有
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
2402 31 S_2 C_4 M street_7 166 82 48.7 B
2403 32 S_2 C_4 F street_6 158 60 59.7 B+
2404 33 S_2 C_4 F street_2 160 84 67.7 B
2405 34 S_2 C_4 F street_6 193 54 47.6 B
df.loc[2402:2304:-1].head(5) #从2402开始从后往前取;loc取到端点
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
2402 31 S_2 C_4 M street_7 166 82 48.7 B
2401 30 S_2 C_4 F street_2 192 62 45.3 A
2305 29 S_2 C_3 M street_4 187 73 48.9 B
2304 28 S_2 C_3 F street_6 164 81 95.5 A-
  • 注:所有在loc中使用的切片全部包含右断电。
    作为pandas的使用者,不会关注最后一个标签再往后一位。若为左闭右开,则需要先知道再后面一列的名字,不便于操作。

单列索引:

df.loc[:,'Height'].head()
ID
1101    173
1102    192
1103    186
1104    167
1105    159
Name: Height, dtype: int64

多列索引:

df.loc[1201:2405,['Math','Physics']].head(5)
Math Physics
ID
1201 97.0 A-
1202 63.5 B-
1203 58.8 A+
1204 33.8 B
1205 68.4 B-
df.loc[:,'Gender':'Weight'].head()
Gender Address Height Weight
ID
1101 M street_1 173 63
1102 F street_2 192 73
1103 M street_2 186 82
1104 F street_2 167 81
1105 F street_4 159 64

联合索引:

df.loc[1101:2405:4,'Address':'Math'].head()
Address Height Weight Math
ID
1101 street_1 173 63 34.0
1105 street_4 159 64 84.8
1204 street_5 162 63 33.8
1303 street_7 188 82 49.7
2102 street_6 161 61 50.6

函数列索引:

  • lambda:匿名函数

g = lambda x: x+1

def g(x): return x+1

两者等价 --> lambda简化了函数定义的书写形式

df.loc[lambda x:x['Height'] >170 ].head()
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1101 0 S_1 C_1 M street_1 173 63 34.0 A+
1102 1 S_1 C_1 F street_2 192 73 32.5 B+
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1201 5 S_1 C_2 M street_5 188 68 97.0 A-
1202 6 S_1 C_2 F street_4 176 94 63.5 B-

loc可传入函数,且函数的输入值是整张表,输出为标量、切片、合法列表(元素出现在索引中)、合法索引

def f(x):
    return [1101,1202]
df.loc[f].head()
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1101 0 S_1 C_1 M street_1 173 63 34.0 A+
1202 6 S_1 C_2 F street_4 176 94 63.5 B-

布尔索引:

df_1 = df['Gender'].isin(['M'])
df_1.head()
ID
1101     True
1102    False
1103     True
1104    False
1105    False
Name: Gender, dtype: bool
df.loc[df_1].head()
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1101 0 S_1 C_1 M street_1 173 63 34.0 A+
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1201 5 S_1 C_2 M street_5 188 68 97.0 A-
1203 7 S_1 C_2 M street_6 160 53 58.8 A+
1301 10 S_1 C_3 M street_4 161 68 31.5 B+
df_2 = [True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]
#df_2为list
df_2
[False,
 False,
 False,
 False,
 True,
  ...]
df.loc[df_2].head()
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1105 4 S_1 C_1 F street_4 159 64 84.8 B+
1202 6 S_1 C_2 F street_4 176 94 63.5 B-
1301 10 S_1 C_3 M street_4 161 68 31.5 B+
1303 12 S_1 C_3 M street_7 188 82 49.7 B
2101 15 S_2 C_1 M street_7 174 84 83.3 C

只有布尔列表和索引子集构成的列表可传入loc

iloc方法(切片右端点不包含)

单行索引:

df.iloc[-1]
Unnamed: 0          34
School             S_2
Class              C_4
Gender               F
Address       street_6
Height             193
Weight              54
Math              47.6
Physics              B
Name: 2405, dtype: object

多行索引:

df.iloc[0:10:2]
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1101 0 S_1 C_1 M street_1 173 63 34.0 A+
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1105 4 S_1 C_1 F street_4 159 64 84.8 B+
1202 6 S_1 C_2 F street_4 176 94 63.5 B-
1204 8 S_1 C_2 F street_5 162 63 33.8 B

单列索引:

df.iloc[:,-1].head()
ID
1101    A+
1102    B+
1103    B+
1104    B-
1105    B+
Name: Physics, dtype: object

多列索引:

df.iloc[:,-1::-2].head()
Physics Weight Address Class Unnamed: 0
ID
1101 A+ 63 street_1 C_1 0
1102 B+ 73 street_2 C_1 1
1103 B+ 82 street_2 C_1 2
1104 B- 81 street_2 C_1 3
1105 B+ 64 street_4 C_1 4

混合索引:

df.iloc[3::4,-1::-3].head()
Physics Height Class
ID
1104 B- 167 C_1
1203 A+ 160 C_2
1302 A- 175 C_3
2101 C 174 C_1
2105 A 170 C_1

函数式索引:

df.iloc[lambda x:[-3],-1::-2].head()
Physics Weight Address Class Unnamed: 0
ID
2403 B+ 60 street_6 C_4 32

iloc中接受的参数智能为整数或整数列表或布尔列表,不能使用布尔Series,若要用则需要将values拿出来

df_3 = (df['Address']=='street_2').values
df_3
array([False,  True,  True,  True, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False,  True, False])
df.iloc[df_3].head()
Unnamed: 0 School Class Gender Address Height Weight Math Physics
ID
1102 1 S_1 C_1 F street_2 192 73 32.5 B+
1103 2 S_1 C_1 M street_2 186 82 87.2 B+
1104 3 S_1 C_1 F street_2 167 81 80.4 B-
1304 13 S_1 C_3 M street_2 195 70 85.2 A
2401 30 S_2 C_4 F street_2 192 62 45.3 A

[]操作符

Series的[]操作

单元素索引:

#df['*']为一个Series,作为data就传入了index,若后边又传入一个index,根据自动对齐规则(以后边指定的index为准),就变成了NaN
#df['*'].tolist()或者df['*'].values;若只有df['*']无法确定是Math的索引还是值
s = pd.Series(df['Math'].values,index = df['Address'])
s['street_2']
street_2    32.5
street_2    87.2
street_2    80.4
street_2    85.2
street_2    45.3
street_2    67.7
dtype: float64
m = pd.Series(df['Math'],index=df.index)
m[2105]
34.2
m[0:4]
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

函数式索引:

#lambda x: x.index[16::-6]为绝对位置切片
#lambda x: 16::-6 为元素切片
m[lambda x: x.index[16::-6]]
ID
2102    50.6
1301    31.5
1105    84.8
Name: Math, dtype: float64

布尔索引:

m>80
ID
1101    False
1102    False
1103     True
1104     True
1105     True


Name: Math, dtype: bool

m[m>80]
ID
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1302    87.7
1304    85.2
2101    83.3
2205    85.4
2304    95.5
Name: Math, dtype: float64

注:在Series中[]的浮点切片不是位置比较,而是值比较,故尽量不要在行索引为浮点时使用[]操作符。

s_int = pd.Series([1,2,3,4],index = [1,3,5,6])
s_float = pd.Series([1,2,3,4],index=[1.,3.,5.,6.])
s_int
1    1
3    2
5    3
6    4
dtype: int64
s_float[2:]#2作为元素
3.0    2
5.0    3
6.0    4
dtype: int64
s_int[2:]#2作为位置
5    3
6    4
dtype: int64