欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Pandas - 数据规整化

程序员文章站 2022-03-12 17:34:08
...


1 数据规整化-合并数据集

1.1 merge的内连,外连,左连,右连

  • merge默认采用的是“inner连结”,取交集部分,没有交集的会舍弃掉
  • 如果没指定用哪列连结,默认情况下merge会将重叠列的列名当做健,一般建议用 on 指定一下
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from numpy import nan as NA
df1 = DataFrame({'key':['b','b','a','c','a','a','b'],
               'data1':range(7)})
df2 = DataFrame({'key':list('abd'),
               'data2':range(3)})
pd.merge(df1,df2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1,df2,on='key')#默认how='inner'
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1,df2,on='key',how='outer')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
  • how=left , 取merge连结的左边数据集,右边只在取有关联的,没关联的NAN值填充
  • how=right , 取merge连结的右边数据集,左边只在取有关联的,没关联的NAN值填充
df5 = DataFrame({'key':['b','b','a','c','a','a'],
               'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
               'data2':range(5)})
pd.merge(df1,df2,on='key',how='left')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1.0
1 1 b 1.0
2 2 a 0.0
3 3 c NaN
4 4 a 0.0
5 5 a 0.0
6 6 b 1.0
df5 = DataFrame({'key':['b','b','a','c','a','a'],
               'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
               'data2':range(5)})
pd.merge(df1,df2,on='key',how='right')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0.0 b 1
1 1.0 b 1
2 6.0 b 1
3 2.0 a 0
4 4.0 a 0
5 5.0 a 0
6 NaN d 2

如果两个对象的列名不同,也可以分别进行指定

df3 = DataFrame({'lkey':['b','b','a','c','a','a','b'],
               'data1':range(7)})
df4 = DataFrame({'rkey':list('abd'),
               'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a

merge方法总结

'''
pd.merge(df1,df2,on/left_on,right_on,how='')
on:
    指定进行连接的列名,不指定默认将重叠列的列名当做健
left_on,right_on:
    没有相同列名时使用这两个参数指定连接列名
how='':
    (1) inner    内连接   内链接,取交集(默认连接方法)
    (2) outer    外连接   取并集
    (3) left     左连接   以df1为主"表"进行链接
    (4) right    右链接   以df2为主"表"进行链接

'''

1.2 Series的数据连接

1.2.1 concat()

  • concat默认在垂直方向上进行连接,axis=0,构造hierarchical index的Series
  • axis=1,将Series在水平方向上从左到右进行链接,行索引取所有需要链接的Series的索引并集,生成一个DataFrame
  • 在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。
s1 = Series([0,1],index=["a","b"])
s2 = Series([2,3,4],index=list("cde"))
s3 = Series([5,6],index=list("fg"))
pd.concat([s1,s2,s3])# 默认axis=0
    a    0
    b    1
    c    2
    d    3
    e    4
    f    5
    g    6
    dtype: int64
#将Series在水平方向上
pd.concat([s1,s2,s3],axis=1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
  • 可以利用concat函数的join=inner 参数来取两个连结的交集
s4 = pd.concat([s1*5,s3])
s4
    a    0
    b    5
    f    5
    g    6
    dtype: int64
pd.concat([s1,s4],axis=1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6
#inner取交集
pd.concat([s1,s4],axis=1,join='inner')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1
a 0 0
b 1 5
  • 可以给key传入一个列表,在连结轴上创建层次索引
#创建层次索引
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
    one    a    0
           b    1
    two    c    2
           d    3
           e    4
    three  f    5
           g    6
    dtype: int64
result.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0

1.2.2 合并重叠数据 combine_first & append

a = Series([NA,2.5,NA,3.5,4.5,NA],index=list("fedcba"))
b = Series(np.arange(len(a)),dtype=np.float64,index=list("fedcba"))
print(a,b)
b[-1] = np.nan
b[:-2].combine_first(a[2:])
  a:
    f    NaN
    e    2.5
    d    NaN
    c    3.5
    b    4.5
    a    NaN
  b:
    f    0.0
    e    1.0
    d    2.0
    c    3.0
    b    4.0
    a    5.0


    a    NaN
    b    4.5
    c    3.0
    d    2.0
    e    1.0
    f    0.0
    dtype: float64
a.append(b)
    f    NaN
    e    2.5
    d    NaN
    c    3.5
    b    4.5
    a    NaN
    f    0.0
    e    1.0
    d    2.0
    c    3.0
    b    4.0
    a    NaN
    dtype: float64

2 数据规整化-重塑与轴向选择

2.1 层次化索引

层次化索引是 pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引
级别。抽象点说,它使你能以低维度形式处理高维度数据。

#层次化索引
data =Series(np.random.randn(10),
             index=[list("aaabbbccdd"),[1,2,3,1,2,3,1,2,2,3]])
data
    a  1   -0.467731
       2   -1.288590
       3   -2.002361
    b  1   -0.075169
       2   -0.666990
       3    0.725769
    c  1    0.583614
       2    0.866867
    d  2    0.424541
       3    0.888412
    dtype: float64

2.2 重塑层次化索引

默认情况下, unstack操作的是最内层( stack也是如此)。传入分层级别的编号或名称即可对其他级别进行 unstack操作

#重塑层次化索引
data.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
1 2 3
a -0.467731 -1.288590 -2.002361
b -0.075169 -0.666990 0.725769
c 0.583614 0.866867 NaN
d NaN 0.424541 0.888412
data.unstack(0)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d
1 -0.467731 -0.075169 0.583614 NaN
2 -1.288590 -0.666990 0.866867 0.424541
3 -2.002361 0.725769 NaN 0.888412
data = DataFrame(np.arange(6).reshape(2,3),
                 index=pd.Index(["Ohio","Colorado"],name="state"),
                 columns=pd.Index(["one","two","three"],name="numbers"))
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
numbers one two three
state
Ohio 0 1 2
Colorado 3 4 5
#将列转行
result = data.stack()
result
    state     numbers
    Ohio      one        0
              two        1
              three      2
    Colorado  one        3
              two        4
              three      5
    dtype: int32
#层次索引
result['Ohio']
    numbers
    one      0
    two      1
    three    2
    dtype: int32
#将行转列
result.unstack('state')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Ohio Colorado
numbers
one 0 3
two 1 4
three 2 5
df = DataFrame({'left':result,'right':result+5},
              columns=pd.Index(['left','right'],
                              name='side'))
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
side left right
state numbers
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.index
    MultiIndex(levels=[['Ohio', 'Colorado'], ['one', 'two', 'three']],
               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
               names=['state', 'numbers'])
temp = df.unstack('state')
temp
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
side left right
state Ohio Colorado Ohio Colorado
numbers
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
temp.left
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Ohio Colorado
numbers
one 0 3
two 1 4
three 2 5
temp.stack('side')#往层级低的方向添加索引
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Colorado Ohio
numbers side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
  • 如果不是所有级别的数据都能在分组中找到的话,则unstack操作可能会引入缺失数据
  • stack 默认为滤除缺失数据,因此该运算是可逆的
s1 = Series([0,1,2,3],index=list('abcd'))
s2 = Series([4,5,6],index=list('cde'))
data2 = pd.concat([s1,s2],keys=['one','two'])
data2.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
data2.unstack().stack()
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
    two  c    4.0
         d    5.0
         e    6.0
    dtype: float64
data2.unstack().stack(dropna=False)
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
         e    NaN
    two  a    NaN
         b    NaN
         c    4.0
         d    5.0
         e    6.0
    dtype: float64

3 数据规整化-数据转换

3.1 清除重复数据

data = DataFrame({'k1':['one']*3+['two']*4,
                'k2':[1,1,2,3,3,4,4]})
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4

3.1.1 duplicated()方法

Data frame的 duplicated方法返回一个布尔型 Series,表示各行是否是重复行

data.duplicated()
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

3.1.2 drop_duplicates()方法

drop duplicates方法用于返回一个移除了重复行的DataFrame

data.drop_duplicates()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
data['v1']=range(7)
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
0 one 1 0
3 two 3 3
#drop_duplicates默认保留的是第一个出现的值组合。keep='first'/'last'
data.drop_duplicates(['k1','k2'],keep='last')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6

3.2 利用函数和映射进行转换

foods = DataFrame({"food":["bacon","pulled pork","bacon","Pastrami",
                           "corned beef","Bacon","pastrami","honey ham",
                           "nova lox"],
                   "ounces":[4,3,12,6,7.5,8,3,5,6]})
meat_to_animal = {"bacon":"pig",
                  "pulled pork":"pig",
                  "pastrami":"cow",
                 "corned beef":"cow",
                 "honey ham":"pig",
                 "nova lox":"salmon"}
foods
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

1、先编写一个映射
2、再利用map函数来进行映射

foods['animal'] = foods['food'].map(str.lower).map(meat_to_animal)
foods
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

直接传入一个能直接完成此功能的函数

foods['food'].map(lambda x : meat_to_animal[x.lower()])
    0       pig
    1       pig
    2       pig
    3       cow
    4       cow
    5       pig
    6       cow
    7       pig
    8    salmon
    Name: food, dtype: object

3.3 数据拆分

3.3.1 按照区间对数据进行拆分pd.cut()

ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
    [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
    Length: 12
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
#每个数样本落在第几个区间,序号从零开始
cats.codes
    array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)
    (18, 25]     5
    (35, 60]     3
    (25, 35]     3
    (60, 100]    1
    dtype: int64
# 对数据进行分组,并为每一组添加标签
group_names = ['Youth','YoungAdult','middleAged','senior']
new_cats = pd.cut(ages,bins,labels=group_names)
new_cats
    [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, senior, middleAged, middleAged, YoungAdult]
    Length: 12
    Categories (4, object): [Youth < YoungAdult < middleAged < senior]
pd.value_counts(new_cats)
    Youth         5
    middleAged    3
    YoungAdult    3
    senior        1
    dtype: int64
import matplotlib.pyplot as plt
plt.hist(new_cats,histtype='bar',density=False)
plt.show()

Pandas - 数据规整化

data = np.random.rand(10)
pd.cut(data,4,precision=2)#precision有效数字保存几位
    [(0.04, 0.19], (0.04, 0.19], (0.34, 0.48], (0.48, 0.63], (0.48, 0.63], (0.19, 0.34], (0.48, 0.63], (0.19, 0.34], (0.04, 0.19], (0.04, 0.19]]
    Categories (4, interval[float64]): [(0.04, 0.19] < (0.19, 0.34] < (0.34, 0.48] < (0.48, 0.63]]

3.3.2 按照分位数对数据进行分组—pd.qcut()

data = np.random.rand(10)
cat = pd.qcut(data,4) # 按照四分位进行切割
cat
    [(0.675, 0.856], (0.37, 0.675], (0.675, 0.856], (0.856, 0.971], (0.103, 0.37], (0.103, 0.37], (0.103, 0.37], (0.37, 0.675], (0.856, 0.971], (0.856, 0.971]]
    Categories (4, interval[float64]): [(0.103, 0.37] < (0.37, 0.675] < (0.675, 0.856] < (0.856, 0.971]]
pd.value_counts(cat,sort=True)
    (0.856, 0.971]    3
    (0.103, 0.37]     3
    (0.675, 0.856]    2
    (0.37, 0.675]     2
    dtype: int64
condition = np.where((data>0.856)&(data<=0.971))
condition
    (array([3, 8, 9], dtype=int64),)
data[condition]
    array([ 0.90659059,  0.93586907,  0.97081758])
#与cut一样也可以自定义分位数(0到1之间的数值,包括端点)
new_cat = pd.qcut(data,[0,0.1,0.5,0.9,1.0])
new_cat
    [(0.675, 0.939], (0.31, 0.675], (0.675, 0.939], (0.675, 0.939], (0.31, 0.675], (0.103, 0.31], (0.31, 0.675], (0.31, 0.675], (0.675, 0.939], (0.939, 0.971]]
    Categories (4, interval[float64]): [(0.103, 0.31] < (0.31, 0.675] < (0.675, 0.939] < (0.939, 0.971]]
pd.value_counts(new_cat)
    (0.675, 0.939]    4
    (0.31, 0.675]     4
    (0.939, 0.971]    1
    (0.103, 0.31]     1
    dtype: int64

3.4 检查和过滤异常值

异常值的过滤或变换运算在很大程度上其实就是数组运算。

data =DataFrame(np.random.randn(1000,4))
data.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.020880 0.001643 -0.019453 -0.026122
std 0.980023 1.025441 0.995088 0.960486
min -3.108915 -3.645860 -3.481593 -3.194414
25% -0.697479 -0.697678 -0.694020 -0.700987
50% -0.005279 0.031774 -0.014728 -0.038483
75% 0.618116 0.690065 0.651287 0.649747
max 2.859053 3.189940 3.525865 3.023720
#找出绝对值大于3的数据
col = data[3]
col[np.abs(col)>3]
    97     3.927528
    305   -3.399312
    400   -3.745356
    Name: 3, dtype: float64
#假如想找出任意一个包含绝对值大于3的行,则可以用 any来筛选
data[(np.abs(data)>3).any(1)]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
46 -0.658090 -0.207434 3.525865 0.283070
67 0.599947 -3.645860 0.255475 -0.549574
289 -1.559625 0.336788 -3.333767 -1.240685
371 -1.116332 -3.018842 -0.298748 0.406954
396 -3.108915 1.117755 -0.152780 -0.340173
526 1.188742 -3.183867 1.050471 -1.042736
573 -2.214074 -3.140963 -1.509976 -0.389818
738 -0.088202 1.090038 -0.848098 -3.194414
768 0.474358 0.003349 -0.011807 3.023720
797 2.368010 0.452649 -3.481593 0.789944
966 0.164293 3.082067 -0.516982 0.251909
994 -0.843849 3.189940 0.070978 0.516982
#获取数据的正负号
np.sign(data).head(10)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
0 1.0 -1.0 1.0 -1.0
1 -1.0 1.0 -1.0 -1.0
2 -1.0 1.0 -1.0 1.0
3 1.0 1.0 -1.0 1.0
4 -1.0 -1.0 -1.0 1.0
5 -1.0 1.0 -1.0 -1.0
6 1.0 1.0 -1.0 1.0
7 1.0 1.0 1.0 1.0
8 -1.0 1.0 1.0 1.0
9 1.0 -1.0 1.0 -1.0
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.020772 0.002361 -0.019163 -0.025951
std 0.979685 1.021487 0.990725 0.959788
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.697479 -0.697678 -0.694020 -0.700987
50% -0.005279 0.031774 -0.014728 -0.038483
75% 0.618116 0.690065 0.651287 0.649747
max 2.859053 3.000000 3.000000 3.000000