1 数据规整化-合并数据集
1.1 merge的内连,外连,左连,右连
- merge默认采用的是“inner连结”,取交集部分,没有交集的会舍弃掉
- 如果没指定用哪列连结,默认情况下merge会将重叠列的列名当做健,一般建议用 on 指定一下
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from numpy import nan as NA
df1 = DataFrame({'key':['b','b','a','c','a','a','b'],
'data1':range(7)})
df2 = DataFrame({'key':list('abd'),
'data2':range(3)})
pd.merge(df1,df2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
key |
data2 |
0 |
0 |
b |
1 |
1 |
1 |
b |
1 |
2 |
6 |
b |
1 |
3 |
2 |
a |
0 |
4 |
4 |
a |
0 |
5 |
5 |
a |
0 |
pd.merge(df1,df2,on='key')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
key |
data2 |
0 |
0 |
b |
1 |
1 |
1 |
b |
1 |
2 |
6 |
b |
1 |
3 |
2 |
a |
0 |
4 |
4 |
a |
0 |
5 |
5 |
a |
0 |
pd.merge(df1,df2,on='key',how='outer')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1.0 |
1 |
1.0 |
b |
1.0 |
2 |
6.0 |
b |
1.0 |
3 |
2.0 |
a |
0.0 |
4 |
4.0 |
a |
0.0 |
5 |
5.0 |
a |
0.0 |
6 |
3.0 |
c |
NaN |
7 |
NaN |
d |
2.0 |
- how=left , 取merge连结的左边数据集,右边只在取有关联的,没关联的NAN值填充
- how=right , 取merge连结的右边数据集,左边只在取有关联的,没关联的NAN值填充
df5 = DataFrame({'key':['b','b','a','c','a','a'],
'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
'data2':range(5)})
pd.merge(df1,df2,on='key',how='left')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
key |
data2 |
0 |
0 |
b |
1.0 |
1 |
1 |
b |
1.0 |
2 |
2 |
a |
0.0 |
3 |
3 |
c |
NaN |
4 |
4 |
a |
0.0 |
5 |
5 |
a |
0.0 |
6 |
6 |
b |
1.0 |
df5 = DataFrame({'key':['b','b','a','c','a','a'],
'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
'data2':range(5)})
pd.merge(df1,df2,on='key',how='right')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
key |
data2 |
0 |
0.0 |
b |
1 |
1 |
1.0 |
b |
1 |
2 |
6.0 |
b |
1 |
3 |
2.0 |
a |
0 |
4 |
4.0 |
a |
0 |
5 |
5.0 |
a |
0 |
6 |
NaN |
d |
2 |
如果两个对象的列名不同,也可以分别进行指定
df3 = DataFrame({'lkey':['b','b','a','c','a','a','b'],
'data1':range(7)})
df4 = DataFrame({'rkey':list('abd'),
'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
data1 |
lkey |
data2 |
rkey |
0 |
0 |
b |
1 |
b |
1 |
1 |
b |
1 |
b |
2 |
6 |
b |
1 |
b |
3 |
2 |
a |
0 |
a |
4 |
4 |
a |
0 |
a |
5 |
5 |
a |
0 |
a |
merge方法总结
'''
pd.merge(df1,df2,on/left_on,right_on,how='')
on:
指定进行连接的列名,不指定默认将重叠列的列名当做健
left_on,right_on:
没有相同列名时使用这两个参数指定连接列名
how='':
(1) inner 内连接 内链接,取交集(默认连接方法)
(2) outer 外连接 取并集
(3) left 左连接 以df1为主"表"进行链接
(4) right 右链接 以df2为主"表"进行链接
'''
1.2 Series的数据连接
1.2.1 concat()
- concat默认在垂直方向上进行连接,axis=0,构造hierarchical index的Series
- axis=1,将Series在水平方向上从左到右进行链接,行索引取所有需要链接的Series的索引并集,生成一个DataFrame
- 在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。
s1 = Series([0,1],index=["a","b"])
s2 = Series([2,3,4],index=list("cde"))
s3 = Series([5,6],index=list("fg"))
pd.concat([s1,s2,s3])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
pd.concat([s1,s2,s3],axis=1)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
2 |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
- 可以利用concat函数的join=inner 参数来取两个连结的交集
s4 = pd.concat([s1*5,s3])
s4
a 0
b 5
f 5
g 6
dtype: int64
pd.concat([s1,s4],axis=1)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
a |
0.0 |
0 |
b |
1.0 |
5 |
f |
NaN |
5 |
g |
NaN |
6 |
pd.concat([s1,s4],axis=1,join='inner')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result.unstack()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
a |
b |
c |
d |
e |
f |
g |
one |
0.0 |
1.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
two |
NaN |
NaN |
2.0 |
3.0 |
4.0 |
NaN |
NaN |
three |
NaN |
NaN |
NaN |
NaN |
NaN |
5.0 |
6.0 |
1.2.2 合并重叠数据 combine_first & append
a = Series([NA,2.5,NA,3.5,4.5,NA],index=list("fedcba"))
b = Series(np.arange(len(a)),dtype=np.float64,index=list("fedcba"))
print(a,b)
b[-1] = np.nan
b[:-2].combine_first(a[2:])
a:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
b:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
a.append(b)
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
2 数据规整化-重塑与轴向选择
2.1 层次化索引
层次化索引是 pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引
级别。抽象点说,它使你能以低维度形式处理高维度数据。
data =Series(np.random.randn(10),
index=[list("aaabbbccdd"),[1,2,3,1,2,3,1,2,2,3]])
data
a 1 -0.467731
2 -1.288590
3 -2.002361
b 1 -0.075169
2 -0.666990
3 0.725769
c 1 0.583614
2 0.866867
d 2 0.424541
3 0.888412
dtype: float64
2.2 重塑层次化索引
默认情况下, unstack操作的是最内层( stack也是如此)。传入分层级别的编号或名称即可对其他级别进行 unstack操作
data.unstack()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
1 |
2 |
3 |
a |
-0.467731 |
-1.288590 |
-2.002361 |
b |
-0.075169 |
-0.666990 |
0.725769 |
c |
0.583614 |
0.866867 |
NaN |
d |
NaN |
0.424541 |
0.888412 |
data.unstack(0)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
a |
b |
c |
d |
1 |
-0.467731 |
-0.075169 |
0.583614 |
NaN |
2 |
-1.288590 |
-0.666990 |
0.866867 |
0.424541 |
3 |
-2.002361 |
0.725769 |
NaN |
0.888412 |
data = DataFrame(np.arange(6).reshape(2,3),
index=pd.Index(["Ohio","Colorado"],name="state"),
columns=pd.Index(["one","two","three"],name="numbers"))
data
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
numbers |
one |
two |
three |
state |
|
|
|
Ohio |
0 |
1 |
2 |
Colorado |
3 |
4 |
5 |
result = data.stack()
result
state numbers
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
result['Ohio']
numbers
one 0
two 1
three 2
dtype: int32
result.unstack('state')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
state |
Ohio |
Colorado |
numbers |
|
|
one |
0 |
3 |
two |
1 |
4 |
three |
2 |
5 |
df = DataFrame({'left':result,'right':result+5},
columns=pd.Index(['left','right'],
name='side'))
df
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
side |
left |
right |
state |
numbers |
|
|
Ohio |
one |
0 |
5 |
two |
1 |
6 |
three |
2 |
7 |
Colorado |
one |
3 |
8 |
two |
4 |
9 |
three |
5 |
10 |
df.index
MultiIndex(levels=[['Ohio', 'Colorado'], ['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
names=['state', 'numbers'])
temp = df.unstack('state')
temp
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
side |
left |
right |
state |
Ohio |
Colorado |
Ohio |
Colorado |
numbers |
|
|
|
|
one |
0 |
3 |
5 |
8 |
two |
1 |
4 |
6 |
9 |
three |
2 |
5 |
7 |
10 |
temp.left
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
state |
Ohio |
Colorado |
numbers |
|
|
one |
0 |
3 |
two |
1 |
4 |
three |
2 |
5 |
temp.stack('side')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
state |
Colorado |
Ohio |
numbers |
side |
|
|
one |
left |
3 |
0 |
right |
8 |
5 |
two |
left |
4 |
1 |
right |
9 |
6 |
three |
left |
5 |
2 |
right |
10 |
7 |
- 如果不是所有级别的数据都能在分组中找到的话,则unstack操作可能会引入缺失数据
- stack 默认为滤除缺失数据,因此该运算是可逆的
s1 = Series([0,1,2,3],index=list('abcd'))
s2 = Series([4,5,6],index=list('cde'))
data2 = pd.concat([s1,s2],keys=['one','two'])
data2.unstack()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
a |
b |
c |
d |
e |
one |
0.0 |
1.0 |
2.0 |
3.0 |
NaN |
two |
NaN |
NaN |
4.0 |
5.0 |
6.0 |
data2.unstack().stack()
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data2.unstack().stack(dropna=False)
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
3 数据规整化-数据转换
3.1 清除重复数据
data = DataFrame({'k1':['one']*3+['two']*4,
'k2':[1,1,2,3,3,4,4]})
data
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
3.1.1 duplicated()方法
Data frame的 duplicated方法返回一个布尔型 Series,表示各行是否是重复行
data.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
3.1.2 drop_duplicates()方法
drop duplicates方法用于返回一个移除了重复行的DataFrame
data.drop_duplicates()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
k1 |
k2 |
0 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
5 |
two |
4 |
data['v1']=range(7)
data
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
1 |
one |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
two |
3 |
4 |
5 |
two |
4 |
5 |
6 |
two |
4 |
6 |
data.drop_duplicates(['k1'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
3 |
two |
3 |
3 |
data.drop_duplicates(['k1','k2'],keep='last')
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
k1 |
k2 |
v1 |
1 |
one |
1 |
1 |
2 |
one |
2 |
2 |
4 |
two |
3 |
4 |
6 |
two |
4 |
6 |
3.2 利用函数和映射进行转换
foods = DataFrame({"food":["bacon","pulled pork","bacon","Pastrami",
"corned beef","Bacon","pastrami","honey ham",
"nova lox"],
"ounces":[4,3,12,6,7.5,8,3,5,6]})
meat_to_animal = {"bacon":"pig",
"pulled pork":"pig",
"pastrami":"cow",
"corned beef":"cow",
"honey ham":"pig",
"nova lox":"salmon"}
foods
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
food |
ounces |
0 |
bacon |
4.0 |
1 |
pulled pork |
3.0 |
2 |
bacon |
12.0 |
3 |
Pastrami |
6.0 |
4 |
corned beef |
7.5 |
5 |
Bacon |
8.0 |
6 |
pastrami |
3.0 |
7 |
honey ham |
5.0 |
8 |
nova lox |
6.0 |
1、先编写一个映射
2、再利用map函数来进行映射
foods['animal'] = foods['food'].map(str.lower).map(meat_to_animal)
foods
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
food |
ounces |
animal |
0 |
bacon |
4.0 |
pig |
1 |
pulled pork |
3.0 |
pig |
2 |
bacon |
12.0 |
pig |
3 |
Pastrami |
6.0 |
cow |
4 |
corned beef |
7.5 |
cow |
5 |
Bacon |
8.0 |
pig |
6 |
pastrami |
3.0 |
cow |
7 |
honey ham |
5.0 |
pig |
8 |
nova lox |
6.0 |
salmon |
直接传入一个能直接完成此功能的函数
foods['food'].map(lambda x : meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
3.3 数据拆分
3.3.1 按照区间对数据进行拆分pd.cut()
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
group_names = ['Youth','YoungAdult','middleAged','senior']
new_cats = pd.cut(ages,bins,labels=group_names)
new_cats
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, senior, middleAged, middleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < middleAged < senior]
pd.value_counts(new_cats)
Youth 5
middleAged 3
YoungAdult 3
senior 1
dtype: int64
import matplotlib.pyplot as plt
plt.hist(new_cats,histtype='bar',density=False)
plt.show()
data = np.random.rand(10)
pd.cut(data,4,precision=2)
[(0.04, 0.19], (0.04, 0.19], (0.34, 0.48], (0.48, 0.63], (0.48, 0.63], (0.19, 0.34], (0.48, 0.63], (0.19, 0.34], (0.04, 0.19], (0.04, 0.19]]
Categories (4, interval[float64]): [(0.04, 0.19] < (0.19, 0.34] < (0.34, 0.48] < (0.48, 0.63]]
3.3.2 按照分位数对数据进行分组—pd.qcut()
data = np.random.rand(10)
cat = pd.qcut(data,4)
cat
[(0.675, 0.856], (0.37, 0.675], (0.675, 0.856], (0.856, 0.971], (0.103, 0.37], (0.103, 0.37], (0.103, 0.37], (0.37, 0.675], (0.856, 0.971], (0.856, 0.971]]
Categories (4, interval[float64]): [(0.103, 0.37] < (0.37, 0.675] < (0.675, 0.856] < (0.856, 0.971]]
pd.value_counts(cat,sort=True)
(0.856, 0.971] 3
(0.103, 0.37] 3
(0.675, 0.856] 2
(0.37, 0.675] 2
dtype: int64
condition = np.where((data>0.856)&(data<=0.971))
condition
(array([3, 8, 9], dtype=int64),)
data[condition]
array([ 0.90659059, 0.93586907, 0.97081758])
new_cat = pd.qcut(data,[0,0.1,0.5,0.9,1.0])
new_cat
[(0.675, 0.939], (0.31, 0.675], (0.675, 0.939], (0.675, 0.939], (0.31, 0.675], (0.103, 0.31], (0.31, 0.675], (0.31, 0.675], (0.675, 0.939], (0.939, 0.971]]
Categories (4, interval[float64]): [(0.103, 0.31] < (0.31, 0.675] < (0.675, 0.939] < (0.939, 0.971]]
pd.value_counts(new_cat)
(0.675, 0.939] 4
(0.31, 0.675] 4
(0.939, 0.971] 1
(0.103, 0.31] 1
dtype: int64
3.4 检查和过滤异常值
异常值的过滤或变换运算在很大程度上其实就是数组运算。
data =DataFrame(np.random.randn(1000,4))
data.describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
-0.020880 |
0.001643 |
-0.019453 |
-0.026122 |
std |
0.980023 |
1.025441 |
0.995088 |
0.960486 |
min |
-3.108915 |
-3.645860 |
-3.481593 |
-3.194414 |
25% |
-0.697479 |
-0.697678 |
-0.694020 |
-0.700987 |
50% |
-0.005279 |
0.031774 |
-0.014728 |
-0.038483 |
75% |
0.618116 |
0.690065 |
0.651287 |
0.649747 |
max |
2.859053 |
3.189940 |
3.525865 |
3.023720 |
col = data[3]
col[np.abs(col)>3]
97 3.927528
305 -3.399312
400 -3.745356
Name: 3, dtype: float64
data[(np.abs(data)>3).any(1)]
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
2 |
3 |
46 |
-0.658090 |
-0.207434 |
3.525865 |
0.283070 |
67 |
0.599947 |
-3.645860 |
0.255475 |
-0.549574 |
289 |
-1.559625 |
0.336788 |
-3.333767 |
-1.240685 |
371 |
-1.116332 |
-3.018842 |
-0.298748 |
0.406954 |
396 |
-3.108915 |
1.117755 |
-0.152780 |
-0.340173 |
526 |
1.188742 |
-3.183867 |
1.050471 |
-1.042736 |
573 |
-2.214074 |
-3.140963 |
-1.509976 |
-0.389818 |
738 |
-0.088202 |
1.090038 |
-0.848098 |
-3.194414 |
768 |
0.474358 |
0.003349 |
-0.011807 |
3.023720 |
797 |
2.368010 |
0.452649 |
-3.481593 |
0.789944 |
966 |
0.164293 |
3.082067 |
-0.516982 |
0.251909 |
994 |
-0.843849 |
3.189940 |
0.070978 |
0.516982 |
np.sign(data).head(10)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
2 |
3 |
0 |
1.0 |
-1.0 |
1.0 |
-1.0 |
1 |
-1.0 |
1.0 |
-1.0 |
-1.0 |
2 |
-1.0 |
1.0 |
-1.0 |
1.0 |
3 |
1.0 |
1.0 |
-1.0 |
1.0 |
4 |
-1.0 |
-1.0 |
-1.0 |
1.0 |
5 |
-1.0 |
1.0 |
-1.0 |
-1.0 |
6 |
1.0 |
1.0 |
-1.0 |
1.0 |
7 |
1.0 |
1.0 |
1.0 |
1.0 |
8 |
-1.0 |
1.0 |
1.0 |
1.0 |
9 |
1.0 |
-1.0 |
1.0 |
-1.0 |
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
-0.020772 |
0.002361 |
-0.019163 |
-0.025951 |
std |
0.979685 |
1.021487 |
0.990725 |
0.959788 |
min |
-3.000000 |
-3.000000 |
-3.000000 |
-3.000000 |
25% |
-0.697479 |
-0.697678 |
-0.694020 |
-0.700987 |
50% |
-0.005279 |
0.031774 |
-0.014728 |
-0.038483 |
75% |
0.618116 |
0.690065 |
0.651287 |
0.649747 |
max |
2.859053 |
3.000000 |
3.000000 |
3.000000 |