欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python 清洗数据

程序员文章站 2024-03-07 16:54:21
...

1、导入数据

import pandas as pd
import os 
import numpy as np
os.chdir(r"E:\Python_learning\data_science\train_0529\5Preprocessing")
camp = pd.read_csv('teleco_camp_orig.csv')
camp.head()
Out[1]: 
    ID  Suc_flag  ARPU  PromCnt12  PromCnt36  PromCntMsg12  PromCntMsg36  \
0   12         1  50.0          6         10             2             3   
1   53         0   NaN          5          9             1             4   
2   67         1  25.0          6         11             2             4   
3   71         1  80.0          7         10             2             4   
4  142         1  15.0          6         11             2             4   

   Class   Age Gender HomeOwner    AvgARPU  AvgHomeValue  AvgIncome  
0      4  57.0      M         H  49.894904         33400      39460  
1      3  55.0      M         H  48.574742         37600      33545  
2      1  57.0      F         H  49.272646        100400      42091  
3      1  52.0      F         H  47.334953         39900      39313  
4      1   NaN      F         U  47.827404         47500          0  

2、查看数据某属性分布情况,并进行统计描述

import matplotlib.pyplot as plt
plt.hist(camp['AvgIncome'], bins=20, normed=True)#查看分布情况
camp['AvgIncome'].describe(include='all')
Out[3]: 
count      9686.000000
mean      40491.444249
std       28707.494146
min           0.000000
25%       24464.000000
50%       43100.000000
75%       56876.000000
max      200001.000000
Name: AvgIncome, dtype: float64

python 清洗数据

由上图可以看出异常值(平均收入不可能为0)

plt.hist(camp['AvgHomeValue'], bins=20, normed=True)#查看分布情况
camp['AvgHomeValue'].describe(include='all')
Out[4]: 
count      9686.000000
mean     110986.299814
std       98670.855450
min           0.000000
25%       52300.000000
50%       76900.000000
75%      128175.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64

python 清洗数据

同样也出现了0,为异常值

3、对异常值进行处理

#这里的0值应该是缺失值
camp['AvgIncome']=camp['AvgIncome'].replace({0: np.NaN})
#像这种外部获取的数据要比较小心,经常出现意义不清晰或这错误值。AvgHomeValue也有这种情况
plt.hist(camp['AvgIncome'], bins=20, normed=True,range=(camp.AvgIncome.min(),camp.AvgIncome.max()))
#由于数据中存在缺失值,需要指定绘图的值域
camp['AvgIncome'].describe(include='all')
Out[5]: 
count      7329.000000
mean      53513.457361
std       19805.168339
min        2499.000000
25%       40389.000000
50%       48699.000000
75%       62385.000000
max      200001.000000
Name: AvgIncome, dtype: float64

python 清洗数据

同理对另一个属性进行处理

camp['AvgHomeValue']=camp['AvgHomeValue'].replace({0: np.NaN})
plt.hist(camp['AvgHomeValue'], bins=20, normed=True,range=(camp.AvgHomeValue.min(),camp.AvgHomeValue.max()))
#由于数据中存在缺失值,需要指定绘图的值域
camp['AvgHomeValue'].describe(include='all')
Out[6]: 
count      9583.000000
mean     112179.202755
std       98522.888583
min        7500.000000
25%       53200.000000
50%       77700.000000
75%      129350.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64

python 清洗数据


4、对各属性值去重

camp.describe()
#如果count数量少于样本量,说明存在缺失
#缺失最多的两个变量是Age和AvgIncome,缺失了大概20%。   

                  ID     Suc_flag         ARPU    PromCnt12    PromCnt36  \
count    9686.000000  9686.000000  4843.000000  9686.000000  9686.000000   
mean    97975.474086     0.500000    78.121722     3.495251     7.466963   
std     56550.171120     0.500026    62.225686     1.270258     1.977909   
min        12.000000     0.000000     5.000000     1.000000     1.000000   
25%     48835.500000     0.000000    50.000000     3.000000     6.000000   
50%     99106.000000     0.500000    65.000000     3.000000     8.000000   
75%    148538.750000     1.000000   100.000000     4.000000     8.000000   
max    191779.000000     1.000000  1000.000000    15.000000    20.000000   

       PromCntMsg12  PromCntMsg36        Class          Age      AvgARPU  \
count   9686.000000   9686.000000  9686.000000  7279.000000  9686.000000   
mean       1.034586      2.323044     2.424530    49.567386    52.905156   
std        0.244171      0.904083     1.049047     6.991306     4.993775   
min        0.000000      0.000000     1.000000    16.000000    46.138968   
25%        1.000000      1.000000     2.000000    45.000000    49.760116   
50%        1.000000      3.000000     2.000000    50.000000    50.876672   
75%        1.000000      3.000000     3.000000    55.000000    54.452822   
max        4.000000      6.000000     4.000000    60.000000    99.444787   

        AvgHomeValue      AvgIncome  
count    9583.000000    7329.000000  
mean   112179.202755   53513.457361  
std     98522.888583   19805.168339  
min      7500.000000    2499.000000  
25%     53200.000000   40389.000000  
50%     77700.000000   48699.000000  
75%    129350.000000   62385.000000  
max    600000.000000  200001.000000

 对重复值打标签

camp['dup'] = camp.duplicated() # 生成重复标识变量
camp.dup.head()
Out[7]: 
0    False
1    False
2    False
3    False
4    False
Name: dup, dtype: bool

    根据标签提取数据

camp_dup = camp[camp['dup'] == True] # 把有重复的数据保存出来,以备核查
camp_nodup = camp[camp['dup'] == False] # 注意与camp.drop_duplicates()的区别
camp_nodup.head()
camp['dup1'] = camp['ID'].duplicated() # 按照主键进行重复记录标识

5、对缺失值进行填充

现对属性age进行处理:

camp['Age'].describe()
Out[9]: 
count    7279.000000
mean       49.567386
std         6.991306
min        16.000000
25%        45.000000
50%        50.000000
75%        55.000000
max        60.000000
Name: Age, dtype: float64
计算age的均值
vmean = camp['Age'].mean(axis=0, skipna=True)
vmean
Out[10]: 49.56738562989422
对age缺失值打标签
camp['Age_empflag'] = camp['Age'].isnull()
camp.head()
Out[11]: 
    ID  Suc_flag  ARPU  PromCnt12  PromCnt36  PromCntMsg12  PromCntMsg36  \
0   12         1  50.0          6         10             2             3   
1   53         0   NaN          5          9             1             4   
2   67         1  25.0          6         11             2             4   
3   71         1  80.0          7         10             2             4   
4  142         1  15.0          6         11             2             4   

   Class   Age Gender HomeOwner    AvgARPU  AvgHomeValue  AvgIncome  \
0      4  57.0      M         H  49.894904       33400.0    39460.0   
1      3  55.0      M         H  48.574742       37600.0    33545.0   
2      1  57.0      F         H  49.272646      100400.0    42091.0   
3      1  52.0      F         H  47.334953       39900.0    39313.0   
4      1   NaN      F         U  47.827404       47500.0        NaN   

   Age_empflag  
0        False  
1        False  
2        False  
3        False  
4         True  
对age中的缺失值用均值填充
camp['Age']= camp['Age'].fillna(vmean)
camp['Age'].describe()
Out[12]: 
count    9686.000000
mean       49.567386
std         6.060585
min        16.000000
25%        47.000000
50%        49.567386
75%        54.000000
max        60.000000
Name: Age, dtype: float64

同理对AvgHomeValue属性进行处理

vmean = camp['AvgHomeValue'].mean(axis=0, skipna=True)
camp['AvgHomeValue_empflag'] = camp['AvgHomeValue'].isnull()
camp['AvgHomeValue']= camp['AvgHomeValue'].fillna(vmean)
camp['AvgHomeValue'].describe()
Out[13]: 
count      9686.000000
mean     112179.202755
std       97997.592632
min        7500.000000
25%       53500.000000
50%       78450.000000
75%      128175.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64
vmean = camp['AvgIncome'].mean(axis=0, skipna=True)
camp['AvgIncome_empflag'] = camp['AvgIncome'].isnull()
camp['AvgIncome']= camp['AvgIncome'].fillna(vmean)
camp['AvgIncome'].describe()
Out[14]: 
count      9686.000000
mean      53513.457361
std       17227.468161
min        2499.000000
25%       42775.000000
50%       53513.457361
75%       56876.000000
max      200001.000000
Name: AvgIncome, dtype: float64















上一篇: 复习——Top K问题

下一篇: