数据理解
程序员文章站
2022-07-13 09:01:06
...
简单查看数据
from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
peek = data.head(10)
print(peek)
数据维度
from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.shape)
数据属性与类型
from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.dtypes)
数据属性与类型
from pandas import read_csv
from pandas import set_option
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
set_option('display.width',100)
set_option('precision',4)
print(data.describe())
数据分组分布
from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.groupby('class').size())
数据属性相关性
数据属性的相关性是指数据的两个属性是否相互影响,以及这种影响是什么方式,可使用皮尔逊相关系数(该系数介于1到-1之间,1为正相关,0为无关,-1为负相关)
当数据相关性比较高时,考虑对特征进行降维处理,如下使用DataFrame的corr()方法计算关系矩阵
from pandas import read_csv
from pandas import set_option
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
set_option('display.width',100)
set_option('precision',2)
print(data.corr(method='pearson'))
数据的分布分析
计算数据的高斯偏离情况
from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.skew())
下一篇: 一文搞定泛型知识