欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

数据理解

程序员文章站 2022-07-13 09:01:06
...

简单查看数据

from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
peek = data.head(10)
print(peek)

数据维度

from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.shape)

数据属性与类型

from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.dtypes)

数据属性与类型

from pandas import read_csv
from pandas import set_option
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
set_option('display.width',100)
set_option('precision',4)
print(data.describe())

数据分组分布

from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.groupby('class').size())

数据属性相关性

数据属性的相关性是指数据的两个属性是否相互影响,以及这种影响是什么方式,可使用皮尔逊相关系数(该系数介于1到-1之间,1为正相关,0为无关,-1为负相关)
当数据相关性比较高时,考虑对特征进行降维处理,如下使用DataFrame的corr()方法计算关系矩阵

from pandas import read_csv
from pandas import set_option
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
set_option('display.width',100)
set_option('precision',2)
print(data.corr(method='pearson'))

数据的分布分析

计算数据的高斯偏离情况

from pandas import read_csv
filename = 'pima.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
print(data.skew())
相关标签: 机器学习 python