数据分析——pandas
程序员文章站
2022-07-09 21:20:23
简介 数据类型 票房分析 运行结果 标注: 统计拍片数前10的某导演,指导电影的总票房 票房分析 特征 导入类库 准备数据 测试代码 案例源码 DATA-->INFOMATION-->KNOWLEDGE-->WISDOM 数据-->信息-->知识-->智慧 爬虫-->数据库-->数据分析-->机器学 ......
简介
1 import pandas as pd 2 3 # 在数据挖掘前一个数据分析、筛选、清理的多功能工具 4 ''' 5 pandas 可以读入excel、csv等文件;可以创建series序列,dataframe表格,日期数组data_range 6 '''
数据类型
1 # 将excel文件,csv文件读取并转换为pandas的dataframe 2 # df_score = pd.read_csv() 3 df_score = pd.read_excel('./score.xlsx') 4 # df_score.values #数据 5 # df_score.columns #列名 6 # print df_score.describe() #计算表的各项数据,count,mean,std,中位数等 7 8 # 创建一个默认索引从0开始的series 9 s = pd.series([1, 2, 3, 4, 5, 6]) 10 # 创建自定义索引的数组,索引由index指定,和前面数组依次对应 11 s = pd.series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'], dtype=int) 12 # 使用字典创建一个dataframe,字典的key会自动成为列名,一个key默认对应一列数据 13 df1 = pd.dataframe({'math': [1, 2, 3, 4, 5], 'physic': [5, 6, 7, 8, 9]}, index=['a', 'b', 'c', 'd', 'e']) 14 ''' 15 # df1.values 数据 16 # df1.head(2) 前两行数据 17 # df1.tail(2) 最后两行数据 18 # df1.index 索引 19 # df1.columns 列名 20 ''' 21 # 生成从20180101开始的时间序列,peroids是增加量,默认增加单位是天d,h小时,s秒 22 dates = pd.date_range('20180101', periods=10, freq='d') 23 # 创建使用时间索引的series 24 # s = pd.series(range(10),index=dates) 25 # 取出指定间隔的行数据 26 # s['2018-01-01':'2018-01-05'] 27 # print dates
票房分析
1 df_imdb = pd.read_csv('./imdb.csv') 2 3 # print df_imdb 4 # print df_imdb.columns 5 # df_imdb['title'].head(5) #选出title列的前五行 6 # df_imdb['title'].tail(3) 7 # df_imdb.title.head(3) #同[]的形式 8 # df_imdb['revenue (millions)'].max() #最大票房 9 # df_imdb['revenue (millions)'].idxmax() #最大票房的索引 10 # df_imdb[50:51] 11 # df_imdb[50:51]['title'] 12 # df_imdb[50:51]['revenue (millions)'] #取出50行,不包括51行 13 # 取出50-56行,收尾都包含,第一维度是行,第二维度是列 14 # df_imdb.loc[50:56,['director','year']] 15 # df_imdb[50:56].loc[:,'director','year'] 16 # 取出1-5行(不包含第5行),2-4(不包含第4列)列的数据,使用整数索引操作,与numpy用法类似 17 # df_imdb.iloc[1:5,2:4] 18 # 统计director列中不同导演出现的次数 19 # df_imdb['director'].value_counts() 20 # 将票房大于5亿美元的电影选出来 21 # df_imdb[df_imdb['revenue (millions)']>500].director 22 # df_imdb[df_imdb['revenue (millions)']>700]['title'] 23 # 将电影风格描述中含有sci-fi(科幻) 关键字的找出 24 # df_imdb[df_imdb['genre'].str.contains('sci-fi')] 25 26 # 将缺失数据(nan)填充为0,也可以自己根据项目需求指定其他数据 27 # df_score.fillna(0) 28 # 将缺失数据的行移除(默认操作,可以使用axis=1指定删除列df_score.dropna(axis=1)) 29 # 0删除行,1删除列 30 # df_score.dropna() 31 # 在datafram中增加一列平均值avg,计算当前datafram中每行的平均值作为avg的数据 32 # 前后赋值数据的行数要对应,axis=1表示按行计算,axis=0(默认值),表示按列计算 33 # df_score['avg'] = df_score.mean(axis=1) 34 # 按照性别分组并求和指定成绩 35 # df_score.iloc[:,4:7].groupby(u'性别').sum() 36 # df_score.loc[:,[u'音乐',u'性别']].groupby(u'性别').sum() 37 # 按照男女分组并绘图,bar柱状图,pie饼状图 38 # df_score[u'性别'].value_counts().plot(kind='bar') 39 # df_score[u'性别'].value_counts().plot(kind='pie') 40 # & 数学大于80且化学大于60 41 # df_score[(df_score[u'数学']>80) &(df_score[u'化学']>60) ] 42 43 # 使用lambda,配合apply方法将日期中的指定年份或月份等提取出来 44 # apply函数会将lambda一次作用到数据集的每个元素 45 # datas = pd.series(['20190901','20190902','20190903']) 46 # datas.apply(lambda x:x[0:4]) 47 # datas.apply(lambda x:x[4:6]) 48 49 # 创建一个数据的副本 50 # df_copy = df.copy() 51 # df_copy['r_sum'] = df['sibsp']+df['parch'] 52 53 # 计算数学列的总和、平均值等,里面的字符串必须有同名函数 54 # df[u'数学'].agg(['sum','mean','max','std']) 55 56 # pandas(series、dataframe)类型转换为numpy(array)类型 57 # df[u'数学'].values 58 # df.loc[:,[u'数学',u'化学']].values 59 60 # 按照指定列的值排序,可指定正序倒序,默认正序 61 # df[u'数学'].sort_values() 62 # 按照索引排序 63 # df[u'数学'].sort_index() 64 # df[u'数学'].sort_values(ascending=false) 65 # 添加新列sum,值为每行总和,并倒序排列 66 # df['sum'] = df.sum(axis=1) 67 # df[u'sum'].sort_values(ascending=false) 68 69 70 # 取出embarked,survived字段,按照两个字段顺序做层次分组,然后做计算总和 71 # r = df.loc[:,['embarked','survived']].groupby(['embarked','survived']).size() 72 # r.c 73 # r.c[1] 74 # r.q 75 # r.q[0] 76 # r.q[1] 77 # r1 = df.loc[:,['embarked','survived']].groupby(['survived','embarked']).size() 78 # r2 = df.loc[:,['embarked','survived']].groupby('embarked').size() 79 # r3 = df.loc[:,['embarked','survived']].groupby('survived').size()
运行结果
"""
上面的运行结果 r
embarked survived
c 0 75
1 93
q 0 47
1 30
s 0 427
1 217
dtype: int64
r.c结果
survived
0 75
1 93
dtype: int64
r.c[1]结果
93
r1结果
survived embarked
0 c 75
q 47
s 427
1 c 93
q 30
s 217
dtype: int64
r2结果
embarked
c 168
q 77
s 644
dtype: int64
r3结果
survived
0 549
1 342
dtype: int64
"""
标注:
'''
1.axis转换行列
2.dataframe筛选一行或一列时会转化为series类型,可以直接后面加[数字]直接进行选择,但series不能使用dataframe的方法(groupby等)
3.筛选出来的数据的索引仍是原索引,不会重新排列新索引
'''
统计拍片数前10的某导演,指导电影的总票房
1 def piaofang(): 2 director10 = df_imdb['director'].value_counts().head(10) 3 # print director10.index[0] 4 revenues = 0 5 for d in director10.index: 6 print df_imdb[df_imdb['director'] == d]['revenue (millions)'].sum() 7 8 # piaofang() 9 10 # df_imdb[df_imdb['director']=='']['revenue (millions)'].sum()
票房分析
特征
'''
passengerid:乘客的唯一标志
survived:1获救,0死亡
pclass:座舱等级 3最好,1最差
name,sex,age,
sibsp:船上有没有兄弟姐妹
parch:父母等直系亲属是否在船上
ticket,
fare:票价或消费
cabin:座舱号
embarked:从哪个港口登船
891
'''
导入类库
1 import numpy as np 2 import matplotlib.pyplot as pt 3 import pandas as pd
准备数据
1 titanic = pd.read_csv('./titanic.csv') 2 3 titanic.fillna(int(titanic[u'age'].mean()))
测试代码
1 # print titanic['age'] 2 3 # print titanic[u'age'].mean() 4 # print titanic.loc[:,u'survived'].value_counts() #存活比例 5 # print titanic.loc[:,u'survived'].count() #总人数 6 7 # print titanic.loc[:, u'sex'].value_counts() #男女分类 8 # print titanic[titanic[u'sex'] == u'male']['survived'].value_counts() #男性生死分类 9 10 # print titanic.columns 11 # print titanic[titanic[u'age'] <= 18][u'survived'].value_counts() 12 # print titanic[(titanic[u'age'] > 18) & (titanic[u'age'] < 60)][u'survived'].value_counts() 13 # print titanic[titanic[u'age'] >= 60][u'survived'].value_counts() 14 15 # print titanic[u'fare'] 16 # print titanic[u'fare'].max() #贫富差距 17 # print titanic[u'fare'].min() 18 19 # print titanic[u'pclass'].value_counts() 20 # print titanic[u'pclass'].value_counts()[1] 21 # print titanic[u'pclass'].value_counts()[3] #座舱 22 # print titanic[titanic[u'pclass'] == 1]['survived'].value_counts() 23 # print titanic[titanic[u'pclass'] == 3]['survived'].value_counts() 24 25 # print titanic[u'sibsp'].value_counts() 26 # print titanic[u'parch'].value_counts() 27 28 # print titanic[u'embarked'].value_counts()
案例源码
1 class titanic(object): 2 def __init__(self): 3 self.data = titanic 4 5 # 1.存活率是多少 6 def rate_survive(self): 7 survived = self.data.loc[:, 'survived'].value_counts()[1] 8 death = self.data.loc[:, 'survived'].value_counts()[0] 9 rate = float(survived) / (float(death) + float(survived)) 10 print '总人数:{},存活人数:{},死亡人数:{}'.format(survived + death, survived, death) 11 return u'存活率:' + '%.2f' % rate 12 13 # 2.哪个年龄段存活率最高 14 def max_survive(self): 15 age18_survived = self.data[self.data[u'age'] <= 18][u'survived'].value_counts()[1] 16 age18_death = self.data[self.data[u'age'] <= 18][u'survived'].value_counts()[0] 17 age18_rate = float(age18_survived) / (float(age18_survived) + float(age18_death)) 18 19 age1860_survived = self.data[(self.data[u'age'] > 18) & (self.data[u'age'] < 60)][u'survived'].value_counts()[1] 20 age1860_death = self.data[(self.data[u'age'] > 18) & (self.data[u'age'] < 60)][u'survived'].value_counts()[0] 21 age1860_rate = float(age1860_survived) / (float(age1860_survived) + float(age1860_death)) 22 23 age60_survived = self.data[self.data[u'age'] >= 60][u'survived'].value_counts()[1] 24 age60_death = self.data[self.data[u'age'] >= 60][u'survived'].value_counts()[0] 25 age60_rate = float(age60_survived) / (float(age60_survived) + float(age60_death)) 26 27 rate = [age18_rate, age60_rate, age1860_rate] 28 age_data = ['18岁以下', '18-60岁', '60岁以上'] 29 max_rate = max(rate) 30 age_range = age_data[rate.index(max(rate))] 31 return '存活率最高的年龄段是{},存活率为{}'.format(age_range, max_rate) 32 33 # 3.女性存活率是否高于男性 34 def than_survive(self): 35 male_survied = self.data[self.data[u'sex'] == u'male'][u'survived'].value_counts()[1] 36 male_death = self.data[self.data[u'sex'] == u'male'][u'survived'].value_counts()[0] 37 rate_male = float(male_survied) / (float(male_survied) + float(male_death)) 38 print '男性共有{}人,存活{}人,死亡{}人'.format(male_death + male_survied, male_survied, male_death) 39 female_survied = self.data[self.data[u'sex'] == u'female'][u'survived'].value_counts()[1] 40 female_death = self.data[self.data[u'sex'] == u'female'][u'survived'].value_counts()[0] 41 rate_female = float(female_survied) / (float(female_survied) + float(female_death)) 42 print '女性共有{}人,存活{}人,死亡{}人'.format(female_death + female_survied, female_survied, female_death) 43 if rate_male > rate_female: 44 return u'男性存活率更高,存活率为:%.2f' % rate_male 45 else: 46 return u'女性存活率更高,存活率为:%.2f' % rate_female 47 48 # 4.船上是否出现贫富差距 49 def poor_wealth(self): 50 max_wealth = self.data[u'fare'].max() 51 max_poor = self.data[u'fare'].min() 52 if max_wealth - max_poor > 500: 53 return '船上乘客最多消费了{},最少消费了{},存在贫富差距'.format(max_wealth, max_poor) 54 else: 55 return '船上乘客最多消费了{},最少消费了{},不存在贫富差距'.format(max_wealth, max_poor) 56 57 # 5.头等舱乘客的存活率是否高于经济舱 58 def pclass_survive(self): 59 pclass1_survived = self.data[self.data[u'pclass'] == 1]['survived'].value_counts()[1] 60 pclass1_death = self.data[self.data[u'pclass'] == 1]['survived'].value_counts()[0] 61 pclass1_rate = float(pclass1_survived) / (float(pclass1_survived) + float(pclass1_death)) 62 63 pclass3_survived = self.data[self.data[u'pclass'] == 3]['survived'].value_counts()[1] 64 pclass3_death = self.data[self.data[u'pclass'] == 3]['survived'].value_counts()[0] 65 pclass3_rate = float(pclass3_survived) / (float(pclass3_survived) + float(pclass3_death)) 66 67 if pclass3_rate > pclass1_rate: 68 return '头等舱乘客存活率更高,存活率为{}'.format(pclass3_rate) 69 else: 70 return '经济舱乘客存活率更高,存活率为{}'.format(pclass1_rate) 71 72 # 6.有亲属在船上乘客比率,有亲属是否会影响存活率 73 def family_survive(self): 74 has_family = self.data[(self.data[u'parch'] != 0) | (self.data[u'sibsp'] != 0)][u'passengerid'].count() 75 no_family = self.data[(self.data[u'parch'] == 0) & (self.data[u'sibsp'] == 0)][u'passengerid'].count() 76 rate_family = float(has_family) / (float(has_family) + float(no_family)) 77 78 has_family_survived = \ 79 self.data[(self.data[u'parch'] != 0) | (self.data[u'sibsp'] != 0)][u'survived'].value_counts()[1] 80 has_family_death = \ 81 self.data[(self.data[u'parch'] != 0) | (self.data[u'sibsp'] != 0)][u'survived'].value_counts()[0] 82 has_family_rate = float(has_family_survived) / (float(has_family_survived) + float(has_family_death)) 83 84 no_family_survived = \ 85 self.data[(self.data[u'parch'] == 0) & (self.data[u'sibsp'] == 0)][u'survived'].value_counts()[1] 86 no_family_death = \ 87 self.data[(self.data[u'parch'] == 0) & (self.data[u'sibsp'] == 0)][u'survived'].value_counts()[0] 88 no_family_rate = float(no_family_survived) / (float(no_family_survived) + float(no_family_death)) 89 90 print '船上乘客中有亲属也在船上的有{}人,无亲属在船上的有{}人,有亲属在船上的乘客的比率为{}'.format(has_family, no_family, rate_family) 91 if has_family_rate > no_family_rate: 92 return '有亲属在船上的乘客存活率更高,存活率为{}'.format(has_family_rate) 93 else: 94 return '无亲属在船上的乘客存活率更高,存活率为{}'.format(no_family_rate) 95 96 # 7.从哪个港口登船是否影响获救 97 def emarked_survive(self): 98 embarked_s_survived = self.data[self.data[u'embarked'] == 's'][u'survived'].value_counts()[1] 99 embarked_s_death = self.data[self.data[u'embarked'] == 's'][u'survived'].value_counts()[0] 100 embarked_s_rate = float(embarked_s_survived) / (float(embarked_s_survived) + float(embarked_s_death)) 101 102 embarked_c_survived = self.data[self.data[u'embarked'] == 'c'][u'survived'].value_counts()[1] 103 embarked_c_death = self.data[self.data[u'embarked'] == 'c'][u'survived'].value_counts()[0] 104 embarked_c_rate = float(embarked_c_survived) / (float(embarked_c_survived) + float(embarked_c_death)) 105 106 embarked_q_survived = self.data[self.data[u'embarked'] == 'q'][u'survived'].value_counts()[1] 107 embarked_q_death = self.data[self.data[u'embarked'] == 'q'][u'survived'].value_counts()[0] 108 embarked_q_rate = float(embarked_q_survived) / (float(embarked_q_survived) + float(embarked_q_death)) 109 110 embarked = ['s港口', 'c港口', 'q港口'] 111 rate = [embarked_s_rate, embarked_c_rate, embarked_q_rate] 112 max_rate = max(rate) 113 return '{}存活率最大,为{}'.format(embarked[rate.index(max_rate)], max_rate) 114 115 # 8.不同年龄段女性的获救率 116 def female_survive(self): 117 female18_survived = \ 118 self.data[(self.data[u'age'] <= 18) & (self.data[u'sex'] == u'female')][u'survived'].value_counts()[1] 119 female18_death = \ 120 self.data[(self.data[u'age'] <= 18) & (self.data[u'sex'] == u'female')][u'survived'].value_counts()[0] 121 female18_rate = float(female18_survived) / (float(female18_survived) + float(female18_death)) 122 123 female1850_survived = \ 124 self.data[(self.data[u'age'] > 18) & (self.data[u'age'] < 50) & (self.data[u'sex'] == u'female')][ 125 u'survived'].value_counts()[1] 126 female1850_death = \ 127 self.data[(self.data[u'age'] > 18) & (self.data[u'age'] < 50) & (self.data[u'sex'] == u'female')][ 128 u'survived'].value_counts()[0] 129 female1850_rate = float(female1850_survived) / (float(female1850_survived) + float(female1850_death)) 130 131 female50_survived = \ 132 self.data[(self.data[u'age'] >= 50) & (self.data[u'sex'] == u'female')][u'survived'].value_counts()[1] 133 female50_death = \ 134 self.data[(self.data[u'age'] >= 50) & (self.data[u'sex'] == u'female')][u'survived'].value_counts()[0] 135 female50_rate = float(female50_survived) / (float(female50_survived) + float(female50_death)) 136 137 return '18岁以下女性存活率:{},18-50岁女性存活率:{},50岁以上女性存活率:{}'.format(female18_rate, female1850_rate, female50_rate) 138 139 140 if __name__ == '__main__': 141 tt = titanic() 142 # print tt.rate_survive() 143 # print tt.than_survive() 144 # print tt.max_survive() 145 # print tt.poor_wealth() 146 # print tt.pclass_survive() 147 # print tt.family_survive() 148 # print tt.emarked_survive() 149 print tt.female_survive()
data-->infomation-->knowledge-->wisdom
数据-->信息-->知识-->智慧
爬虫-->数据库-->数据分析-->机器学习
- 信息:通过某种方式组织和处理数据,分析数据间的关系,数据就有了意义
- 知识:如果说数据是一个事实的集合,从中可以得出关于事实的结论。那么知识(knowledge)就是信息的集合,它使信息变得有用。知识是对信息的应用,是一个对信息判断和确认的过程,这个过程结合了经验、上下文、诠释和反省。知识可以回答“如何?”的问题,可以帮助我们建模和仿真
- 智慧:智慧可以简单的归纳为做正确判断和决定的能力,包括对知识的最佳使用。智慧可以回答“为什么”的问题。回到前面的例子,根据故障对客户的业务影响可以识别改进点