9.Pandas练习:美国个州的统计数据
Pandas练习:美国各州的统计数据
前面讲解了那么多Pandas中的内容,下面将结合真实的数据集来进行Pandas数据分析练习,
我们这个练习只进行最简单的分析,目的是找出人口密度最高和最稀疏的州
数据集为美国各州的统计数据,包括每个州的缩写,人口信息和面积
数据集原地址为原书作者的github地址(https://github.com/jakevdp/data-USstates/)
这里给出一个链接(按住Ctrl
鼠标左键单击访问)
我也提供了一个百度网盘下载地址(链接:百度网盘下载地址 ,提取码:6666)
文件存放地址根据个人喜好而定,我是放在当前python文件下得同一文件夹内
准备部分
准备部分非常简单,就是导入需要用的库和加载需要处理的文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
population=pd.read_csv('./data-USstates-master/state-population.csv')
areas=pd.read_csv('./data-USstates-master/state-areas.csv')
abbreviation=pd.read_csv('./data-USstates-master/state-abbrevs.csv')
了解数据集
通常进行数据分析时,上手的第一步不是直接对数据集进行操作,而是首先了解我们需要处理的数据集,包括数据集的类型,数据集中数据的多少,读取到数据集的形状,数据集大致的内容等等
所以我们分成两个维度进行,数据集的内容和数据集的附属信息.类似于Linux中文件除了本身的内容外还具有归属,权限,创建时间等等附属信息
print('---------------数据集内容---------------\n')
print(abbreviation.head())
print(areas.head())
print(population.head())
print('---------------数据集信息---------------\n')
print(type(abbreviation),'\t\t\t\t\t',abbreviation.shape)
print(abbreviation.columns)
print(abbreviation.index,'\n')
print(type(areas),'\t\t\t\t\t',areas.shape)
print(areas.columns)
print(areas.index,'\n')
print(type(population),'\t\t\t\t\t',population.shape)
print(population.columns)
print(population.index)
>>>
---------------数据集内容---------------
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
---------------数据集信息---------------
<class 'pandas.core.frame.DataFrame'> (51, 2)
Index(['state', 'abbreviation'], dtype='object')
RangeIndex(start=0, stop=51, step=1)
<class 'pandas.core.frame.DataFrame'> (52, 2)
Index(['state', 'area (sq. mi)'], dtype='object')
RangeIndex(start=0, stop=52, step=1)
<class 'pandas.core.frame.DataFrame'> (2544, 4)
Index(['state/region', 'ages', 'year', 'population'], dtype='object')
RangeIndex(start=0, stop=2544, step=1)
从上面的结果中,我们能够得知
-
读取到的三个数据表都被转化成了DataFrame对象,
-
abbreviation表中储存的是美国各个州的全程与对应的缩写,
areas表中储存的是美国各州的面积
population表中储存的美国各州若干年份的未成年人和成年人数据
问题一
表一中有51个州,但是表二却有52列,那么到底是哪里不同?
可以借助外连接之后不是两个表共有的值会以NaN填充
所以我们可以查找NaN值来首先看看那一列具有缺失值,然后再在缺失的列中进行查找
关键在于逐步减小缺失值的范围
whichstate=pd.merge(abbreviation,areas,on='state',how='outer')
print(whichstate.isnull().any())
>>>
state False
abbreviation True
area (sq. mi) False
dtype: bool
大显示abbreviation这一列中有缺失值,那么我们接下来直接答应出这一列中缺失值的
whichstate=pd.merge(abbreviation,areas,on='state',how='outer')
print(whichstate[whichstate['abbreviation'].isnull()])
>>>
state abbreviation area (sq. mi)
51 Puerto Rico NaN 3515
发现是Puerto Rico这个州缺失了缩写,百度一下这个名字,发现是波多黎各
再进一步了解得知,波多黎各之前是美国的一个自治区,2017年全民公投希望成为美国的第51个州,所以这里没有缩写
再验证一下,发现果然如此
print('Puerto Rico' in abbreviation['state'])
>>>
False
我们实际上只是对数据集进行了初步的了解就已经获得了在背后的隐藏信息
问题二
算上波多戈里,美国一共有51个州,但是在population表中,却有2544行,那么population这个表的构造到底是什么样的呢?
为此我们可以首先挑出阿拉巴马州(Alabama)来看看population中一个州有多少个年份
print(population.set_index('state/region'))
print(population.set_index('state/region').T['AL'].shape[1])
>>>
ages year population
state/region
AL under18 2012 1117489.0
AL total 2012 4817528.0
AL under18 2010 1130966.0
AL total 2010 4785570.0
AL under18 2011 1125763.0
... ... ... ...
USA total 2010 309326295.0
USA under18 2011 73902222.0
USA total 2011 311582564.0
USA under18 2012 73708179.0
USA total 2012 313873685.0
[2544 rows x 3 columns]
48
这里我们查询一个州有多少个年份的时候首先对原DataFrame进行转置,这是因为我们只能对列索引(columns)进行查找取值
此外,由于每年的数据包括未成年和总体,所以实际上一共有24年
然后我们继续查看一共记录了多少个州
print(population.shape[0]/48)
>>>
53.0
发现一共有罗列了53个周
这显然和前面的50 / 51个州对不上,因此我们需要继续查看是哪里不对,到底多了哪些内容
问题三
根据前面的论述,发现在population中具有多的"州",我们下面就要找出多的"周"
同样,结合前面查找波多黎各的例子,我们依旧是使用pd.merge拼接population和abbreviation,然后查找NaN在哪
首先确定合并的列,而用于合并的列关键在于具有相同的值,所以首先看看那一列的值相同
print(population.head())
print(abbreviation.head())
>>>
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
所以我们可以指定合并的列为’state/region’和’state’,然后查找具有缺失的列
whichstat_1=pd.merge(abbreviation,population,left_on='abbreviation',right_on='state/region',how='outer')
print(whichstat_1.isnull().any())
>>>
state True
abbreviation True
state/region False
ages False
year False
population True
dtype: bool
所以我们查找state中的缺失值即可
whichstat_1=pd.merge(abbreviation,population,left_on='abbreviation',right_on='state/region',how='outer')
print(population['state/region'][whichstat_1['state'].isnull()])
print(population['state/region'][whichstat_1['state'].isnull()].size)
>>>
2448 PR
2449 PR
2450 PR
2451 PR
2452 PR
...
2539 USA
2540 USA
2541 USA
2542 USA
2543 USA
Name: state/region, Length: 96, dtype: object
96
所以刚好多了PR和USA这两个’州’
结合前面所讲,population相比与abbreviation多了波多黎各和全美国的人口信息
接下来同理,转置之后查看PR和USA分别具体的内容
print(population.set_index('state/region').T['PR'].T)
print(population.set_index('state/region').T['USA'].T)
>>>
ages year population
state/region
PR under18 1990 NaN
PR total 1990 NaN
PR total 1991 NaN
PR under18 1991 NaN
PR total 1993 NaN
PR under18 1993 NaN
PR under18 1992 NaN
PR total 1992 NaN
PR under18 1994 NaN
PR total 1994 NaN
PR total 1995 NaN
PR under18 1995 NaN
PR under18 1996 NaN
PR total 1996 NaN
PR under18 1998 NaN
PR total 1998 NaN
PR total 1997 NaN
PR under18 1997 NaN
PR total 1999 NaN
PR under18 1999 NaN
PR total 2000 3.81060e+06
PR under18 2000 1.08906e+06
PR total 2001 3.81877e+06
PR under18 2001 1.07757e+06
PR total 2002 3.8237e+06
PR under18 2002 1.06505e+06
PR total 2004 3.82688e+06
PR under18 2004 1.03592e+06
PR total 2003 3.8261e+06
PR under18 2003 1.05062e+06
PR total 2005 3.82136e+06
PR under18 2005 1.01945e+06
PR total 2006 3.80521e+06
PR under18 2006 998543
PR total 2007 3.783e+06
PR under18 2007 973613
PR total 2008 3.76087e+06
PR under18 2008 945705
PR under18 2013 814068
PR total 2013 3.61509e+06
PR total 2009 3.74041e+06
PR under18 2009 920794
PR total 2010 3.72121e+06
PR under18 2010 896945
PR under18 2011 869327
PR total 2011 3.68658e+06
PR under18 2012 841740
PR total 2012 3.65154e+06
ages year population
state/region
USA under18 1990 6.42185e+07
USA total 1990 2.49623e+08
USA total 1991 2.52981e+08
USA under18 1991 6.5313e+07
USA under18 1992 6.65092e+07
USA total 1992 2.56514e+08
USA total 1993 2.59919e+08
USA under18 1993 6.75949e+07
USA under18 1994 6.86409e+07
USA total 1994 2.63126e+08
USA under18 1995 6.94731e+07
USA under18 1996 7.02335e+07
USA total 1995 2.66278e+08
USA total 1996 2.69394e+08
USA total 1997 2.72647e+08
USA under18 1997 7.09207e+07
USA under18 1998 7.14314e+07
USA total 1998 2.75854e+08
USA under18 1999 7.19461e+07
USA total 2000 2.82162e+08
USA under18 2000 7.23762e+07
USA total 1999 2.7904e+08
USA total 2001 2.84969e+08
USA under18 2001 7.26712e+07
USA total 2002 2.87625e+08
USA under18 2002 7.29365e+07
USA total 2003 2.90108e+08
USA under18 2003 7.31008e+07
USA total 2004 2.92805e+08
USA under18 2004 7.32977e+07
USA total 2005 2.95517e+08
USA under18 2005 7.35237e+07
USA total 2006 2.9838e+08
USA under18 2006 7.37577e+07
USA total 2007 3.01231e+08
USA under18 2007 7.40194e+07
USA total 2008 3.04094e+08
USA under18 2008 7.41046e+07
USA under18 2013 7.35859e+07
USA total 2013 3.16129e+08
USA total 2009 3.06772e+08
USA under18 2009 7.41342e+07
USA under18 2010 7.41196e+07
USA total 2010 3.09326e+08
USA under18 2011 7.39022e+07
USA total 2011 3.11583e+08
USA under18 2012 7.37082e+07
USA total 2012 3.13874e+08
至此我们已经初步了解了我们即将需要处理的数据集。
解决问题
我们的目标就是找出人口密度最大和最低的州,为此我们就需要让各州的历年的人口除以各州的面积,然后调用reduce类方法来获取即可
但是从我们前面的讲解中我们知道我们即将处理的数据集具有缺失值,是比较“脏”的,所以我们首先要进行数据预处理,去除缺失值和极端值等
数据预处理
这里我们的操作上面已经说过了,所以为了确保我们能够相处得到正确的结果,我们首先需要排除掉各州人口和面积中的缺失值
print(areas['area (sq. mi)'].isnull().sum(),'\n')
print(population['population'].isnull().sum())
>>>
0
20
发现各州的面积没有缺失值,但是各州历年的人口中有缺失值,那么我们首先找出来所有的缺失值在哪,然后用0来填充
print(population['population'].isnull())
print(population[population['population'].isnull()])
>>>
0 False
1 False
2 False
3 False
4 False
...
2539 False
2540 False
2541 False
2542 False
2543 False
Name: population, Length: 2544, dtype: bool
state/region ages year population
2448 PR under18 1990 NaN
2449 PR total 1990 NaN
2450 PR total 1991 NaN
2451 PR under18 1991 NaN
2452 PR total 1993 NaN
2453 PR under18 1993 NaN
2454 PR under18 1992 NaN
2455 PR total 1992 NaN
2456 PR under18 1994 NaN
2457 PR total 1994 NaN
2458 PR total 1995 NaN
2459 PR under18 1995 NaN
2460 PR under18 1996 NaN
2461 PR total 1996 NaN
2462 PR under18 1998 NaN
2463 PR total 1998 NaN
2464 PR total 1997 NaN
2465 PR under18 1997 NaN
2466 PR total 1999 NaN
2467 PR under18 1999 NaN
这里我们发现,我们对population的population这一列查找缺失值,得到的是一个只有一列的布尔DataFrame
但是我们依旧能够以这个布尔DataFrame来作为花哨的索引,这是因为进行了广播
接下来我们用0来填充然后检查填充后的population是否存在缺失值
population['population']=population['population'].fillna(0)
print(population['population'].isnull().sum())
>>>
0
数据处理
由于population中只有各州的缩写,而areas中只有各州的全写,所以我们首先把abbreviation和areas按照全写合并起来,接下来再让abbreviation与areas的合并DataFrame对象与population合并
我们对数据集的了解得知,population中具有areas不具有的USA以及缺少波多黎各州的缩写,所以在合并abbreviation与areas的合并DataFrame与popultion前,需要去掉所有含USA的行并且补充波多黎各的缩写
population['population']=population['population'].fillna(0)
population=population.set_index('state/region').T.drop('USA',axis=1).T
merged_half=pd.merge(abbreviation,areas,left_on='state',right_on='state',how='outer')
merged_half['abbreviation'].values[-1]='PR'
merged_all=pd.merge(population,merged_half,left_on='state/region',right_on='abbreviation',how='outer')
merged_all['density']=merged_all['population']/merged_all['area (sq. mi)']
merged_all.sort_values(ascending=False,inplace=True,by='density')
print(merged_all)
>>>
ages year population state abbreviation area (sq. mi) density
391 total 2013 646449 District of Columbia DC 68 9506.6
385 total 2012 633427 District of Columbia DC 68 9315.1
387 total 2011 619624 District of Columbia DC 68 9112.12
431 total 1990 605321 District of Columbia DC 68 8901.78
389 total 2010 605125 District of Columbia DC 68 8898.9
... ... ... ... ... ... ... ...
2461 total 1996 0 Puerto Rico PR 3515 0
2462 under18 1998 0 Puerto Rico PR 3515 0
2463 total 1998 0 Puerto Rico PR 3515 0
2465 under18 1997 0 Puerto Rico PR 3515 0
2448 under18 1990 0 Puerto Rico PR 3515 0
[2496 rows x 7 columns]
上面我们最后调用了sort_values方法来指定按照density的大小排序,并且将排序后的结果覆盖原数组
获取结果
我们上面已经成功的进行了排序,我们下面将使用后面会讲到的query方法来获取特定的值
print(merged_all.query("year==2010 & ages == 'total'").head())
>>>
ages year population state abbreviation area (sq. mi) density
389 total 2010 605125 District of Columbia DC 68 8898.9
2490 total 2010 3.72121e+06 Puerto Rico PR 3515 1058.67
1445 total 2010 8.80271e+06 New Jersey NJ 8722 1009.25
1914 total 2010 1.05267e+06 Rhode Island RI 1545 681.339
293 total 2010 3.57921e+06 Connecticut CT 5544 645.601
这里我们指定查询2010年各州所有年龄的人口密度,显示出前几个即人口密度前几的地方
发现2010年全美人工密度最高的是华盛顿特区的哥伦比亚区,各州中人口密度最高的是新泽西州
我们再看看人口密度最少的州
print(merged_all.query("year==2010 & ages == 'total'").tail())
>>>
ages year population state abbreviation area (sq. mi) density
2010 total 2010 816211 South Dakota SD 77121 10.5835
1637 total 2010 674344 North Dakota ND 70704 9.53757
1253 total 2010 990527 Montana MT 147046 6.73617
2405 total 2010 564222 Wyoming WY 97818 5.76808
91 total 2010 713868 Alaska AK 656425 1.08751
发现人口密度最低的州是阿拉斯加州,人口刚过1万人每平方公里
至此,我们已经成功实现了我们的目的,最后重构下代码,添加查询和绘图功能
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
'''
API说明:
FindAns(state,year, age)
state是需要查询的州,只有state参数时会输出查询州的所有人口信息
year是要查询的年份,只有year参数是会输出查询年份的所有人口信息
age是指定查询的对象,可选total或under18,只有year参数会输出指定所有州所有年份指定对象的人口信息
三个参数可以任意搭配
当指定state参数和age参数时会绘制该州指定人口历年的人口图像
示例:
FindAns(age='total')
FindAns(year=2010)
FindAns(state='AK')
FindAns(state='WI',age='total',year=2010)
FindAns(state='CA',age='total')
FindAns(age='total',year=1998)
'''
global population,areas,abbreviation
population=pd.read_csv('./data-USstates-master/state-population.csv')
areas=pd.read_csv('./data-USstates-master/state-areas.csv')
abbreviation=pd.read_csv('./data-USstates-master/state-abbrevs.csv')
global All_sorted,All_unsorted,Ans_sorted,Ans_unsorted
def Description():
print('---------------数据集内容---------------\n')
print(abbreviation.head())
print('\n\n')
print(areas.head())
print('\n\n')
print(population.head())
print('\n\n\n')
print('---------------数据集信息---------------\n')
print('abbreviation:')
print('\t',type(abbreviation), '\t\t\t\t\t', abbreviation.shape)
print('\t',abbreviation.columns)
print('\t',abbreviation.index,'\n')
print('\n\n')
print('areas:')
print('\t',type(areas),'\t\t\t\t\t',areas.shape)
print('\t',areas.columns)
print('\t',areas.index,'\n')
print('population:')
print('\t',type(population),'\t\t\t\t\t',population.shape)
print('\t',population.columns)
print('\t',population.index)
print('\n\n\n')
def GetAll():
global All_sorted,population,All_unsorted
population['population'].fillna(0)
population['population']=population['population'].fillna(0)
population=population.set_index('state/region').T.drop('USA',axis=1).T
merged_half=pd.merge(abbreviation,areas,left_on='state',right_on='state',how='outer')
merged_half['abbreviation'].values[-1]='PR'
merged_all=pd.merge(population,merged_half,left_on='state/region',right_on='abbreviation',how='outer')
merged_all['density']=merged_all['population']/merged_all['area (sq. mi)']
All_unsorted=merged_all.sort_values(ascending=True,by='year')
All_sorted=merged_all.sort_values(ascending=False,by='density')
def FindAns(age=0,state=0,year=0):
global Ans_sorted,Ans_unsorted,All_sorted,All_unsorted
if age==0 and state==0 and year==0:
print('Error value, you must indicate at least one parameter in year,state,age')
return 0
elif age!=0 and state!=0 and year==0:
str_find='ages== \''+str(age)+'\' & '+'abbreviation== \''+str(state)+'\''
elif age!=0 and state==0 and year!=0:
str_find='ages==\''+str(age)+'\' & '+'year=='+str(year)
elif age==0 and state!=0 and year!=0:
str_find='abbreviation==\''+str(state)+'\' & '+'year=='+str(year)
elif age!=0 and state==0 and year==0:
str_find='ages==\''+str(age)+'\''
elif age==0 and state!=0 and year==0:
str_find='abbreviation==\''+str(state)+'\''
elif age==0 and state==0 and year!=0:
str_find='year=='+str(year)
elif age!=0 and state!=0 and year!=0:
str_find='ages==\''+str(age)+'\' & '+'abbreviation==\''+str(state)+'\' & '+'year=='+str(year)
Ans_sorted=All_sorted.query(str_find)
Ans_unsorted=All_unsorted.query(str_find)
print(Ans_unsorted)
if age!=0 and state!=0 and year==0:
PlotState()
def PlotState():
global Ans_sorted,Ans_unsorted
plt.plot(Ans_unsorted['year'],Ans_unsorted['density'])
plt.xlabel('Year')
plt.ylabel('Population Density')
plt.title('Population Density of '+str(Ans_unsorted.iloc[0,3]))
plt.show()
Description()
GetAll()