Introduction to Data Science in Python 第 2 周 Assignment
Introduction to Data Science in Python
第 2 周 Assignment
记录下问题和自己的答案
Part 1
The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.
The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.
Question 1
Which country has won the most gold medals in summer games?
This function should return a single string value.
【IN】
def answer_one():
return df[df['Gold'] == df['Gold'].max()].index[0]
answer_one()
【OUT】
'United States'
Question 2
Which country had the biggest difference between their summer and winter gold medal counts?
This function should return a single string value.
【IN】
def answer_two():
df['diff'] = abs(df['Gold'] - df['Gold.1'])
max = df['diff'].max()
return df[df['diff'] == max].index[0]
answer_two()
【OUT】
'United States'
Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?
Only include countries that have won at least 1 gold in both summer and winter.
This function should return a single string value.
【IN】
def answer_three():
result = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
result['diff'] = (result['Gold'] - result['Gold.1']) / result['Gold.2']
max = result['diff'].max()
return result[result['diff'] == max].index[0]
answer_three()
【OUT】
'Bulgaria'
Question 4
Write a function that creates a Series called “Points” which is a weighted value where each gold medal (Gold.2
) counts for 3 points, silver medals (Silver.2
) for 2 points, and bronze medals (Bronze.2
) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.
This function should return a Series named Points
of length 146
【IN】
def answer_four():
result = df['Gold.2'] * 3 + df['Silver.2'] * 2 + df['Bronze.2']
return pd.Series(result, name="Points")
answer_four()
【OUT】
Afghanistan 2
Algeria 27
Argentina 130
Armenia 16
Australasia 22
Australia 923
Austria 569
Azerbaijan 43
Bahamas 24
Bahrain 1
Barbados 1
Belarus 154
Belgium 276
Bermuda 1
Bohemia 5
Botswana 2
Brazil 184
British West Indies 2
Bulgaria 411
Burundi 3
Cameroon 12
Canada 846
Chile 24
China 1120
Colombia 29
Costa Rica 7
Ivory Coast 2
Croatia 67
Cuba 420
Cyprus 2
...
Spain 268
Sri Lanka 4
Sudan 2
Suriname 4
Sweden 1217
Switzerland 630
Syria 6
Chinese Taipei 32
Tajikistan 4
Tanzania 4
Thailand 44
Togo 1
Tonga 2
Trinidad and Tobago 27
Tunisia 19
Turkey 191
Uganda 14
Ukraine 220
United Arab Emirates 3
United States 5684
Uruguay 16
Uzbekistan 38
Venezuela 18
Vietnam 4
Virgin Islands 2
Yugoslavia 171
Independent Olympic Participants 4
Zambia 3
Zimbabwe 18
Mixed team 38
Name: Points, dtype: int64
Part 2
For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.
The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.
Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You’ll need this for future questions too…)
This function should return a single string value.
【IN】
def answer_five():
city_level = census_df[census_df['SUMLEV'] == 50]
st_list = pd.DataFrame(city_level.groupby("STNAME").size(), columns=["size"])
max = st_list['size'].max()
return st_list[st_list['size'] == max].index[0]
answer_five()
【OUT】
'Texas'
Question 6
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP
.
This function should return a list of string values.
【IN】
def answer_six():
state_list = census_df[census_df['SUMLEV'] == 50].groupby('STNAME').apply(lambda x: x.sort_values('CENSUS2010POP', ascending=False).head(3).sum())
result = state_list.sort_values('CENSUS2010POP', ascending=False)
return result.index[0:3].tolist()
answer_six()
【OUT】
['California', 'Texas', 'Illinois']
以下第一次提交的答案错了,审题不仔细
def answer_six():
cty_list = census_df[census_df['SUMLEV'] == 50].sort_values('CENSUS2010POP', ascending=False)
return cty_list['CTYNAME'].tolist()[0:3]
answer_six()
# 漏了关键:Only looking at the three most populous counties for each state
# 题目要求只计算每个STATE里population前三的COUNTRY,而不是所有
Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.
This function should return a single string value.
【IN】
def answer_seven():
cty_list = census_df[census_df['SUMLEV'] == 50]
population = cty_list[['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']]
pop_max = population.max(axis=1)
pop_min = population.min(axis=1)
diff = pop_max - pop_min
cty_list['diff'] = diff
return cty_list.loc[diff.argmax()]['CTYNAME']
answer_seven()
【OUT】
'Harris County'
以下第一次提交的答案错了,审题不清。。。
def answer_seven():
population = census_df[['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']]
diff = abs(population.max(axis=1) - population.min(axis=1))
census_df['diff'] = diff
return census_df.iloc[census_df['diff'].argmax()]['CTYNAME']
answer_seven()
# 忘记去掉 “洲”!!这里得到的答案 Texas 是STATE,而不是COUNTRY
# 需要先过滤掉 SUMLEV == 40 的(即 STATE)
以下第二次提交的答案还是错了,纠结了很久错误的位置。。最后发现是 iloc 和 loc 使用混淆。。
def answer_seven():
cty_list = census_df[census_df['SUMLEV'] == 50]
population = cty_list[['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']]
pop_max = population.max(axis=1)
pop_min = population.min(axis=1)
population['max'] = pop_max
population['min'] = pop_min
diff = pop_max - pop_min
cty_list['diff'] = diff
return cty_list.iloc[diff.argmax()]['CTYNAME']
# return census_df[census_df['diff'] == diff.max()]['CTYNAME']
# return census_df[census_df['CTYNAME'] == 'Texas'][['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015', 'diff']]
answer_seven()
# 不能用 iloc,这里要用 loc。 需要搞清楚区别和使用
Question 8
In this datafile, the United States is broken up into four regions using the “REGION” column.
Create a query that finds the counties that belong to regions 1 or 2, whose name starts with ‘Washington’, and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.
This function should return a 5x2 DataFrame with the columns = [‘STNAME’, ‘CTYNAME’] and the same index ID as the census_df (sorted ascending by index).
【IN】
def answer_eight():
region_1 = census_df['REGION'] == 1
region_2 = census_df['REGION'] == 2
cty_washington = census_df['CTYNAME'].str.startswith('Washington')
pop_2015_greater_2014 = census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']
region_list = census_df[(region_1 | region_2) & cty_washington & pop_2015_greater_2014]
return region_list[['STNAME', 'CTYNAME']]
answer_eight()
【OUT】
STNAME CTYNAME
896 Iowa Washington County
1419 Minnesota Washington County
2345 Pennsylvania Washington County
2355 Rhode Island Washington County
3163 Wisconsin Washington County