Assignment 2
Assignment 2
(Olympic Medals)
necromuralist
Source
Part 1
The following code loads the olympics dataset (olympics.csv), which was derived from the Wikipedia
entry on All Time Olympic Games Medals, and does some basic data cleaning.
Imports
# third party
import numpy
import pandas
from tabulate import tabulate
head olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№
Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
Australia (AUS) [AUS]
[Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480
Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304
Azerbaijan (AZE),5,6,5,15,26,5,0,0,0,0,10,6,5,15,26
Looking at the output of our source data-file (olympics.csv) you can see that the
first row is a column-index that we'll want to get rid of. The first column is also
missing a header and there are columns with unicode and mysterious numeric
names with exclamation points at the end.
If you go to the wikipedia page and look at the NOC's with medals table
(NOC stands for National Olympic Committee - each country that participates in
the Olympics has one, unless they are part of a combined committee) you'll see
that the numbers with leading zeros correspond to the medal types, the decimal
numbers are the game (Summer (no trailing decimal), Winter (.1), or combined
(.2)). So 01 !.2 means the total number of gold medals for both the summer and
winter games combined. The No columns is the count of the number of times a
country participated in a game. So, looking at the output above, Afghanistan has
participated in 13 Summer games (this data was apparently pulled before the 2016
games in Rio de Janeiro, as their count is currently 14 on the source page). As for
the ! notation, the wikipedia page has arrows that you can click on that will let you
sort the rows by a particular column and I think this is how their screen-scraper
translated them into ascii.
description = data.describe()
print(tabulate(description, headers="keys", tablefmt="orgtbl"))
data.head()
№ Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 \
Afghanistan (AFG) 13 0 0 2 2 0 0
Algeria (ALG) 12 5 2 8 15 3 0
Argentina (ARG) 23 18 24 28 70 18 0
Armenia (ARM) 5 1 2 9 12 6 0
Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0
02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 \
Afghanistan (AFG) 0 0 0 13 0 0
Algeria (ALG) 0 0 0 15 5 2
Argentina (ARG) 0 0 0 41 18 24
Armenia (ARM) 0 0 0 11 1 2
Australasia (ANZ) [ANZ] 0 0 0 2 3 4
It looks like the countries got read as the index for the data-frame since the first
column was empty.
data.tail()
№ Summer 01 ! 02 ! 03 ! \
Independent Olympic Participants (IOP) [IOP] 1 0 1 2
Zambia (ZAM) [ZAM] 12 0 1 1
Zimbabwe (ZIM) [ZIM] 12 3 4 1
Mixed team (ZZX) [ZZX] 3 8 5 4
Totals 27 4809 4775 5130
Combined total
Independent Olympic Participants (IOP) [IOP] 3
Zambia (ZAM) [ZAM] 2
Zimbabwe (ZIM) [ZIM] 8
Mixed team (ZZX) [ZZX] 17
Totals 17579
print((data["Combined total"].sum() - data.loc["Totals"]["Combined total"]) *
2)
print(data["Combined total"].sum())
35158
35158
It also looks like the last row is a sum of the columns, which we will need to get
rid of (as you can see from the sum it doubles any math we do on a column).
new_columns = ("summer_participations",
"summer_gold",
"summer_silver",
"summer_bronze",
"summer_total",
"winter_participations",
"winter_gold",
"winter_silver",
"winter_bronze",
"winter_total",
"total_participations",
"total_gold",
"total_silver",
"total_bronze",
"total_combined")
assert len(new_columns) == len(data.columns)
data.head(1)
summer_participations summer_gold summer_silver \
Afghanistan (AFG) 13 0 0
total_bronze total_combined
Afghanistan (AFG) 2 2
data = data.drop('Totals')
data.head(1)
summer_participations summer_gold summer_silver
summer_bronze \
Afghanistan 13 0 0 2
Question 0 (Example)
def answer_zero():
"""This function returns the row for Afghanistan
Returns
-------
Questions
Question 1
Which country has won the most gold medals in summer games?
def answer_one():
"""
Returns
-------
Str: name of the country with the most summer gold medals
"""
return data.summer_gold.argmax()
answer_one()
United States
Since we put the NOCs in the index (and most are country names), finding the
index for the maximum value returns the country name that we want.
Question 2
Which country had the biggest difference between their summer and winter gold medal counts?
def answer_two():
"""
Returns
-------
Str: country with most difference between summer and winter gold
"""
return (data.summer_gold - data.winter_gold).abs().argmax()
answer_two()
United States
Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold
medal counts relative to their total gold medal count?
Summer Gold−Winter GoldTotal Gold
Only include countries that have won at least 1 gold in both summer and winter.
def answer_three():
"""
find country with the biggest difference between the
summer and winter gold counts relative to total gold metal count
if they have at least one gold medal in both summer and winter
Returns
-------
str: country with biggest summer/winter gold differe
"""
elegible = data[(data.summer_gold>=1) & (data.winter_gold>=1)]
ratios = (elegible.summer_gold -
elegible.winter_gold).abs()/elegible.total_gold
return ratios.argmax()
answer_three()
Bulgaria
Question 4
Write a function that creates a Series called "Points" which is a weighted value where each gold medal
(Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2)
for 1 point. The function should return only the column (a Series object) which you created.
def answer_four():
"""
Creates weighted points based on medals
* Gold: 3 points
* Silver: 2 points
* Bronze: 1 point
Returns
-------
Part 2
For the next set of questions, we will be using census data from the United States Census Bureau.
Counties are political and geographic subdivisions of states in the United States. This dataset contains
population data for counties and states in the US from 2010 to 2015. See this document for a
description of the variable names.
census_data = pandas.read_csv('census.csv')
census_data.head()
SUMLEV REGION DIVISION STATE COUNTY STNAME
CTYNAME \
0 40 3 6 1 0 Alabama Alabama
1 50 3 6 1 1 Alabama Autauga County
2 50 3 6 1 3 Alabama Baldwin County
3 50 3 6 1 5 Alabama Barbour County
4 50 3 6 1 7 Alabama Bibb County
CENSUS2010POP ESTIMATESBASE2010
POPESTIMATE2010 ... \
0 4779736 4780127 4785161 ...
1 54571 54571 54660 ...
2 182265 182265 183193 ...
3 27457 27457 27341 ...
4 22915 22919 22861 ...
RNETMIG2015
0 0.712594
1 -2.187333
2 18.293499
3 -10.543299
4 1.107861
class CensusVariables:
state_name = "STNAME"
county_name = "CTYNAME"
census_population = "CENSUS2010POP"
region = "REGION"
population_2014 = "POPESTIMATE2014"
population_2015 = "POPESTIMATE2015"
population_estimates = ["POPESTIMATE2010",
"POPESTIMATE2011",
"POPESTIMATE2012",
"POPESTIMATE2013",
population_2014,
population_2015]
county_level = 50
summary_level = "SUMLEV"
Counties Only
counties = census_data[census_data[
CensusVariables.summary_level]==CensusVariables.county_level]
# this throws off the numeric index for the argmax method so reset it
counties = counties.reset_index()
def answer_five():
"""finds state with the most counties
Returns
-------
answer_five()
Texas
Question 6
Only looking at the three most populous counties for each state, what are the three most populous states
(in order of highest population to lowest population)? Use CENSUS2010POP.
def answer_six():
"""finds three most populous states based on top three counties in each
Returns
-------
answer_six()
California Texas Illinois
Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint:
population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you
need to consider all six columns.)
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then
its largest change in the period would be |130-80| = 50.
This function should return a single string value.
def answer_seven():
"""Find county with largest absolute population variance
Returns
-------
answer_seven()
Harris County
Question 8
In this datafile, the United States is broken up into four regions using the "REGION" column.
Create a query that finds the counties that belong to regions 1 or 2, whose name
starts with 'Washington', and whose POPESTIMATE2015 was greater than
their POPESTIMATE2014.
This function should return a 5x2 DataFrame with the columns =
['STNAME', 'CTYNAME'] and the same index ID as the census_data (sorted
ascending by index).
def answer_eight():
"""find region 1 or 2 counties:
.. note:: the index in the final data-frame has to match the original
census data
Returns
-------
outcome = answer_eight()
assert outcome.shape == (5,2)
assert list(outcome.columns) == ['STNAME', 'CTYNAME']
print(tabulate(outcome, headers=["index"] + list(outcome.columns),
tablefmt="orgtbl"))
index STNAME CTYNAME
896 Iowa Washington County
1419 Minnesota Washington County
2345 Pennsylvania Washington County
2355 Rhode Island Washington County
3163 Wisconsin Washington County