0% found this document useful (0 votes)
143 views18 pages

Assignment 2

The document summarizes an assignment to analyze Olympic medal data from a CSV file. It loads the data, cleans it by removing unnecessary columns and rows and renaming columns to be more descriptive. It also provides some descriptive statistics about the data.

Uploaded by

madhu4urao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views18 pages

Assignment 2

The document summarizes an assignment to analyze Olympic medal data from a CSV file. It loads the data, cleans it by removing unnecessary columns and rows and renaming columns to be more descriptive. It also provides some descriptive statistics about the data.

Uploaded by

madhu4urao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 18

Assignment 2 - Pandas Introduction

(Olympic Medals)
necromuralist 
 
Source

This is the notebook for assignment 2 of the Coursera Python Data


Analysis course. This was converted from a jupyter notebook that you can
download it as part of the course downloads zip file. The submissions work by
uploading a ipynb file so there's a bit of cutting and pasting needed to get the code
from here to there. Also I didn't use their variable names so the final outcome
needs some aliasing for the grader to pass it.

Part 1
The following code loads the olympics dataset (olympics.csv), which was derived from the Wikipedia
entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as


• # of Summer games,
• Summer medals
• # of Winter games
• Winter medals
• total # number of games
• total # of medals

Imports

# third party

import numpy
import pandas
from tabulate import tabulate

Load the Data

A quick look at the data-source


The data is stored in a comma-separated file named olympics.csv.

head olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№
Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
Australia (AUS) [AUS]
[Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480
Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304
Azerbaijan (AZE),5,6,5,15,26,5,0,0,0,0,10,6,5,15,26

Looking at the output of our source data-file (olympics.csv) you can see that the
first row is a column-index that we'll want to get rid of. The first column is also
missing a header and there are columns with unicode and mysterious numeric
names with exclamation points at the end.
If you go to the wikipedia page and look at the NOC's with medals table
(NOC stands for National Olympic Committee - each country that participates in
the Olympics has one, unless they are part of a combined committee) you'll see
that the numbers with leading zeros correspond to the medal types, the decimal
numbers are the game (Summer (no trailing decimal), Winter (.1), or combined
(.2)). So 01 !.2 means the total number of gold medals for both the summer and
winter games combined. The No columns is the count of the number of times a
country participated in a game. So, looking at the output above, Afghanistan has
participated in 13 Summer games (this data was apparently pulled before the 2016
games in Rio de Janeiro, as their count is currently 14 on the source page). As for
the ! notation, the wikipedia page has arrows that you can click on that will let you
sort the rows by a particular column and I think this is how their screen-scraper
translated them into ascii.

Load the data into pandas

data = pandas.read_csv('olympics.csv', index_col=0, skiprows=1)

description = data.describe()
print(tabulate(description, headers="keys", tablefmt="orgtbl"))

  № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 №


count 147 147 147 147 147 147 147 147 147 147 1
mean 13.4762 65.4286 64.966 69.7959 200.19 6.70068 13.0476 13.034 12.898 38.9796 2
std 7.07236 405.55 399.31 427.187 1231.31 7.43319 80.7992 80.6344 79.5884 240.917 1
min 1 0 0 0 0 0 0 0 0 0 1
25% 8 0 1 1 2 0 0 0 0 0 1
50% 13 3 4 6 12 5 0 0 0 0 1
75% 18.5 24 28 29 86 10 1 2 1 5 2
max 27 4809 4775 5130 14714 22 959 958 948 2865 4

data.head()
№ Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 \
Afghanistan (AFG) 13 0 0 2 2 0 0
Algeria (ALG) 12 5 2 8 15 3 0
Argentina (ARG) 23 18 24 28 70 18 0
Armenia (ARM) 5 1 2 9 12 6 0
Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0
02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 \
Afghanistan (AFG) 0 0 0 13 0 0
Algeria (ALG) 0 0 0 15 5 2
Argentina (ARG) 0 0 0 41 18 24
Armenia (ARM) 0 0 0 11 1 2
Australasia (ANZ) [ANZ] 0 0 0 2 3 4

03 !.2 Combined total


Afghanistan (AFG) 2 2
Algeria (ALG) 8 15
Argentina (ARG) 28 70
Armenia (ARM) 9 12
Australasia (ANZ) [ANZ] 5 12

It looks like the countries got read as the index for the data-frame since the first
column was empty.

data.tail()
№ Summer 01 ! 02 ! 03 ! \
Independent Olympic Participants (IOP) [IOP] 1 0 1 2
Zambia (ZAM) [ZAM] 12 0 1 1
Zimbabwe (ZIM) [ZIM] 12 3 4 1
Mixed team (ZZX) [ZZX] 3 8 5 4
Totals 27 4809 4775 5130

Total № Winter 01 !.1 02 !.1 \


Independent Olympic Participants (IOP) [IOP] 3 0 0 0
Zambia (ZAM) [ZAM] 2 0 0 0
Zimbabwe (ZIM) [ZIM] 8 1 0 0
Mixed team (ZZX) [ZZX] 17 0 0 0
Totals 14714 22 959 958

03 !.1 Total.1 № Games \


Independent Olympic Participants (IOP) [IOP] 0 0 1
Zambia (ZAM) [ZAM] 0 0 12
Zimbabwe (ZIM) [ZIM] 0 0 13
Mixed team (ZZX) [ZZX] 0 0 3
Totals 948 2865 49

01 !.2 02 !.2 03 !.2 \


Independent Olympic Participants (IOP) [IOP] 0 1 2
Zambia (ZAM) [ZAM] 0 1 1
Zimbabwe (ZIM) [ZIM] 3 4 1
Mixed team (ZZX) [ZZX] 8 5 4
Totals 5768 5733 6078

Combined total
Independent Olympic Participants (IOP) [IOP] 3
Zambia (ZAM) [ZAM] 2
Zimbabwe (ZIM) [ZIM] 8
Mixed team (ZZX) [ZZX] 17
Totals 17579
print((data["Combined total"].sum() - data.loc["Totals"]["Combined total"]) *
2)
print(data["Combined total"].sum())
35158
35158

It also looks like the last row is a sum of the columns, which we will need to get
rid of (as you can see from the sum it doubles any math we do on a column).

Fixing the Column Names


First the unicode is removed from the column names and the columns are given more human-readable
names.

new_columns = ("summer_participations",
"summer_gold",
"summer_silver",
"summer_bronze",
"summer_total",
"winter_participations",
"winter_gold",
"winter_silver",
"winter_bronze",
"winter_total",
"total_participations",
"total_gold",
"total_silver",
"total_bronze",
"total_combined")
assert len(new_columns) == len(data.columns)

column_remap = dict(zip(data.columns, new_columns))


if data.columns[0] != "summer_participations":
for column in data.columns:
data.rename(columns={column:column_remap[column]},
inplace=True)

data.head(1)
summer_participations summer_gold summer_silver \
Afghanistan (AFG) 13 0 0

summer_bronze summer_total winter_participations \


Afghanistan (AFG) 2 2 0

winter_gold winter_silver winter_bronze winter_total \


Afghanistan (AFG) 0 0 0 0

total_participations total_gold total_silver \


Afghanistan (AFG) 13 0 0

total_bronze total_combined
Afghanistan (AFG) 2 2

Changing the Indices


Since the index has both a country name and a country code, we'll create a new column with just the
ID's. The indices are split on the first left parenthesis ("(") creating a tuple with the first item being the
name of the country and the second containing the country codes. To avoid the trailing parentheses and
any other extra characters, the country-code gets sliced. Finally the last row ("Totals") gets dropped

# split the index by '('


names_ids = data.index.str.split('\s\(')
data.index = names_ids.str[0]
data['ID'] = names_ids.str[1].str[:3]

data = data.drop('Totals')

data.head(1)
summer_participations summer_gold summer_silver
summer_bronze \
Afghanistan 13 0 0 2

summer_total winter_participations winter_gold winter_silver \


Afghanistan 2 0 0 0

winter_bronze winter_total total_participations total_gold \


Afghanistan 0 0 13 0

total_silver total_bronze total_combined ID


Afghanistan 0 2 2 AFG

Question 0 (Example)

What is the first country in in the data-frame?


The autograder will call this function and compare the return value against the correct solution value.

def answer_zero():
"""This function returns the row for Afghanistan

Returns
-------

Series: first row in the data


"""
return data.iloc[0]
answer_zero().name
Afghanistan

Questions

Question 1
Which country has won the most gold medals in summer games?

def answer_one():
"""
Returns
-------

Str: name of the country with the most summer gold medals
"""
return data.summer_gold.argmax()

answer_one()
United States

Since we put the NOCs in the index (and most are country names), finding the
index for the maximum value returns the country name that we want.
Question 2
Which country had the biggest difference between their summer and winter gold medal counts?

def answer_two():
"""
Returns
-------

Str: country with most difference between summer and winter gold
"""
return (data.summer_gold - data.winter_gold).abs().argmax()

answer_two()
United States

Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold
medal counts relative to their total gold medal count?
Summer Gold−Winter GoldTotal Gold
Only include countries that have won at least 1 gold in both summer and winter.

def answer_three():
"""
find country with the biggest difference between the
summer and winter gold counts relative to total gold metal count
if they have at least one gold medal in both summer and winter

Returns
-------
str: country with biggest summer/winter gold differe
"""
elegible = data[(data.summer_gold>=1) & (data.winter_gold>=1)]
ratios = (elegible.summer_gold -
elegible.winter_gold).abs()/elegible.total_gold
return ratios.argmax()

answer_three()
Bulgaria

Question 4
Write a function that creates a Series called "Points" which is a weighted value where each gold medal
(Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2)
for 1 point. The function should return only the column (a Series object) which you created.

def answer_four():
"""
Creates weighted points based on medals
* Gold: 3 points
* Silver: 2 points
* Bronze: 1 point

Returns
-------

Series: column of points for each NOC


"""
points = numpy.zeros(len(data))
points += data.total_gold * 3
points += data.total_silver * 2
points += data.total_bronze
return pandas.Series(points, index=data.index)
points = answer_four()
assert points.loc["United States"] == 5684

Part 2
For the next set of questions, we will be using census data from the United States Census Bureau.
Counties are political and geographic subdivisions of states in the United States. This dataset contains
population data for counties and states in the US from 2010 to 2015. See this document for a
description of the variable names.

Load the Census Data

census_data = pandas.read_csv('census.csv')

census_data.head()
SUMLEV REGION DIVISION STATE COUNTY STNAME
CTYNAME \
0 40 3 6 1 0 Alabama Alabama
1 50 3 6 1 1 Alabama Autauga County
2 50 3 6 1 3 Alabama Baldwin County
3 50 3 6 1 5 Alabama Barbour County
4 50 3 6 1 7 Alabama Bibb County

CENSUS2010POP ESTIMATESBASE2010
POPESTIMATE2010 ... \
0 4779736 4780127 4785161 ...
1 54571 54571 54660 ...
2 182265 182265 183193 ...
3 27457 27457 27341 ...
4 22915 22919 22861 ...

RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013


RDOMESTICMIG2014 \
0 0.002295 -0.193196 0.381066 0.582002
1 7.242091 -2.915927 -3.012349 2.265971
2 14.832960 17.647293 21.845705 19.243287
3 -4.728132 -2.500690 -7.056824 -3.904217
4 -5.527043 -5.068871 -6.201001 -0.177537

RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013


RNETMIG2014 \
0 -0.467369 1.030015 0.826644 1.383282 1.724718
1 -2.530799 7.606016 -2.626146 -2.722002 2.592270
2 17.197872 15.844176 18.559627 22.727626 20.317142
3 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583
4 0.177258 -5.088389 -4.363636 -5.403729 0.754533

RNETMIG2015
0 0.712594
1 -2.187333
2 18.293499
3 -10.543299
4 1.107861

[5 rows x 100 columns]


Census Variable Names

class CensusVariables:

state_name = "STNAME"
county_name = "CTYNAME"
census_population = "CENSUS2010POP"
region = "REGION"
population_2014 = "POPESTIMATE2014"
population_2015 = "POPESTIMATE2015"
population_estimates = ["POPESTIMATE2010",
"POPESTIMATE2011",
"POPESTIMATE2012",
"POPESTIMATE2013",
population_2014,
population_2015]
county_level = 50
summary_level = "SUMLEV"

Counties Only

counties = census_data[census_data[

CensusVariables.summary_level]==CensusVariables.county_level]
# this throws off the numeric index for the argmax method so reset it
counties = counties.reset_index()

# but the last question wants the original index


counties_original_index = census_data[census_data[
CensusVariables.summary_level]==CensusVariables.county_level]
Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for
future questions too…)

def answer_five():
"""finds state with the most counties

Returns
-------

str name of state with the most counties


"""
return counties.groupby(
CensusVariables.state_name).count().COUNTY.argmax()

answer_five()
Texas

Question 6
Only looking at the three most populous counties for each state, what are the three most populous states
(in order of highest population to lowest population)? Use CENSUS2010POP.

def answer_six():
"""finds three most populous states based on top three counties in each

Returns
-------

List: top three state-names (highest to lowest)


"""
top_threes = counties.groupby(
CensusVariables.state_name
)[CensusVariables.census_population].nlargest(3)
states = top_threes.groupby(level=0).sum()
return list(states.nlargest(3).index)

answer_six()
California Texas Illinois

Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint:
population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you
need to consider all six columns.)

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then
its largest change in the period would be |130-80| = 50.
This function should return a single string value.

def answer_seven():
"""Find county with largest absolute population variance

Returns
-------

str: name of the county


"""
return counties.iloc[
(counties[
CensusVariables.population_estimates].max(axis=1) -
counties[
CensusVariables.population_estimates].min(axis=1)
).argmax()][CensusVariables.county_name]

answer_seven()
Harris County

Question 8
In this datafile, the United States is broken up into four regions using the "REGION" column.

Create a query that finds the counties that belong to regions 1 or 2, whose name
starts with 'Washington', and whose POPESTIMATE2015 was greater than
their POPESTIMATE2014.
This function should return a 5x2 DataFrame with the columns =
['STNAME', 'CTYNAME'] and the same index ID as the census_data (sorted
ascending by index).

def answer_eight():
"""find region 1 or 2 counties:

* with names that start with Washington


* whose population grew from 2014 to 2015

.. note:: the index in the final data-frame has to match the original
census data

Returns
-------

DataFrame: with the county and state-name columns


"""
regions = counties_original_index[
(counties_original_index[CensusVariables.region]==1) |
(counties_original_index[CensusVariables.region]==2)]
washingtons = regions[
regions[CensusVariables.county_name].str.startswith("Washington")]
grew = washingtons[washingtons[CensusVariables.population_2015] >
washingtons[CensusVariables.population_2014]]
return grew[[CensusVariables.state_name,
CensusVariables.county_name]]

outcome = answer_eight()
assert outcome.shape == (5,2)
assert list(outcome.columns) == ['STNAME', 'CTYNAME']
print(tabulate(outcome, headers=["index"] + list(outcome.columns),
tablefmt="orgtbl"))
index STNAME CTYNAME
896 Iowa Washington County
1419 Minnesota Washington County
2345 Pennsylvania Washington County
2355 Rhode Island Washington County
3163 Wisconsin Washington County

You might also like