Assignment 2 - Pandas Introduction (Olympic Medals)
This is the notebook for assignment 2 of the Coursera Python Data Analysis course. This was converted from a jupyter notebook that you can download it as part of the course downloads zip file. The submissions work by uploading a ipynb file so there's a bit of cutting and pasting needed to get the code from here to there. Also I didn't use their variable names so the final outcome needs some aliasing for the grader to pass it.
Part 1
The following code loads the olympics dataset (olympics.csv), which was derived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.
The columns are organized as
- # of Summer games,
- Summer medals
- # of Winter games
- Winter medals
- total # number of games
- total # of medals
Imports
# third party
import numpy
import pandas
from tabulate import tabulate
Load the Data
A quick look at the data-source
The data is stored in a comma-separated file named olympics.csv.
head olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2 Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15 Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70 Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12 Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12 Australia (AUS) [AUS] [Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480 Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304 Azerbaijan (AZE),5,6,5,15,26,5,0,0,0,0,10,6,5,15,26
Looking at the output of our source data-file (olympics.csv) you can see that the first row is a column-index that we'll want to get rid of. The first column is also missing a header and there are columns with unicode and mysterious numeric names with exclamation points at the end.
If you go to the wikipedia page and look at the NOC's with medals table (NOC stands for National Olympic Committee - each country that participates in the Olympics has one, unless they are part of a combined committee) you'll see that the numbers with leading zeros correspond to the medal types, the decimal numbers are the game (Summer (no trailing decimal), Winter (.1), or combined (.2)). So 01 !.2
means the total number of gold medals for both the summer and winter games combined. The No
columns is the count of the number of times a country participated in a game. So, looking at the output above, Afghanistan has participated in 13 Summer games (this data was apparently pulled before the 2016 games in Rio de Janeiro, as their count is currently 14 on the source page). As for the !
notation, the wikipedia page has arrows that you can click on that will let you sort the rows by a particular column and I think this is how their screen-scraper translated them into ascii.
Load the data into pandas
data = pandas.read_csv('olympics.csv', index_col=0, skiprows=1)
description = data.describe()
print(tabulate(description, headers="keys", tablefmt="orgtbl"))
№ Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 | 147 |
mean | 13.4762 | 65.4286 | 64.966 | 69.7959 | 200.19 | 6.70068 | 13.0476 | 13.034 | 12.898 | 38.9796 | 20.1769 | 78.4762 | 78 | 82.6939 | 239.17 |
std | 7.07236 | 405.55 | 399.31 | 427.187 | 1231.31 | 7.43319 | 80.7992 | 80.6344 | 79.5884 | 240.917 | 13.257 | 485.013 | 478.86 | 505.855 | 1469.07 |
min | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
25% | 8 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 1 | 2.5 |
50% | 13 | 3 | 4 | 6 | 12 | 5 | 0 | 0 | 0 | 0 | 15 | 3 | 4 | 7 | 12 |
75% | 18.5 | 24 | 28 | 29 | 86 | 10 | 1 | 2 | 1 | 5 | 27 | 25.5 | 29 | 32.5 | 89 |
max | 27 | 4809 | 4775 | 5130 | 14714 | 22 | 959 | 958 | 948 | 2865 | 49 | 5768 | 5733 | 6078 | 17579 |
data.head()
№ Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 \ Afghanistan (AFG) 13 0 0 2 2 0 0 Algeria (ALG) 12 5 2 8 15 3 0 Argentina (ARG) 23 18 24 28 70 18 0 Armenia (ARM) 5 1 2 9 12 6 0 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 0 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 \ Afghanistan (AFG) 0 0 0 13 0 0 Algeria (ALG) 0 0 0 15 5 2 Argentina (ARG) 0 0 0 41 18 24 Armenia (ARM) 0 0 0 11 1 2 Australasia (ANZ) [ANZ] 0 0 0 2 3 4 03 !.2 Combined total Afghanistan (AFG) 2 2 Algeria (ALG) 8 15 Argentina (ARG) 28 70 Armenia (ARM) 9 12 Australasia (ANZ) [ANZ] 5 12
It looks like the countries got read as the index for the data-frame since the first column was empty.
data.tail()
№ Summer 01 ! 02 ! 03 ! \ Independent Olympic Participants (IOP) [IOP] 1 0 1 2 Zambia (ZAM) [ZAM] 12 0 1 1 Zimbabwe (ZIM) [ZIM] 12 3 4 1 Mixed team (ZZX) [ZZX] 3 8 5 4 Totals 27 4809 4775 5130 Total № Winter 01 !.1 02 !.1 \ Independent Olympic Participants (IOP) [IOP] 3 0 0 0 Zambia (ZAM) [ZAM] 2 0 0 0 Zimbabwe (ZIM) [ZIM] 8 1 0 0 Mixed team (ZZX) [ZZX] 17 0 0 0 Totals 14714 22 959 958 03 !.1 Total.1 № Games \ Independent Olympic Participants (IOP) [IOP] 0 0 1 Zambia (ZAM) [ZAM] 0 0 12 Zimbabwe (ZIM) [ZIM] 0 0 13 Mixed team (ZZX) [ZZX] 0 0 3 Totals 948 2865 49 01 !.2 02 !.2 03 !.2 \ Independent Olympic Participants (IOP) [IOP] 0 1 2 Zambia (ZAM) [ZAM] 0 1 1 Zimbabwe (ZIM) [ZIM] 3 4 1 Mixed team (ZZX) [ZZX] 8 5 4 Totals 5768 5733 6078 Combined total Independent Olympic Participants (IOP) [IOP] 3 Zambia (ZAM) [ZAM] 2 Zimbabwe (ZIM) [ZIM] 8 Mixed team (ZZX) [ZZX] 17 Totals 17579
print((data["Combined total"].sum() - data.loc["Totals"]["Combined total"]) * 2)
print(data["Combined total"].sum())
35158 35158
It also looks like the last row is a sum of the columns, which we will need to get rid of (as you can see from the sum it doubles any math we do on a column).
Fixing the Column Names
First the unicode is removed from the column names and the columns are given more human-readable names.
new_columns = ("summer_participations",
"summer_gold",
"summer_silver",
"summer_bronze",
"summer_total",
"winter_participations",
"winter_gold",
"winter_silver",
"winter_bronze",
"winter_total",
"total_participations",
"total_gold",
"total_silver",
"total_bronze",
"total_combined")
assert len(new_columns) == len(data.columns)
column_remap = dict(zip(data.columns, new_columns))
if data.columns[0] != "summer_participations":
for column in data.columns:
data.rename(columns={column:column_remap[column]}, inplace=True)
data.head(1)
summer_participations summer_gold summer_silver \ Afghanistan (AFG) 13 0 0 summer_bronze summer_total winter_participations \ Afghanistan (AFG) 2 2 0 winter_gold winter_silver winter_bronze winter_total \ Afghanistan (AFG) 0 0 0 0 total_participations total_gold total_silver \ Afghanistan (AFG) 13 0 0 total_bronze total_combined Afghanistan (AFG) 2 2
Changing the Indices
Since the index has both a country name and a country code, we'll create a new column with just the ID's. The indices are split on the first left parenthesis ("(") creating a tuple with the first item being the name of the country and the second containing the country codes. To avoid the trailing parentheses and any other extra characters, the country-code gets sliced. Finally the last row ("Totals") gets dropped
# split the index by '('
names_ids = data.index.str.split('\s\(')
data.index = names_ids.str[0]
data['ID'] = names_ids.str[1].str[:3]
data = data.drop('Totals')
data.head(1)
summer_participations summer_gold summer_silver summer_bronze \ Afghanistan 13 0 0 2 summer_total winter_participations winter_gold winter_silver \ Afghanistan 2 0 0 0 winter_bronze winter_total total_participations total_gold \ Afghanistan 0 0 13 0 total_silver total_bronze total_combined ID Afghanistan 0 2 2 AFG
Question 0 (Example)
What is the first country in in the data-frame?
The autograder will call this function and compare the return value against the correct solution value.
def answer_zero():
"""This function returns the row for Afghanistan
Returns
-------
Series: first row in the data
"""
return data.iloc[0]
answer_zero().name
Afghanistan
Questions
Question 1
Which country has won the most gold medals in summer games?
def answer_one():
"""
Returns
-------
Str: name of the country with the most summer gold medals
"""
return data.summer_gold.argmax()
answer_one()
United States
Since we put the NOCs in the index (and most are country names), finding the index for the maximum value returns the country name that we want.
Question 2
Which country had the biggest difference between their summer and winter gold medal counts?
def answer_two():
"""
Returns
-------
Str: country with most difference between summer and winter gold
"""
return (data.summer_gold - data.winter_gold).abs().argmax()
answer_two()
United States
Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?
\(\frac{Summer~Gold - Winter~Gold}{Total~Gold}\)
Only include countries that have won at least 1 gold in both summer and winter.
def answer_three():
"""
find country with the biggest difference between the
summer and winter gold counts relative to total gold metal count
if they have at least one gold medal in both summer and winter
Returns
-------
str: country with biggest summer/winter gold differe
"""
elegible = data[(data.summer_gold>=1) & (data.winter_gold>=1)]
ratios = (elegible.summer_gold - elegible.winter_gold).abs()/elegible.total_gold
return ratios.argmax()
answer_three()
Bulgaria
Question 4
Write a function that creates a Series called "Points" which is a weighted value where each gold medal (`Gold.2`) counts for 3 points, silver medals (`Silver.2`) for 2 points, and bronze medals (`Bronze.2`) for 1 point. The function should return only the column (a Series object) which you created.
def answer_four():
"""
Creates weighted points based on medals
* Gold: 3 points
* Silver: 2 points
* Bronze: 1 point
Returns
-------
Series: column of points for each NOC
"""
points = numpy.zeros(len(data))
points += data.total_gold * 3
points += data.total_silver * 2
points += data.total_bronze
return pandas.Series(points, index=data.index)
points = answer_four()
assert points.loc["United States"] == 5684
Part 2
For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.
Load the Census Data
census_data = pandas.read_csv('census.csv')
census_data.head()
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME \ 0 40 3 6 1 0 Alabama Alabama 1 50 3 6 1 1 Alabama Autauga County 2 50 3 6 1 3 Alabama Baldwin County 3 50 3 6 1 5 Alabama Barbour County 4 50 3 6 1 7 Alabama Bibb County CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... \ 0 4779736 4780127 4785161 ... 1 54571 54571 54660 ... 2 182265 182265 183193 ... 3 27457 27457 27341 ... 4 22915 22919 22861 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 \ 0 0.002295 -0.193196 0.381066 0.582002 1 7.242091 -2.915927 -3.012349 2.265971 2 14.832960 17.647293 21.845705 19.243287 3 -4.728132 -2.500690 -7.056824 -3.904217 4 -5.527043 -5.068871 -6.201001 -0.177537 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 \ 0 -0.467369 1.030015 0.826644 1.383282 1.724718 1 -2.530799 7.606016 -2.626146 -2.722002 2.592270 2 17.197872 15.844176 18.559627 22.727626 20.317142 3 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 4 0.177258 -5.088389 -4.363636 -5.403729 0.754533 RNETMIG2015 0 0.712594 1 -2.187333 2 18.293499 3 -10.543299 4 1.107861 [5 rows x 100 columns]
Census Variable Names
class CensusVariables:
state_name = "STNAME"
county_name = "CTYNAME"
census_population = "CENSUS2010POP"
region = "REGION"
population_2014 = "POPESTIMATE2014"
population_2015 = "POPESTIMATE2015"
population_estimates = ["POPESTIMATE2010",
"POPESTIMATE2011",
"POPESTIMATE2012",
"POPESTIMATE2013",
population_2014,
population_2015]
county_level = 50
summary_level = "SUMLEV"
Counties Only
counties = census_data[census_data[
CensusVariables.summary_level]==CensusVariables.county_level]
# this throws off the numeric index for the argmax method so reset it
counties = counties.reset_index()
# but the last question wants the original index
counties_original_index = census_data[census_data[
CensusVariables.summary_level]==CensusVariables.county_level]
Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too…)
def answer_five():
"""finds state with the most counties
Returns
-------
str name of state with the most counties
"""
return counties.groupby(
CensusVariables.state_name).count().COUNTY.argmax()
answer_five()
Texas
Question 6
Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP
.
def answer_six():
"""finds three most populous states based on top three counties in each
Returns
-------
List: top three state-names (highest to lowest)
"""
top_threes = counties.groupby(
CensusVariables.state_name
)[CensusVariables.census_population].nlargest(3)
states = top_threes.groupby(level=0).sum()
return list(states.nlargest(3).index)
answer_six()
California | Texas | Illinois |
Question 7
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010
through POPESTIMATE2015
, you need to consider all six columns.)
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.
This function should return a single string value.
def answer_seven():
"""Find county with largest absolute population variance
Returns
-------
str: name of the county
"""
return counties.iloc[
(counties[
CensusVariables.population_estimates].max(axis=1) -
counties[
CensusVariables.population_estimates].min(axis=1)
).argmax()][CensusVariables.county_name]
answer_seven()
Harris County
Question 8
In this datafile, the United States is broken up into four regions using the "REGION" column.
Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015
was greater than their POPESTIMATE2014
.
This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_data
(sorted ascending by index).
def answer_eight():
"""find region 1 or 2 counties:
* with names that start with Washington
* whose population grew from 2014 to 2015
.. note:: the index in the final data-frame has to match the original
census data
Returns
-------
DataFrame: with the county and state-name columns
"""
regions = counties_original_index[
(counties_original_index[CensusVariables.region]==1) |
(counties_original_index[CensusVariables.region]==2)]
washingtons = regions[
regions[CensusVariables.county_name].str.startswith("Washington")]
grew = washingtons[washingtons[CensusVariables.population_2015] >
washingtons[CensusVariables.population_2014]]
return grew[[CensusVariables.state_name,
CensusVariables.county_name]]
outcome = answer_eight()
assert outcome.shape == (5,2)
assert list(outcome.columns) == ['STNAME', 'CTYNAME']
print(tabulate(outcome, headers=["index"] + list(outcome.columns),
tablefmt="orgtbl"))
index | STNAME | CTYNAME |
---|---|---|
896 | Iowa | Washington County |
1419 | Minnesota | Washington County |
2345 | Pennsylvania | Washington County |
2355 | Rhode Island | Washington County |
3163 | Wisconsin | Washington County |