Assignment 2 - Pandas Introduction (Olympic Medals)

This is the notebook for assignment 2 of the Coursera Python Data Analysis course. This was converted from a jupyter notebook that you can download it as part of the course downloads zip file. The submissions work by uploading a ipynb file so there's a bit of cutting and pasting needed to get the code from here to there. Also I didn't use their variable names so the final outcome needs some aliasing for the grader to pass it.

Part 1

The following code loads the olympics dataset (olympics.csv), which was derived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as

  • # of Summer games,
  • Summer medals
  • # of Winter games
  • Winter medals
  • total # number of games
  • total # of medals

Imports

# third party
import numpy
import pandas
from tabulate import tabulate

Load the Data

A quick look at the data-source

The data is stored in a comma-separated file named olympics.csv.

head olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
Australia (AUS) [AUS] [Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480
Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304
Azerbaijan (AZE),5,6,5,15,26,5,0,0,0,0,10,6,5,15,26

Looking at the output of our source data-file (olympics.csv) you can see that the first row is a column-index that we'll want to get rid of. The first column is also missing a header and there are columns with unicode and mysterious numeric names with exclamation points at the end.

If you go to the wikipedia page and look at the NOC's with medals table (NOC stands for National Olympic Committee - each country that participates in the Olympics has one, unless they are part of a combined committee) you'll see that the numbers with leading zeros correspond to the medal types, the decimal numbers are the game (Summer (no trailing decimal), Winter (.1), or combined (.2)). So 01 !.2 means the total number of gold medals for both the summer and winter games combined. The No columns is the count of the number of times a country participated in a game. So, looking at the output above, Afghanistan has participated in 13 Summer games (this data was apparently pulled before the 2016 games in Rio de Janeiro, as their count is currently 14 on the source page). As for the ! notation, the wikipedia page has arrows that you can click on that will let you sort the rows by a particular column and I think this is how their screen-scraper translated them into ascii.

Load the data into pandas

data = pandas.read_csv('olympics.csv', index_col=0, skiprows=1)
description = data.describe()
print(tabulate(description, headers="keys", tablefmt="orgtbl"))
  № Summer 01 ! 02 ! 03 ! Total № Winter 01 !.1 02 !.1 03 !.1 Total.1 № Games 01 !.2 02 !.2 03 !.2 Combined total
count 147 147 147 147 147 147 147 147 147 147 147 147 147 147 147
mean 13.4762 65.4286 64.966 69.7959 200.19 6.70068 13.0476 13.034 12.898 38.9796 20.1769 78.4762 78 82.6939 239.17
std 7.07236 405.55 399.31 427.187 1231.31 7.43319 80.7992 80.6344 79.5884 240.917 13.257 485.013 478.86 505.855 1469.07
min 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1
25% 8 0 1 1 2 0 0 0 0 0 11 0 1 1 2.5
50% 13 3 4 6 12 5 0 0 0 0 15 3 4 7 12
75% 18.5 24 28 29 86 10 1 2 1 5 27 25.5 29 32.5 89
max 27 4809 4775 5130 14714 22 959 958 948 2865 49 5768 5733 6078 17579
data.head()
                         № Summer  01 !  02 !  03 !  Total  № Winter  01 !.1  \
Afghanistan (AFG)              13     0     0     2      2         0       0   
Algeria (ALG)                  12     5     2     8     15         3       0   
Argentina (ARG)                23    18    24    28     70        18       0   
Armenia (ARM)                   5     1     2     9     12         6       0   
Australasia (ANZ) [ANZ]         2     3     4     5     12         0       0   

                         02 !.1  03 !.1  Total.1  № Games  01 !.2  02 !.2  \
Afghanistan (AFG)             0       0        0       13       0       0   
Algeria (ALG)                 0       0        0       15       5       2   
Argentina (ARG)               0       0        0       41      18      24   
Armenia (ARM)                 0       0        0       11       1       2   
Australasia (ANZ) [ANZ]       0       0        0        2       3       4   

                         03 !.2  Combined total  
Afghanistan (AFG)             2               2  
Algeria (ALG)                 8              15  
Argentina (ARG)              28              70  
Armenia (ARM)                 9              12  
Australasia (ANZ) [ANZ]       5              12

It looks like the countries got read as the index for the data-frame since the first column was empty.

data.tail()
                                              № Summer  01 !  02 !  03 !  \
Independent Olympic Participants (IOP) [IOP]         1     0     1     2   
Zambia (ZAM) [ZAM]                                  12     0     1     1   
Zimbabwe (ZIM) [ZIM]                                12     3     4     1   
Mixed team (ZZX) [ZZX]                               3     8     5     4   
Totals                                              27  4809  4775  5130   

                                              Total  № Winter  01 !.1  02 !.1  \
Independent Olympic Participants (IOP) [IOP]      3         0       0       0   
Zambia (ZAM) [ZAM]                                2         0       0       0   
Zimbabwe (ZIM) [ZIM]                              8         1       0       0   
Mixed team (ZZX) [ZZX]                           17         0       0       0   
Totals                                        14714        22     959     958   

                                              03 !.1  Total.1  № Games  \
Independent Olympic Participants (IOP) [IOP]       0        0        1   
Zambia (ZAM) [ZAM]                                 0        0       12   
Zimbabwe (ZIM) [ZIM]                               0        0       13   
Mixed team (ZZX) [ZZX]                             0        0        3   
Totals                                           948     2865       49   

                                              01 !.2  02 !.2  03 !.2  \
Independent Olympic Participants (IOP) [IOP]       0       1       2   
Zambia (ZAM) [ZAM]                                 0       1       1   
Zimbabwe (ZIM) [ZIM]                               3       4       1   
Mixed team (ZZX) [ZZX]                             8       5       4   
Totals                                          5768    5733    6078   

                                              Combined total  
Independent Olympic Participants (IOP) [IOP]               3  
Zambia (ZAM) [ZAM]                                         2  
Zimbabwe (ZIM) [ZIM]                                       8  
Mixed team (ZZX) [ZZX]                                    17  
Totals                                                 17579
print((data["Combined total"].sum() - data.loc["Totals"]["Combined total"]) * 2)
print(data["Combined total"].sum())
35158
35158

It also looks like the last row is a sum of the columns, which we will need to get rid of (as you can see from the sum it doubles any math we do on a column).

Fixing the Column Names

First the unicode is removed from the column names and the columns are given more human-readable names.

new_columns = ("summer_participations",
	       "summer_gold",
	       "summer_silver",
	       "summer_bronze",
	       "summer_total",
	       "winter_participations",
	       "winter_gold",
	       "winter_silver",
	       "winter_bronze",
	       "winter_total",
	       "total_participations",
	       "total_gold",
	       "total_silver",
	       "total_bronze",
	       "total_combined")
assert len(new_columns) == len(data.columns)

column_remap = dict(zip(data.columns, new_columns))
if data.columns[0] != "summer_participations":
    for column in data.columns:
	data.rename(columns={column:column_remap[column]}, inplace=True)

data.head(1)
                   summer_participations  summer_gold  summer_silver  \
Afghanistan (AFG)                     13            0              0   

                   summer_bronze  summer_total  winter_participations  \
Afghanistan (AFG)              2             2                      0   

                   winter_gold  winter_silver  winter_bronze  winter_total  \
Afghanistan (AFG)            0              0              0             0   

                   total_participations  total_gold  total_silver  \
Afghanistan (AFG)                    13           0             0   

                   total_bronze  total_combined  
Afghanistan (AFG)             2               2

Changing the Indices

Since the index has both a country name and a country code, we'll create a new column with just the ID's. The indices are split on the first left parenthesis ("(") creating a tuple with the first item being the name of the country and the second containing the country codes. To avoid the trailing parentheses and any other extra characters, the country-code gets sliced. Finally the last row ("Totals") gets dropped

# split the index by '('
names_ids = data.index.str.split('\s\(')

data.index = names_ids.str[0]
data['ID'] = names_ids.str[1].str[:3]

data = data.drop('Totals')

data.head(1)
             summer_participations  summer_gold  summer_silver  summer_bronze  \
Afghanistan                     13            0              0              2   

             summer_total  winter_participations  winter_gold  winter_silver  \
Afghanistan             2                      0            0              0   

             winter_bronze  winter_total  total_participations  total_gold  \
Afghanistan              0             0                    13           0   

             total_silver  total_bronze  total_combined   ID  
Afghanistan             0             2               2  AFG

Question 0 (Example)

What is the first country in in the data-frame?

The autograder will call this function and compare the return value against the correct solution value.

def answer_zero():
    """This function returns the row for Afghanistan

    Returns
    -------

    Series: first row in the data
    """
    return data.iloc[0]
answer_zero().name
Afghanistan

Questions

Question 1

Which country has won the most gold medals in summer games?

def answer_one():
    """
    Returns
    -------

    Str: name of the country with the most summer gold medals
    """
    return data.summer_gold.argmax()

answer_one()
United States

Since we put the NOCs in the index (and most are country names), finding the index for the maximum value returns the country name that we want.

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

def answer_two():
    """
    Returns
    -------

    Str: country with most difference between summer and winter gold
    """
    return (data.summer_gold - data.winter_gold).abs().argmax()

answer_two()
United States

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?

\(\frac{Summer~Gold - Winter~Gold}{Total~Gold}\)

Only include countries that have won at least 1 gold in both summer and winter.

def answer_three():
    """
    find country with the biggest difference between the
    summer and winter gold counts relative to total gold metal count
    if they have at least one gold medal in both summer and winter

    Returns
    -------

    str: country with biggest summer/winter gold differe
    """
    elegible = data[(data.summer_gold>=1) & (data.winter_gold>=1)]
    ratios = (elegible.summer_gold - elegible.winter_gold).abs()/elegible.total_gold
    return ratios.argmax()

answer_three()
Bulgaria

Question 4

Write a function that creates a Series called "Points" which is a weighted value where each gold medal (`Gold.2`) counts for 3 points, silver medals (`Silver.2`) for 2 points, and bronze medals (`Bronze.2`) for 1 point. The function should return only the column (a Series object) which you created.

 def answer_four():
     """
     Creates weighted points based on medals
       * Gold: 3 points
       * Silver: 2 points
       * Bronze: 1 point

     Returns
     -------

     Series: column of points for each NOC
     """
     points = numpy.zeros(len(data))
     points += data.total_gold * 3
     points += data.total_silver * 2
     points += data.total_bronze
     return pandas.Series(points, index=data.index)
points = answer_four()
assert points.loc["United States"] == 5684

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

Load the Census Data

census_data = pandas.read_csv('census.csv')
census_data.head()
   SUMLEV  REGION  DIVISION  STATE  COUNTY   STNAME         CTYNAME  \
0      40       3         6      1       0  Alabama         Alabama   
1      50       3         6      1       1  Alabama  Autauga County   
2      50       3         6      1       3  Alabama  Baldwin County   
3      50       3         6      1       5  Alabama  Barbour County   
4      50       3         6      1       7  Alabama     Bibb County   

   CENSUS2010POP  ESTIMATESBASE2010  POPESTIMATE2010     ...       \
0        4779736            4780127          4785161     ...        
1          54571              54571            54660     ...        
2         182265             182265           183193     ...        
3          27457              27457            27341     ...        
4          22915              22919            22861     ...        

   RDOMESTICMIG2011  RDOMESTICMIG2012  RDOMESTICMIG2013  RDOMESTICMIG2014  \
0          0.002295         -0.193196          0.381066          0.582002   
1          7.242091         -2.915927         -3.012349          2.265971   
2         14.832960         17.647293         21.845705         19.243287   
3         -4.728132         -2.500690         -7.056824         -3.904217   
4         -5.527043         -5.068871         -6.201001         -0.177537   

   RDOMESTICMIG2015  RNETMIG2011  RNETMIG2012  RNETMIG2013  RNETMIG2014  \
0         -0.467369     1.030015     0.826644     1.383282     1.724718   
1         -2.530799     7.606016    -2.626146    -2.722002     2.592270   
2         17.197872    15.844176    18.559627    22.727626    20.317142   
3        -10.543299    -4.874741    -2.758113    -7.167664    -3.978583   
4          0.177258    -5.088389    -4.363636    -5.403729     0.754533   

   RNETMIG2015  
0     0.712594  
1    -2.187333  
2    18.293499  
3   -10.543299  
4     1.107861  

[5 rows x 100 columns]

Census Variable Names

class CensusVariables:
    state_name = "STNAME"
    county_name = "CTYNAME"
    census_population = "CENSUS2010POP"
    region = "REGION"
    population_2014 = "POPESTIMATE2014"
    population_2015 = "POPESTIMATE2015"
    population_estimates = ["POPESTIMATE2010",
			    "POPESTIMATE2011",
			    "POPESTIMATE2012",
			    "POPESTIMATE2013",
			    population_2014,
			    population_2015]
    county_level = 50
    summary_level = "SUMLEV"

Counties Only

counties = census_data[census_data[
    CensusVariables.summary_level]==CensusVariables.county_level]
# this throws off the numeric index for the argmax method so reset it
counties = counties.reset_index()

# but the last question wants the original index
counties_original_index = census_data[census_data[
    CensusVariables.summary_level]==CensusVariables.county_level]

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too…)

def answer_five():
    """finds state with the most counties

    Returns
    -------

    str name of state with the most counties
    """
    return counties.groupby(
	CensusVariables.state_name).count().COUNTY.argmax()

answer_five()
Texas

Question 6

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.

def answer_six():
    """finds three most populous states based on top three counties in each

    Returns
    -------

    List: top three state-names (highest to lowest)
    """
    top_threes = counties.groupby(
	CensusVariables.state_name
    )[CensusVariables.census_population].nlargest(3)
    states = top_threes.groupby(level=0).sum()
    return list(states.nlargest(3).index)

answer_six()
California Texas Illinois

Question 7

Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.

This function should return a single string value.

def answer_seven():
    """Find county with largest absolute population variance

    Returns
    -------

    str: name of the county
    """
    return counties.iloc[
	(counties[
	    CensusVariables.population_estimates].max(axis=1) -
	 counties[
	     CensusVariables.population_estimates].min(axis=1)
	).argmax()][CensusVariables.county_name]

answer_seven()
Harris County

Question 8

In this datafile, the United States is broken up into four regions using the "REGION" column.

Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE2014.

This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_data (sorted ascending by index).

def answer_eight():
    """find region 1 or 2 counties:

      * with names that start with Washington
      * whose population grew from 2014 to 2015

    .. note:: the index in the final data-frame has to match the original
    census data

    Returns
    -------

    DataFrame: with the county and state-name columns
    """
    regions = counties_original_index[
	(counties_original_index[CensusVariables.region]==1) |
	(counties_original_index[CensusVariables.region]==2)]
    washingtons = regions[
	regions[CensusVariables.county_name].str.startswith("Washington")]
    grew = washingtons[washingtons[CensusVariables.population_2015] >
		       washingtons[CensusVariables.population_2014]]
    return grew[[CensusVariables.state_name,
		 CensusVariables.county_name]]

outcome = answer_eight()
assert outcome.shape == (5,2)
assert list(outcome.columns) == ['STNAME', 'CTYNAME']
print(tabulate(outcome, headers=["index"] + list(outcome.columns),
	       tablefmt="orgtbl"))
index STNAME CTYNAME
896 Iowa Washington County
1419 Minnesota Washington County
2345 Pennsylvania Washington County
2355 Rhode Island Washington County
3163 Wisconsin Washington County