Pulling Top Names From the Past 100 Years
Table of Contents
Beginning
Set Up
Imports
# python
from functools import partial
# pypi
import hvplot.pandas
import pandas
import requests
from graeae import EmbedHoloviews
Plotting
SLUG = "pulling-top-100-names"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")
Middle
Pulling the Table
Trying a Naive Request
URL = "https://www.ssa.gov/OACT/babynames/decades/century.html"
response = requests.get(URL)
assert response.ok
tables = pandas.read_html(response.text)
print(len(tables))
1
table = tables[0]
print(table.head(1))
Unnamed: 0_level_0 Males Females Rank Name Number Name Number 0 1.0 James 4764644 Mary 3328565
So it looks like it's a plain table so we don't need Selenium or something else to render javascript, but the header is a little messed up.
Cleaning Up the Table
males = table[table.columns[1:3]]
females = table[table.columns[3:]]
print(males.head(1))
Males Name Number 0 James 4764644
The Last Row
print(males.iloc[-1])
Males Name Source: 100% sample based on Social Security c... Number Source: 100% sample based on Social Security c... Name: 100, dtype: object
print(table.tail(1)[table.columns[-1]].values[0])
Source: 100% sample based on Social Security card application data as of of March 2019. See the limitations of this data source.
That's interesting, but it doesn't help with what I want, which is just the names.
males = males.iloc[:-1]
females = females.iloc[:-1]
assert len(males) == 100
assert len(females) == 100
print(females.head(1))
Females Name Number 0 Mary 3328565
So, now to re-add the rank and add gender columns so the multi-level column headers can be re-done.
males.columns = ["name", "count"]
females.columns = ["name", "count"]
males["gender"] = "male"
females["gender"] = "female"
males["rank"] = list(range(1, 101))
females["rank"] = list(range(1, 101))
print(males.head(1))
name count gender rank 0 James 4764644 male 1
Now to re-combine them.
names = pandas.concat([males, females])
Some Plotting
Before we can plot them we have to change the count
column to be (you guessed it) a number since that last row with the source information prevents pandas' ability to convert the column itself.
names.loc[:, "count"] = names["count"].astype(int)
plot = names.hvplot.bar(x="rank", y="count", by="gender").opts(
title="Top Male and Female Names of the Last 100 Years",
width=990,
height=780,
fontscale=2,
)
outcome = Embed(plot=plot, file_name="gender_plot")()
print(outcome)
Well, pretend that you don't see the x-ticks… It's interesting that the highest ranked male names have a higher count than the highest ranked female names.
End
Time to save the table.
names.to_csv("apeiron/top_100_ssa_names.csv", index=False)