Pulling Top Names From the Past 100 Years

The Cloistered Monkey

2020-08-18 17:33

Beginning

Set Up

Imports

# python
from functools import partial

# pypi
import hvplot.pandas
import pandas
import requests

from graeae import EmbedHoloviews

Plotting

SLUG = "pulling-top-100-names"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Pulling the Table

Trying a Naive Request

URL = "https://www.ssa.gov/OACT/babynames/decades/century.html"
response = requests.get(URL)
assert response.ok

tables = pandas.read_html(response.text)
print(len(tables))

table = tables[0]
print(table.head(1))

  Unnamed: 0_level_0  Males          Females         
                Rank   Name   Number    Name   Number
0                1.0  James  4764644    Mary  3328565

So it looks like it's a plain table so we don't need Selenium or something else to render javascript, but the header is a little messed up.

Cleaning Up the Table

males = table[table.columns[1:3]]
females = table[table.columns[3:]]
print(males.head(1))

   Males         
    Name   Number
0  James  4764644

The Last Row

print(males.iloc[-1])

Males  Name      Source: 100% sample based on Social Security c...
       Number    Source: 100% sample based on Social Security c...
Name: 100, dtype: object

print(table.tail(1)[table.columns[-1]].values[0])

Source: 100% sample based on Social Security card application data as of of March 2019. See the limitations of  this data source.

That's interesting, but it doesn't help with what I want, which is just the names.

males = males.iloc[:-1]
females = females.iloc[:-1]

assert len(males) == 100
assert len(females) == 100
print(females.head(1))

  Females         
     Name   Number
0    Mary  3328565

So, now to re-add the rank and add gender columns so the multi-level column headers can be re-done.

males.columns = ["name", "count"]
females.columns = ["name", "count"]
males["gender"] = "male"
females["gender"] = "female"

males["rank"] = list(range(1, 101))
females["rank"] = list(range(1, 101))

print(males.head(1))

    name    count gender  rank
0  James  4764644   male     1

Now to re-combine them.

names = pandas.concat([males, females])

Some Plotting

Before we can plot them we have to change the count column to be (you guessed it) a number since that last row with the source information prevents pandas' ability to convert the column itself.

names.loc[:, "count"] = names["count"].astype(int)

plot = names.hvplot.bar(x="rank", y="count", by="gender").opts(
    title="Top Male and Female Names of the Last 100 Years",
    width=990,
    height=780,
    fontscale=2,
)

outcome = Embed(plot=plot, file_name="gender_plot")()

print(outcome)

Well, pretend that you don't see the x-ticks… It's interesting that the highest ranked male names have a higher count than the highest ranked female names.

End

Time to save the table.

names.to_csv("apeiron/top_100_ssa_names.csv", index=False)

Table of Contents