XML To Pandas

Beginning

I went to the Social Security website to change my password (I had forgotten it so they mailed a temporary one to me) and noticed that they have your yearly reported earnings both as a PDF and as a data download. Unfortunately the data is given as an XML file instead of a CSV (it has more than just the earnings data, so it would have to be a series of CSVs instead of one file) so these are my notes on converting it to a pandas DataFrame using BeautifulSoup.

Imports

To actually do the conversion you only need BeautifulSoup and pandas, all the rest of the stuff comes into play because I'm making a post.

# python
from argparse import Namespace
from functools import partial
from pathlib import Path

import os
import random

# from pypi
from bs4 import BeautifulSoup
from dotenv import load_dotenv

import hvplot.pandas
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

The Dotenv

I'm using a .env file to point to the location of the file so this call will load it.

load_dotenv(override=True)

The Plotting

This is just a central place to set up some constants so I don't have to re-type them or hunt around for them if I want to change things.

SLUG = "xml-to-pandas"
Embed =  partial(EmbedHoloviews,
                 create_folder=True,
                folder_path=f"files/posts/{SLUG}")

Plot = Namespace(
    width=990,
    height=780,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    font_scale=2,
)

The Random Seed

I decided since I'm talking about social security stuff I'd make at least some attempt at obfuscating things so I'm adding a random offset to the years.

OFFSET = random.randint(10, 20)

Middle

The XML

Loading It

First, I'll load the XML into BeautifulSoup.

path = Path(os.environ["SOCIAL_SECURITY"]).expanduser()
assert path.is_file()

with path.open() as reader:
    xml = BeautifulSoup(reader.read(), "xml")

Note that you have to pass in the "xml" argument - BeautifulSoup is primarily an HTML parser so it assumes an HTML document by default. Normally I'd do some exploring at this point, but I didn't find it such an easy thing to do (I don't work with XML data generally) and found that it was easier to look at it with less and figure out what I needed.

The Earnings

The data I wanted is in a set of tags that look like this:

<osss:Earnings endYear="1998" startYear="1998">
<osss:FicaEarnings>526</osss:FicaEarnings>
<osss:MedicareEarnings>526</osss:MedicareEarnings>
</osss:Earnings>

This set represents what was collected in 1998 - $526 for Social Security and Medicare (FICA stands for Federal Insurance Contributions Act and is the name of the tax that funds Social Security). Yes, I only made $526 in 1998 because I was a teenager working a part-time job taking store inventories - still, I can't believe how little I got paid…

Anyway, so the first thing to do is to grab all the nodes representing earning.

earnings = xml.find_all("Earnings")
print(len(earnings))
print(earnings[0])
34
<osss:Earnings endYear="1998" startYear="1998">
<osss:FicaEarnings>526</osss:FicaEarnings>
<osss:MedicareEarnings>526</osss:MedicareEarnings>
</osss:Earnings>

The Years

Now that we have the earnings, we can see about getting the years. Although they have endYear and startYear they're always the same so I'll use startYear. I'm adding the OFFSET here just to obfuscate what years I'm looking at.

for year in earnings[:5]:
    print(int(year.get("startYear")) + OFFSET)
1998
1999
2000
2001
2002

Collected

Next I'll see about grabbing the amounts collected for each year. The FICA and Medicare amounts are always the same so I'll just use the FICA amount.

for year in earnings[:5]:
    print(year.find("FicaEarnings").string)
526
1123
1546
0
0

That looks right. It drops to 0 because I went to college and started working on campus and since I went to a state university they didn't collect FICA.

To a DataFrame

Anyway, that's basically all you need to get it going.

data = pandas.DataFrame.from_dict(
    dict(
        year = [int(year.get("startYear"))  for year in earnings],
        amount = [int(year.find("FicaEarnings").string) for year in earnings]
        )
)

data["offset"] = data.year + OFFSET
print(data.head(5)[["offset", "amount"]])
   offset  amount
0    1998     526
1    1999    1123
2    2000    1546
3    2001       0
4    2002       0

Save It

del(data["offset"])
csv = Path(os.environ["SOCIAL_SECURITY_CSV"]).expanduser()
data.to_csv(csv, index=False)

End

I usually like to put in a little plotting so I'm going to plot the amount over time. Since I already shared my first five years I'm going to start after that and obscure the numbers a little.

Time Series

First I'll add the offset year back in, then I'm going to scale it.

data["year"] = data.year + OFFSET
to_plot = data.iloc[5:]
MAX = to_plot.amount.max()
to_plot.loc[:, "amount"] = to_plot.amount/MAX

plot = to_plot.hvplot(x="year", y="amount").opts(
    title="Income Over Time",
    width=Plot.width, height=Plot.height,
    fontscale=Plot.font_scale,
    xaxis="bare",
    color=Plot.blue,
)

outcome = Embed(plot=plot, file_name="income_over_time")()
print(outcome)

Figure Missing

Well, there you go. Since I'm adding a random offset to the years I removed the x-axis labels, but the plot starts out with me still working for the State, so there's no reported income, then there's an uptick when I took a retail job, then a drop again when I went back to school for a little while to look into getting a design degree, then there's another uptick when I gave up on the design degree and went to work as a clerk for a real estate office, then the next drop comes when I went back to get a Computer Science degree and the rising after that shows the difference between working retail/clerical versus being a computer programmer, with the two plateaus representing the two companies I've worked for.