XML To Pandas
Table of Contents
Beginning
I went to the Social Security website to change my password (I had forgotten it so they mailed a temporary one to me) and noticed that they have your yearly reported earnings both as a PDF and as a data download. Unfortunately the data is given as an XML file instead of a CSV (it has more than just the earnings data, so it would have to be a series of CSVs instead of one file) so these are my notes on converting it to a pandas DataFrame using BeautifulSoup.
Imports
To actually do the conversion you only need BeautifulSoup and pandas, all the rest of the stuff comes into play because I'm making a post.
# python
from argparse import Namespace
from functools import partial
from pathlib import Path
import os
import random
# from pypi
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import hvplot.pandas
import pandas
# my stuff
from graeae import EmbedHoloviews
Set Up
The Dotenv
I'm using a .env
file to point to the location of the file so this call will load it.
load_dotenv(override=True)
The Plotting
This is just a central place to set up some constants so I don't have to re-type them or hunt around for them if I want to change things.
SLUG = "xml-to-pandas"
Embed = partial(EmbedHoloviews,
create_folder=True,
folder_path=f"files/posts/{SLUG}")
Plot = Namespace(
width=990,
height=780,
tan="#ddb377",
blue="#4687b7",
red="#ce7b6d",
font_scale=2,
)
The Random Seed
I decided since I'm talking about social security stuff I'd make at least some attempt at obfuscating things so I'm adding a random offset to the years.
OFFSET = random.randint(10, 20)
Middle
The XML
Loading It
First, I'll load the XML into BeautifulSoup.
path = Path(os.environ["SOCIAL_SECURITY"]).expanduser()
assert path.is_file()
with path.open() as reader:
xml = BeautifulSoup(reader.read(), "xml")
Note that you have to pass in the "xml"
argument - BeautifulSoup is primarily an HTML parser so it assumes an HTML document by default. Normally I'd do some exploring at this point, but I didn't find it such an easy thing to do (I don't work with XML data generally) and found that it was easier to look at it with less and figure out what I needed.
The Earnings
The data I wanted is in a set of tags that look like this:
<osss:Earnings endYear="1998" startYear="1998">
<osss:FicaEarnings>526</osss:FicaEarnings>
<osss:MedicareEarnings>526</osss:MedicareEarnings>
</osss:Earnings>
This set represents what was collected in 1998 - $526 for Social Security and Medicare (FICA stands for Federal Insurance Contributions Act and is the name of the tax that funds Social Security). Yes, I only made $526 in 1998 because I was a teenager working a part-time job taking store inventories - still, I can't believe how little I got paid…
Anyway, so the first thing to do is to grab all the nodes representing earning.
earnings = xml.find_all("Earnings")
print(len(earnings))
print(earnings[0])
34 <osss:Earnings endYear="1998" startYear="1998"> <osss:FicaEarnings>526</osss:FicaEarnings> <osss:MedicareEarnings>526</osss:MedicareEarnings> </osss:Earnings>
The Years
Now that we have the earnings, we can see about getting the years. Although they have endYear
and startYear
they're always the same so I'll use startYear
. I'm adding the OFFSET
here just to obfuscate what years I'm looking at.
for year in earnings[:5]:
print(int(year.get("startYear")) + OFFSET)
1998 1999 2000 2001 2002
Collected
Next I'll see about grabbing the amounts collected for each year. The FICA and Medicare amounts are always the same so I'll just use the FICA amount.
for year in earnings[:5]:
print(year.find("FicaEarnings").string)
526 1123 1546 0 0
That looks right. It drops to 0 because I went to college and started working on campus and since I went to a state university they didn't collect FICA.
To a DataFrame
Anyway, that's basically all you need to get it going.
data = pandas.DataFrame.from_dict(
dict(
year = [int(year.get("startYear")) for year in earnings],
amount = [int(year.find("FicaEarnings").string) for year in earnings]
)
)
data["offset"] = data.year + OFFSET
print(data.head(5)[["offset", "amount"]])
offset amount 0 1998 526 1 1999 1123 2 2000 1546 3 2001 0 4 2002 0
Save It
del(data["offset"])
csv = Path(os.environ["SOCIAL_SECURITY_CSV"]).expanduser()
data.to_csv(csv, index=False)
End
I usually like to put in a little plotting so I'm going to plot the amount over time. Since I already shared my first five years I'm going to start after that and obscure the numbers a little.
Time Series
First I'll add the offset year back in, then I'm going to scale it.
data["year"] = data.year + OFFSET
to_plot = data.iloc[5:]
MAX = to_plot.amount.max()
to_plot.loc[:, "amount"] = to_plot.amount/MAX
plot = to_plot.hvplot(x="year", y="amount").opts(
title="Income Over Time",
width=Plot.width, height=Plot.height,
fontscale=Plot.font_scale,
xaxis="bare",
color=Plot.blue,
)
outcome = Embed(plot=plot, file_name="income_over_time")()
print(outcome)
Well, there you go. Since I'm adding a random offset to the years I removed the x-axis labels, but the plot starts out with me still working for the State, so there's no reported income, then there's an uptick when I took a retail job, then a drop again when I went back to school for a little while to look into getting a design degree, then there's another uptick when I gave up on the design degree and went to work as a clerk for a real estate office, then the next drop comes when I went back to get a Computer Science degree and the rising after that shows the difference between working retail/clerical versus being a computer programmer, with the two plateaus representing the two companies I've worked for.