Selecting Tabular Data
Table of Contents
Set Up
Imports
Python
from functools import partial
from pathlib import Path
import os
PyPi
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas
My Projects
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
Holoviews Bokeh
I don't know why but you have to specify that you're using bokeh
, even though it looks like it's working when you don't.
holoviews.extension("bokeh")
The Embedder
files_path = Path("../../files/posts/libraries/selecting-tabular-data/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
Dotenv
I have the path to the data-set in a .env
file so I'll load it into the environment dictionary.
load_dotenv(".env")
Load the Data
This is the same measles/pertusis data that I used before.
path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year | Week | State | measles | pertussis |
---|---|---|---|---|
1928 | 1 | Alabama | 3.67 | nan |
1928 | 2 | Alabama | 6.25 | nan |
1928 | 3 | Alabama | 7.95 | nan |
1928 | 4 | Alabama | 12.58 | nan |
1928 | 5 | Alabama | 8.03 | nan |
Convert the DataFrame to a Dataset
key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)
Aggregate The Data
dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset [Year,State] (measles,pertussis) (4284, 4)
Plot a Subset
northwest = ["Oregon", "Washington", "Idaho"]
bars = dataset.select(State=northwest, Year=(1928, 1938)).to(
holoviews.Bars, ["Year", "State"], "measles").sort()
plot = bars.opts(
opts.Bars(width=800, height=400, tools=["hover"], xrotation=90, show_legend=False)
)
Embed(plot, "northwest_measles")()
As with all things HoloViews, there are many things that are unclear here - but the one that really tripped me up was the selection of years. Although I passed in a tuple it used it as a range (start, stop) so my original plot had a century instead of two years. Oh, well.
Interesting how Oregon spiked up in 1935 and 1936 then dropped down in 1937. According to the CDC, the measles vaccine didn't come out until 1963, so I guess those ebb-and-flows are just the normal way diseases cycle through populations. Oregon and Washington probably had more immigrants than Idaho, as well as higher population densities in their main cities, which might account for their higher rates. Hard to say without corellating data.
Facets
The bar-plot is okay for comparing the cities within any given year but they are hard to get trends from. Here's how to use selecting to plot lines for each city.
grouped = dataset.select(State=northwest, Year=(1928, 2011)).to(holoviews.Curve, "Year", "measles")
gridspace = grouped.grid("State")
plot = gridspace.opts(
opts.Curve(width=200, color="crimson", tools=["hover"])
)
Embed(plot, "northwest_measles_grid")()
Overlays
While the side-by-side plots are clearer than the bar-plots, it's harder to compare the cities year-by-year, so it might be better to plot them over each other.
overlay = grouped.overlay("State")
plot = overlay.opts(
opts.Curve(height=500, width=1000, color=holoviews.Cycle(values=["crimson", "slateblue", "cadetblue"]))
)
Embed(plot, "northwest_measles_overlay")()