Selecting Tabular Data

Cloistered Monkey

2019-03-02 16:25

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Holoviews Bokeh

I don't know why but you have to specify that you're using bokeh, even though it looks like it's working when you don't.

holoviews.extension("bokeh")

The Embedder

files_path = Path("../../files/posts/libraries/selecting-tabular-data/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Dotenv

I have the path to the data-set in a .env file so I'll load it into the environment dictionary.

load_dotenv(".env")

Load the Data

This is the same measles/pertusis data that I used before.

path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)

print(orgtable(diseases.head()))

Year	Week	State	measles	pertussis
1928	1	Alabama	3.67	nan
1928	2	Alabama	6.25	nan
1928	3	Alabama	7.95	nan
1928	4	Alabama	12.58	nan
1928	5	Alabama	8.03	nan

Convert the DataFrame to a Dataset

key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)

Aggregate The Data

dataset = dataset.aggregate(function=numpy.mean)

print(dataset)
print(dataset.shape)

:Dataset   [Year,State]   (measles,pertussis)
(4284, 4)

Plot a Subset

northwest = ["Oregon", "Washington", "Idaho"]
bars = dataset.select(State=northwest, Year=(1928, 1938)).to(
    holoviews.Bars, ["Year", "State"], "measles").sort()
plot = bars.opts(
    opts.Bars(width=800, height=400, tools=["hover"], xrotation=90, show_legend=False)
)

Embed(plot, "northwest_measles")()

As with all things HoloViews, there are many things that are unclear here - but the one that really tripped me up was the selection of years. Although I passed in a tuple it used it as a range (start, stop) so my original plot had a century instead of two years. Oh, well.

Interesting how Oregon spiked up in 1935 and 1936 then dropped down in 1937. According to the CDC, the measles vaccine didn't come out until 1963, so I guess those ebb-and-flows are just the normal way diseases cycle through populations. Oregon and Washington probably had more immigrants than Idaho, as well as higher population densities in their main cities, which might account for their higher rates. Hard to say without corellating data.

Facets

The bar-plot is okay for comparing the cities within any given year but they are hard to get trends from. Here's how to use selecting to plot lines for each city.

grouped = dataset.select(State=northwest, Year=(1928, 2011)).to(holoviews.Curve, "Year", "measles")
gridspace = grouped.grid("State")
plot = gridspace.opts(
    opts.Curve(width=200, color="crimson", tools=["hover"])
)

Embed(plot, "northwest_measles_grid")()

Overlays

While the side-by-side plots are clearer than the bar-plots, it's harder to compare the cities year-by-year, so it might be better to plot them over each other.

overlay = grouped.overlay("State")
plot = overlay.opts(
    opts.Curve(height=500, width=1000, color=holoviews.Cycle(values=["crimson", "slateblue", "cadetblue"]))
)

Embed(plot, "northwest_measles_overlay")()

Table of Contents