HoloViews Tabular Datasets

Introduction

This is a walk through the HoloViews page on Tabular Datasets. The data-set was created by the Wall Street Journal using Project Tycho, but I'm getting it from the HoloViews github repository. The Wall Street Journal page is here. Unfortunately it has mixed content types (https and http) as well as some other problems which prevent Firefox and Chrome-based browsers from rendering it the visualization so I don't know what it actually looks like. Given that it's a commercial site I'm assuming it's an old page that they don't care about anymore.

Warning: I originally did this with modin and it wouldn't plot correctly. Save it for pre-processing and just use the real pandas when plotting.

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from bokeh.io import output_notebook
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Holoviews Bokeh

I don't know why but you have to specify that you're using bokeh, even though it looks like it's working when you don't.

holoviews.extension("bokeh")
output_notebook()

The Embedder

files_path = Path("../../files/posts/libraries/holoviews-tabular-datasets/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Dotenv

I have the path to the data-set in a .env file so I'll load it into the environment dictionary.

load_dotenv(".env")

Tabulate

This will print a table that org knows how to render.

orgtable = partial(tabulate, headers="keys", tablefmt="orgtbl", 
                   showindex=False)

Load the Data

The data comes from Project Tycho, which provides health-related data sets for research. The .env file assumes that I cloned the HoloViews repository so that I can load the data from it.

path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year Week State measles pertussis
1928 1 Alabama 3.67 nan
1928 2 Alabama 6.25 nan
1928 3 Alabama 7.95 nan
1928 4 Alabama 12.58 nan
1928 5 Alabama 8.03 nan
print(len(diseases))
print(len(diseases.State.unique()))
print(diseases.Year.min())
print(diseases.Year.max())
print(diseases.Week.unique())
222768
51
1928
2011
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52]

The State column includes Washington D.C., which is why there are 51 states. It has 52 weeks of data for each year from 1928 through 2011. Since measles and pertussis are greater than 1 I assume that this is some kind of rate (like incidence per million), but the page doesn't say and I the article they link to doesn't seem to render in my browser (and I don't have an account to download a dataset).

Create a Dataset

HoloViews has a class called a Dataset that lets you declare the dependent (value dimensions (vdims)) and independent variables (key dimensions (kdims)).

key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)

The value_dimensions list has tuples - these take the form (<column-name>, <output-name>) so when you make a plot it will use the <output-name> for any labels that are created.

Aggregate The Data

The one column that I didn't add is the Week column. The Dataset has a rather confusing aggregate method (confusing because you only pass in the function to aggregate with) that apparently knows how to use the key_dimensions variables we passed in to figure out what to aggregate.

dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset   [Year,State]   (measles,pertussis)
(4284, 4)
layout = (dataset.to(holoviews.Curve, "Year", "measles")
          + dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
layout.opts(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
Embed(layout, "measles_pertusis")()

Two things to note. One is that HoloViews picked up the nicer names without us having to specify them. Another is that only Alabama is displayed. In the demonstration HoloViews created a drop-down menu to select a state but it didn't do it here. Maybe you need to run it in a jupyter notebook…

Actually, I think it might be a conflict with nikola, this is a page saved from a jupyter notebook without any nikola pre-processing:

nil

Save the HTML

I'll see if you can do it directly here without using jupyter.

save_file = "diseases_2.html"
output = files_path.joinpath(save_file)
holoviews.save(layout, output)
print("[[file:{}][This is the plot.]]".format(save_file))

This is the plot.