HoloViews Tabular Datasets
Table of Contents
Introduction
This is a walk through the HoloViews page on Tabular Datasets. The data-set was created by the Wall Street Journal using Project Tycho, but I'm getting it from the HoloViews github repository. The Wall Street Journal page is here. Unfortunately it has mixed content types (https and http) as well as some other problems which prevent Firefox and Chrome-based browsers from rendering it the visualization so I don't know what it actually looks like. Given that it's a commercial site I'm assuming it's an old page that they don't care about anymore.
Warning: I originally did this with modin
and it wouldn't plot correctly. Save it for pre-processing and just use the real pandas
when plotting.
Set Up
Imports
Python
from functools import partial
from pathlib import Path
import os
PyPi
from bokeh.io import output_notebook
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas
My Projects
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
Holoviews Bokeh
I don't know why but you have to specify that you're using bokeh
, even though it looks like it's working when you don't.
holoviews.extension("bokeh")
output_notebook()
The Embedder
files_path = Path("../../files/posts/libraries/holoviews-tabular-datasets/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
Dotenv
I have the path to the data-set in a .env
file so I'll load it into the environment dictionary.
load_dotenv(".env")
Tabulate
This will print a table that org knows how to render.
orgtable = partial(tabulate, headers="keys", tablefmt="orgtbl",
showindex=False)
Load the Data
The data comes from Project Tycho, which provides health-related data sets for research. The .env
file assumes that I cloned the HoloViews
repository so that I can load the data from it.
path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year | Week | State | measles | pertussis |
---|---|---|---|---|
1928 | 1 | Alabama | 3.67 | nan |
1928 | 2 | Alabama | 6.25 | nan |
1928 | 3 | Alabama | 7.95 | nan |
1928 | 4 | Alabama | 12.58 | nan |
1928 | 5 | Alabama | 8.03 | nan |
print(len(diseases))
print(len(diseases.State.unique()))
print(diseases.Year.min())
print(diseases.Year.max())
print(diseases.Week.unique())
222768 51 1928 2011 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52]
The State
column includes Washington D.C., which is why there are 51 states. It has 52 weeks of data for each year from 1928 through 2011. Since measles and pertussis are greater than 1 I assume that this is some kind of rate (like incidence per million), but the page doesn't say and I the article they link to doesn't seem to render in my browser (and I don't have an account to download a dataset).
Create a Dataset
HoloViews has a class called a Dataset that lets you declare the dependent (value dimensions (vdims)) and independent variables (key dimensions (kdims)).
key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)
The value_dimensions
list has tuples - these take the form (<column-name>, <output-name>)
so when you make a plot it will use the <output-name>
for any labels that are created.
Aggregate The Data
The one column that I didn't add is the Week
column. The Dataset
has a rather confusing aggregate
method (confusing because you only pass in the function to aggregate with) that apparently knows how to use the key_dimensions
variables we passed in to figure out what to aggregate.
dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset [Year,State] (measles,pertussis) (4284, 4)
layout = (dataset.to(holoviews.Curve, "Year", "measles")
+ dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
layout.opts(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
Embed(layout, "measles_pertusis")()
Two things to note. One is that HoloViews picked up the nicer names without us having to specify them. Another is that only Alabama is displayed. In the demonstration HoloViews created a drop-down menu to select a state but it didn't do it here. Maybe you need to run it in a jupyter notebook…
Actually, I think it might be a conflict with nikola
, this is a page saved from a jupyter notebook without any nikola pre-processing:
Save the HTML
I'll see if you can do it directly here without using jupyter.
save_file = "diseases_2.html"
output = files_path.joinpath(save_file)
holoviews.save(layout, output)
print("[[file:{}][This is the plot.]]".format(save_file))