Walking Through the Pyviz Workflow Example

Cloistered Monkey

2019-03-14 12:48

Beginning

This is a replication of (most of) the pyviz workflow introduction to see if I can replicate it. Nothing new here.

Set Up

Imports

From Python.

from functools import partial
from pathlib import Path
import os

From PyPi.

from bokeh.models import CrosshairTool, HoverTool
from dotenv import load_dotenv
from tabulate import tabulate

import holoviews
import hvplot.pandas
import numpy
import pandas

My Stuff.

from graeae.visualization import EmbedBokeh

Presets

Load the dotenv.

load_dotenv(".env")

Set up the bokeh.

OUTPUT_FOLDER = Path("../../files/posts/libraries/walking-through-the-pyviz-workflow-example/")
Embed = partial(EmbedBokeh, folder_path=OUTPUT_FOLDER)
holoviews.extension("bokeh", width=800)

Set up the table printer.

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

Constants

class Plots:
    width = 1000
    height = 600
    font_size = 18

Load Some Data

Measles and pertusis by state.

diseases = pandas.read_csv(Path(os.environ.get("DISEASES")))

What does the data look like?

print(TABLE(diseases.head(5)))

Year	Week	State	measles	pertussis
1928	1	Alabama	3.67	nan
1928	2	Alabama	6.25	nan
1928	3	Alabama	7.95	nan
1928	4	Alabama	12.58	nan
1928	5	Alabama	8.03	nan

print(TABLE(diseases.sample(5)))

Year	Week	State	measles	pertussis
1945	18	Tennessee	2.26	0.66
1988	7	Mississippi	nan	nan
1973	14	North Dakota	0	nan
1963	29	Montana	5.55	nan
1932	13	Utah	0.39	nan

It looks like this data set has weekly measles… counts? I don't know why the values are fractions. It also looks like pertusis (whooping cough) is missing values.

Looking at the Missing Data

missing_measles = diseases[diseases.measles.isna()]
print("Missing Measles: {:.2f} %".format(
    100 * len(missing_measles)/len(diseases)))

missing_pertussis = diseases[diseases.pertussis.isna()]
print("Missing Pertussis: {:.2f} %".format(
    100 * len(missing_pertussis)/len(diseases)))

Missing Measles: 34.84 %
Missing Pertussis: 51.04 %

It looks like they both have missing values, but there are more missing pertussis values. I'll look at how the missing values are distributed.

Measles

decade_hover = HoverTool(
    tooltips=[
        ("Missing", "@Year_count{0,0}"),
        ("Decade Center Year", "@Year{0}"),
    ],
    formatters={
        "Count": "numeral",
        "Year": "numeral",
    },
    mode="vline",
)

histogram = (missing_measles[(missing_measles.Year >1929) 
                             & (missing_measles.Year < 2011)]
             .hvplot.hist("Year", bins=8, tools=[decade_hover],
                          width=Plots.width, height=Plots.height,
                          ylabel="Count",
                          title="Missing Weekly Measles Reports by Decade")
             .opts(fontsize=Plots.font_size))
Embed(histogram, "missing_measles_histogram")()

Since we have roughly eight decades of data I trimmed the years to exactly eight and then made a histogram with eight bins to get the counts per decade. Since this is a post about doing the plotting and not really about measles or pertussis I should probably mention that the way I got the Year_count variable name for the HoverTool was by printing the histogram object after I'd plotted it once.

print(histogram)

:Histogram   [Year]   (Year_count)

year_hover = HoverTool(
    tooltips=[
        ("Missing", "@count{0,0}"),
        ("Year", "@Year{0}"),
    ],
    formatters={
        "Count": "numeral",
        "Year": "numeral",
    },
    mode="vline",
)

year_counts = (missing_measles.groupby("Year")
               .agg({"Year": "count"})
               .rename(columns={"Year": "count"}).reset_index())
plot = (year_counts.hvplot(tools=[year_hover], x="Year", y="count",
                          width=Plots.width, height=Plots.height,
                          xlabel="Missing Count",
                          title="Missing Measles Reports by Year")
        .opts(fontsize=Plots.font_size))
Embed(plot, "aggregated_missing_measles")()

It looks like there are more missing values from 1981 onward. That kind of surprised me, but I guess that prior to this latest outbreak the reporting might have become less necessary since measles was less common (it was declared eliminated from the Americas in 2016).

print(histogram)

:Histogram   [Year]   (Year_count)

print(len(diseases.State.unique()) * len(diseases.Week.unique()))

By 2003 there are 2,652 missing values, which is our maxmimum value so it looks like there was no reporting in this data set from this year forward.

Pertussis

histogram = (missing_pertussis[(missing_pertussis.Year > 1929) 
                              & (missing_pertussis.Year < 2011)]
             .hvplot.hist("Year", bins=8, 
                          ylabel="Count of Missing",
                          title="Missing Pertussis by Decade", 
                          tools=[decade_hover])
             .opts(fontsize=Plots.font_size))
Embed(histogram, "missing_pertussis_distribution")()

Strangely the missing reports seem to peak in the 1960s. This seems problematic if you're going to look at the incident rates, but I'm only going to look at measles anyway.

Middle

Looking at Measles

By Year

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

measles_by_year = diseases[["Year", "measles"]].groupby("Year").agg(numpy.sum)
plot = measles_by_year.hvplot(title="Measles In the U.S. by Year", 
                              xlabel="Year", 
                              ylabel="Cases", 
                              width=1000,
                              tools=[hover])
Embed(plot, "measles_by_year")()

You can see from the plot that cases of measles have dropped dramatically over the years, with a particularly sharp drop in the 1960's.

Vaccines Enter The Picture

According to the CDC, the Edmonston-D vaccine was released in the United States in 1963 and the Edmonston-Enders vaccine (which is still currently in use) was released in 1968. Let's draw some lines to mark when certain events happened.

first = holoviews.VLine(1963).opts(color="black", alpha=0.5)
first_label = holoviews.Text(1963 - 1, 27000, "Vaccine Introduced", 
                             halign="right")
current = holoviews.VLine(1968).opts(color="black", alpha=0.5)
current_label = holoviews.Text(1968 + 1, 27000, "Newer Vaccine", halign="left")
highest = holoviews.VLine(int(measles_by_year.idxmax())).opts(color="red")
highest_label = holoviews.Text(int(measles_by_year.idxmax()) + 1, 27000, 
                               "Year of the Most Cases", halign="left")
lowest = holoviews.VLine(int(measles_by_year.idxmin())).opts(color="blue", 
                                                             alpha=0.5)
lowest_label = holoviews.Text(int(measles_by_year.idxmin()) - 1, 27000, 
                              "Zero Cases", halign="right")

plot_2 = (plot
          * first * first_label 
          * current * current_label 
          * highest * highest_label 
          * lowest * lowest_label).opts(fontsize=Plots.font_size)
Embed(plot_2, "measles_with_landmarks")()

It does look like the introduction of the vaccine(s) had a dramatic effect on the incidence of measles in the United States.

It appears that there were zero cases in 2002, but the actual value is 0.31, but my formatter cuts off the decimal place. 2003 is the first true zero, but as we saw above, this is also the first N/A value. Maybe N/A means zero, not missing. It's hard to say without some documentation about the data.

Measles By State

This creates a dropdown menu so we can see the states' measles cases separately. It doesn't work in this template so I'm saving it as a separate page.

measles_by_state = diseases.groupby(["Year", "State"])["measles"].sum()
states_plot = measles_by_state.hvplot(x="Year", groupby="State", width=800, dynamic=False)
file_name = "measles_by_state.html"
holoviews.save(states_plot, OUTPUT_FOLDER.joinpath(file_name))
print("[[file:{}][Link to plot]]".format(file_name))

Link to plot

Oregon Vs Hawaii

Instead of looking at all the states one at a time, we can choose two states and lay them out side by side to make them easier to compare. The addition sign is used to make plots next to each other.

hover = HoverTool(
    tooltips=[
        ("Measles Cases", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

oregon_plot = states_plot["Oregon"].relabel("Measles in Oregon").opts(
    tools=[hover],
    width=550, 
    fontsize=Plots.font_size)
hawaii_plot = states_plot["Hawaii"].relabel("Measles in Hawaii").opts(
    tools=[hover],
    width=550, 
    fontsize=Plots.font_size)
plot = (oregon_plot * first * current * current_label * first_label
        + hawaii_plot * first * first_label * current * current_label)
Embed(plot, "oregon_vs_hawaii")()

I don't know why but the labels don't work. This is one of the problems with HoloViews, I think - they make some things really easy but the minute you step outside of what they have documented there's no way to figure out what's going on and how to fix it (or do it in the first place). It's an impressive programming feat but not documented enough to be as useful as it might be. More of a thing for quick sketching after which you have to switch back over to bokeh if you want anything other than the canned views (kind of like Excel… except with less documentation).

Surprisingly (to me), Hawaii had more cases in their peak years and huge swings. Perhaps since it's an island the sailors and other travelers introduced epidemics. Or maybe they weren't as good at keeping records back then.

According to the 1960 Census Count Oregon was quite a bit more populous than Hawaii.

Four States

Since Oregon had a much larger population that Hawaii I thought I'd plot the states closer to it in size. The four states nearest to Hawaii in population (from the same 1960 popurlation report) are (in descending order):

Montana (674,767)
Idaho (667,191)
Hawaii (632,772)
North Dakota (632,446)

While the side-by side plot lets you see groupings and heights you can't easily compare values for each year so this time I'll put them all on the same plot. HoloViews automatically creates a legend which lets you dim a line by clicking on it in the legend.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
        ("State", "@State"),
    ],
    formatters={"measles": "numeral"},
)

states = ["Montana", "Idaho", "Hawaii", "North Dakota"]
start_year, end_year = 1930, 2005
plot = (measles_by_state.loc[start_year:end_year, states].hvplot(
    by="State",
    title="Measles 1930 - 2005",
    tools=[hover],
    fontsize=Plots.font_size,
    width=Plots.width) 
        * first * first_label 
        * current * current_label)
Embed(plot, "four_states_measles")()

Surprisingly Hawaii had the highest values in 1951 and 1955 and it looks like they had a really bad outbreak from 1955 through 1958, although Montana had higher values overall. Once again, the introduction of a vaccine seems to have a dramatic effect, although there continued to be mini outbreaks in the 1970s.

Faceting

Another way to compare the states is to plot them side-by side.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
    ],
    formatters={"measles": "numeral"},
)

crosshairs = CrosshairTool()
plot = (measles_by_state.loc[start_year:end_year, states].hvplot(
    x="Year", col="State", width=300, height=200, rot=90, tools=[crosshairs])
        * first * first_label * current * current_label).opts(title="Measles By Year")
Embed(plot, "faceted_measles_states")()

This makes it harder to compare year by year, but it looks kind of elegant and it's clearer that Hawaii only had this one intense period recorded (starting in 1950) before the vaccine came out, while Idaho and North Dakota had this steady low-level amount of cases and something was much worse in Montana (although, really, their pattern looks closer to that of the United States as a whole) and all of them benefitted from the vaccines.

As far as the plot goes, it might not be obvious but the main difference with what I did to get this plot is that the State column was assigned to the col argument instead of the by argument. I also had to set the title using the opts for the total plot, setting it for the hvplot didn't do anything.

Bar Chart

This time I'll plot the counts for the states using a bar-chart instead of a line-plot.

hover = HoverTool(
    tooltips=[
        ("Measles", "@measles{0,0}"),
        ("Year", "@Year"),
        ("State", "@State"),
    ],
    formatters={"measles": "numeral"},
    mode="vline",
)

plot = measles_by_state.loc[1960:1970, states].hvplot.bar(
    "Year",
    height=Plots.height,
    width=Plots.width,
    fontsize=Plots.font_size,
    title="Measles Count by Year",
    by="State", 
    tools=[hover],
    rot=90)
Embed(plot, "measles_bar_chart")()

This gives pretty much the same information as the earlier line-plot, except it makes it easier to see a states' case-count for a given year. On the other hand it's harder to really see the year-to-year patterns and you can't turn off states to highlight other states. This looks like something business people would use more than what scientists would use. I think there is a certain aesthetic advantage to it which is traded off by the extra trend information given by a line plot. It does have a major advantage in the way it simplifies the output, though. It also allows you to use the "vline" option so the user only has to be on the same x-axis as a bar to trigger the pop-up. When I did this with the line plots the ones for the different states ended up covering each other up (since they all shared the same x-value).

I tried to add the previous vertical lines to indicate when vaccines were introduced but a bar-plot uses different inputs so it raised an error. It's probably less important for a bar-chart anyway, since you aren't emphasizing the time line quite so much as with a line plot.

print(first)
print(plot)

:VLine   [x,y]
:Bars   [Year,State]   (measles)

I think (guess) that it needs both the year and state, while the line plot only needed the year.

Nationwide Mean With Error Bars

error = diseases.groupby("Year").agg({"measles": [numpy.mean, numpy.std]}).xs(
    "measles", axis=1)
plot = (error.hvplot(y="mean", 
                     title="Mean National Measles Cases By Year") 
        * holoviews.ErrorBars(error, "Year").redim.range(mean=(0, None)) 
        * first * first_label * current * current_label).opts(
            fontsize=Plots.font_size,
            width=Plots.width,
        )
Embed(plot, "error_bars")()

So looking at this we can see that the national mean actually paints a slightly different picture from the raw counts for measles. My guess would be that the different populations for each state offset each other enough that the mean is flattened out.

Actually, in thinking about it, the mean is really a weekly mean for the year so… I'll have to think about this.

Table of Contents