A First Look At HVPlot


This is a look at HVPlot, a HoloViews based plotting adapter that works directly with pandas or other pandas-like libraries (e.g. dask). I'm starting with their Introduction but might branch out after that. We'll see.

Set Up


From Python

from datetime import datetime
from functools import partial
from pathlib import Path
from typing import Union
import textwrap

From PyPi

from sklearn.datasets import load_iris
from tabulate import tabulate
import numpy
import pandas

My Stuff

from neurotic.tangles.timer import Timer

The Bokeh Imports

from bokeh.embed import autoload_static
import bokeh.resources

Set Up the HVPlot

I'm not sure exactly what it's doing, but this next import adds an hvplot method to pandas' DataFrames to do the actual plotting.

import holoviews
import hvplot.pandas


PathType = Union[str, Path]


FOLDER_PATH = "../files/posts/libraries/a-first-look-at-hvplot/"


table = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)


EmbedBokeh Class

The Timer

TIMER = Timer()

The Data

Portland Crime

This is taken from the Portland Crime Statistics page.

portland_path = Path("~/data/datasets/portland/crime-to-january-2018.csv").expanduser()
assert portland_path.is_file()
with TIMER:
    crime = pandas.read_csv(portland_path)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217224 entries, 0 to 217223
Data columns (total 17 columns):
Address              196626 non-null object
Case Number          217224 non-null object
Crime Against        217224 non-null object
Neighborhood         210788 non-null object
Number of Records    217224 non-null int64
Occur Month Year     217224 non-null object
Occur Date           217224 non-null object
Occur Time           217224 non-null int64
Offense Category     217224 non-null object
Offense Count        217224 non-null int64
Offense Type         217224 non-null object
OpenDataLat          193352 non-null float64
OpenDataLon          193352 non-null float64
OpenDataX            193352 non-null float64
OpenDataY            193352 non-null float64
Report Date          217224 non-null object
ReportMonthYear      217224 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 28.2+ MB

Here's a possible categorical column to use.

crime["type"] = crime["Crime Against"].astype("category")
crime = crime.drop(columns=["Crime Against"])
print(table(crime.type.value_counts().reset_index(), headers=["Type", "Count"]))
Type Count
Property 175567
Person 32109
Society 9548

Making the Plot

Holoviews is expecting you to work in a jupyter notebook and isn't quite so easy to work with in org-mode so I'll make the plot with hvplot but then convert it to a bokeh figure to embed it in this post.

The Plot

with TIMER:
    crime["date"] = pandas.to_datetime(crime["Occur Date"])
    crime["id"] = crime["Case Number"]
    crime = crime.drop(columns=["Occur Date", "Case Number"])
    crime_dates = crime.set_index("date")
weekly = crime_dates.resample("W").count()
plot = weekly.id.hvplot()
Embed(plot, "weekly_crime.js")()

That didn't work out is planned. It turns out that the data starts in 1972, but is mostly empty until around May of 2015. It also looks like January is missing values. I think I'll trim the data set.


crime_dates = crime_dates[(crime_dates.index >= datetime(2015, 5, 31))
                          & (crime_dates.index < datetime(2019, 1, 1))]
weekly = crime_dates.resample("W").count()

By Type

HoloViews uses this rather odd way of composing figures. Instead of the object-oriented way you might expect it overrides the multiplication sign (* for adding to the same plot) and addition sign (+ for adding an adjacent plot) so to plot the types I'll have to multiply their plots.

types = {name: crime_dates[crime_dates.type==name]
         for name in crime_dates.type.unique()}
weekly_types = {name: data.resample("W").count()
                for name, data in types.items()}
keys = list(weekly_types.keys())
first = keys[0]
plot = weekly_types[first].hvplot(y="id", label=first)
for key in keys[1:]:
    plot *= weekly_types[key].hvplot(y="id", label=key)

It looks like it could use more trimming, but it also looks like it's mostly property crimes, which is what you'd expect, I guess. Actually I tried another trim and it looks like it always starts at zero because of the way the resampling works, so trimming doesn't make that first anomaly go away. Maybe trimming the weekly would help.

Looking a Little More at the Crimes

By Neighborhood

top_ten = crime_dates.Neighborhood.value_counts()[:10].reset_index()
print(table(top_ten, headers="Neighborhood Count".split()))
Neighborhood Count
Downtown 10237
Hazelwood 10127
Lents 5681
Powellhurst-Gilbert 5605
Centennial 5016
Old Town/Chinatown 4966
Northwest 4648
Montavilla 4026
Pearl 3905
Lloyd 3699
neighborhoods = crime_dates["Neighborhood"]
neighborhoods = pandas.get_dummies(neighborhoods)
neighborhoods = neighborhoods[top_ten["index"]].resample("M").sum()
plot = (neighborhoods.hvplot(title="Top Ten Monthly Neighborhood Crime Counts")
        + neighborhoods.hvplot.table(columns=["Downtown", "Hazelwood",
                                              "Lents", "Powellhurst-Gilbert"]))
Embed(plot, "neighborhoods")()

So the first thing to notice is that Downtown and Hazelwood dominate the case counts. There doesn't seem to be any strong upward or downward trend.

I live in Powelhurst-Gilbert, about a block north of Lents, and it looks like if you considered them one big neighborhood (they are adjacent), then they form the highest-crime Neighborhood, but, sticking to the arbitrariness of the boundaries, we are relegated to numbers three and four.


plot = neighborhoods.hvplot.kde(
    title="Distributions of Top Ten Crime Neighborhoods")
Embed(plot, "neighborhoods_kde")()

I don't know what that mysterious bulge around zero is, all the neighborhoods are in the other peaks.


Since the previous data was time-series data I thought I'd load a data set that wasn't to illustrate the use of the by parameter.

irises = load_iris()
iris_data = pandas.DataFrame(irises.data, columns=irises.feature_names)
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

I don't know where this convention came from, but you can use the by keyword to specify a categorical column to differentiate the data points. In this case I'll use it to differentiate the species.

target = pandas.Series(irises.target)
target_map = dict(zip(range(3), irises.target_names))
iris_data["target"] = target.apply(lambda x: target_map[x])
plot = iris_data.hvplot.scatter(x="sepal length (cm)", y="petal length (cm)",
                                by="target", alpha=0.5,
                                title="Iris Sepal Length vs Petal Length")
EmbedBokeh(plot, folder_path=FOLDER_PATH, file_name="irises.js")()

Scatter Matrix

plot = hvplot.scatter_matrix(iris_data, c="target")
Embed(plot, "iris_scatter_matrix")()

Parallel Coordinates

plot = hvplot.parallel_coordinates(iris_data, "target")
Embed(plot, "iris_parallel_coordinates")()