HoloViews Tabular Data Take Two

In [1]:
from pathlib import Path
import os
In [50]:
from bokeh.io import output_notebook
from bokeh.embed import autoload_static
from bokeh.resources import CDN
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas
In [36]:
output_notebook()
Loading BokehJS ...
In [37]:
load_dotenv(".env")
Out[37]:
True
In [38]:
holoviews.extension("bokeh")
In [39]:
path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)
In [40]:
diseases.head()
Out[40]:
Year Week State measles pertussis
0 1928 1 Alabama 3.67 NaN
1 1928 2 Alabama 6.25 NaN
2 1928 3 Alabama 7.95 NaN
3 1928 4 Alabama 12.58 NaN
4 1928 5 Alabama 8.03 NaN
In [41]:
key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)
In [42]:
dataset = dataset.aggregate(function=numpy.mean)
In [43]:
dataset
Out[43]:
:Dataset   [Year,State]   (measles,pertussis)
In [48]:
layout = (dataset.to(holoviews.Curve, "Year", "measles")
          + dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
plot = layout.options(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
plot
Out[48]:

So if you looked at this in a jupyter notebook with a running server there would be the plot above with working dropdown menus. But otherwise nope.

In [49]:
holoviews.save(plot, "diseases.html")
In [51]:
renderer = holoviews.renderer("bokeh")
figure = renderer.get_plot(plot).state
javascript, tag = autoload_static(figure, CDN, "diseases.js")
In [52]:
print(tag)
<script src="diseases.js" id="5e346ec9-db34-4143-bfa3-a2fe8d2e0da0"></script>
In [53]:
%%HTML
<script src="diseases.js" id="5e346ec9-db34-4143-bfa3-a2fe8d2e0da0"></script>
In [56]:
with open("../../files/posts/libraries/holoviews-tabular-data-take-two/diseases.js", "w") as writer:
    writer.write(javascript)

Okay, so weirdly, using autoload_static seems to be what's breaking it, because even the javascript exported by this notebook doesn't have the dropdown menu.

In [ ]:
 

HoloViews Tabular Datasets

Introduction

This is a walk through the HoloViews page on Tabular Datasets. The data-set was created by the Wall Street Journal using Project Tycho, but I'm getting it from the HoloViews github repository. The Wall Street Journal page is here. Unfortunately it has mixed content types (https and http) as well as some other problems which prevent Firefox and Chrome-based browsers from rendering it the visualization so I don't know what it actually looks like. Given that it's a commercial site I'm assuming it's an old page that they don't care about anymore.

Warning: I originally did this with modin and it wouldn't plot correctly. Save it for pre-processing and just use the real pandas when plotting.

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from bokeh.io import output_notebook
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Holoviews Bokeh

I don't know why but you have to specify that you're using bokeh, even though it looks like it's working when you don't.

holoviews.extension("bokeh")
output_notebook()

The Embedder

files_path = Path("../../files/posts/libraries/holoviews-tabular-datasets/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Dotenv

I have the path to the data-set in a .env file so I'll load it into the environment dictionary.

load_dotenv(".env")

Tabulate

This will print a table that org knows how to render.

orgtable = partial(tabulate, headers="keys", tablefmt="orgtbl", 
                   showindex=False)

Load the Data

The data comes from Project Tycho, which provides health-related data sets for research. The .env file assumes that I cloned the HoloViews repository so that I can load the data from it.

path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
    diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year Week State measles pertussis
1928 1 Alabama 3.67 nan
1928 2 Alabama 6.25 nan
1928 3 Alabama 7.95 nan
1928 4 Alabama 12.58 nan
1928 5 Alabama 8.03 nan
print(len(diseases))
print(len(diseases.State.unique()))
print(diseases.Year.min())
print(diseases.Year.max())
print(diseases.Week.unique())
222768
51
1928
2011
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52]

The State column includes Washington D.C., which is why there are 51 states. It has 52 weeks of data for each year from 1928 through 2011. Since measles and pertussis are greater than 1 I assume that this is some kind of rate (like incidence per million), but the page doesn't say and I the article they link to doesn't seem to render in my browser (and I don't have an account to download a dataset).

Create a Dataset

HoloViews has a class called a Dataset that lets you declare the dependent (value dimensions (vdims)) and independent variables (key dimensions (kdims)).

key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)

The value_dimensions list has tuples - these take the form (<column-name>, <output-name>) so when you make a plot it will use the <output-name> for any labels that are created.

Aggregate The Data

The one column that I didn't add is the Week column. The Dataset has a rather confusing aggregate method (confusing because you only pass in the function to aggregate with) that apparently knows how to use the key_dimensions variables we passed in to figure out what to aggregate.

dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset   [Year,State]   (measles,pertussis)
(4284, 4)
layout = (dataset.to(holoviews.Curve, "Year", "measles")
          + dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
layout.opts(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
Embed(layout, "measles_pertusis")()

Two things to note. One is that HoloViews picked up the nicer names without us having to specify them. Another is that only Alabama is displayed. In the demonstration HoloViews created a drop-down menu to select a state but it didn't do it here. Maybe you need to run it in a jupyter notebook…

Actually, I think it might be a conflict with nikola, this is a page saved from a jupyter notebook without any nikola pre-processing:

nil

Save the HTML

I'll see if you can do it directly here without using jupyter.

save_file = "diseases_2.html"
output = files_path.joinpath(save_file)
holoviews.save(layout, output)
print("[[file:{}][This is the plot.]]".format(save_file))

This is the plot.

GISS Global/Hemispheric Temperatures

Set Up

Imports

Python

from functools import partial
from pathlib import Path
import os

PyPi

from bokeh.layouts import column
from bokeh.palettes import Set1
from bokeh.models import (
    BoxZoomTool,
    HoverTool,
    Legend,
    PanTool,
    ResetTool,
    SaveTool,
    Span,
    WheelZoomTool,
)
from bokeh.models.widgets import Panel, Tabs
from bokeh.plotting import (figure, 
                            ColumnDataSource,
)
from dotenv import load_dotenv
import holoviews
import pandas

This Project

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Setup the Embed

files_path = Path("../../files/posts/giss/giss-globalhemispheric-temperatures/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Set Up Bokeh

I originally createde EmbedBokeh to use HoloViews to do the rendering so you have to set bokeh to be the backend or it will try to use matplotlib instead.

holoviews.extension("bokeh")

Load the Data

load_dotenv(".env")
path = Path(os.environ.get("ZONES")).expanduser()
assert path.is_file()
with path.open() as reader:
    giss = pandas.read_csv(path)
giss.loc[:, "Year"] = giss.Year.astype("int32")
print(giss.describe())
              Year        Glob        NHem        SHem     24N-90N  \
count   139.000000  139.000000  139.000000  139.000000  139.000000   
mean   1949.000000    0.032302    0.056043    0.008561    0.077698   
std      40.269923    0.336896    0.393435    0.301848    0.464606   
min    1880.000000   -0.490000   -0.540000   -0.490000   -0.580000   
25%    1914.500000   -0.200000   -0.220000   -0.235000   -0.280000   
50%    1949.000000   -0.070000   -0.010000   -0.080000    0.020000   
75%    1983.500000    0.215000    0.210000    0.265000    0.255000   
max    2018.000000    0.980000    1.260000    0.710000    1.500000   

          24S-24N     90S-24S     64N-90N     44N-64N     24N-44N     EQU-24N  \
count  139.000000  139.000000  139.000000  139.000000  139.000000  139.000000   
mean     0.036115   -0.018561    0.111079    0.117770    0.027698    0.027626   
std      0.331384    0.295695    0.917715    0.516729    0.356416    0.326111   
min     -0.650000   -0.470000   -1.640000   -0.710000   -0.590000   -0.720000   
25%     -0.215000   -0.250000   -0.545000   -0.270000   -0.200000   -0.230000   
50%     -0.030000   -0.100000    0.020000    0.000000   -0.070000    0.000000   
75%      0.255000    0.230000    0.660000    0.360000    0.135000    0.240000   
max      0.970000    0.700000    3.050000    1.440000    1.060000    0.930000   

          24S-EQU     44S-24S     64S-44S     90S-64S  
count  139.000000  139.000000  139.000000  139.000000  
mean     0.045683    0.020432   -0.069353   -0.078129  
std      0.343385    0.312688    0.269380    0.732359  
min     -0.580000   -0.430000   -0.540000   -2.570000  
25%     -0.210000   -0.220000   -0.265000   -0.490000  
50%     -0.030000   -0.080000   -0.090000    0.050000  
75%      0.290000    0.260000    0.180000    0.410000  
max      1.020000    0.780000    0.450000    1.270000  
print(giss.iloc[0])
print()
print(giss.iloc[-1])
Year       1880.00
Glob         -0.18
NHem         -0.31
SHem         -0.06
24N-90N      -0.38
24S-24N      -0.17
90S-24S      -0.01
64N-90N      -0.97
44N-64N      -0.47
24N-44N      -0.25
EQU-24N      -0.21
24S-EQU      -0.13
44S-24S      -0.04
64S-44S       0.05
90S-64S       0.67
Name: 0, dtype: float64

Year       2018.00
Glob          0.82
NHem          0.99
SHem          0.66
24N-90N       1.19
24S-24N       0.64
90S-24S       0.70
64N-90N       1.87
44N-64N       1.09
24N-44N       1.03
EQU-24N       0.69
24S-EQU       0.59
44S-24S       0.78
64S-44S       0.37
90S-64S       1.07
Name: 138, dtype: float64
print(giss.columns)
giss = giss.rename(columns=dict(
    Glob="Global", 
    NHem="Northern Hemisphere", 
    SHem="Southern Hemisphere"))
print(giss.columns)
Index(['Year', 'Glob', 'NHem', 'SHem', '24N-90N', '24S-24N', '90S-24S',
       '64N-90N', '44N-64N', '24N-44N', 'EQU-24N', '24S-EQU', '44S-24S',
       '64S-44S', '90S-64S'],
      dtype='object')
Index(['Year', 'Global', 'Northern Hemisphere', 'Southern Hemisphere',
       '24N-90N', '24S-24N', '90S-24S', '64N-90N', '44N-64N', '24N-44N',
       'EQU-24N', '24S-EQU', '44S-24S', '64S-44S', '90S-64S'],
      dtype='object')

Plot

Global/Hemispheric

class Plot:
    width = 1000
    height = 800
    line_width = 4
    alpha = 0.8
    light_alpha = 0.2
    title_font_size = "14pt"
hover = HoverTool(
    tooltips = [
        ("Year", "@year"),
        ("Difference From Normal", "@anomaly")
    ]
)

tools = [
    hover,
    PanTool(),
    WheelZoomTool(),
    BoxZoomTool(),
    ResetTool(),
    SaveTool(),
]

plot = figure(plot_width=Plot.width, plot_height=Plot.height, 
              x_range=(giss.Year.min(), giss.Year.max()),
              x_axis_label="Year",
              y_axis_label="Difference (Celsius)",
              tools=tools)

plot.title.text = "Yearly Temperature Difference from Mean 1931-1980 Temperature by Hemisphere"
plot.title.text_font_size = Plot.title_font_size

horizontal = Span(location=0, dimension="width", line_color="darkgray",
                  line_width=Plot.line_width, 
                  line_cap="round",
                  line_dash="dashed")
plot.renderers.extend([horizontal])
locations = ["Global", "Northern Hemisphere", "Southern Hemisphere"]
for location, color in zip(locations, Set1[3]):
    columns = ColumnDataSource(
        data=dict(
            year=giss.Year,
            anomaly=giss[location],
            smoothed=giss.rolling(5, on="Year", min_periods=1)[location].mean(),
        )
    )
    line = plot.circle("year", "anomaly", source=columns, 
                       color=color, 
                       line_width=Plot.line_width, 
                       alpha=Plot.light_alpha,
                       legend=location)
    line = plot.line("year", "smoothed", source=columns,
                     color=color,
                     line_width=Plot.line_width, alpha=Plot.alpha,
                     legend="{} Rolling 5 Year Mean".format(location))
plot.legend.click_policy = "hide"
plot.legend.location = "top_left"

Embed the Plot

I need to fix the EmbedBokeh class.

embed = Embed(plot, "global_temperature_anomalies")
embed._figure = plot
embed()

GISS Surface Temperature Analysis (GISTEMP v32)

Introduction

This is a look at the Godard Institute for Space Studies' surface temperature data. In particular it is the Global-mean monthly, seasonal, and annual means data which has data from 1880 to the present (CSV Download Link).

Set Up

Imports

Python

from pathlib import Path
import os

PyPi

from dotenv import load_dotenv
import pandas

Load Dotenv

load_dotenv(".env")

Load the Data

Take One

path = Path(os.environ.get("GLOBAL")).expanduser()
assert path.is_file()
with path.open() as reader:
    giss = pandas.read_csv(path)
print(giss.head())
                                                                                          Land-Ocean: Global Means
Year Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec  J-D  D-N  DJF  MAM  JJA                       SON
1880 -.29 -.18 -.11 -.19 -.11 -.23 -.20 -.09 -.16 -.23 -.20 -.22 -.18 ***  ***  -.14 -.17                     -.19
1881 -.15 -.17 .04  .04  .02  -.20 -.06 -.02 -.14 -.21 -.22 -.11 -.10 -.11 -.18 .03  -.10                     -.19
1882 .14  .15  .04  -.19 -.16 -.26 -.21 -.05 -.10 -.25 -.16 -.24 -.11 -.10 .06  -.10 -.17                     -.17
1883 -.31 -.39 -.13 -.17 -.20 -.12 -.08 -.15 -.20 -.14 -.22 -.16 -.19 -.20 -.31 -.16 -.12                     -.19

One thing to notice is that the first line got read in as columns and the columns got read in as the first row.

print(giss.iloc[0])
Land-Ocean: Global Means    SON
Name: (Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J-D, D-N, DJF, MAM, JJA), dtype: object

So we're going to have to skip the first row.

Take Two

path = Path(os.environ.get("GLOBAL")).expanduser()
assert path.is_file()
with path.open() as reader:
    giss = pandas.read_csv(path, skiprows=1)
print(giss.head())
   Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov  \
0  1880 -0.29  -.18  -.11  -.19  -.11  -.23  -.20  -.09  -.16  -.23  -.20   
1  1881 -0.15  -.17   .04   .04   .02  -.20  -.06  -.02  -.14  -.21  -.22   
2  1882  0.14   .15   .04  -.19  -.16  -.26  -.21  -.05  -.10  -.25  -.16   
3  1883 -0.31  -.39  -.13  -.17  -.20  -.12  -.08  -.15  -.20  -.14  -.22   
4  1884 -0.15  -.08  -.37  -.42  -.36  -.40  -.34  -.26  -.27  -.24  -.30   

    Dec   J-D   D-N   DJF   MAM   JJA   SON  
0  -.22  -.18   ***   ***  -.14  -.17  -.19  
1  -.11  -.10  -.11  -.18   .03  -.10  -.19  
2  -.24  -.11  -.10   .06  -.10  -.17  -.17  
3  -.16  -.19  -.20  -.31  -.16  -.12  -.19  
4  -.29  -.29  -.28  -.13  -.38  -.34  -.27  
print(giss.columns)
Index(['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec', 'J-D', 'D-N', 'DJF', 'MAM', 'JJA', 'SON'],
      dtype='object')
print(giss.describe())
            Year         Jan
count   140.0000  140.000000
mean   1949.5000    0.027500
std      40.5586    0.396867
min    1880.0000   -0.790000
25%    1914.7500   -0.265000
50%    1949.5000   -0.020000
75%    1984.2500    0.290000
max    2019.0000    1.150000

So most of the columns weren't read as numeric, probably because of the use of *** for missing data.

Take Three

with path.open() as reader:
    giss = pandas.read_csv(path, skiprows=1, na_values="***")
print(giss.describe())
            Year         Jan         Feb         Mar         Apr         May  \
count   140.0000  140.000000  139.000000  139.000000  139.000000  139.000000   
mean   1949.5000    0.027500    0.038201    0.052806    0.026187    0.016043   
std      40.5586    0.396867    0.393732    0.387470    0.363309    0.348825   
min    1880.0000   -0.790000   -0.610000   -0.600000   -0.600000   -0.560000   
25%    1914.7500   -0.265000   -0.235000   -0.230000   -0.260000   -0.240000   
50%    1949.5000   -0.020000   -0.040000   -0.020000   -0.050000   -0.050000   
75%    1984.2500    0.290000    0.325000    0.275000    0.250000    0.260000   
max    2019.0000    1.150000    1.330000    1.300000    1.070000    0.900000   

              Jun         Jul         Aug         Sep         Oct         Nov  \
count  139.000000  139.000000  139.000000  139.000000  139.000000  139.000000   
mean     0.003022    0.026043    0.030863    0.041367    0.060072    0.048561   
std      0.339148    0.317524    0.330365    0.323767    0.335174    0.341057   
min     -0.530000   -0.540000   -0.540000   -0.530000   -0.570000   -0.540000   
25%     -0.245000   -0.210000   -0.210000   -0.180000   -0.190000   -0.185000   
50%     -0.070000   -0.050000   -0.050000   -0.060000    0.000000   -0.020000   
75%      0.190000    0.195000    0.190000    0.205000    0.190000    0.180000   
max      0.780000    0.820000    1.000000    0.880000    1.060000    1.020000   

              Dec         J-D         D-N         DJF         MAM         JJA  \
count  139.000000  139.000000  138.000000  138.000000  139.000000  139.000000   
mean     0.021727    0.032302    0.033116    0.026449    0.031583    0.020360   
std      0.364511    0.336896    0.338215    0.369663    0.361006    0.324987   
min     -0.790000   -0.490000   -0.510000   -0.660000   -0.560000   -0.520000   
25%     -0.220000   -0.200000   -0.215000   -0.240000   -0.255000   -0.220000   
50%     -0.050000   -0.070000   -0.060000   -0.070000   -0.060000   -0.070000   
75%      0.275000    0.215000    0.230000    0.280000    0.265000    0.195000   
max      1.100000    0.980000    1.010000    1.190000    1.090000    0.860000   

              SON  
count  139.000000  
mean     0.050504  
std      0.327437  
min     -0.490000  
25%     -0.190000  
50%     -0.020000  
75%      0.190000  
max      0.970000  

Actually I just looked at the "official" file given by Coursera and I downloaded the wrong one.

The Real Data

The data I was supposed to pull was the Combined Land-Surface Air and Sea-Surface Water Temperature Anomolies' Zonal Annual Means which shows the different annual mean for each zone in a given year (rather than monthly global averages).

zone_path = Path(os.environ.get("ZONES")).expanduser()
assert zone_path.is_file()
with zone_path.open() as reader:
    giss = pandas.read_csv(reader)
print(giss.describe())
              Year        Glob        NHem        SHem     24N-90N  \
count   139.000000  139.000000  139.000000  139.000000  139.000000   
mean   1949.000000    0.032302    0.056043    0.008561    0.077698   
std      40.269923    0.336896    0.393435    0.301848    0.464606   
min    1880.000000   -0.490000   -0.540000   -0.490000   -0.580000   
25%    1914.500000   -0.200000   -0.220000   -0.235000   -0.280000   
50%    1949.000000   -0.070000   -0.010000   -0.080000    0.020000   
75%    1983.500000    0.215000    0.210000    0.265000    0.255000   
max    2018.000000    0.980000    1.260000    0.710000    1.500000   

          24S-24N     90S-24S     64N-90N     44N-64N     24N-44N     EQU-24N  \
count  139.000000  139.000000  139.000000  139.000000  139.000000  139.000000   
mean     0.036115   -0.018561    0.111079    0.117770    0.027698    0.027626   
std      0.331384    0.295695    0.917715    0.516729    0.356416    0.326111   
min     -0.650000   -0.470000   -1.640000   -0.710000   -0.590000   -0.720000   
25%     -0.215000   -0.250000   -0.545000   -0.270000   -0.200000   -0.230000   
50%     -0.030000   -0.100000    0.020000    0.000000   -0.070000    0.000000   
75%      0.255000    0.230000    0.660000    0.360000    0.135000    0.240000   
max      0.970000    0.700000    3.050000    1.440000    1.060000    0.930000   

          24S-EQU     44S-24S     64S-44S     90S-64S  
count  139.000000  139.000000  139.000000  139.000000  
mean     0.045683    0.020432   -0.069353   -0.078129  
std      0.343385    0.312688    0.269380    0.732359  
min     -0.580000   -0.430000   -0.540000   -2.570000  
25%     -0.210000   -0.220000   -0.265000   -0.490000  
50%     -0.030000   -0.080000   -0.090000    0.050000  
75%      0.290000    0.260000    0.180000    0.410000  
max      1.020000    0.780000    0.450000    1.270000  
print(giss.iloc[0])
Year       1880.00
Glob         -0.18
NHem         -0.31
SHem         -0.06
24N-90N      -0.38
24S-24N      -0.17
90S-24S      -0.01
64N-90N      -0.97
44N-64N      -0.47
24N-44N      -0.25
EQU-24N      -0.21
24S-EQU      -0.13
44S-24S      -0.04
64S-44S       0.05
90S-64S       0.67
Name: 0, dtype: float64

Criteria

Appropriate Chart Selection and Variables

Did you select the appropriate chart and use the correct chart elements to visualize the nominal, ordinal, discrete, and continuous variables, as described in lecture 2.1.3? Continuous data variables should be assigned to continuous chart elements (e.g., lines between data points), whereas discrete variables should be assigned to discrete chart elements (e.g., separate bars). Furthermore, the assignment of variables to elements should follow the priorities in lecture 2.1.2.

Design of the Chart

Does the chart effectively display the data, based on the design rules in lecture 2.3.1?

Content

How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data? For example, was a streamgraph used instead of an ordinary bar chart?

Grading

Criteria Poor (1–2 points) Fair (3 points) Good (4 points) Great (5 points)
Appropriate Chart Selection and Variables Chart is indecipherable or significantly misleading because of poor chart type or assignment of variables to elements Major problem(s) with chart selection or assignment of elements to variables Minor problem(s) with chart selection or assignment of elements to variables Chart selection is appropriate for data and its elements properly assigned to appropriate data variables
Design of the Chart No apparent attention paid to design Evidence that several of the design rules should have been followed but were not Evidence that one of the design rules should have been followed but was not Attention paid to all design rules
Content Misleading Boring Not boring Interesting

Citation

Interactive Bokeh Legends

Introduction

This is a reproduction of the bokeh interactive legends example.

Set Up

Imports

Python

I have a class that I use to embed the bokeh and I'm going to make it easier to reuse by binding some values to it with partial.

from functools import partial

PyPi

  • Bokeh

    The Spectral color scheme from bokeh.palettes is a categorical color scheme from Color Brewer, a color scheme helper for cartographers.

    from bokeh.palettes import Spectral4
    from bokeh.plotting import figure
    from bokeh.sampledata.stocks import AAPL, IBM, MSFT, GOOG
    
  • Pandas And HoloViews

    Pandas is for the data, holoviews is because I created the EmbedBokeh class expecting to always use it..

    import holoviews
    import pandas
    

My Stuff

This is a convenience class for embedding the javascript that bokeh creates into this post.

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

The Embedder

The files_path is where the javascript needs to be stored for nikola to copy it to the right place.

files_path = Path("../../files/posts/libraries/interactive-bokeh-legends/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

The Bokeh Backend

For some reason the backend is defaulting to matplotlib so this fixes it.

holoviews.extension("bokeh")

Hide Plots

This creates a plot where the entry fore each line in the legend becomes a button that toggles the plot's visibility.

plot = figure(plot_width=800, plot_height=400, 
              x_axis_type="datetime", 
              tools="hover,pan,wheel_zoom,box_zoom,reset")
plot.title.text = "Some Stock Prices (Click On the Legend To Hide Plots)"
for data, name, color in zip([AAPL, IBM, MSFT, GOOG], "Apple IBM Microsoft Google".split(), Spectral4):
    frame = pandas.DataFrame(data)
    frame["date"] = pandas.to_datetime(frame["date"])
    plot.line(frame["date"], frame["close"], line_width=2, color=color, alpha=0.8, legend=name)
plot.legend.location = "top_left"
plot.legend.click_policy = "hide"
plot.title.text_font_size = "14pt"

Output The Plot

Fade Plots

Clicking on the line in the legend will change the alpha-value for the line (to make it less visible). This has the effect of keeping the hovertool working for the line, while hiding it disables the hover tool.

fade_plot = figure(plot_width=800, plot_height=400, 
                   x_axis_type="datetime", 
                   tools="hover,pan,wheel_zoom,box_zoom,reset")
fade_plot.title.text = "Some Stock Prices (Click On the Legend To Mute Plots)"
fade_plot.title.text_font_size = "14pt"
for data, name, color in zip([AAPL, IBM, MSFT, GOOG], "Apple IBM Microsoft Google".split(), Spectral4):
    frame = pandas.DataFrame(data)
    frame["date"] = pandas.to_datetime(frame["date"])
    fade_plot.line(frame["date"], frame["close"], line_width=2, color=color, muted_alpha=0.2, muted_color=color, alpha=0.8, legend=name)
fade_plot.legend.location = "top_left"
fade_plot.legend.click_policy = "mute"

Output The Plot

Customizing HoloViews

Introduction

This is another exploration - this time looking at what they call Customization. In my introduction post when I made a scatter plot with a hover tool I first had to make the Scatter element and then add the hover tool as part of the options. HoloViews does this to try and emphasize a separation of content and presentation. When making the Scatter element I was supposed to only be thinking about the data that I wanted to add, then when working with the options I was turning to focus on the aesthetics.

Set Up

Imports

Python

from datetime import datetime
from functools import partial
from pathlib import Path
import os

PyPi

Related Projects

from neurotic.tangles.timer import Timer

This Project

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

The Timer

TIMER = Timer()

The Embedder

files_path = Path("../../files/posts/libraries/customizing-holoviews/")
Embed = partial(
    EmbedBokeh,
    folder_path=files_path)

Bokeh Backend

When I ran the code further down in the notebook to render the javascript I was getting this error:

ValueError: autoload_static expects a single Model or Document

It was because I forgot the next step and it was defaulting to Matplotlib for some reason.

holoviews.extension("bokeh")

The Data

load_dotenv(".env")
path = Path(os.environ.get("PORTLAND_CRIME")).expanduser()
assert path.exists()
with TIMER:
    data = pandas.read_csv(path)
Started: 2019-03-02 14:25:02.818262
Ended: 2019-03-02 14:25:03.296873
Elapsed: 0:00:00.478611
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217224 entries, 0 to 217223
Data columns (total 17 columns):
Address              196626 non-null object
Case Number          217224 non-null object
Crime Against        217224 non-null object
Neighborhood         210788 non-null object
Number of Records    217224 non-null int64
Occur Month Year     217224 non-null object
Occur Date           217224 non-null object
Occur Time           217224 non-null int64
Offense Category     217224 non-null object
Offense Count        217224 non-null int64
Offense Type         217224 non-null object
OpenDataLat          193352 non-null float64
OpenDataLon          193352 non-null float64
OpenDataX            193352 non-null float64
OpenDataY            193352 non-null float64
Report Date          217224 non-null object
ReportMonthYear      217224 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 28.2+ MB
None
date = (data["Occur Date"]
        + " "
        + data["Occur Time"].astype(str).str.zfill(4))
data["date"] = pandas.to_datetime(date, format="%m/%d/%Y %H%M")

print(data.date[:5])
0   2017-08-26 00:00:00
1   2017-08-29 16:00:00
2   2017-08-12 19:00:00
3   2017-08-27 01:00:00
4   2017-07-24 09:03:00
Name: date, dtype: datetime64[ns]
data = data[(data.date >= datetime(2015, 5, 31))
            & (data.date < datetime(2019, 1, 1))]
selection = data[data.date > datetime(2018, 12, 24)].sort_values("date")

Plot time vs Latitude.

First we get our content.

curve = holoviews.Curve(selection, ("date", "Date-Time"), ("OpenDataLat", "Latitude"))
timestamps = holoviews.Spikes(selection, ("date", "Date-Time"), [])
layout = curve + timestamps

Now we make our presentation.

Take Two

Although the defaults give us a plot that's hard to read, by adjusting the width of the plot we can make it something more interpretable.

layout = layout.opts(
    opts.Curve(height=200, width=900, xaxis=None, color="red", line_width=1.5, tools=["hover"]),
    opts.Spikes(height=150, width=900, xaxis=None, color="grey")
).cols(1)

HoloViews Introduction

Introduction

I've already taken an initial look at HVPlot so I'm going to look at HoloViews which acts as an intermediate layer between the main plotting libraries like bokeh and matplotlib and the upper layer given by HVPlot. I haven't used it before so I'm not really sure when you would use what. I guess HVPlot gives you access to the pandas plots in bokeh without a lot of work, which is nice, although I noticed that the plots tended to be missing things sometimes (like the Hover tool) so if you want to add more back in you probably have to understand HoloViews which itself sometimes doesn't give you what you want (like the ability to render it in org-mode posts) so you still need bokeh too, sometimes. And of course I'm only using the static-page versions of everything, not the features that work with a bokeh or jupyter server. But I guess that's for later.

I'm going to be working from the Introduction of their Getting Started guide.

Set Up

Imports

Python

from functools import partial
from pathlib import Path

From PiPy

from holoviews import opts
from sklearn.datasets import fetch_california_housing
import holoviews
import numpy
import pandas

This Project

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

The HoloViews Backend

If you use HVPlot you don't need to set the backend (because it defaults to 'bokeh', I think) but this is going to be about HoloViews so I'm going to do it their way, rather than relying on all the pandas methods.

holoviews.extension("bokeh")

A Partial Bokeh Embedder

Since the output folder is always the same I'm going to bind it to the EmbedBokeh definition.

plot_path = Path("../../files/posts/libraries/holoviews-introduction/")
Embed = partial(EmbedBokeh, folder_path=plot_path)

The Data Set

Load It

Sklearn downloads it as a 'bunch' so we need to get it in that form first and then turn it into a data frame (I'm sure there's a way to skip this step but this is the way I already know how to do it).

folder = Path("~/data/datasets/california-housing").expanduser()
assert folder.is_dir()
print(folder)
bunch = fetch_california_housing(folder)
print(bunch.DESCR)
/home/hades/data/datasets/california-housing
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

Make The DataFrame

data = pandas.DataFrame(bunch.data, columns=bunch.feature_names)
data["median_value"] = bunch.target
print(data.head())
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  median_value  
0    -122.23         4.526  
1    -122.22         3.585  
2    -122.24         3.521  
3    -122.25         3.413  
4    -122.25         3.422  

A Plot

Our target is the median value of the house. Does that correlate with median income?

scatter = holoviews.Scatter(data,
                            ("MedInc", "Median Income"),
                            ("median_value", "Median Value"),
                            label="California Housing")

After setting up the basic plot we can do things to affect the appearance like setting the color or adding tools.

scatter = scatter.opts(opts.Scatter(color="red", tools=["hover"]))

Adding To the Layout

What if we want to add a distrbution to the plot? HoloViews uses the + operator to indicate that you want to append a plot to another one.

layout = scatter + holoviews.Histogram(
    numpy.histogram(data.HouseAge, bins=24), kdims=["HouseAge"])
layout = layout.opts(opts.Histogram(tools=["hover"]))

A First Look At HVPlot

Introduction

This is a look at HVPlot, a HoloViews based plotting adapter that works directly with pandas or other pandas-like libraries (e.g. dask). I'm starting with their Introduction but might branch out after that. We'll see.

Set Up

Imports

From Python

from datetime import datetime
from functools import partial
from pathlib import Path
from typing import Union
import textwrap

From PyPi

from sklearn.datasets import load_iris
from tabulate import tabulate
import numpy
import pandas

My Stuff

from neurotic.tangles.timer import Timer

The Bokeh Imports

from bokeh.embed import autoload_static
import bokeh.resources

Set Up the HVPlot

I'm not sure exactly what it's doing, but this next import adds an hvplot method to pandas' DataFrames to do the actual plotting.

import holoviews
import hvplot.pandas

Typing

PathType = Union[str, Path]

Constants

FOLDER_PATH = "../files/posts/libraries/a-first-look-at-hvplot/"

Tables

table = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

Helpers

EmbedBokeh Class

class EmbedBokeh:
    """Embed a bokeh figure

    Args:
     plot: a hvplot to embed
     folder_path: path to the folder to save the file
     file_name: name of the file to save the javascript in
     create_folder: if the folder doesn't exist create it
     make_parents: if creating a folder add the missing folders in the path
    """
    def __init__(self, plot: holoviews.core.overlay.NdOverlay,
                 file_name: str,
                 folder_path: PathType,
                 create_folder: bool=True,
                 make_parents: bool=True) -> None:
        self.plot = plot
        self._figure = None
        self.create_folder = create_folder
        self.make_parents = make_parents
        self._folder_path = None
        self.folder_path = folder_path
        self._file_name = None
        self.file_name = file_name
        self._source = None
        self._javascript = None
        self._bokeh_source = None
        self._export_string = None
        return

    @property
    def folder_path(self) -> Path:
        """The path to the folder to store javascript"""
        return self._folder_path

    @folder_path.setter
    def folder_path(self, path: PathType) -> None:
        """Sets the path to the javascript folder"""
        self._folder_path = Path(path)
        if self.create_folder and  not self._folder_path.is_dir():
            self._folder_path.mkdir(parents=self.make_parents)
        return

    @property
    def file_name(self) -> str:
        """The name of the javascript file"""
        return self._file_name

    @file_name.setter
    def file_name(self, name: str) -> None:
        """Sets the filename

        Args:
         name: name to save the javascript (without the folder)
        """
        name = Path(name)
        self._file_name = "{}.js".format(name.stem)
        return

    @property
    def figure(self) -> bokeh.plotting.Figure:
        """The Figure to plot"""
        if self._figure is None:
            self._figure = holoviews.render(self.plot)
        return self._figure

    @property
    def bokeh_source(self) -> bokeh.resources.Resources:
        """The javascript source
        """
        if self._bokeh_source is None:
            self._bokeh_source = bokeh.resources.CDN
        return self._bokeh_source

    @property
    def source(self) -> str:
        """The HTML fragment to export"""
        if self._source is None:
            self._javascript, self._source = autoload_static(self.figure,
                                                             self.bokeh_source,
                                                             self.file_name)
        return self._source

    @property
    def javascript(self) -> str:
        """javascript to save"""
        if self._javascript is None:
            self._javascript, self._source = autoload_static(self.figure,
                                                             self.bokeh_source,
                                                             self.file_name)
        return self._javascript

    @property
    def export_string(self) -> str:
        """The string to embed the figure into org-mode"""
        if self._export_string is None:
            self._export_string = textwrap.dedent(
                """#+BEGIN_EXPORT html{}
#+END_EXPORT""".format(self.source))
        return self._export_string

    def save_figure(self) -> None:
        """Saves the javascript file"""
        with open(self.folder_path.joinpath(self.file_name), "w") as writer:
            writer.write(self.javascript)
        return

    def __call__(self) -> None:
        """Creates the bokeh javascript and emits it"""
        self.save_figure()
        print(self.export_string)
        return

    def reset(self) -> None:
        """Sets the generated (bokeh) properties back to None"""
        self._export_string = None
        self._javascript = None
        self._source = None
        self._figure = None
        return
Embed = partial(EmbedBokeh, folder_path=FOLDER_PATH)

The Timer

TIMER = Timer()

The Data

Portland Crime

This is taken from the Portland Crime Statistics page.

portland_path = Path("~/data/datasets/portland/crime-to-january-2018.csv").expanduser()
assert portland_path.is_file()
with TIMER:
    crime = pandas.read_csv(portland_path)
Started: 2019-02-02 18:38:59.025251
Ended: 2019-02-02 18:39:00.170796
Elapsed: 0:00:01.145545
print(crime.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217224 entries, 0 to 217223
Data columns (total 17 columns):
Address              196626 non-null object
Case Number          217224 non-null object
Crime Against        217224 non-null object
Neighborhood         210788 non-null object
Number of Records    217224 non-null int64
Occur Month Year     217224 non-null object
Occur Date           217224 non-null object
Occur Time           217224 non-null int64
Offense Category     217224 non-null object
Offense Count        217224 non-null int64
Offense Type         217224 non-null object
OpenDataLat          193352 non-null float64
OpenDataLon          193352 non-null float64
OpenDataX            193352 non-null float64
OpenDataY            193352 non-null float64
Report Date          217224 non-null object
ReportMonthYear      217224 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 28.2+ MB
None

Here's a possible categorical column to use.

crime["type"] = crime["Crime Against"].astype("category")
crime = crime.drop(columns=["Crime Against"])
print(table(crime.type.value_counts().reset_index(), headers=["Type", "Count"]))
Type Count
Property 175567
Person 32109
Society 9548

Making the Plot

Holoviews is expecting you to work in a jupyter notebook and isn't quite so easy to work with in org-mode so I'll make the plot with hvplot but then convert it to a bokeh figure to embed it in this post.

The Plot

with TIMER:
    crime["date"] = pandas.to_datetime(crime["Occur Date"])
    crime["id"] = crime["Case Number"]
    crime = crime.drop(columns=["Occur Date", "Case Number"])
    crime_dates = crime.set_index("date")
Started: 2019-02-01 20:31:47.668915
Ended: 2019-02-01 20:32:09.889378
Elapsed: 0:00:22.220463
weekly = crime_dates.resample("W").count()
plot = weekly.id.hvplot()
Embed(plot, "weekly_crime.js")()

That didn't work out is planned. It turns out that the data starts in 1972, but is mostly empty until around May of 2015. It also looks like January is missing values. I think I'll trim the data set.

Trimmed

crime_dates = crime_dates[(crime_dates.index >= datetime(2015, 5, 31))
                          & (crime_dates.index < datetime(2019, 1, 1))]
weekly = crime_dates.resample("W").count()

By Type

HoloViews uses this rather odd way of composing figures. Instead of the object-oriented way you might expect it overrides the multiplication sign (* for adding to the same plot) and addition sign (+ for adding an adjacent plot) so to plot the types I'll have to multiply their plots.

types = {name: crime_dates[crime_dates.type==name]
         for name in crime_dates.type.unique()}
weekly_types = {name: data.resample("W").count()
                for name, data in types.items()}
keys = list(weekly_types.keys())
first = keys[0]
plot = weekly_types[first].hvplot(y="id", label=first)
for key in keys[1:]:
    plot *= weekly_types[key].hvplot(y="id", label=key)

It looks like it could use more trimming, but it also looks like it's mostly property crimes, which is what you'd expect, I guess. Actually I tried another trim and it looks like it always starts at zero because of the way the resampling works, so trimming doesn't make that first anomaly go away. Maybe trimming the weekly would help.

Looking a Little More at the Crimes

By Neighborhood

top_ten = crime_dates.Neighborhood.value_counts()[:10].reset_index()
print(table(top_ten, headers="Neighborhood Count".split()))
Neighborhood Count
Downtown 10237
Hazelwood 10127
Lents 5681
Powellhurst-Gilbert 5605
Centennial 5016
Old Town/Chinatown 4966
Northwest 4648
Montavilla 4026
Pearl 3905
Lloyd 3699
neighborhoods = crime_dates["Neighborhood"]
neighborhoods = pandas.get_dummies(neighborhoods)
neighborhoods = neighborhoods[top_ten["index"]].resample("M").sum()
plot = (neighborhoods.hvplot(title="Top Ten Monthly Neighborhood Crime Counts")
        + neighborhoods.hvplot.table(columns=["Downtown", "Hazelwood",
                                              "Lents", "Powellhurst-Gilbert"]))
Embed(plot, "neighborhoods")()

So the first thing to notice is that Downtown and Hazelwood dominate the case counts. There doesn't seem to be any strong upward or downward trend.

I live in Powelhurst-Gilbert, about a block north of Lents, and it looks like if you considered them one big neighborhood (they are adjacent), then they form the highest-crime Neighborhood, but, sticking to the arbitrariness of the boundaries, we are relegated to numbers three and four.

Distribution

plot = neighborhoods.hvplot.kde(
    by="Neighborhood",
    title="Distributions of Top Ten Crime Neighborhoods")
Embed(plot, "neighborhoods_kde")()

I don't know what that mysterious bulge around zero is, all the neighborhoods are in the other peaks.

Irises

Since the previous data was time-series data I thought I'd load a data set that wasn't to illustrate the use of the by parameter.

irises = load_iris()
print(irises.DESCR)
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...
iris_data = pandas.DataFrame(irises.data, columns=irises.feature_names)
print(iris_data.head())
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

I don't know where this convention came from, but you can use the by keyword to specify a categorical column to differentiate the data points. In this case I'll use it to differentiate the species.

target = pandas.Series(irises.target)
target_map = dict(zip(range(3), irises.target_names))
iris_data["target"] = target.apply(lambda x: target_map[x])
plot = iris_data.hvplot.scatter(x="sepal length (cm)", y="petal length (cm)",
                                by="target", alpha=0.5,
                                title="Iris Sepal Length vs Petal Length")
EmbedBokeh(plot, folder_path=FOLDER_PATH, file_name="irises.js")()

Scatter Matrix

plot = hvplot.scatter_matrix(iris_data, c="target")
Embed(plot, "iris_scatter_matrix")()

Parallel Coordinates

plot = hvplot.parallel_coordinates(iris_data, "target")
Embed(plot, "iris_parallel_coordinates")()

Portland Daily Temperatures Data

Introduction

I'm going to work with the Daily Temperatures data set for Portland, Oregon (measured at the airport) taken from the National Weather Service. I cleaned it up a little already, removing the extra header rows and adding a missing column header (Metric) but the data is arranged with the year and month as a column and then each day is given its own column, which isn't how I want to work with it, so I'm going to transform it a little to make it more like what I expect it to look like.}

Set Up

Imports

Python

from functools import partial
from datetime import datetime
from pathlib import Path
from typing import Union
import os

From PyPi

from dotenv import load_dotenv
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas
import seaborn

My Stuff

from graeae.timers import Timer
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh

Plotting

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)

The Timer

TIMER = Timer()

The Embedder

Embed = partial(
    EmbedBokeh, 
    folder_path="../../files/posts/portland-daily-climate/portland-daily-temperatures-data/")

Loading the Data

load_dotenv()
path = Path(os.environ.get("CSV")).expanduser()
print(path)
assert path.is_file()
/home/athena/data/datasets/necromuralist/daily-climate-data/portland_1940_to_april_2018.csv

Some Preparation

The first thing to work with is that there are three characters representing "missing" data (that I noticed) - M, T, and - - that we have to tell pandas about when we use read_csv.

missing = ["M", "T", "-"]

I was going to load the measurement type (e.g. "TX"), but I realized that I was planning to turn those into column headers so maybe it's not a good idea.

with TIMER:
    data = pandas.read_csv(path, na_values=missing)
print(data.shape)
Started: 2019-03-10 18:50:58.399150
Ended: 2019-03-10 18:50:58.410684
Elapsed: 0:00:00.011534
(3756, 35)
print(data.columns)
Index(['YR', 'MO', 'Metric', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '30', '31', 'AVG or Total'],
      dtype='object')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3756 entries, 0 to 3755
Data columns (total 35 columns):
YR              3756 non-null int64
MO              3756 non-null int64
Metric          3756 non-null object
1               3602 non-null float64
2               3554 non-null float64
3               3583 non-null float64
4               3604 non-null float64
5               3599 non-null float64
6               3610 non-null float64
7               3587 non-null object
8               3590 non-null float64
9               3595 non-null float64
10              3614 non-null float64
11              3602 non-null float64
12              3600 non-null float64
13              3583 non-null float64
14              3582 non-null float64
15              3591 non-null float64
16              3604 non-null float64
17              3598 non-null float64
18              3615 non-null float64
19              3611 non-null float64
20              3588 non-null float64
21              3606 non-null float64
22              3609 non-null float64
23              3595 non-null float64
24              3605 non-null float64
25              3598 non-null float64
26              3600 non-null float64
27              3598 non-null float64
28              3593 non-null float64
29              3371 non-null float64
30              3294 non-null float64
31              2097 non-null float64
AVG or Total    3616 non-null float64
dtypes: float64(31), int64(2), object(2)
memory usage: 1.0+ MB
None

For some reason column 7 wasn't converted to a float.

for index, row in enumerate(data["7"]):
    try:
        float(row)
    except Exception as error:
        print(error)
        print("Row: {}".format(index))
        print("Value: {}".format(row))
could not convert string to float: 
Row: 1835
Value:  

It turns out that this one row also had a space (' ') for one of the values. Strange.

missing.append(" ")
data = pandas.read_csv(path, na_values=missing)
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3756 entries, 0 to 3755
Data columns (total 35 columns):
YR              3756 non-null int64
MO              3756 non-null int64
Metric          3756 non-null object
1               3602 non-null float64
2               3554 non-null float64
3               3583 non-null float64
4               3604 non-null float64
5               3599 non-null float64
6               3610 non-null float64
7               3586 non-null float64
8               3590 non-null float64
9               3595 non-null float64
10              3614 non-null float64
11              3602 non-null float64
12              3600 non-null float64
13              3583 non-null float64
14              3582 non-null float64
15              3591 non-null float64
16              3604 non-null float64
17              3598 non-null float64
18              3615 non-null float64
19              3611 non-null float64
20              3588 non-null float64
21              3606 non-null float64
22              3609 non-null float64
23              3595 non-null float64
24              3605 non-null float64
25              3598 non-null float64
26              3600 non-null float64
27              3598 non-null float64
28              3593 non-null float64
29              3371 non-null float64
30              3294 non-null float64
31              2097 non-null float64
AVG or Total    3616 non-null float64
dtypes: float64(32), int64(2), object(1)
memory usage: 1.0+ MB
None

Cleaning

Drop the Last Column

Besides the fact that the last column is a calculated one, the fact that it's ambiguous (I guess you can tell by how big it is whether it's a Total, but still) makes me think I should get rid of the last column (using drop).

cleaned = data.drop(data.columns[-1], axis="columns")
print(cleaned.shape)
assert len(cleaned.columns) == len(data.columns) - 1
(3756, 34)

Rotate the Days

Now I'm going to move the day-columns into row-values using melt.

melted = pandas.melt(cleaned, id_vars=["YR", "MO", "Metric"], var_name="Day", value_name="Value")
print(melted.head())
     YR  MO Metric Day  Value
0  1940  10     TX   1    NaN
1  1940  10     TN   1    NaN
2  1940  10     PR   1    NaN
3  1940  10     SN   1    NaN
4  1940  11     TX   1   52.0
print(melted.shape)
assert len(melted) == len(data) * 31
(116436, 5)

Casting the Days to Integers

Although they look like integers, the Day column was converted from column headers so they're strings. Maybe I could have cast them at the time of the conversion, but, oh, well.

print(type(melted.iloc[0].Day))
<class 'str'>
melted["Day"] = melted.Day.astype(int)
print(type(melted.iloc[0].Day))
<class 'numpy.int64'>

Make a Date Column

Now I'll make a single date column.

with TIMER:
    melted["date"] = melted.apply(lambda row: datetime(year=row.YR,
                                                       month=row.MO,
                                                       day=row.Day),
                                  axis="columns")
print(melted.head())

That raised an error..

ValueError: ('day is out of range for month', 'occurred at index 105184')
print(melted.iloc[105184])
YR        1941
MO           2
Metric      TX
Day         29
Value      NaN
Name: 105184, dtype: object

February 29? Was 1941 a leap year? According to wikipedia, leap years have to be divisible by four.

print(melted.iloc[105184].YR/4)
485.25

It doesn't look like there was a February 29 in 1941, so here we have a problem in that not all the dates exist. Also, for some reason the '-' didn't get converted to a NaN, but one thing at a time.

def to_datetime(row: pandas.Series) -> Union[datetime, None]:
    """Converts the row to a datetime

    Args:
     row: row in the dataframe with year, month, and day
    Returns:
     row converted to datetime or None if it isn't valid
    """
    if not pandas.isnull(row.Value):
        try:
            return datetime(year=row.YR, month=row.MO, day=row.Day)
        except ValueError as error:
            print(error)
    return    
with TIMER:
    melted["date"] = melted.apply(to_datetime, axis="columns")
    print(melted.head())
Started: 2019-03-10 18:56:57.314885
day is out of range for month
     YR  MO Metric  Day  Value       date
0  1940  10     TX    1    NaN        NaT
1  1940  10     TN    1    NaN        NaT
2  1940  10     PR    1    NaN        NaT
3  1940  10     SN    1    NaN        NaT
4  1940  11     TX    1   52.0 1940-11-01
Ended: 2019-03-10 18:57:01.094165
Elapsed: 0:00:03.779280

It looks like there was only one case where the date didn't exist, but there are multiple entries with missing values.

print("Fraction Missing: {:.2f}".format(
    len(melted[melted.Value.isnull()])/len(melted)))
Fraction Missing: 0.06

Drop the Missing

I'll drop the dates that didn't have data.

cleaned = melted.dropna(subset=["Value"])
print(cleaned.head())
     YR  MO Metric  Day  Value       date
4  1940  11     TX    1  52.00 1940-11-01
5  1940  11     TN    1  40.00 1940-11-01
6  1940  11     PR    1   0.17 1940-11-01
7  1940  11     SN    1   0.00 1940-11-01
8  1940  12     TX    1  51.00 1940-12-01

Drop the Extra Date Columns

Since we have a date column I'll get rid of the columns that I used to make it.

cleaned = cleaned.drop(["YR", "MO", "Day"], axis="columns")
print(cleaned.head())
  Metric  Value       date
4     TX  52.00 1940-11-01
5     TN  40.00 1940-11-01
6     PR   0.17 1940-11-01
7     SN   0.00 1940-11-01
8     TX  51.00 1940-12-01

Figuring Out the Missing Date

One of the entries has values but no date.

print(cleaned[cleaned.date.isnull()])
       Metric  Value date
105427     SN   34.0  NaT
print(melted.iloc[105427])
YR        1946
MO           2
Metric      SN
Day         29
Value       34
date       NaT
Name: 105427, dtype: object
print(melted.iloc[105427].YR/4)
486.5

Okay, this is another non-leap year. What's going on?

print(data[(data.YR==1946) & (data.MO==2)])
       YR  MO Metric      1      2     3      4      5      6      7  ...  \
256  1946   2     TX  48.00  47.00  45.0  43.00  48.00  48.00  43.00  ...   
257  1946   2     TN  44.00  35.00  32.0  32.00  37.00  39.00  33.00  ...   
258  1946   2     PR   0.05   0.02   NaN   0.01   1.54   0.63   0.06  ...   
259  1946   2     SN   0.00   0.00   0.0   0.00   0.00   0.00   0.00  ...   

       23     24    25     26     27     28    29  30  31  AVG or Total  
256  58.0  52.00  53.0  49.00  53.00  55.00   NaN NaN NaN         49.40  
257  43.0  40.00  39.0  35.00  44.00  40.00   NaN NaN NaN         36.00  
258   0.1   0.26   NaN   0.57   0.64   0.04   NaN NaN NaN          4.99  
259   0.0   0.00   0.0   0.00   0.00   0.00  34.0 NaN NaN          0.00  

[4 rows x 35 columns]

It looks like there's something wrong with the snowfall measurement for that date, the other measurements don't have values.

print(data[(data.YR==1946) & (data.MO==2) & (data.Metric=="SN")])
       YR  MO Metric    1    2    3    4    5    6    7  ...   23   24   25  \
259  1946   2     SN  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   

      26   27   28    29  30  31  AVG or Total  
259  0.0  0.0  0.0  34.0 NaN NaN           0.0  

[1 rows x 35 columns]

It was just all 0's and then there's this mysterious 34 inches of snow on the 29th of February. I'm pretty sure that's a mistake. I'll have to delete that.

Although I have the index in the original data frame I've already done all this cleaning so I think it's easier just to drop the missing dates.

rows, columns = cleaned.shape
cleaned = cleaned.dropna(subset=["date"])
assert cleaned.shape[0] == rows - 1

Pivot the Metric Column

So, besides getting the dates into a column one of the points of this was to get the metric types into columns by pivoting. I guess you could argue that this is just a category, but since each date gets all four of the values I think this makes sense.

pivoted = cleaned.pivot(index="date", columns="Metric", values="Value")
print(pivoted.head())
Metric        PR   SN    TN    TX
date                             
1940-10-13  0.01  0.0  57.0  75.0
1940-10-14   NaN  0.0  53.0  70.0
1940-10-15   NaN  0.0  52.0  64.0
1940-10-16  0.00  0.0  50.0  72.0
1940-10-17  0.13  0.0  58.0  72.0

It looks like there's some missing precipitation data. I don't really have a solution for that. I think decisions to imput missing values should come when the data set is being used.

for metric in ("PR", "SN", "TN", "TX"):
    print("{} Missing: {:,}".format(metric, len(pivoted[pivoted[metric].isnull()])))
PR Missing: 3,297
SN Missing: 523
TN Missing: 0
TX Missing: 0

So it looks like we're okay with the temperatures but maybe not so well off with the precipitation.

missing = pivoted[pivoted.PR.isnull()]
missing.loc[:, "missing"] = 1
monthly = missing.missing.resample("M")
figure, axe = pyplot.subplots()
figure.suptitle("Missing Monthly Precipitation", weight="bold")
counts = monthly.count()
stem = axe.stem(counts.index, counts)

nil

So, I was expecting this to be a problem that happened early and then died out, but it appears there's an ongoing problem with collecting precipitation - or maybe they use a symbol for 0 that I'm interpreting as missing?

yearly = missing.missing.resample("Y")
figure, axe = pyplot.subplots()
figure.suptitle("Missing Yearly Precipitation", weight="bold")
counts = yearly.count()
stem = axe.stem(counts.index, counts)

nil

This does seem problematic, if I do anything with precipitation I'll have to figure out what's going on here.

Updating the Columns

The whole TX, TN, etc. encoding scheme seems like it causes too much mental overhead so I'm going to rename the metric columns.

renamed = pivoted.rename(dict(PR="precipitation",
                              SN="snowfall",
                              TN="minimum_temperature",
                              TX="maximum_temperature"),
                         axis="columns")
print(renamed.head())
Metric      precipitation  snowfall  minimum_temperature  maximum_temperature
date                                                                         
1940-10-13           0.01       0.0                 57.0                 75.0
1940-10-14            NaN       0.0                 53.0                 70.0
1940-10-15            NaN       0.0                 52.0                 64.0
1940-10-16           0.00       0.0                 50.0                 72.0
1940-10-17           0.13       0.0                 58.0                 72.0

Save the Message Pack

Now that we have the cleaned-up data, I'll save it as a message pack.

pack_path = Path(os.environ.get("MESSAGE_PACK")).expanduser()
print(pack_path)
/home/hades/pCloudDrive/data/daily-climate-data/portland_1940_to_april_2018.msg
renamed.to_msgpack(pack_path)
assert pack_path.is_file()

Looking at Some Plots

maximum_temperature = renamed.maximum_temperature.resample("Y")
medians = maximum_temperature.median()
maxes = maximum_temperature.max()
mins = maximum_temperature.min()
figure, axe = pyplot.subplots()
figure.suptitle("Portland, OR Yearly Daily Temperatures", weight="bold")
axe.stem(maxes.index, maxes, markerfmt="ro",label="Maximum")
axe.stem(mins.index, mins, markerfmt="go", label="Minimum")
stem = axe.stem(medians.index, medians, label="Median")
axe.set_xlabel("Year")
axe.set_ylabel("Temperature (F)")
legend = axe.legend(bbox_to_anchor=(1, 1))

nil

maximum_temperature = renamed.maximum_temperature.resample("Y")
frame = pandas.DataFrame.from_dict(
    {"Maximum": maximum_temperature.max(),
     "Median": maximum_temperature.median(),
     "Minimum": maximum_temperature.min(),
    }
)
output = frame.hvplot(width=1000, height=600, 
                      title="Mean Maximum Portland Temperatures Per Year",
                      fontsize="18pt")
Embed(output, "min_median_max")()

On the one hand, it's pretty neat what you get for such little code, on the other hand it's not at all obvious how to fix all the styling to make it a better plot.

Kaggle On Time-Series Visualization

Introduction

This is a walk-through of the kaggle notebook on Time-Series Plotting by Aleksey Bilogur.

Set Up

Imports

From Python

from datetime import datetime
from functools import partial
from pathlib import Path
import os

From PyPi

from dotenv import load_dotenv
from bokeh.io.doc import curdoc
from bokeh.models import CrosshairTool, HoverTool
from bokeh.themes import Theme
from bokeh.palettes import Category20
from holoviews import opts
from holoviews.plotting.links import RangeToolLink
from hvplot import hvPlot
from pandas.plotting import autocorrelation_plot, lag_plot
from tabulate import tabulate
import holoviews
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
import seaborn

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
from graeae.tables import CountPercentage

Plotting

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)

Holoviews Backend

holoviews.extension("bokeh")

Bokeh

class Plots:
    width = 1100
    height = 600
    font = "Open Sans"
    font_size = "24pt"
    line_width = 3
    tools =  ["hover"]
    blue = seaborn.color_palette()[0]
    light_blue = Category20[3][1]
    red = seaborn.color_palette()[3]
    yellow = seaborn.color_palette()[1]
    green = seaborn.color_palette()[2]
    gray = seaborn.color_palette()[7]

theme = Theme(json={
    "attrs": {
        "Figure": {
            "text_font": "Open Sans",
            "text_font_size": "18pt",
            "line_color": Category20[3][0],
            "plot_width": Plots.width,
            "plot_height": Plots.height,
            "tools": ["pan", "zoom_in", "hover", "reset"],
        },
        "Title": {
            "text_font_style": "bold",
        },
    },
})
curdoc().theme = theme

Setup Libraries

load_dotenv()
table = partial(tabulate, headers="keys", tablefmt="orgtbl")
kaggle_path = Path(os.environ.get("KAGGLE")).expanduser()
assert kaggle_path.is_dir()

The Embedder

Embed = partial(
    EmbedBokeh, 
    folder_path="../../files/posts/tutorials/kaggle-on-time-series-visualization/")

The Data

New York Stock Exchange Prices

nyse_path = kaggle_path.joinpath("nyse/prices.csv")
assert nyse_path.is_file()
nyse = pandas.read_csv(nyse_path, parse_dates=["date"])
nyse.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 851264 entries, 0 to 851263
Data columns (total 7 columns):
date      851264 non-null datetime64[ns]
symbol    851264 non-null object
open      851264 non-null float64
close     851264 non-null float64
low       851264 non-null float64
high      851264 non-null float64
volume    851264 non-null float64
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 45.5+ MB
nyse = nyse.set_index("date")
print(table(nyse.head()))
date symbol open close low high volume
2016-01-05 00:00:00 WLTW 123.43 125.84 122.31 126.25 2.1636e+06
2016-01-06 00:00:00 WLTW 125.24 119.98 119.94 125.54 2.3864e+06
2016-01-07 00:00:00 WLTW 116.38 114.95 114.93 119.74 2.4895e+06
2016-01-08 00:00:00 WLTW 115.48 116.62 113.5 117.44 2.0063e+06
2016-01-11 00:00:00 WLTW 117.01 114.97 114.09 117.33 1.4086e+06

The notebook describes this as an example of a "strong" date case because the dates act as an explicit index for the data and are, in this case, an aggregate for a day of trading.

UPS

Some of the correlational plots don't show anything meaningful when you use the market as a whole (I guess because different stocks are moving in different directions) so I'm going to pull out the UPS stock information to use later.

ups = nyse[nyse.symbol=="UPS"]
print(ups.shape)
(1762, 6)

Shelter Outcomes

shelter_path = kaggle_path.joinpath(
    "austin-animal-center-shelter-outcomes/aac_shelter_outcomes.csv")
assert shelter_path.is_file()
shelter = pandas.read_csv(shelter_path, parse_dates=["datetime", "date_of_birth"])
shelter.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
age_upon_outcome    78248 non-null object
animal_id           78256 non-null object
animal_type         78256 non-null object
breed               78256 non-null object
color               78256 non-null object
date_of_birth       78256 non-null datetime64[ns]
datetime            78256 non-null datetime64[ns]
monthyear           78256 non-null object
name                54370 non-null object
outcome_subtype     35963 non-null object
outcome_type        78244 non-null object
sex_upon_outcome    78254 non-null object
dtypes: datetime64[ns](2), object(10)
memory usage: 7.2+ MB

Some of the columns are only identifiers (like a name) so we'll drop them to make it easier to inspect the data (although we aren't really going to do anything with it here anyway).

shelter = shelter[["outcome_type", "age_upon_outcome", "datetime",
                   "animal_type", "breed", "color", "sex_upon_outcome",
                   "date_of_birth"]]
print(table(shelter.head(), showindex=False))
outcome_type age_upon_outcome datetime animal_type breed color sex_upon_outcome date_of_birth
Transfer 2 weeks 2014-07-22 16:04:00 Cat Domestic Shorthair Mix Orange Tabby Intact Male 2014-07-07 00:00:00
Transfer 1 year 2013-11-07 11:47:00 Dog Beagle Mix White/Brown Spayed Female 2012-11-06 00:00:00
Adoption 1 year 2014-06-03 14:20:00 Dog Pit Bull Blue/White Neutered Male 2013-03-31 00:00:00
Transfer 9 years 2014-06-15 15:50:00 Dog Miniature Schnauzer Mix White Neutered Male 2005-06-02 00:00:00
Euthanasia 5 months 2014-07-07 14:04:00 Other Bat Mix Brown Unknown 2014-01-07 00:00:00

The notebook describes this as an example of a "weak" date case because the dates are only there for record-keeping and, while they might be significant for modeling, aren't acting as an index for the records.

Cryptocurrency

currency_path = kaggle_path.joinpath("all-crypto-currencies/crypto-markets.csv")
assert currency_path.is_file()
currency = pandas.read_csv(currency_path, parse_dates=["date"])
currency = currency.set_index("date")
print(table(currency.head(), showindex=True))
date slug symbol name ranknow open high low close volume market close_ratio spread
2013-04-28 00:00:00 bitcoin BTC Bitcoin 1 135.3 135.98 132.1 134.21 0 1.48857e+09 0.5438 3.88
2013-04-29 00:00:00 bitcoin BTC Bitcoin 1 134.44 147.49 134 144.54 0 1.60377e+09 0.7813 13.49
2013-04-30 00:00:00 bitcoin BTC Bitcoin 1 144 146.93 134.05 139 0 1.54281e+09 0.3843 12.88
2013-05-01 00:00:00 bitcoin BTC Bitcoin 1 139 139.89 107.72 116.99 0 1.29895e+09 0.2882 32.17
2013-05-02 00:00:00 bitcoin BTC Bitcoin 1 116.38 125.6 92.28 105.21 0 1.16852e+09 0.3881 33.32

Grouping

Birth Dates

Per Date

Here's a plot of the birth dates of the animals in the shelter.

births = shelter.date_of_birth.value_counts()
births_peak = births.idxmax()
births = births.reset_index().sort_values(by="index")
births.columns = ["birth_date", "Births"]
hover = HoverTool(
tooltips=[
    ("Date", "@birth_date{%Y-%m-%d}"),
    ("Births", "@Births{0,0}"),
],
    formatters= {"birth_date": "datetime", 
                 "Births": "numeral"},
    mode="vline",
)
line = holoviews.VLine(births_peak)
curve = holoviews.Curve(
    births, ("birth_date", "Date of Birth"), "Births",
)

main = curve.relabel("Count of Births By Date").opts(labelled=["y"], 
                                                     tools=[hover], 
                                                     height=Plots.height, 
                                                     ylabel="Births", 
                                                     xaxis=None)
range_finder = curve.opts(height=100, yaxis=None, default_tools=[], 
                          xlabel="Birth Dates")

link = RangeToolLink(range_finder, main)

combination = (main * line + range_finder * line)

layout = combination.opts(
    opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size),
    opts.Curve(width=Plots.width, 
               color=Category20[3][0], 
               fontsize=Plots.font_size,
               line_width=Plots.line_width),
    opts.VLine(color=Plots.red, line_dash="dotted")
).cols(1)
Embed(layout, "shelter_births")()

It lools like there was an upward trend until about 2016 when it started to taper off, but since we're counting by days there's a lot of variance so we're going to group the data using pandas' resample method.

Note: One interesting problem I found is that unless I zoom in I can't get my mouse to trigger the hover-tool for the day with the greatest value (May 5, 2014).

There's a couple of different ways to do the grouping of the days, but the simplest way is to take the count for each date using value_counts. This will leave us with a Series with the dates in the index and the counts as values. Once we have this we can aggregate the dates by year and then count how many births there were per year.

By Year

First I'll get the counts for each day using value_counts and print off the first values to see what it looks like. Calling reset_index changes the Series to a DataFrame with the dates as a column.

counts = shelter.date_of_birth.value_counts()
print(table(counts.head().reset_index(), showindex=False))
index date_of_birth
2014-05-05 00:00:00 112
2015-09-01 00:00:00 110
2014-04-21 00:00:00 105
2015-04-28 00:00:00 104
2016-05-01 00:00:00 102

Now we can aggregate the birth-counts by year using resample.

year_counts = counts.resample("Y")
print(year_counts)
DatetimeIndexResampler [freq=<YearEnd: month=12>, axis=0, closed=right, label=right, convention=start, base=0]

Note that this is a grouper, we don't get what we want until we call a method (like count) on it. In this case since we have value counts we want to sum all of the counts for a year (so we need sum).

Now I'm going to aggregate the yearly counts using the sum method.

sums = year_counts.sum()

Calling sum gives us a Series with the dates in the index and the sums as the values.

print(sums.head())
1991-12-31    1
1992-12-31    1
1993-12-31    1
1994-12-31    9
1995-12-31    7
Freq: A-DEC, Name: date_of_birth, dtype: int64

The idxmax method gives us the index of the greatest value and since the dates are in the index, using it will give us the date of the year with the most births, which I'll call sum_peak.

sum_peak = sums.idxmax()

As you may have noticed, all the dates are set for December 31, but for plotting it's better to have them set to January 1 so I'll set it here and do a some other cleanup.

sums = sums.reset_index()
sums.columns = ["birth_date", "Births"]
sum_peak = datetime(sum_peak.year, 1, 1)
sums["birth_date"] = sums.birth_date.apply(lambda date: datetime(date.year, 1, 1))

And now for the plot.

  • The Tools

    First set up the tools

    hover = HoverTool(
    tooltips=[
        ("Year", "@birth_date{%Y}"),
        ("Births", "@Births{0,0}"),
    ],
        formatters= {"birth_date": "datetime", 
                     "Births": "numeral"},
        mode="vline",
    )
    
  • The Plot Parts

    Now I'll create our plotting objects.

    The vertical line will mark the peak year.

    line = holoviews.VLine(sum_peak, label=sum_peak.strftime("%Y"))
    

    And I'm going to add an annotation to it.

    x = datetime(sum_peak.year, 3, 1)
    text = holoviews.Text(x, sums.max()[1]/4,
                          "Max Year: {}".format(sum_peak.year), 
                          halign="left")
    

    Now our data-curve.

    curve = holoviews.Curve(
        sums, ("birth_date", "Date of Birth"), "Births",
    )
    

    Next I'll make two copies of the curve - main will be the larger curve and range_finder will create a plot below it to let us select a range of dates which get linked using the RangeToolLink.

    main = curve.relabel("Births Per Year (1991- 2017)").opts(
        labelled=["y"], 
        tools=[hover], 
        xaxis=None,
        ylabel="Births",
        height=Plots.height)
    range_finder = curve.opts(height=100, yaxis=None, 
                              default_tools=[],
                              xlabel="Year")
    
    link = RangeToolLink(range_finder, main)
    

    Now combine the parts to make our visible plot.

    combination = (line * main * text + line * range_finder)
    

    This next bit is to set some styling on the plot.

  • The Options
    layout = combination.opts(
        opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size),
        opts.Curve(width=Plots.width, 
                   color=Plots.blue,
                   padding=0.01,
                   fontsize=Plots.font_size, 
                   line_width=Plots.line_width),
        opts.VLine(color=Plots.red, line_dash="dotted")
    ).cols(1)
    
  • Embed

    Finally, create the javascript and embed it in this notebook.

    Embed(layout, "shelter_births_per_year")()
    

Lollipop Plot

An alternative way to look at this would be a lollipop plot.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@birth_date{%Y}"),
    ("Births", "@Births{0,0}"),
],
    formatters= {"birth_date": "datetime", 
                 "Births": "numeral"},
    mode="vline",
)

# The Parts
line = holoviews.VLine(sum_peak, label=peak.strftime("%Y"))
spikes = holoviews.Spikes(sums, ("birth_date", "Date of Birth"), "Births")
circles = holoviews.Scatter(sums, "birth_date", "Births")

# The Range Finder
main = circles.relabel().opts(
    labelled=["y"], 
    tools=[hover], 
    xaxis=None,
    ylabel="Births",
    height=Plots.height,
    size=10,
    padding=(0, (0, 0.1)))
range_finder = circles.opts(height=100, 
                            yaxis=None, 
                            default_tools=[],
                            size=5,
                            fontsize={"ticks": "14pt"},
                            xlabel="Year of Birth")

link = RangeToolLink(range_finder, main)

# The Layout
combination = (spikes * line * main + spikes * line * range_finder)

# The Styling Options
layout = combination.opts(
    opts.Layout(shared_axes=False, 
                merge_tools=False,
                title="Shelter Animal Births Per Year (1991- 2017)",
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                color=Plots.red, 
                fontsize=Plots.font_size,
                line_width=Plots.line_width),
    opts.Scatter(color=Plots.blue, fontsize={"ticks": "14pt"}, legend_position="left"),
    opts.VLine(color=Plots.green),
).cols(1)

# The HTML and Javascript
Embed(layout, "births_per_year_spikes")()

Note that putting the title in the Layout changes the font. I was trying to set it to Open Sans but HoloViews is horribly documented for most things so I couldn't figure out how to do it.

Animal Shelter Outcomes

While knowing the birthdates of the animals in the shelter is interesting, what about the dates when their cases were resolved? I originally called this Animal Shelter Adoptions but "outcome" doesn't always mean "adopted", unfortunately.

CountPercentage(shelter.outcome_type)()
Value Count Percentage
Adoption 33112 42.32
Transfer 23499 30.03
Return to Owner 14354 18.35
Euthanasia 6080 7.77
Died 680 0.87
Disposal 307 0.39
Rto-Adopt 150 0.19
Missing 46 0.06
Relocate 16 0.02

I don't know what Disposal means, but it doesn't sound good. Neither does Missing, really, especially if there are any restaurants nearby. Anyway, on to the plotting. I'll aggregate the outcome-counts by year.

outcome_counts = shelter.datetime.value_counts()
outcomes = outcome_counts.resample("Y").sum()
print(table(outcome_counts.head().reset_index(), showindex=False))
outcomes = outcomes.reset_index()
outcomes.columns = ["date", "count"]
outcomes["date"] = outcomes.date.apply(lambda date: datetime(date.year, 1, 1))
index datetime
2016-04-18 00:00:00 39
2015-08-11 00:00:00 25
2017-10-17 00:00:00 25
2015-11-17 00:00:00 22
2015-07-02 00:00:00 22

This next part isn't really necessary but I think keeping the names consistent is helpful, especially since I was struggling so much with HoloViews and didn't need the extra confusion about column-names being wrong.

sums = sums.rename(columns=dict(birth_date="date", Births="count"))

This is going to be like the previous plot but I'm going to add a crosshair tool to make it easier to see how things line up with the axis.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Count", "@count{0,0}"),
],
    formatters= {"date": "datetime", 
                 "count": "numeral"},
    mode="vline",
)
crosshairs = CrosshairTool(line_color=Plots.light_blue)

# The Parts
births = holoviews.Scatter(sums, "date", "count", label="Births")
outcome_circles = holoviews.Scatter(outcomes, "date", "count", 
                                    group="outcome", label="Outcomes")
spikes = holoviews.Spikes(outcomes, ("date", 'Year'), ("count", "Count"), 
                          group="outcome")

# The Layout
combination = spikes * outcome_circles * births

# The Styling
layout = combination.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                height=Plots.height,
                title="Shelter Animal Births vs Outcomes Per Year",
                show_title=True,
                fontsize=Plots.font_size,
                padding=(0, (0, .1)),
                color=Plots.blue,
                line_width=Plots.line_width),
    opts.Scatter("outcome", color=Plots.blue, size=10, legend_position="top_left"),
    opts.Scatter(fontsize={"ticks": "14pt"}, color=Plots.red, size=10, 
                 tools=[hover, crosshairs]),
)

# The HTML
Embed(layout, "outcome_lollipops")()

You can see that there are only six years of adoption outcomes although there are sixteen years of birth dates, with a sudden uptick to the peak year of 2014. It's interesting that the births drop off much faster than the outcomes - the animals seemed to be getting older for some reason.

Trading Volume

The previous plot was a count-plot. You can also use other summary-statistics like a mean to see how things changed over time. I'll plot the mean volume per year for the New York Stock Exchange.

volume = nyse.volume.resample("Y")
means = volume.mean().reset_index()
means["date"] = means.date.apply(lambda date: datetime(date.year, 1, 1))

Along with the standard deviations.

deviations = volume.std().reset_index()
means["two_sigma"] = means.volume + 2 * deviations.volume

And now our plot.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Volume", "@volume{0,0.00}"),
],
    formatters= {"date": "datetime", 
                 "volume": "numeral",
    },
    mode="vline",
)

# The Parts
top_spread = holoviews.ErrorBars((means.date, means.volume, means.two_sigma),
                              group="volume")

volume_curve = holoviews.Curve(means, 
                               ("date", "Year"), 
                               ("volume", "Volume"), 
                               group="volume")

zero_line = holoviews.HLine(0)

# The Layout
layout = volume_curve * top_spread * zero_line

# The Styling
layout = layout.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Curve(width=Plots.width, 
               height=Plots.height,
               title="Mean NYSE Trading Volume Per Year",
               show_title=True,
               fontsize=Plots.font_size,
               padding=(0, (0, .1)),
               color=Plots.blue,
               line_width=Plots.line_width,
               tools=[hover]),
    opts.HLine(line_color=Plots.gray)
)

# The HTML
Embed(layout, "stock_mean_volume")()

While the standard deviation is important, in this case it's so large that it smashes the mean down flat (although maybe the fact that it's so large tells us that the mean isn't so accurate).

hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Volume", "@volume{0,0.00}"),
],
    formatters= {"date": "datetime", 
                 "volume": "numeral"},
    mode="vline",
)

volume_circles = holoviews.Scatter(means, "date", "volume")
volume_spikes = holoviews.Spikes(means, ("date", "Date"), 
                                 ("volume", "Volume"))
combination = volume_spikes * volume_circles
crosshairs = CrosshairTool(line_color=Plots.light_blue, dimensions="height")

layout = combination.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                height=Plots.height,
                title="NYSE Mean Annual Trading Volume",
                show_title=True,
                fontsize=Plots.font_size,
                padding=(0, (0, .1)),
                color=Plots.blue,
                line_width=Plots.line_width),
    opts.Scatter(color=Plots.blue,
                 size=10, 
                 tools=[hover, crosshairs]),
)
Embed(layout, "stock_lollipops")()

I took the cross-hairs out of the plot with the standard deviations but it was (a little) more helpful for the lollipop plots because you have to be directly above the points to trigger the hover tool, whereas you can be above any part of a segment in the Curve plot and it triggers the hover tool.

Lag Plots

The Lag Plot helps you check if there is a significance to the ordering of the data. You are plotting the value in the inputs vs the next value (e.g. one day against the following day). If there is no significance to the ordering then the plot will look random.

NYSE

The lag_plot function isn't one of the DataFrame methods so I don't think it will work with HoloViews, although I haven't tried yet.

volume = nyse.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("NYSE Volume Lag Plot", weight="bold")
subplot = lag_plot(volume.sum().tail(365), ax=axe)

nil

So, the center points do seem to show a relationship, as the next-days volume goes up along with the previous day's volume, but I don't know what those bands around 0 are. One thing I noticed is that there are holidays in the data.

print(volume.sum().index[-6])
2016-12-25 00:00:00

And there are also weekends in there.

print(volume.sum().index[-13].strftime("%a"))
Sun

So it's likely that there are days in there where there was no trading and so they won't correlate with the days that preceded the start of a break or the ones that followed the end of a break. I think. I don't really know if there's trading all year round.

volume_sums = volume.sum()
for day in volume_sums[volume_sums==0][-9:].index:
    print("{} {}".format(day.strftime("%a"), day))
Sat 2016-12-03 00:00:00
Sun 2016-12-04 00:00:00
Sat 2016-12-10 00:00:00
Sun 2016-12-11 00:00:00
Sat 2016-12-17 00:00:00
Sun 2016-12-18 00:00:00
Sat 2016-12-24 00:00:00
Sun 2016-12-25 00:00:00
Mon 2016-12-26 00:00:00

So it does look like the zeros are weekends and holidays.

UPS

Here's what just the UPS trading volumes look like.

figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Lag Plot", weight="bold")
subplot = lag_plot(ups.volume, ax=axe)

nil

I don't know why but that makes it look better. I guess the market as a whole doesn't move quite so well together day by day as a single stock does.

Autcorrelation Plot

UPS

figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Daily Autocorrelation", weight="bold")
subplot = autocorrelation_plot(ups.volume, ax=axe)

nil

This plot shows the lag in relationship to correlation over different lag intervals. It looks like up to about 500 days of lag the correlation is positive but it starts to become more negative after that. The horizontal lines are the confidence intervals - the solid grey lines are the 95 % interval and the dashed grey lines are the 99% interval. The points that fall outside of these intervals are statistically significant.

Cryptocurrency

Lag Plot

crypto_daily = currency.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("Cryptocurrency Volume Lag Plot", weight="bold")
subplot = lag_plot(crypto_daily.sum(), ax=axe)

nil

Unlike the stock-exchange, the cryptocurrencies seem to move together and don't take days off.

Autocorrelation Plot

figure, axe = pyplot.subplots()
figure.suptitle("Dogecoin Auto Correlation", weight="bold")
dogecoin = currency[currency.name=="Dogecoin"]
subplot = autocorrelation_plot(dogecoin.volume, ax=axe)

nil

If my understanding of how this plot works is correct, there is some kind of significance to lags of 125 and 250 days. Is this really true? Possibly.