HoloViews Tabular Data Take Two
from pathlib import Path
import os
from bokeh.io import output_notebook
from bokeh.embed import autoload_static
from bokeh.resources import CDN
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas
output_notebook()
load_dotenv(".env")
True
holoviews.extension("bokeh")
path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
diseases = pandas.read_csv(path)
diseases.head()
Year | Week | State | measles | pertussis | |
---|---|---|---|---|---|
0 | 1928 | 1 | Alabama | 3.67 | NaN |
1 | 1928 | 2 | Alabama | 6.25 | NaN |
2 | 1928 | 3 | Alabama | 7.95 | NaN |
3 | 1928 | 4 | Alabama | 12.58 | NaN |
4 | 1928 | 5 | Alabama | 8.03 | NaN |
key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)
dataset = dataset.aggregate(function=numpy.mean)
dataset
:Dataset [Year,State] (measles,pertussis)
layout = (dataset.to(holoviews.Curve, "Year", "measles")
+ dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
plot = layout.options(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
plot
So if you looked at this in a jupyter notebook with a running server there would be the plot above with working dropdown menus. But otherwise nope.
holoviews.save(plot, "diseases.html")
renderer = holoviews.renderer("bokeh")
figure = renderer.get_plot(plot).state
javascript, tag = autoload_static(figure, CDN, "diseases.js")
print(tag)
<script src="diseases.js" id="5e346ec9-db34-4143-bfa3-a2fe8d2e0da0"></script>
%%HTML
<script src="diseases.js" id="5e346ec9-db34-4143-bfa3-a2fe8d2e0da0"></script>
with open("../../files/posts/libraries/holoviews-tabular-data-take-two/diseases.js", "w") as writer:
writer.write(javascript)
Okay, so weirdly, using autoload_static
seems to be what's breaking it, because even the javascript exported by this notebook doesn't have the dropdown menu.
HoloViews Tabular Datasets
Table of Contents
Introduction
This is a walk through the HoloViews page on Tabular Datasets. The data-set was created by the Wall Street Journal using Project Tycho, but I'm getting it from the HoloViews github repository. The Wall Street Journal page is here. Unfortunately it has mixed content types (https and http) as well as some other problems which prevent Firefox and Chrome-based browsers from rendering it the visualization so I don't know what it actually looks like. Given that it's a commercial site I'm assuming it's an old page that they don't care about anymore.
Warning: I originally did this with modin
and it wouldn't plot correctly. Save it for pre-processing and just use the real pandas
when plotting.
Set Up
Imports
Python
from functools import partial
from pathlib import Path
import os
PyPi
from bokeh.io import output_notebook
from dotenv import load_dotenv
from holoviews import opts
from tabulate import tabulate
import holoviews
import numpy
import pandas
My Projects
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
Holoviews Bokeh
I don't know why but you have to specify that you're using bokeh
, even though it looks like it's working when you don't.
holoviews.extension("bokeh")
output_notebook()
The Embedder
files_path = Path("../../files/posts/libraries/holoviews-tabular-datasets/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
Dotenv
I have the path to the data-set in a .env
file so I'll load it into the environment dictionary.
load_dotenv(".env")
Tabulate
This will print a table that org knows how to render.
orgtable = partial(tabulate, headers="keys", tablefmt="orgtbl",
showindex=False)
Load the Data
The data comes from Project Tycho, which provides health-related data sets for research. The .env
file assumes that I cloned the HoloViews
repository so that I can load the data from it.
path = Path(os.environ.get("DISEASES")).expanduser()
assert path.is_file()
with path.open() as reader:
diseases = pandas.read_csv(path)
print(orgtable(diseases.head()))
Year | Week | State | measles | pertussis |
---|---|---|---|---|
1928 | 1 | Alabama | 3.67 | nan |
1928 | 2 | Alabama | 6.25 | nan |
1928 | 3 | Alabama | 7.95 | nan |
1928 | 4 | Alabama | 12.58 | nan |
1928 | 5 | Alabama | 8.03 | nan |
print(len(diseases))
print(len(diseases.State.unique()))
print(diseases.Year.min())
print(diseases.Year.max())
print(diseases.Week.unique())
222768 51 1928 2011 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52]
The State
column includes Washington D.C., which is why there are 51 states. It has 52 weeks of data for each year from 1928 through 2011. Since measles and pertussis are greater than 1 I assume that this is some kind of rate (like incidence per million), but the page doesn't say and I the article they link to doesn't seem to render in my browser (and I don't have an account to download a dataset).
Create a Dataset
HoloViews has a class called a Dataset that lets you declare the dependent (value dimensions (vdims)) and independent variables (key dimensions (kdims)).
key_dimensions = "Year State".split()
value_dimensions = [("measles", "Measles Incidence"), ("pertussis", "Pertusis Incidence")]
dataset = holoviews.Dataset(diseases, key_dimensions, value_dimensions)
The value_dimensions
list has tuples - these take the form (<column-name>, <output-name>)
so when you make a plot it will use the <output-name>
for any labels that are created.
Aggregate The Data
The one column that I didn't add is the Week
column. The Dataset
has a rather confusing aggregate
method (confusing because you only pass in the function to aggregate with) that apparently knows how to use the key_dimensions
variables we passed in to figure out what to aggregate.
dataset = dataset.aggregate(function=numpy.mean)
print(dataset)
print(dataset.shape)
:Dataset [Year,State] (measles,pertussis) (4284, 4)
layout = (dataset.to(holoviews.Curve, "Year", "measles")
+ dataset.to(holoviews.Curve, "Year", "pertussis")).cols(1)
layout.opts(opts.Curve(width=600, height=300, framewise=True, tools=["hover"]))
Embed(layout, "measles_pertusis")()
Two things to note. One is that HoloViews picked up the nicer names without us having to specify them. Another is that only Alabama is displayed. In the demonstration HoloViews created a drop-down menu to select a state but it didn't do it here. Maybe you need to run it in a jupyter notebook…
Actually, I think it might be a conflict with nikola
, this is a page saved from a jupyter notebook without any nikola pre-processing:
Save the HTML
I'll see if you can do it directly here without using jupyter.
save_file = "diseases_2.html"
output = files_path.joinpath(save_file)
holoviews.save(layout, output)
print("[[file:{}][This is the plot.]]".format(save_file))
GISS Global/Hemispheric Temperatures
Table of Contents
Set Up
Imports
Python
from functools import partial
from pathlib import Path
import os
PyPi
from bokeh.layouts import column
from bokeh.palettes import Set1
from bokeh.models import (
BoxZoomTool,
HoverTool,
Legend,
PanTool,
ResetTool,
SaveTool,
Span,
WheelZoomTool,
)
from bokeh.models.widgets import Panel, Tabs
from bokeh.plotting import (figure,
ColumnDataSource,
)
from dotenv import load_dotenv
import holoviews
import pandas
This Project
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
Setup the Embed
files_path = Path("../../files/posts/giss/giss-globalhemispheric-temperatures/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
Set Up Bokeh
I originally createde EmbedBokeh to use HoloViews to do the rendering so you have to set bokeh to be the backend or it will try to use matplotlib
instead.
holoviews.extension("bokeh")
Load the Data
load_dotenv(".env")
path = Path(os.environ.get("ZONES")).expanduser()
assert path.is_file()
with path.open() as reader:
giss = pandas.read_csv(path)
giss.loc[:, "Year"] = giss.Year.astype("int32")
print(giss.describe())
Year Glob NHem SHem 24N-90N \ count 139.000000 139.000000 139.000000 139.000000 139.000000 mean 1949.000000 0.032302 0.056043 0.008561 0.077698 std 40.269923 0.336896 0.393435 0.301848 0.464606 min 1880.000000 -0.490000 -0.540000 -0.490000 -0.580000 25% 1914.500000 -0.200000 -0.220000 -0.235000 -0.280000 50% 1949.000000 -0.070000 -0.010000 -0.080000 0.020000 75% 1983.500000 0.215000 0.210000 0.265000 0.255000 max 2018.000000 0.980000 1.260000 0.710000 1.500000 24S-24N 90S-24S 64N-90N 44N-64N 24N-44N EQU-24N \ count 139.000000 139.000000 139.000000 139.000000 139.000000 139.000000 mean 0.036115 -0.018561 0.111079 0.117770 0.027698 0.027626 std 0.331384 0.295695 0.917715 0.516729 0.356416 0.326111 min -0.650000 -0.470000 -1.640000 -0.710000 -0.590000 -0.720000 25% -0.215000 -0.250000 -0.545000 -0.270000 -0.200000 -0.230000 50% -0.030000 -0.100000 0.020000 0.000000 -0.070000 0.000000 75% 0.255000 0.230000 0.660000 0.360000 0.135000 0.240000 max 0.970000 0.700000 3.050000 1.440000 1.060000 0.930000 24S-EQU 44S-24S 64S-44S 90S-64S count 139.000000 139.000000 139.000000 139.000000 mean 0.045683 0.020432 -0.069353 -0.078129 std 0.343385 0.312688 0.269380 0.732359 min -0.580000 -0.430000 -0.540000 -2.570000 25% -0.210000 -0.220000 -0.265000 -0.490000 50% -0.030000 -0.080000 -0.090000 0.050000 75% 0.290000 0.260000 0.180000 0.410000 max 1.020000 0.780000 0.450000 1.270000
print(giss.iloc[0])
print()
print(giss.iloc[-1])
Year 1880.00 Glob -0.18 NHem -0.31 SHem -0.06 24N-90N -0.38 24S-24N -0.17 90S-24S -0.01 64N-90N -0.97 44N-64N -0.47 24N-44N -0.25 EQU-24N -0.21 24S-EQU -0.13 44S-24S -0.04 64S-44S 0.05 90S-64S 0.67 Name: 0, dtype: float64 Year 2018.00 Glob 0.82 NHem 0.99 SHem 0.66 24N-90N 1.19 24S-24N 0.64 90S-24S 0.70 64N-90N 1.87 44N-64N 1.09 24N-44N 1.03 EQU-24N 0.69 24S-EQU 0.59 44S-24S 0.78 64S-44S 0.37 90S-64S 1.07 Name: 138, dtype: float64
print(giss.columns)
giss = giss.rename(columns=dict(
Glob="Global",
NHem="Northern Hemisphere",
SHem="Southern Hemisphere"))
print(giss.columns)
Index(['Year', 'Glob', 'NHem', 'SHem', '24N-90N', '24S-24N', '90S-24S', '64N-90N', '44N-64N', '24N-44N', 'EQU-24N', '24S-EQU', '44S-24S', '64S-44S', '90S-64S'], dtype='object') Index(['Year', 'Global', 'Northern Hemisphere', 'Southern Hemisphere', '24N-90N', '24S-24N', '90S-24S', '64N-90N', '44N-64N', '24N-44N', 'EQU-24N', '24S-EQU', '44S-24S', '64S-44S', '90S-64S'], dtype='object')
Plot
Global/Hemispheric
class Plot:
width = 1000
height = 800
line_width = 4
alpha = 0.8
light_alpha = 0.2
title_font_size = "14pt"
hover = HoverTool(
tooltips = [
("Year", "@year"),
("Difference From Normal", "@anomaly")
]
)
tools = [
hover,
PanTool(),
WheelZoomTool(),
BoxZoomTool(),
ResetTool(),
SaveTool(),
]
plot = figure(plot_width=Plot.width, plot_height=Plot.height,
x_range=(giss.Year.min(), giss.Year.max()),
x_axis_label="Year",
y_axis_label="Difference (Celsius)",
tools=tools)
plot.title.text = "Yearly Temperature Difference from Mean 1931-1980 Temperature by Hemisphere"
plot.title.text_font_size = Plot.title_font_size
horizontal = Span(location=0, dimension="width", line_color="darkgray",
line_width=Plot.line_width,
line_cap="round",
line_dash="dashed")
plot.renderers.extend([horizontal])
locations = ["Global", "Northern Hemisphere", "Southern Hemisphere"]
for location, color in zip(locations, Set1[3]):
columns = ColumnDataSource(
data=dict(
year=giss.Year,
anomaly=giss[location],
smoothed=giss.rolling(5, on="Year", min_periods=1)[location].mean(),
)
)
line = plot.circle("year", "anomaly", source=columns,
color=color,
line_width=Plot.line_width,
alpha=Plot.light_alpha,
legend=location)
line = plot.line("year", "smoothed", source=columns,
color=color,
line_width=Plot.line_width, alpha=Plot.alpha,
legend="{} Rolling 5 Year Mean".format(location))
plot.legend.click_policy = "hide"
plot.legend.location = "top_left"
Embed the Plot
I need to fix the EmbedBokeh
class.
embed = Embed(plot, "global_temperature_anomalies")
embed._figure = plot
embed()
GISS Surface Temperature Analysis (GISTEMP v32)
Table of Contents
Introduction
This is a look at the Godard Institute for Space Studies' surface temperature data. In particular it is the Global-mean monthly, seasonal, and annual means data which has data from 1880 to the present (CSV Download Link).
Set Up
Imports
Python
from pathlib import Path
import os
PyPi
from dotenv import load_dotenv
import pandas
Load Dotenv
load_dotenv(".env")
Load the Data
Take One
path = Path(os.environ.get("GLOBAL")).expanduser()
assert path.is_file()
with path.open() as reader:
giss = pandas.read_csv(path)
print(giss.head())
Land-Ocean: Global Means Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec J-D D-N DJF MAM JJA SON 1880 -.29 -.18 -.11 -.19 -.11 -.23 -.20 -.09 -.16 -.23 -.20 -.22 -.18 *** *** -.14 -.17 -.19 1881 -.15 -.17 .04 .04 .02 -.20 -.06 -.02 -.14 -.21 -.22 -.11 -.10 -.11 -.18 .03 -.10 -.19 1882 .14 .15 .04 -.19 -.16 -.26 -.21 -.05 -.10 -.25 -.16 -.24 -.11 -.10 .06 -.10 -.17 -.17 1883 -.31 -.39 -.13 -.17 -.20 -.12 -.08 -.15 -.20 -.14 -.22 -.16 -.19 -.20 -.31 -.16 -.12 -.19
One thing to notice is that the first line got read in as columns and the columns got read in as the first row.
print(giss.iloc[0])
Land-Ocean: Global Means SON Name: (Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J-D, D-N, DJF, MAM, JJA), dtype: object
So we're going to have to skip the first row.
Take Two
path = Path(os.environ.get("GLOBAL")).expanduser()
assert path.is_file()
with path.open() as reader:
giss = pandas.read_csv(path, skiprows=1)
print(giss.head())
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov \ 0 1880 -0.29 -.18 -.11 -.19 -.11 -.23 -.20 -.09 -.16 -.23 -.20 1 1881 -0.15 -.17 .04 .04 .02 -.20 -.06 -.02 -.14 -.21 -.22 2 1882 0.14 .15 .04 -.19 -.16 -.26 -.21 -.05 -.10 -.25 -.16 3 1883 -0.31 -.39 -.13 -.17 -.20 -.12 -.08 -.15 -.20 -.14 -.22 4 1884 -0.15 -.08 -.37 -.42 -.36 -.40 -.34 -.26 -.27 -.24 -.30 Dec J-D D-N DJF MAM JJA SON 0 -.22 -.18 *** *** -.14 -.17 -.19 1 -.11 -.10 -.11 -.18 .03 -.10 -.19 2 -.24 -.11 -.10 .06 -.10 -.17 -.17 3 -.16 -.19 -.20 -.31 -.16 -.12 -.19 4 -.29 -.29 -.28 -.13 -.38 -.34 -.27
print(giss.columns)
Index(['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'J-D', 'D-N', 'DJF', 'MAM', 'JJA', 'SON'], dtype='object')
print(giss.describe())
Year Jan count 140.0000 140.000000 mean 1949.5000 0.027500 std 40.5586 0.396867 min 1880.0000 -0.790000 25% 1914.7500 -0.265000 50% 1949.5000 -0.020000 75% 1984.2500 0.290000 max 2019.0000 1.150000
So most of the columns weren't read as numeric, probably because of the use of ***
for missing data.
Take Three
with path.open() as reader:
giss = pandas.read_csv(path, skiprows=1, na_values="***")
print(giss.describe())
Year Jan Feb Mar Apr May \ count 140.0000 140.000000 139.000000 139.000000 139.000000 139.000000 mean 1949.5000 0.027500 0.038201 0.052806 0.026187 0.016043 std 40.5586 0.396867 0.393732 0.387470 0.363309 0.348825 min 1880.0000 -0.790000 -0.610000 -0.600000 -0.600000 -0.560000 25% 1914.7500 -0.265000 -0.235000 -0.230000 -0.260000 -0.240000 50% 1949.5000 -0.020000 -0.040000 -0.020000 -0.050000 -0.050000 75% 1984.2500 0.290000 0.325000 0.275000 0.250000 0.260000 max 2019.0000 1.150000 1.330000 1.300000 1.070000 0.900000 Jun Jul Aug Sep Oct Nov \ count 139.000000 139.000000 139.000000 139.000000 139.000000 139.000000 mean 0.003022 0.026043 0.030863 0.041367 0.060072 0.048561 std 0.339148 0.317524 0.330365 0.323767 0.335174 0.341057 min -0.530000 -0.540000 -0.540000 -0.530000 -0.570000 -0.540000 25% -0.245000 -0.210000 -0.210000 -0.180000 -0.190000 -0.185000 50% -0.070000 -0.050000 -0.050000 -0.060000 0.000000 -0.020000 75% 0.190000 0.195000 0.190000 0.205000 0.190000 0.180000 max 0.780000 0.820000 1.000000 0.880000 1.060000 1.020000 Dec J-D D-N DJF MAM JJA \ count 139.000000 139.000000 138.000000 138.000000 139.000000 139.000000 mean 0.021727 0.032302 0.033116 0.026449 0.031583 0.020360 std 0.364511 0.336896 0.338215 0.369663 0.361006 0.324987 min -0.790000 -0.490000 -0.510000 -0.660000 -0.560000 -0.520000 25% -0.220000 -0.200000 -0.215000 -0.240000 -0.255000 -0.220000 50% -0.050000 -0.070000 -0.060000 -0.070000 -0.060000 -0.070000 75% 0.275000 0.215000 0.230000 0.280000 0.265000 0.195000 max 1.100000 0.980000 1.010000 1.190000 1.090000 0.860000 SON count 139.000000 mean 0.050504 std 0.327437 min -0.490000 25% -0.190000 50% -0.020000 75% 0.190000 max 0.970000
Actually I just looked at the "official" file given by Coursera and I downloaded the wrong one.
The Real Data
The data I was supposed to pull was the Combined Land-Surface Air and Sea-Surface Water Temperature Anomolies' Zonal Annual Means which shows the different annual mean for each zone in a given year (rather than monthly global averages).
zone_path = Path(os.environ.get("ZONES")).expanduser()
assert zone_path.is_file()
with zone_path.open() as reader:
giss = pandas.read_csv(reader)
print(giss.describe())
Year Glob NHem SHem 24N-90N \ count 139.000000 139.000000 139.000000 139.000000 139.000000 mean 1949.000000 0.032302 0.056043 0.008561 0.077698 std 40.269923 0.336896 0.393435 0.301848 0.464606 min 1880.000000 -0.490000 -0.540000 -0.490000 -0.580000 25% 1914.500000 -0.200000 -0.220000 -0.235000 -0.280000 50% 1949.000000 -0.070000 -0.010000 -0.080000 0.020000 75% 1983.500000 0.215000 0.210000 0.265000 0.255000 max 2018.000000 0.980000 1.260000 0.710000 1.500000 24S-24N 90S-24S 64N-90N 44N-64N 24N-44N EQU-24N \ count 139.000000 139.000000 139.000000 139.000000 139.000000 139.000000 mean 0.036115 -0.018561 0.111079 0.117770 0.027698 0.027626 std 0.331384 0.295695 0.917715 0.516729 0.356416 0.326111 min -0.650000 -0.470000 -1.640000 -0.710000 -0.590000 -0.720000 25% -0.215000 -0.250000 -0.545000 -0.270000 -0.200000 -0.230000 50% -0.030000 -0.100000 0.020000 0.000000 -0.070000 0.000000 75% 0.255000 0.230000 0.660000 0.360000 0.135000 0.240000 max 0.970000 0.700000 3.050000 1.440000 1.060000 0.930000 24S-EQU 44S-24S 64S-44S 90S-64S count 139.000000 139.000000 139.000000 139.000000 mean 0.045683 0.020432 -0.069353 -0.078129 std 0.343385 0.312688 0.269380 0.732359 min -0.580000 -0.430000 -0.540000 -2.570000 25% -0.210000 -0.220000 -0.265000 -0.490000 50% -0.030000 -0.080000 -0.090000 0.050000 75% 0.290000 0.260000 0.180000 0.410000 max 1.020000 0.780000 0.450000 1.270000
print(giss.iloc[0])
Year 1880.00 Glob -0.18 NHem -0.31 SHem -0.06 24N-90N -0.38 24S-24N -0.17 90S-24S -0.01 64N-90N -0.97 44N-64N -0.47 24N-44N -0.25 EQU-24N -0.21 24S-EQU -0.13 44S-24S -0.04 64S-44S 0.05 90S-64S 0.67 Name: 0, dtype: float64
Criteria
Appropriate Chart Selection and Variables
Did you select the appropriate chart and use the correct chart elements to visualize the nominal, ordinal, discrete, and continuous variables, as described in lecture 2.1.3? Continuous data variables should be assigned to continuous chart elements (e.g., lines between data points), whereas discrete variables should be assigned to discrete chart elements (e.g., separate bars). Furthermore, the assignment of variables to elements should follow the priorities in lecture 2.1.2.
Design of the Chart
Does the chart effectively display the data, based on the design rules in lecture 2.3.1?
Content
How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data? For example, was a streamgraph used instead of an ordinary bar chart?
Grading
Criteria | Poor (1–2 points) | Fair (3 points) | Good (4 points) | Great (5 points) |
---|---|---|---|---|
Appropriate Chart Selection and Variables | Chart is indecipherable or significantly misleading because of poor chart type or assignment of variables to elements | Major problem(s) with chart selection or assignment of elements to variables | Minor problem(s) with chart selection or assignment of elements to variables | Chart selection is appropriate for data and its elements properly assigned to appropriate data variables |
Design of the Chart | No apparent attention paid to design | Evidence that several of the design rules should have been followed but were not | Evidence that one of the design rules should have been followed but was not | Attention paid to all design rules |
Content | Misleading | Boring | Not boring | Interesting |
Citation
- GISTEMP Team, 2019: GISS Surface Temperature Analysis (GISTEMP). NASA Goddard Institute for Space Studies. Dataset accessed 2019-02-27 at https://data.giss.nasa.gov/gistemp/.
- Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change, Rev. Geophys., 48, RG4004, https://doi.org/10.1029/2010RG000345.
Interactive Bokeh Legends
Table of Contents
Introduction
This is a reproduction of the bokeh interactive legends example.
Set Up
Imports
Python
I have a class that I use to embed the bokeh and I'm going to make it easier to reuse by binding some values to it with partial.
from functools import partial
PyPi
- Bokeh
The
Spectral
color scheme from bokeh.palettes is a categorical color scheme from Color Brewer, a color scheme helper for cartographers.from bokeh.palettes import Spectral4 from bokeh.plotting import figure from bokeh.sampledata.stocks import AAPL, IBM, MSFT, GOOG
- Pandas And HoloViews
Pandas is for the data, holoviews is because I created the EmbedBokeh class expecting to always use it..
import holoviews import pandas
My Stuff
This is a convenience class for embedding the javascript that bokeh creates into this post.
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
The Embedder
The files_path
is where the javascript needs to be stored for nikola
to copy it to the right place.
files_path = Path("../../files/posts/libraries/interactive-bokeh-legends/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
The Bokeh Backend
For some reason the backend is defaulting to matplotlib
so this fixes it.
holoviews.extension("bokeh")
Hide Plots
This creates a plot where the entry fore each line in the legend becomes a button that toggles the plot's visibility.
plot = figure(plot_width=800, plot_height=400,
x_axis_type="datetime",
tools="hover,pan,wheel_zoom,box_zoom,reset")
plot.title.text = "Some Stock Prices (Click On the Legend To Hide Plots)"
for data, name, color in zip([AAPL, IBM, MSFT, GOOG], "Apple IBM Microsoft Google".split(), Spectral4):
frame = pandas.DataFrame(data)
frame["date"] = pandas.to_datetime(frame["date"])
plot.line(frame["date"], frame["close"], line_width=2, color=color, alpha=0.8, legend=name)
plot.legend.location = "top_left"
plot.legend.click_policy = "hide"
plot.title.text_font_size = "14pt"
Output The Plot
Fade Plots
Clicking on the line in the legend will change the alpha-value for the line (to make it less visible). This has the effect of keeping the hovertool working for the line, while hiding it disables the hover tool.
fade_plot = figure(plot_width=800, plot_height=400,
x_axis_type="datetime",
tools="hover,pan,wheel_zoom,box_zoom,reset")
fade_plot.title.text = "Some Stock Prices (Click On the Legend To Mute Plots)"
fade_plot.title.text_font_size = "14pt"
for data, name, color in zip([AAPL, IBM, MSFT, GOOG], "Apple IBM Microsoft Google".split(), Spectral4):
frame = pandas.DataFrame(data)
frame["date"] = pandas.to_datetime(frame["date"])
fade_plot.line(frame["date"], frame["close"], line_width=2, color=color, muted_alpha=0.2, muted_color=color, alpha=0.8, legend=name)
fade_plot.legend.location = "top_left"
fade_plot.legend.click_policy = "mute"
Output The Plot
Customizing HoloViews
Table of Contents
Introduction
This is another exploration - this time looking at what they call Customization. In my introduction post when I made a scatter plot with a hover tool I first had to make the Scatter element and then add the hover tool as part of the options. HoloViews does this to try and emphasize a separation of content and presentation. When making the Scatter element I was supposed to only be thinking about the data that I wanted to add, then when working with the options I was turning to focus on the aesthetics.
Set Up
Imports
Python
from datetime import datetime
from functools import partial
from pathlib import Path
import os
PyPi
Related Projects
from neurotic.tangles.timer import Timer
This Project
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
The Timer
TIMER = Timer()
The Embedder
files_path = Path("../../files/posts/libraries/customizing-holoviews/")
Embed = partial(
EmbedBokeh,
folder_path=files_path)
Bokeh Backend
When I ran the code further down in the notebook to render the javascript I was getting this error:
ValueError: autoload_static expects a single Model or Document
It was because I forgot the next step and it was defaulting to Matplotlib for some reason.
holoviews.extension("bokeh")
The Data
load_dotenv(".env")
path = Path(os.environ.get("PORTLAND_CRIME")).expanduser()
assert path.exists()
with TIMER:
data = pandas.read_csv(path)
Started: 2019-03-02 14:25:02.818262 Ended: 2019-03-02 14:25:03.296873 Elapsed: 0:00:00.478611
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 217224 entries, 0 to 217223 Data columns (total 17 columns): Address 196626 non-null object Case Number 217224 non-null object Crime Against 217224 non-null object Neighborhood 210788 non-null object Number of Records 217224 non-null int64 Occur Month Year 217224 non-null object Occur Date 217224 non-null object Occur Time 217224 non-null int64 Offense Category 217224 non-null object Offense Count 217224 non-null int64 Offense Type 217224 non-null object OpenDataLat 193352 non-null float64 OpenDataLon 193352 non-null float64 OpenDataX 193352 non-null float64 OpenDataY 193352 non-null float64 Report Date 217224 non-null object ReportMonthYear 217224 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 28.2+ MB None
date = (data["Occur Date"]
+ " "
+ data["Occur Time"].astype(str).str.zfill(4))
data["date"] = pandas.to_datetime(date, format="%m/%d/%Y %H%M")
print(data.date[:5])
0 2017-08-26 00:00:00 1 2017-08-29 16:00:00 2 2017-08-12 19:00:00 3 2017-08-27 01:00:00 4 2017-07-24 09:03:00 Name: date, dtype: datetime64[ns]
data = data[(data.date >= datetime(2015, 5, 31))
& (data.date < datetime(2019, 1, 1))]
selection = data[data.date > datetime(2018, 12, 24)].sort_values("date")
Plot time vs Latitude.
First we get our content.
curve = holoviews.Curve(selection, ("date", "Date-Time"), ("OpenDataLat", "Latitude"))
timestamps = holoviews.Spikes(selection, ("date", "Date-Time"), [])
layout = curve + timestamps
Now we make our presentation.
Take Two
Although the defaults give us a plot that's hard to read, by adjusting the width of the plot we can make it something more interpretable.
layout = layout.opts(
opts.Curve(height=200, width=900, xaxis=None, color="red", line_width=1.5, tools=["hover"]),
opts.Spikes(height=150, width=900, xaxis=None, color="grey")
).cols(1)
HoloViews Introduction
Table of Contents
Introduction
I've already taken an initial look at HVPlot so I'm going to look at HoloViews which acts as an intermediate layer between the main plotting libraries like bokeh and matplotlib and the upper layer given by HVPlot. I haven't used it before so I'm not really sure when you would use what. I guess HVPlot
gives you access to the pandas
plots in bokeh without a lot of work, which is nice, although I noticed that the plots tended to be missing things sometimes (like the Hover tool) so if you want to add more back in you probably have to understand HoloViews
which itself sometimes doesn't give you what you want (like the ability to render it in org-mode posts) so you still need bokeh too, sometimes. And of course I'm only using the static-page versions of everything, not the features that work with a bokeh or jupyter server. But I guess that's for later.
I'm going to be working from the Introduction of their Getting Started guide.
Set Up
Imports
Python
from functools import partial
from pathlib import Path
From PiPy
from holoviews import opts
from sklearn.datasets import fetch_california_housing
import holoviews
import numpy
import pandas
This Project
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
The HoloViews Backend
If you use HVPlot
you don't need to set the backend (because it defaults to 'bokeh', I think) but this is going to be about HoloViews
so I'm going to do it their way, rather than relying on all the pandas
methods.
holoviews.extension("bokeh")
A Partial Bokeh Embedder
Since the output folder is always the same I'm going to bind it to the EmbedBokeh definition.
plot_path = Path("../../files/posts/libraries/holoviews-introduction/")
Embed = partial(EmbedBokeh, folder_path=plot_path)
The Data Set
Load It
Sklearn downloads it as a 'bunch' so we need to get it in that form first and then turn it into a data frame (I'm sure there's a way to skip this step but this is the way I already know how to do it).
folder = Path("~/data/datasets/california-housing").expanduser()
assert folder.is_dir()
print(folder)
bunch = fetch_california_housing(folder)
print(bunch.DESCR)
/home/hades/data/datasets/california-housing .. _california_housing_dataset: California Housing dataset -------------------------- **Data Set Characteristics:** :Number of Instances: 20640 :Number of Attributes: 8 numeric, predictive attributes and the target :Attribute Information: - MedInc median income in block - HouseAge median house age in block - AveRooms average number of rooms - AveBedrms average number of bedrooms - Population block population - AveOccup average house occupancy - Latitude house block latitude - Longitude house block longitude :Missing Attribute Values: None This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/ The target variable is the median house value for California districts. This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). It can be downloaded/loaded using the :func:`sklearn.datasets.fetch_california_housing` function. .. topic:: References - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297
Make The DataFrame
data = pandas.DataFrame(bunch.data, columns=bunch.feature_names)
data["median_value"] = bunch.target
print(data.head())
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \ 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 Longitude median_value 0 -122.23 4.526 1 -122.22 3.585 2 -122.24 3.521 3 -122.25 3.413 4 -122.25 3.422
A Plot
Our target is the median value of the house. Does that correlate with median income?
scatter = holoviews.Scatter(data,
("MedInc", "Median Income"),
("median_value", "Median Value"),
label="California Housing")
After setting up the basic plot we can do things to affect the appearance like setting the color or adding tools.
scatter = scatter.opts(opts.Scatter(color="red", tools=["hover"]))
Adding To the Layout
What if we want to add a distrbution to the plot? HoloViews uses the +
operator to indicate that you want to append a plot to another one.
layout = scatter + holoviews.Histogram(
numpy.histogram(data.HouseAge, bins=24), kdims=["HouseAge"])
layout = layout.opts(opts.Histogram(tools=["hover"]))
A First Look At HVPlot
Table of Contents
Introduction
This is a look at HVPlot, a HoloViews based plotting adapter that works directly with pandas or other pandas-like libraries (e.g. dask). I'm starting with their Introduction but might branch out after that. We'll see.
Set Up
Imports
From Python
from datetime import datetime
from functools import partial
from pathlib import Path
from typing import Union
import textwrap
From PyPi
from sklearn.datasets import load_iris
from tabulate import tabulate
import numpy
import pandas
My Stuff
from neurotic.tangles.timer import Timer
The Bokeh Imports
from bokeh.embed import autoload_static
import bokeh.resources
Set Up the HVPlot
I'm not sure exactly what it's doing, but this next import adds an hvplot
method to pandas' DataFrames to do the actual plotting.
import holoviews
import hvplot.pandas
Typing
PathType = Union[str, Path]
Constants
FOLDER_PATH = "../files/posts/libraries/a-first-look-at-hvplot/"
Tables
table = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)
Helpers
EmbedBokeh Class
class EmbedBokeh:
"""Embed a bokeh figure
Args:
plot: a hvplot to embed
folder_path: path to the folder to save the file
file_name: name of the file to save the javascript in
create_folder: if the folder doesn't exist create it
make_parents: if creating a folder add the missing folders in the path
"""
def __init__(self, plot: holoviews.core.overlay.NdOverlay,
file_name: str,
folder_path: PathType,
create_folder: bool=True,
make_parents: bool=True) -> None:
self.plot = plot
self._figure = None
self.create_folder = create_folder
self.make_parents = make_parents
self._folder_path = None
self.folder_path = folder_path
self._file_name = None
self.file_name = file_name
self._source = None
self._javascript = None
self._bokeh_source = None
self._export_string = None
return
@property
def folder_path(self) -> Path:
"""The path to the folder to store javascript"""
return self._folder_path
@folder_path.setter
def folder_path(self, path: PathType) -> None:
"""Sets the path to the javascript folder"""
self._folder_path = Path(path)
if self.create_folder and not self._folder_path.is_dir():
self._folder_path.mkdir(parents=self.make_parents)
return
@property
def file_name(self) -> str:
"""The name of the javascript file"""
return self._file_name
@file_name.setter
def file_name(self, name: str) -> None:
"""Sets the filename
Args:
name: name to save the javascript (without the folder)
"""
name = Path(name)
self._file_name = "{}.js".format(name.stem)
return
@property
def figure(self) -> bokeh.plotting.Figure:
"""The Figure to plot"""
if self._figure is None:
self._figure = holoviews.render(self.plot)
return self._figure
@property
def bokeh_source(self) -> bokeh.resources.Resources:
"""The javascript source
"""
if self._bokeh_source is None:
self._bokeh_source = bokeh.resources.CDN
return self._bokeh_source
@property
def source(self) -> str:
"""The HTML fragment to export"""
if self._source is None:
self._javascript, self._source = autoload_static(self.figure,
self.bokeh_source,
self.file_name)
return self._source
@property
def javascript(self) -> str:
"""javascript to save"""
if self._javascript is None:
self._javascript, self._source = autoload_static(self.figure,
self.bokeh_source,
self.file_name)
return self._javascript
@property
def export_string(self) -> str:
"""The string to embed the figure into org-mode"""
if self._export_string is None:
self._export_string = textwrap.dedent(
"""#+BEGIN_EXPORT html{}
#+END_EXPORT""".format(self.source))
return self._export_string
def save_figure(self) -> None:
"""Saves the javascript file"""
with open(self.folder_path.joinpath(self.file_name), "w") as writer:
writer.write(self.javascript)
return
def __call__(self) -> None:
"""Creates the bokeh javascript and emits it"""
self.save_figure()
print(self.export_string)
return
def reset(self) -> None:
"""Sets the generated (bokeh) properties back to None"""
self._export_string = None
self._javascript = None
self._source = None
self._figure = None
return
Embed = partial(EmbedBokeh, folder_path=FOLDER_PATH)
The Timer
TIMER = Timer()
The Data
Portland Crime
This is taken from the Portland Crime Statistics page.
portland_path = Path("~/data/datasets/portland/crime-to-january-2018.csv").expanduser()
assert portland_path.is_file()
with TIMER:
crime = pandas.read_csv(portland_path)
Started: 2019-02-02 18:38:59.025251 Ended: 2019-02-02 18:39:00.170796 Elapsed: 0:00:01.145545
print(crime.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 217224 entries, 0 to 217223 Data columns (total 17 columns): Address 196626 non-null object Case Number 217224 non-null object Crime Against 217224 non-null object Neighborhood 210788 non-null object Number of Records 217224 non-null int64 Occur Month Year 217224 non-null object Occur Date 217224 non-null object Occur Time 217224 non-null int64 Offense Category 217224 non-null object Offense Count 217224 non-null int64 Offense Type 217224 non-null object OpenDataLat 193352 non-null float64 OpenDataLon 193352 non-null float64 OpenDataX 193352 non-null float64 OpenDataY 193352 non-null float64 Report Date 217224 non-null object ReportMonthYear 217224 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 28.2+ MB None
Here's a possible categorical column to use.
crime["type"] = crime["Crime Against"].astype("category")
crime = crime.drop(columns=["Crime Against"])
print(table(crime.type.value_counts().reset_index(), headers=["Type", "Count"]))
Type | Count |
---|---|
Property | 175567 |
Person | 32109 |
Society | 9548 |
Making the Plot
Holoviews is expecting you to work in a jupyter notebook and isn't quite so easy to work with in org-mode so I'll make the plot with hvplot
but then convert it to a bokeh figure to embed it in this post.
The Plot
with TIMER:
crime["date"] = pandas.to_datetime(crime["Occur Date"])
crime["id"] = crime["Case Number"]
crime = crime.drop(columns=["Occur Date", "Case Number"])
crime_dates = crime.set_index("date")
Started: 2019-02-01 20:31:47.668915 Ended: 2019-02-01 20:32:09.889378 Elapsed: 0:00:22.220463
weekly = crime_dates.resample("W").count()
plot = weekly.id.hvplot()
Embed(plot, "weekly_crime.js")()
That didn't work out is planned. It turns out that the data starts in 1972, but is mostly empty until around May of 2015. It also looks like January is missing values. I think I'll trim the data set.
Trimmed
crime_dates = crime_dates[(crime_dates.index >= datetime(2015, 5, 31))
& (crime_dates.index < datetime(2019, 1, 1))]
weekly = crime_dates.resample("W").count()
By Type
HoloViews uses this rather odd way of composing figures. Instead of the object-oriented way you might expect it overrides the multiplication sign (*
for adding to the same plot) and addition sign (+
for adding an adjacent plot) so to plot the types I'll have to multiply their plots.
types = {name: crime_dates[crime_dates.type==name]
for name in crime_dates.type.unique()}
weekly_types = {name: data.resample("W").count()
for name, data in types.items()}
keys = list(weekly_types.keys())
first = keys[0]
plot = weekly_types[first].hvplot(y="id", label=first)
for key in keys[1:]:
plot *= weekly_types[key].hvplot(y="id", label=key)
It looks like it could use more trimming, but it also looks like it's mostly property crimes, which is what you'd expect, I guess. Actually I tried another trim and it looks like it always starts at zero because of the way the resampling works, so trimming doesn't make that first anomaly go away. Maybe trimming the weekly would help.
Looking a Little More at the Crimes
By Neighborhood
top_ten = crime_dates.Neighborhood.value_counts()[:10].reset_index()
print(table(top_ten, headers="Neighborhood Count".split()))
Neighborhood | Count |
---|---|
Downtown | 10237 |
Hazelwood | 10127 |
Lents | 5681 |
Powellhurst-Gilbert | 5605 |
Centennial | 5016 |
Old Town/Chinatown | 4966 |
Northwest | 4648 |
Montavilla | 4026 |
Pearl | 3905 |
Lloyd | 3699 |
neighborhoods = crime_dates["Neighborhood"]
neighborhoods = pandas.get_dummies(neighborhoods)
neighborhoods = neighborhoods[top_ten["index"]].resample("M").sum()
plot = (neighborhoods.hvplot(title="Top Ten Monthly Neighborhood Crime Counts")
+ neighborhoods.hvplot.table(columns=["Downtown", "Hazelwood",
"Lents", "Powellhurst-Gilbert"]))
Embed(plot, "neighborhoods")()
So the first thing to notice is that Downtown and Hazelwood dominate the case counts. There doesn't seem to be any strong upward or downward trend.
I live in Powelhurst-Gilbert, about a block north of Lents, and it looks like if you considered them one big neighborhood (they are adjacent), then they form the highest-crime Neighborhood, but, sticking to the arbitrariness of the boundaries, we are relegated to numbers three and four.
Distribution
plot = neighborhoods.hvplot.kde(
by="Neighborhood",
title="Distributions of Top Ten Crime Neighborhoods")
Embed(plot, "neighborhoods_kde")()
I don't know what that mysterious bulge around zero is, all the neighborhoods are in the other peaks.
Irises
Since the previous data was time-series data I thought I'd load a data set that wasn't to illustrate the use of the by
parameter.
irises = load_iris()
print(irises.DESCR)
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...
iris_data = pandas.DataFrame(irises.data, columns=irises.feature_names)
print(iris_data.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
I don't know where this convention came from, but you can use the by
keyword to specify a categorical column to differentiate the data points. In this case I'll use it to differentiate the species.
target = pandas.Series(irises.target)
target_map = dict(zip(range(3), irises.target_names))
iris_data["target"] = target.apply(lambda x: target_map[x])
plot = iris_data.hvplot.scatter(x="sepal length (cm)", y="petal length (cm)",
by="target", alpha=0.5,
title="Iris Sepal Length vs Petal Length")
EmbedBokeh(plot, folder_path=FOLDER_PATH, file_name="irises.js")()
Scatter Matrix
plot = hvplot.scatter_matrix(iris_data, c="target")
Embed(plot, "iris_scatter_matrix")()
Parallel Coordinates
plot = hvplot.parallel_coordinates(iris_data, "target")
Embed(plot, "iris_parallel_coordinates")()
Portland Daily Temperatures Data
Table of Contents
Introduction
I'm going to work with the Daily Temperatures data set for Portland, Oregon (measured at the airport) taken from the National Weather Service. I cleaned it up a little already, removing the extra header rows and adding a missing column header (Metric
) but the data is arranged with the year and month as a column and then each day is given its own column, which isn't how I want to work with it, so I'm going to transform it a little to make it more like what I expect it to look like.}
Set Up
Imports
Python
from functools import partial
from datetime import datetime
from pathlib import Path
from typing import Union
import os
From PyPi
from dotenv import load_dotenv
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas
import seaborn
My Stuff
from graeae.timers import Timer
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
Plotting
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
rc={"axes.grid": False,
"font.family": ["sans-serif"],
"font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
"figure.figsize": (8, 6)},
font_scale=1)
The Timer
TIMER = Timer()
The Embedder
Embed = partial(
EmbedBokeh,
folder_path="../../files/posts/portland-daily-climate/portland-daily-temperatures-data/")
Loading the Data
load_dotenv()
path = Path(os.environ.get("CSV")).expanduser()
print(path)
assert path.is_file()
/home/athena/data/datasets/necromuralist/daily-climate-data/portland_1940_to_april_2018.csv
Some Preparation
The first thing to work with is that there are three characters representing "missing" data (that I noticed) - M, T, and - - that we have to tell pandas about when we use read_csv.
missing = ["M", "T", "-"]
I was going to load the measurement type (e.g. "TX"), but I realized that I was planning to turn those into column headers so maybe it's not a good idea.
with TIMER:
data = pandas.read_csv(path, na_values=missing)
print(data.shape)
Started: 2019-03-10 18:50:58.399150 Ended: 2019-03-10 18:50:58.410684 Elapsed: 0:00:00.011534 (3756, 35)
print(data.columns)
Index(['YR', 'MO', 'Metric', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', 'AVG or Total'], dtype='object')
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3756 entries, 0 to 3755 Data columns (total 35 columns): YR 3756 non-null int64 MO 3756 non-null int64 Metric 3756 non-null object 1 3602 non-null float64 2 3554 non-null float64 3 3583 non-null float64 4 3604 non-null float64 5 3599 non-null float64 6 3610 non-null float64 7 3587 non-null object 8 3590 non-null float64 9 3595 non-null float64 10 3614 non-null float64 11 3602 non-null float64 12 3600 non-null float64 13 3583 non-null float64 14 3582 non-null float64 15 3591 non-null float64 16 3604 non-null float64 17 3598 non-null float64 18 3615 non-null float64 19 3611 non-null float64 20 3588 non-null float64 21 3606 non-null float64 22 3609 non-null float64 23 3595 non-null float64 24 3605 non-null float64 25 3598 non-null float64 26 3600 non-null float64 27 3598 non-null float64 28 3593 non-null float64 29 3371 non-null float64 30 3294 non-null float64 31 2097 non-null float64 AVG or Total 3616 non-null float64 dtypes: float64(31), int64(2), object(2) memory usage: 1.0+ MB None
For some reason column 7
wasn't converted to a float.
for index, row in enumerate(data["7"]):
try:
float(row)
except Exception as error:
print(error)
print("Row: {}".format(index))
print("Value: {}".format(row))
could not convert string to float: Row: 1835 Value:
It turns out that this one row also had a space (' ') for one of the values. Strange.
missing.append(" ")
data = pandas.read_csv(path, na_values=missing)
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3756 entries, 0 to 3755 Data columns (total 35 columns): YR 3756 non-null int64 MO 3756 non-null int64 Metric 3756 non-null object 1 3602 non-null float64 2 3554 non-null float64 3 3583 non-null float64 4 3604 non-null float64 5 3599 non-null float64 6 3610 non-null float64 7 3586 non-null float64 8 3590 non-null float64 9 3595 non-null float64 10 3614 non-null float64 11 3602 non-null float64 12 3600 non-null float64 13 3583 non-null float64 14 3582 non-null float64 15 3591 non-null float64 16 3604 non-null float64 17 3598 non-null float64 18 3615 non-null float64 19 3611 non-null float64 20 3588 non-null float64 21 3606 non-null float64 22 3609 non-null float64 23 3595 non-null float64 24 3605 non-null float64 25 3598 non-null float64 26 3600 non-null float64 27 3598 non-null float64 28 3593 non-null float64 29 3371 non-null float64 30 3294 non-null float64 31 2097 non-null float64 AVG or Total 3616 non-null float64 dtypes: float64(32), int64(2), object(1) memory usage: 1.0+ MB None
Cleaning
Drop the Last Column
Besides the fact that the last column is a calculated one, the fact that it's ambiguous (I guess you can tell by how big it is whether it's a Total, but still) makes me think I should get rid of the last column (using drop).
cleaned = data.drop(data.columns[-1], axis="columns")
print(cleaned.shape)
assert len(cleaned.columns) == len(data.columns) - 1
(3756, 34)
Rotate the Days
Now I'm going to move the day-columns into row-values using melt.
melted = pandas.melt(cleaned, id_vars=["YR", "MO", "Metric"], var_name="Day", value_name="Value")
print(melted.head())
YR MO Metric Day Value 0 1940 10 TX 1 NaN 1 1940 10 TN 1 NaN 2 1940 10 PR 1 NaN 3 1940 10 SN 1 NaN 4 1940 11 TX 1 52.0
print(melted.shape)
assert len(melted) == len(data) * 31
(116436, 5)
Casting the Days to Integers
Although they look like integers, the Day
column was converted from column headers so they're strings. Maybe I could have cast them at the time of the conversion, but, oh, well.
print(type(melted.iloc[0].Day))
<class 'str'>
melted["Day"] = melted.Day.astype(int)
print(type(melted.iloc[0].Day))
<class 'numpy.int64'>
Make a Date Column
Now I'll make a single date column.
with TIMER:
melted["date"] = melted.apply(lambda row: datetime(year=row.YR,
month=row.MO,
day=row.Day),
axis="columns")
print(melted.head())
That raised an error..
ValueError: ('day is out of range for month', 'occurred at index 105184')
print(melted.iloc[105184])
YR 1941 MO 2 Metric TX Day 29 Value NaN Name: 105184, dtype: object
February 29? Was 1941 a leap year? According to wikipedia, leap years have to be divisible by four.
print(melted.iloc[105184].YR/4)
485.25
It doesn't look like there was a February 29 in 1941, so here we have a problem in that not all the dates exist. Also, for some reason the '-' didn't get converted to a NaN, but one thing at a time.
def to_datetime(row: pandas.Series) -> Union[datetime, None]:
"""Converts the row to a datetime
Args:
row: row in the dataframe with year, month, and day
Returns:
row converted to datetime or None if it isn't valid
"""
if not pandas.isnull(row.Value):
try:
return datetime(year=row.YR, month=row.MO, day=row.Day)
except ValueError as error:
print(error)
return
with TIMER:
melted["date"] = melted.apply(to_datetime, axis="columns")
print(melted.head())
Started: 2019-03-10 18:56:57.314885 day is out of range for month YR MO Metric Day Value date 0 1940 10 TX 1 NaN NaT 1 1940 10 TN 1 NaN NaT 2 1940 10 PR 1 NaN NaT 3 1940 10 SN 1 NaN NaT 4 1940 11 TX 1 52.0 1940-11-01 Ended: 2019-03-10 18:57:01.094165 Elapsed: 0:00:03.779280
It looks like there was only one case where the date didn't exist, but there are multiple entries with missing values.
print("Fraction Missing: {:.2f}".format(
len(melted[melted.Value.isnull()])/len(melted)))
Fraction Missing: 0.06
Drop the Missing
I'll drop the dates that didn't have data.
cleaned = melted.dropna(subset=["Value"])
print(cleaned.head())
YR MO Metric Day Value date 4 1940 11 TX 1 52.00 1940-11-01 5 1940 11 TN 1 40.00 1940-11-01 6 1940 11 PR 1 0.17 1940-11-01 7 1940 11 SN 1 0.00 1940-11-01 8 1940 12 TX 1 51.00 1940-12-01
Drop the Extra Date Columns
Since we have a date column I'll get rid of the columns that I used to make it.
cleaned = cleaned.drop(["YR", "MO", "Day"], axis="columns")
print(cleaned.head())
Metric Value date 4 TX 52.00 1940-11-01 5 TN 40.00 1940-11-01 6 PR 0.17 1940-11-01 7 SN 0.00 1940-11-01 8 TX 51.00 1940-12-01
Figuring Out the Missing Date
One of the entries has values but no date.
print(cleaned[cleaned.date.isnull()])
Metric Value date 105427 SN 34.0 NaT
print(melted.iloc[105427])
YR 1946 MO 2 Metric SN Day 29 Value 34 date NaT Name: 105427, dtype: object
print(melted.iloc[105427].YR/4)
486.5
Okay, this is another non-leap year. What's going on?
print(data[(data.YR==1946) & (data.MO==2)])
YR MO Metric 1 2 3 4 5 6 7 ... \ 256 1946 2 TX 48.00 47.00 45.0 43.00 48.00 48.00 43.00 ... 257 1946 2 TN 44.00 35.00 32.0 32.00 37.00 39.00 33.00 ... 258 1946 2 PR 0.05 0.02 NaN 0.01 1.54 0.63 0.06 ... 259 1946 2 SN 0.00 0.00 0.0 0.00 0.00 0.00 0.00 ... 23 24 25 26 27 28 29 30 31 AVG or Total 256 58.0 52.00 53.0 49.00 53.00 55.00 NaN NaN NaN 49.40 257 43.0 40.00 39.0 35.00 44.00 40.00 NaN NaN NaN 36.00 258 0.1 0.26 NaN 0.57 0.64 0.04 NaN NaN NaN 4.99 259 0.0 0.00 0.0 0.00 0.00 0.00 34.0 NaN NaN 0.00 [4 rows x 35 columns]
It looks like there's something wrong with the snowfall measurement for that date, the other measurements don't have values.
print(data[(data.YR==1946) & (data.MO==2) & (data.Metric=="SN")])
YR MO Metric 1 2 3 4 5 6 7 ... 23 24 25 \ 259 1946 2 SN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 26 27 28 29 30 31 AVG or Total 259 0.0 0.0 0.0 34.0 NaN NaN 0.0 [1 rows x 35 columns]
It was just all 0's and then there's this mysterious 34 inches of snow on the 29th of February. I'm pretty sure that's a mistake. I'll have to delete that.
Although I have the index in the original data
frame I've already done all this cleaning so I think it's easier just to drop the missing dates.
rows, columns = cleaned.shape
cleaned = cleaned.dropna(subset=["date"])
assert cleaned.shape[0] == rows - 1
Pivot the Metric Column
So, besides getting the dates into a column one of the points of this was to get the metric types into columns by pivoting. I guess you could argue that this is just a category, but since each date gets all four of the values I think this makes sense.
pivoted = cleaned.pivot(index="date", columns="Metric", values="Value")
print(pivoted.head())
Metric PR SN TN TX date 1940-10-13 0.01 0.0 57.0 75.0 1940-10-14 NaN 0.0 53.0 70.0 1940-10-15 NaN 0.0 52.0 64.0 1940-10-16 0.00 0.0 50.0 72.0 1940-10-17 0.13 0.0 58.0 72.0
It looks like there's some missing precipitation data. I don't really have a solution for that. I think decisions to imput missing values should come when the data set is being used.
for metric in ("PR", "SN", "TN", "TX"):
print("{} Missing: {:,}".format(metric, len(pivoted[pivoted[metric].isnull()])))
PR Missing: 3,297 SN Missing: 523 TN Missing: 0 TX Missing: 0
So it looks like we're okay with the temperatures but maybe not so well off with the precipitation.
missing = pivoted[pivoted.PR.isnull()]
missing.loc[:, "missing"] = 1
monthly = missing.missing.resample("M")
figure, axe = pyplot.subplots()
figure.suptitle("Missing Monthly Precipitation", weight="bold")
counts = monthly.count()
stem = axe.stem(counts.index, counts)
So, I was expecting this to be a problem that happened early and then died out, but it appears there's an ongoing problem with collecting precipitation - or maybe they use a symbol for 0 that I'm interpreting as missing?
yearly = missing.missing.resample("Y")
figure, axe = pyplot.subplots()
figure.suptitle("Missing Yearly Precipitation", weight="bold")
counts = yearly.count()
stem = axe.stem(counts.index, counts)
This does seem problematic, if I do anything with precipitation I'll have to figure out what's going on here.
Updating the Columns
The whole TX
, TN
, etc. encoding scheme seems like it causes too much mental overhead so I'm going to rename the metric columns.
renamed = pivoted.rename(dict(PR="precipitation",
SN="snowfall",
TN="minimum_temperature",
TX="maximum_temperature"),
axis="columns")
print(renamed.head())
Metric precipitation snowfall minimum_temperature maximum_temperature date 1940-10-13 0.01 0.0 57.0 75.0 1940-10-14 NaN 0.0 53.0 70.0 1940-10-15 NaN 0.0 52.0 64.0 1940-10-16 0.00 0.0 50.0 72.0 1940-10-17 0.13 0.0 58.0 72.0
Save the Message Pack
Now that we have the cleaned-up data, I'll save it as a message pack.
pack_path = Path(os.environ.get("MESSAGE_PACK")).expanduser()
print(pack_path)
/home/hades/pCloudDrive/data/daily-climate-data/portland_1940_to_april_2018.msg
renamed.to_msgpack(pack_path)
assert pack_path.is_file()
Looking at Some Plots
maximum_temperature = renamed.maximum_temperature.resample("Y")
medians = maximum_temperature.median()
maxes = maximum_temperature.max()
mins = maximum_temperature.min()
figure, axe = pyplot.subplots()
figure.suptitle("Portland, OR Yearly Daily Temperatures", weight="bold")
axe.stem(maxes.index, maxes, markerfmt="ro",label="Maximum")
axe.stem(mins.index, mins, markerfmt="go", label="Minimum")
stem = axe.stem(medians.index, medians, label="Median")
axe.set_xlabel("Year")
axe.set_ylabel("Temperature (F)")
legend = axe.legend(bbox_to_anchor=(1, 1))
maximum_temperature = renamed.maximum_temperature.resample("Y")
frame = pandas.DataFrame.from_dict(
{"Maximum": maximum_temperature.max(),
"Median": maximum_temperature.median(),
"Minimum": maximum_temperature.min(),
}
)
output = frame.hvplot(width=1000, height=600,
title="Mean Maximum Portland Temperatures Per Year",
fontsize="18pt")
Embed(output, "min_median_max")()
On the one hand, it's pretty neat what you get for such little code, on the other hand it's not at all obvious how to fix all the styling to make it a better plot.
Kaggle On Time-Series Visualization
Table of Contents
Introduction
This is a walk-through of the kaggle notebook on Time-Series Plotting by Aleksey Bilogur.
Set Up
Imports
From Python
from datetime import datetime
from functools import partial
from pathlib import Path
import os
From PyPi
from dotenv import load_dotenv
from bokeh.io.doc import curdoc
from bokeh.models import CrosshairTool, HoverTool
from bokeh.themes import Theme
from bokeh.palettes import Category20
from holoviews import opts
from holoviews.plotting.links import RangeToolLink
from hvplot import hvPlot
from pandas.plotting import autocorrelation_plot, lag_plot
from tabulate import tabulate
import holoviews
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
import seaborn
My Projects
from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
from graeae.tables import CountPercentage
Plotting
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
rc={"axes.grid": False,
"font.family": ["sans-serif"],
"font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
"figure.figsize": (8, 6)},
font_scale=1)
Holoviews Backend
holoviews.extension("bokeh")
Bokeh
class Plots:
width = 1100
height = 600
font = "Open Sans"
font_size = "24pt"
line_width = 3
tools = ["hover"]
blue = seaborn.color_palette()[0]
light_blue = Category20[3][1]
red = seaborn.color_palette()[3]
yellow = seaborn.color_palette()[1]
green = seaborn.color_palette()[2]
gray = seaborn.color_palette()[7]
theme = Theme(json={
"attrs": {
"Figure": {
"text_font": "Open Sans",
"text_font_size": "18pt",
"line_color": Category20[3][0],
"plot_width": Plots.width,
"plot_height": Plots.height,
"tools": ["pan", "zoom_in", "hover", "reset"],
},
"Title": {
"text_font_style": "bold",
},
},
})
curdoc().theme = theme
Setup Libraries
load_dotenv()
table = partial(tabulate, headers="keys", tablefmt="orgtbl")
kaggle_path = Path(os.environ.get("KAGGLE")).expanduser()
assert kaggle_path.is_dir()
The Embedder
Embed = partial(
EmbedBokeh,
folder_path="../../files/posts/tutorials/kaggle-on-time-series-visualization/")
The Data
New York Stock Exchange Prices
nyse_path = kaggle_path.joinpath("nyse/prices.csv")
assert nyse_path.is_file()
nyse = pandas.read_csv(nyse_path, parse_dates=["date"])
nyse.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 851264 entries, 0 to 851263 Data columns (total 7 columns): date 851264 non-null datetime64[ns] symbol 851264 non-null object open 851264 non-null float64 close 851264 non-null float64 low 851264 non-null float64 high 851264 non-null float64 volume 851264 non-null float64 dtypes: datetime64[ns](1), float64(5), object(1) memory usage: 45.5+ MB
nyse = nyse.set_index("date")
print(table(nyse.head()))
date | symbol | open | close | low | high | volume |
---|---|---|---|---|---|---|
2016-01-05 00:00:00 | WLTW | 123.43 | 125.84 | 122.31 | 126.25 | 2.1636e+06 |
2016-01-06 00:00:00 | WLTW | 125.24 | 119.98 | 119.94 | 125.54 | 2.3864e+06 |
2016-01-07 00:00:00 | WLTW | 116.38 | 114.95 | 114.93 | 119.74 | 2.4895e+06 |
2016-01-08 00:00:00 | WLTW | 115.48 | 116.62 | 113.5 | 117.44 | 2.0063e+06 |
2016-01-11 00:00:00 | WLTW | 117.01 | 114.97 | 114.09 | 117.33 | 1.4086e+06 |
The notebook describes this as an example of a "strong" date case because the dates act as an explicit index for the data and are, in this case, an aggregate for a day of trading.
UPS
Some of the correlational plots don't show anything meaningful when you use the market as a whole (I guess because different stocks are moving in different directions) so I'm going to pull out the UPS stock information to use later.
ups = nyse[nyse.symbol=="UPS"]
print(ups.shape)
(1762, 6)
Shelter Outcomes
shelter_path = kaggle_path.joinpath(
"austin-animal-center-shelter-outcomes/aac_shelter_outcomes.csv")
assert shelter_path.is_file()
shelter = pandas.read_csv(shelter_path, parse_dates=["datetime", "date_of_birth"])
shelter.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 78256 entries, 0 to 78255 Data columns (total 12 columns): age_upon_outcome 78248 non-null object animal_id 78256 non-null object animal_type 78256 non-null object breed 78256 non-null object color 78256 non-null object date_of_birth 78256 non-null datetime64[ns] datetime 78256 non-null datetime64[ns] monthyear 78256 non-null object name 54370 non-null object outcome_subtype 35963 non-null object outcome_type 78244 non-null object sex_upon_outcome 78254 non-null object dtypes: datetime64[ns](2), object(10) memory usage: 7.2+ MB
Some of the columns are only identifiers (like a name) so we'll drop them to make it easier to inspect the data (although we aren't really going to do anything with it here anyway).
shelter = shelter[["outcome_type", "age_upon_outcome", "datetime",
"animal_type", "breed", "color", "sex_upon_outcome",
"date_of_birth"]]
print(table(shelter.head(), showindex=False))
outcome_type | age_upon_outcome | datetime | animal_type | breed | color | sex_upon_outcome | date_of_birth |
---|---|---|---|---|---|---|---|
Transfer | 2 weeks | 2014-07-22 16:04:00 | Cat | Domestic Shorthair Mix | Orange Tabby | Intact Male | 2014-07-07 00:00:00 |
Transfer | 1 year | 2013-11-07 11:47:00 | Dog | Beagle Mix | White/Brown | Spayed Female | 2012-11-06 00:00:00 |
Adoption | 1 year | 2014-06-03 14:20:00 | Dog | Pit Bull | Blue/White | Neutered Male | 2013-03-31 00:00:00 |
Transfer | 9 years | 2014-06-15 15:50:00 | Dog | Miniature Schnauzer Mix | White | Neutered Male | 2005-06-02 00:00:00 |
Euthanasia | 5 months | 2014-07-07 14:04:00 | Other | Bat Mix | Brown | Unknown | 2014-01-07 00:00:00 |
The notebook describes this as an example of a "weak" date case because the dates are only there for record-keeping and, while they might be significant for modeling, aren't acting as an index for the records.
Cryptocurrency
currency_path = kaggle_path.joinpath("all-crypto-currencies/crypto-markets.csv")
assert currency_path.is_file()
currency = pandas.read_csv(currency_path, parse_dates=["date"])
currency = currency.set_index("date")
print(table(currency.head(), showindex=True))
date | slug | symbol | name | ranknow | open | high | low | close | volume | market | close_ratio | spread |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013-04-28 00:00:00 | bitcoin | BTC | Bitcoin | 1 | 135.3 | 135.98 | 132.1 | 134.21 | 0 | 1.48857e+09 | 0.5438 | 3.88 |
2013-04-29 00:00:00 | bitcoin | BTC | Bitcoin | 1 | 134.44 | 147.49 | 134 | 144.54 | 0 | 1.60377e+09 | 0.7813 | 13.49 |
2013-04-30 00:00:00 | bitcoin | BTC | Bitcoin | 1 | 144 | 146.93 | 134.05 | 139 | 0 | 1.54281e+09 | 0.3843 | 12.88 |
2013-05-01 00:00:00 | bitcoin | BTC | Bitcoin | 1 | 139 | 139.89 | 107.72 | 116.99 | 0 | 1.29895e+09 | 0.2882 | 32.17 |
2013-05-02 00:00:00 | bitcoin | BTC | Bitcoin | 1 | 116.38 | 125.6 | 92.28 | 105.21 | 0 | 1.16852e+09 | 0.3881 | 33.32 |
Grouping
Birth Dates
Per Date
Here's a plot of the birth dates of the animals in the shelter.
births = shelter.date_of_birth.value_counts()
births_peak = births.idxmax()
births = births.reset_index().sort_values(by="index")
births.columns = ["birth_date", "Births"]
hover = HoverTool(
tooltips=[
("Date", "@birth_date{%Y-%m-%d}"),
("Births", "@Births{0,0}"),
],
formatters= {"birth_date": "datetime",
"Births": "numeral"},
mode="vline",
)
line = holoviews.VLine(births_peak)
curve = holoviews.Curve(
births, ("birth_date", "Date of Birth"), "Births",
)
main = curve.relabel("Count of Births By Date").opts(labelled=["y"],
tools=[hover],
height=Plots.height,
ylabel="Births",
xaxis=None)
range_finder = curve.opts(height=100, yaxis=None, default_tools=[],
xlabel="Birth Dates")
link = RangeToolLink(range_finder, main)
combination = (main * line + range_finder * line)
layout = combination.opts(
opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size),
opts.Curve(width=Plots.width,
color=Category20[3][0],
fontsize=Plots.font_size,
line_width=Plots.line_width),
opts.VLine(color=Plots.red, line_dash="dotted")
).cols(1)
Embed(layout, "shelter_births")()
It lools like there was an upward trend until about 2016 when it started to taper off, but since we're counting by days there's a lot of variance so we're going to group the data using pandas' resample method.
Note: One interesting problem I found is that unless I zoom in I can't get my mouse to trigger the hover-tool for the day with the greatest value (May 5, 2014).
There's a couple of different ways to do the grouping of the days, but the simplest way is to take the count for each date using value_counts. This will leave us with a Series with the dates in the index and the counts as values. Once we have this we can aggregate the dates by year and then count how many births there were per year.
By Year
First I'll get the counts for each day using value_counts
and print off the first values to see what it looks like. Calling reset_index
changes the Series to a DataFrame with the dates as a column.
counts = shelter.date_of_birth.value_counts()
print(table(counts.head().reset_index(), showindex=False))
index | date_of_birth |
---|---|
2014-05-05 00:00:00 | 112 |
2015-09-01 00:00:00 | 110 |
2014-04-21 00:00:00 | 105 |
2015-04-28 00:00:00 | 104 |
2016-05-01 00:00:00 | 102 |
Now we can aggregate the birth-counts by year using resample.
year_counts = counts.resample("Y")
print(year_counts)
DatetimeIndexResampler [freq=<YearEnd: month=12>, axis=0, closed=right, label=right, convention=start, base=0]
Note that this is a grouper, we don't get what we want until we call a method (like
count
) on it. In this case since we have value counts we want to sum all of the counts for a year (so we needsum
).
Now I'm going to aggregate the yearly counts using the sum
method.
sums = year_counts.sum()
Calling sum
gives us a Series with the dates in the index and the sums as the values.
print(sums.head())
1991-12-31 1 1992-12-31 1 1993-12-31 1 1994-12-31 9 1995-12-31 7 Freq: A-DEC, Name: date_of_birth, dtype: int64
The idxmax
method gives us the index of the greatest value and since the dates are in the index, using it will give us the date of the year with the most births, which I'll call sum_peak
.
sum_peak = sums.idxmax()
As you may have noticed, all the dates are set for December 31, but for plotting it's better to have them set to January 1 so I'll set it here and do a some other cleanup.
sums = sums.reset_index()
sums.columns = ["birth_date", "Births"]
sum_peak = datetime(sum_peak.year, 1, 1)
sums["birth_date"] = sums.birth_date.apply(lambda date: datetime(date.year, 1, 1))
And now for the plot.
- The Tools
First set up the tools
hover = HoverTool( tooltips=[ ("Year", "@birth_date{%Y}"), ("Births", "@Births{0,0}"), ], formatters= {"birth_date": "datetime", "Births": "numeral"}, mode="vline", )
- The Plot Parts
Now I'll create our plotting objects.
The vertical line will mark the peak year.
line = holoviews.VLine(sum_peak, label=sum_peak.strftime("%Y"))
And I'm going to add an annotation to it.
x = datetime(sum_peak.year, 3, 1) text = holoviews.Text(x, sums.max()[1]/4, "Max Year: {}".format(sum_peak.year), halign="left")
Now our data-curve.
curve = holoviews.Curve( sums, ("birth_date", "Date of Birth"), "Births", )
Next I'll make two copies of the curve -
main
will be the larger curve andrange_finder
will create a plot below it to let us select a range of dates which get linked using theRangeToolLink
.main = curve.relabel("Births Per Year (1991- 2017)").opts( labelled=["y"], tools=[hover], xaxis=None, ylabel="Births", height=Plots.height) range_finder = curve.opts(height=100, yaxis=None, default_tools=[], xlabel="Year") link = RangeToolLink(range_finder, main)
Now combine the parts to make our visible plot.
combination = (line * main * text + line * range_finder)
This next bit is to set some styling on the plot.
- The Options
layout = combination.opts( opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size), opts.Curve(width=Plots.width, color=Plots.blue, padding=0.01, fontsize=Plots.font_size, line_width=Plots.line_width), opts.VLine(color=Plots.red, line_dash="dotted") ).cols(1)
- Embed
Finally, create the javascript and embed it in this notebook.
Embed(layout, "shelter_births_per_year")()
Lollipop Plot
An alternative way to look at this would be a lollipop plot.
# The Tools
hover = HoverTool(
tooltips=[
("Year", "@birth_date{%Y}"),
("Births", "@Births{0,0}"),
],
formatters= {"birth_date": "datetime",
"Births": "numeral"},
mode="vline",
)
# The Parts
line = holoviews.VLine(sum_peak, label=peak.strftime("%Y"))
spikes = holoviews.Spikes(sums, ("birth_date", "Date of Birth"), "Births")
circles = holoviews.Scatter(sums, "birth_date", "Births")
# The Range Finder
main = circles.relabel().opts(
labelled=["y"],
tools=[hover],
xaxis=None,
ylabel="Births",
height=Plots.height,
size=10,
padding=(0, (0, 0.1)))
range_finder = circles.opts(height=100,
yaxis=None,
default_tools=[],
size=5,
fontsize={"ticks": "14pt"},
xlabel="Year of Birth")
link = RangeToolLink(range_finder, main)
# The Layout
combination = (spikes * line * main + spikes * line * range_finder)
# The Styling Options
layout = combination.opts(
opts.Layout(shared_axes=False,
merge_tools=False,
title="Shelter Animal Births Per Year (1991- 2017)",
show_title=True,
fontsize=Plots.font_size),
opts.Spikes(width=Plots.width,
color=Plots.red,
fontsize=Plots.font_size,
line_width=Plots.line_width),
opts.Scatter(color=Plots.blue, fontsize={"ticks": "14pt"}, legend_position="left"),
opts.VLine(color=Plots.green),
).cols(1)
# The HTML and Javascript
Embed(layout, "births_per_year_spikes")()
Note that putting the title in the Layout changes the font. I was trying to set it to Open Sans but HoloViews is horribly documented for most things so I couldn't figure out how to do it.
Animal Shelter Outcomes
While knowing the birthdates of the animals in the shelter is interesting, what about the dates when their cases were resolved? I originally called this Animal Shelter Adoptions but "outcome" doesn't always mean "adopted", unfortunately.
CountPercentage(shelter.outcome_type)()
Value | Count | Percentage |
---|---|---|
Adoption | 33112 | 42.32 |
Transfer | 23499 | 30.03 |
Return to Owner | 14354 | 18.35 |
Euthanasia | 6080 | 7.77 |
Died | 680 | 0.87 |
Disposal | 307 | 0.39 |
Rto-Adopt | 150 | 0.19 |
Missing | 46 | 0.06 |
Relocate | 16 | 0.02 |
I don't know what Disposal means, but it doesn't sound good. Neither does Missing, really, especially if there are any restaurants nearby. Anyway, on to the plotting. I'll aggregate the outcome-counts by year.
outcome_counts = shelter.datetime.value_counts()
outcomes = outcome_counts.resample("Y").sum()
print(table(outcome_counts.head().reset_index(), showindex=False))
outcomes = outcomes.reset_index()
outcomes.columns = ["date", "count"]
outcomes["date"] = outcomes.date.apply(lambda date: datetime(date.year, 1, 1))
index | datetime |
---|---|
2016-04-18 00:00:00 | 39 |
2015-08-11 00:00:00 | 25 |
2017-10-17 00:00:00 | 25 |
2015-11-17 00:00:00 | 22 |
2015-07-02 00:00:00 | 22 |
This next part isn't really necessary but I think keeping the names consistent is helpful, especially since I was struggling so much with HoloViews and didn't need the extra confusion about column-names being wrong.
sums = sums.rename(columns=dict(birth_date="date", Births="count"))
This is going to be like the previous plot but I'm going to add a crosshair tool to make it easier to see how things line up with the axis.
# The Tools
hover = HoverTool(
tooltips=[
("Year", "@date{%Y}"),
("Count", "@count{0,0}"),
],
formatters= {"date": "datetime",
"count": "numeral"},
mode="vline",
)
crosshairs = CrosshairTool(line_color=Plots.light_blue)
# The Parts
births = holoviews.Scatter(sums, "date", "count", label="Births")
outcome_circles = holoviews.Scatter(outcomes, "date", "count",
group="outcome", label="Outcomes")
spikes = holoviews.Spikes(outcomes, ("date", 'Year'), ("count", "Count"),
group="outcome")
# The Layout
combination = spikes * outcome_circles * births
# The Styling
layout = combination.opts(
opts.Layout(shared_axes=False,
height=Plots.height,
merge_tools=False,
show_title=True,
fontsize=Plots.font_size),
opts.Spikes(width=Plots.width,
height=Plots.height,
title="Shelter Animal Births vs Outcomes Per Year",
show_title=True,
fontsize=Plots.font_size,
padding=(0, (0, .1)),
color=Plots.blue,
line_width=Plots.line_width),
opts.Scatter("outcome", color=Plots.blue, size=10, legend_position="top_left"),
opts.Scatter(fontsize={"ticks": "14pt"}, color=Plots.red, size=10,
tools=[hover, crosshairs]),
)
# The HTML
Embed(layout, "outcome_lollipops")()
You can see that there are only six years of adoption outcomes although there are sixteen years of birth dates, with a sudden uptick to the peak year of 2014. It's interesting that the births drop off much faster than the outcomes - the animals seemed to be getting older for some reason.
Trading Volume
The previous plot was a count-plot. You can also use other summary-statistics like a mean to see how things changed over time. I'll plot the mean volume per year for the New York Stock Exchange.
volume = nyse.volume.resample("Y")
means = volume.mean().reset_index()
means["date"] = means.date.apply(lambda date: datetime(date.year, 1, 1))
Along with the standard deviations.
deviations = volume.std().reset_index()
means["two_sigma"] = means.volume + 2 * deviations.volume
And now our plot.
# The Tools
hover = HoverTool(
tooltips=[
("Year", "@date{%Y}"),
("Volume", "@volume{0,0.00}"),
],
formatters= {"date": "datetime",
"volume": "numeral",
},
mode="vline",
)
# The Parts
top_spread = holoviews.ErrorBars((means.date, means.volume, means.two_sigma),
group="volume")
volume_curve = holoviews.Curve(means,
("date", "Year"),
("volume", "Volume"),
group="volume")
zero_line = holoviews.HLine(0)
# The Layout
layout = volume_curve * top_spread * zero_line
# The Styling
layout = layout.opts(
opts.Layout(shared_axes=False,
height=Plots.height,
merge_tools=False,
show_title=True,
fontsize=Plots.font_size),
opts.Curve(width=Plots.width,
height=Plots.height,
title="Mean NYSE Trading Volume Per Year",
show_title=True,
fontsize=Plots.font_size,
padding=(0, (0, .1)),
color=Plots.blue,
line_width=Plots.line_width,
tools=[hover]),
opts.HLine(line_color=Plots.gray)
)
# The HTML
Embed(layout, "stock_mean_volume")()
While the standard deviation is important, in this case it's so large that it smashes the mean down flat (although maybe the fact that it's so large tells us that the mean isn't so accurate).
hover = HoverTool(
tooltips=[
("Year", "@date{%Y}"),
("Volume", "@volume{0,0.00}"),
],
formatters= {"date": "datetime",
"volume": "numeral"},
mode="vline",
)
volume_circles = holoviews.Scatter(means, "date", "volume")
volume_spikes = holoviews.Spikes(means, ("date", "Date"),
("volume", "Volume"))
combination = volume_spikes * volume_circles
crosshairs = CrosshairTool(line_color=Plots.light_blue, dimensions="height")
layout = combination.opts(
opts.Layout(shared_axes=False,
height=Plots.height,
merge_tools=False,
show_title=True,
fontsize=Plots.font_size),
opts.Spikes(width=Plots.width,
height=Plots.height,
title="NYSE Mean Annual Trading Volume",
show_title=True,
fontsize=Plots.font_size,
padding=(0, (0, .1)),
color=Plots.blue,
line_width=Plots.line_width),
opts.Scatter(color=Plots.blue,
size=10,
tools=[hover, crosshairs]),
)
Embed(layout, "stock_lollipops")()
I took the cross-hairs out of the plot with the standard deviations but it was (a little) more helpful for the lollipop plots because you have to be directly above the points to trigger the hover tool, whereas you can be above any part of a segment in the
Curve
plot and it triggers the hover tool.
Lag Plots
The Lag Plot helps you check if there is a significance to the ordering of the data. You are plotting the value in the inputs vs the next value (e.g. one day against the following day). If there is no significance to the ordering then the plot will look random.
NYSE
The lag_plot
function isn't one of the DataFrame methods so I don't think it will work with HoloViews, although I haven't tried yet.
volume = nyse.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("NYSE Volume Lag Plot", weight="bold")
subplot = lag_plot(volume.sum().tail(365), ax=axe)
So, the center points do seem to show a relationship, as the next-days volume goes up along with the previous day's volume, but I don't know what those bands around 0 are. One thing I noticed is that there are holidays in the data.
print(volume.sum().index[-6])
2016-12-25 00:00:00
And there are also weekends in there.
print(volume.sum().index[-13].strftime("%a"))
Sun
So it's likely that there are days in there where there was no trading and so they won't correlate with the days that preceded the start of a break or the ones that followed the end of a break. I think. I don't really know if there's trading all year round.
volume_sums = volume.sum()
for day in volume_sums[volume_sums==0][-9:].index:
print("{} {}".format(day.strftime("%a"), day))
Sat 2016-12-03 00:00:00 Sun 2016-12-04 00:00:00 Sat 2016-12-10 00:00:00 Sun 2016-12-11 00:00:00 Sat 2016-12-17 00:00:00 Sun 2016-12-18 00:00:00 Sat 2016-12-24 00:00:00 Sun 2016-12-25 00:00:00 Mon 2016-12-26 00:00:00
So it does look like the zeros are weekends and holidays.
UPS
Here's what just the UPS trading volumes look like.
figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Lag Plot", weight="bold")
subplot = lag_plot(ups.volume, ax=axe)
I don't know why but that makes it look better. I guess the market as a whole doesn't move quite so well together day by day as a single stock does.
Autcorrelation Plot
UPS
figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Daily Autocorrelation", weight="bold")
subplot = autocorrelation_plot(ups.volume, ax=axe)
This plot shows the lag in relationship to correlation over different lag intervals. It looks like up to about 500 days of lag the correlation is positive but it starts to become more negative after that. The horizontal lines are the confidence intervals - the solid grey lines are the 95 % interval and the dashed grey lines are the 99% interval. The points that fall outside of these intervals are statistically significant.
Cryptocurrency
Lag Plot
crypto_daily = currency.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("Cryptocurrency Volume Lag Plot", weight="bold")
subplot = lag_plot(crypto_daily.sum(), ax=axe)
Unlike the stock-exchange, the cryptocurrencies seem to move together and don't take days off.
Autocorrelation Plot
figure, axe = pyplot.subplots()
figure.suptitle("Dogecoin Auto Correlation", weight="bold")
dogecoin = currency[currency.name=="Dogecoin"]
subplot = autocorrelation_plot(dogecoin.volume, ax=axe)
If my understanding of how this plot works is correct, there is some kind of significance to lags of 125 and 250 days. Is this really true? Possibly.