Kaggle On Time-Series Visualization

Introduction

This is a walk-through of the kaggle notebook on Time-Series Plotting by Aleksey Bilogur.

Set Up

Imports

From Python

from datetime import datetime
from functools import partial
from pathlib import Path
import os

From PyPi

from dotenv import load_dotenv
from bokeh.io.doc import curdoc
from bokeh.models import CrosshairTool, HoverTool
from bokeh.themes import Theme
from bokeh.palettes import Category20
from holoviews import opts
from holoviews.plotting.links import RangeToolLink
from hvplot import hvPlot
from pandas.plotting import autocorrelation_plot, lag_plot
from tabulate import tabulate
import holoviews
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
import seaborn

My Projects

from bartleby_the_penguin.tangles.embed_bokeh import EmbedBokeh
from graeae.tables import CountPercentage

Plotting

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)

Holoviews Backend

holoviews.extension("bokeh")

Bokeh

class Plots:
    width = 1100
    height = 600
    font = "Open Sans"
    font_size = "24pt"
    line_width = 3
    tools =  ["hover"]
    blue = seaborn.color_palette()[0]
    light_blue = Category20[3][1]
    red = seaborn.color_palette()[3]
    yellow = seaborn.color_palette()[1]
    green = seaborn.color_palette()[2]
    gray = seaborn.color_palette()[7]

theme = Theme(json={
    "attrs": {
        "Figure": {
            "text_font": "Open Sans",
            "text_font_size": "18pt",
            "line_color": Category20[3][0],
            "plot_width": Plots.width,
            "plot_height": Plots.height,
            "tools": ["pan", "zoom_in", "hover", "reset"],
        },
        "Title": {
            "text_font_style": "bold",
        },
    },
})
curdoc().theme = theme

Setup Libraries

load_dotenv()
table = partial(tabulate, headers="keys", tablefmt="orgtbl")
kaggle_path = Path(os.environ.get("KAGGLE")).expanduser()
assert kaggle_path.is_dir()

The Embedder

Embed = partial(
    EmbedBokeh, 
    folder_path="../../files/posts/tutorials/kaggle-on-time-series-visualization/")

The Data

New York Stock Exchange Prices

nyse_path = kaggle_path.joinpath("nyse/prices.csv")
assert nyse_path.is_file()
nyse = pandas.read_csv(nyse_path, parse_dates=["date"])
nyse.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 851264 entries, 0 to 851263
Data columns (total 7 columns):
date      851264 non-null datetime64[ns]
symbol    851264 non-null object
open      851264 non-null float64
close     851264 non-null float64
low       851264 non-null float64
high      851264 non-null float64
volume    851264 non-null float64
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 45.5+ MB
nyse = nyse.set_index("date")
print(table(nyse.head()))
date symbol open close low high volume
2016-01-05 00:00:00 WLTW 123.43 125.84 122.31 126.25 2.1636e+06
2016-01-06 00:00:00 WLTW 125.24 119.98 119.94 125.54 2.3864e+06
2016-01-07 00:00:00 WLTW 116.38 114.95 114.93 119.74 2.4895e+06
2016-01-08 00:00:00 WLTW 115.48 116.62 113.5 117.44 2.0063e+06
2016-01-11 00:00:00 WLTW 117.01 114.97 114.09 117.33 1.4086e+06

The notebook describes this as an example of a "strong" date case because the dates act as an explicit index for the data and are, in this case, an aggregate for a day of trading.

UPS

Some of the correlational plots don't show anything meaningful when you use the market as a whole (I guess because different stocks are moving in different directions) so I'm going to pull out the UPS stock information to use later.

ups = nyse[nyse.symbol=="UPS"]
print(ups.shape)
(1762, 6)

Shelter Outcomes

shelter_path = kaggle_path.joinpath(
    "austin-animal-center-shelter-outcomes/aac_shelter_outcomes.csv")
assert shelter_path.is_file()
shelter = pandas.read_csv(shelter_path, parse_dates=["datetime", "date_of_birth"])
shelter.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
age_upon_outcome    78248 non-null object
animal_id           78256 non-null object
animal_type         78256 non-null object
breed               78256 non-null object
color               78256 non-null object
date_of_birth       78256 non-null datetime64[ns]
datetime            78256 non-null datetime64[ns]
monthyear           78256 non-null object
name                54370 non-null object
outcome_subtype     35963 non-null object
outcome_type        78244 non-null object
sex_upon_outcome    78254 non-null object
dtypes: datetime64[ns](2), object(10)
memory usage: 7.2+ MB

Some of the columns are only identifiers (like a name) so we'll drop them to make it easier to inspect the data (although we aren't really going to do anything with it here anyway).

shelter = shelter[["outcome_type", "age_upon_outcome", "datetime",
                   "animal_type", "breed", "color", "sex_upon_outcome",
                   "date_of_birth"]]
print(table(shelter.head(), showindex=False))
outcome_type age_upon_outcome datetime animal_type breed color sex_upon_outcome date_of_birth
Transfer 2 weeks 2014-07-22 16:04:00 Cat Domestic Shorthair Mix Orange Tabby Intact Male 2014-07-07 00:00:00
Transfer 1 year 2013-11-07 11:47:00 Dog Beagle Mix White/Brown Spayed Female 2012-11-06 00:00:00
Adoption 1 year 2014-06-03 14:20:00 Dog Pit Bull Blue/White Neutered Male 2013-03-31 00:00:00
Transfer 9 years 2014-06-15 15:50:00 Dog Miniature Schnauzer Mix White Neutered Male 2005-06-02 00:00:00
Euthanasia 5 months 2014-07-07 14:04:00 Other Bat Mix Brown Unknown 2014-01-07 00:00:00

The notebook describes this as an example of a "weak" date case because the dates are only there for record-keeping and, while they might be significant for modeling, aren't acting as an index for the records.

Cryptocurrency

currency_path = kaggle_path.joinpath("all-crypto-currencies/crypto-markets.csv")
assert currency_path.is_file()
currency = pandas.read_csv(currency_path, parse_dates=["date"])
currency = currency.set_index("date")
print(table(currency.head(), showindex=True))
date slug symbol name ranknow open high low close volume market close_ratio spread
2013-04-28 00:00:00 bitcoin BTC Bitcoin 1 135.3 135.98 132.1 134.21 0 1.48857e+09 0.5438 3.88
2013-04-29 00:00:00 bitcoin BTC Bitcoin 1 134.44 147.49 134 144.54 0 1.60377e+09 0.7813 13.49
2013-04-30 00:00:00 bitcoin BTC Bitcoin 1 144 146.93 134.05 139 0 1.54281e+09 0.3843 12.88
2013-05-01 00:00:00 bitcoin BTC Bitcoin 1 139 139.89 107.72 116.99 0 1.29895e+09 0.2882 32.17
2013-05-02 00:00:00 bitcoin BTC Bitcoin 1 116.38 125.6 92.28 105.21 0 1.16852e+09 0.3881 33.32

Grouping

Birth Dates

Per Date

Here's a plot of the birth dates of the animals in the shelter.

births = shelter.date_of_birth.value_counts()
births_peak = births.idxmax()
births = births.reset_index().sort_values(by="index")
births.columns = ["birth_date", "Births"]
hover = HoverTool(
tooltips=[
    ("Date", "@birth_date{%Y-%m-%d}"),
    ("Births", "@Births{0,0}"),
],
    formatters= {"birth_date": "datetime", 
                 "Births": "numeral"},
    mode="vline",
)
line = holoviews.VLine(births_peak)
curve = holoviews.Curve(
    births, ("birth_date", "Date of Birth"), "Births",
)

main = curve.relabel("Count of Births By Date").opts(labelled=["y"], 
                                                     tools=[hover], 
                                                     height=Plots.height, 
                                                     ylabel="Births", 
                                                     xaxis=None)
range_finder = curve.opts(height=100, yaxis=None, default_tools=[], 
                          xlabel="Birth Dates")

link = RangeToolLink(range_finder, main)

combination = (main * line + range_finder * line)

layout = combination.opts(
    opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size),
    opts.Curve(width=Plots.width, 
               color=Category20[3][0], 
               fontsize=Plots.font_size,
               line_width=Plots.line_width),
    opts.VLine(color=Plots.red, line_dash="dotted")
).cols(1)
Embed(layout, "shelter_births")()

It lools like there was an upward trend until about 2016 when it started to taper off, but since we're counting by days there's a lot of variance so we're going to group the data using pandas' resample method.

Note: One interesting problem I found is that unless I zoom in I can't get my mouse to trigger the hover-tool for the day with the greatest value (May 5, 2014).

There's a couple of different ways to do the grouping of the days, but the simplest way is to take the count for each date using value_counts. This will leave us with a Series with the dates in the index and the counts as values. Once we have this we can aggregate the dates by year and then count how many births there were per year.

By Year

First I'll get the counts for each day using value_counts and print off the first values to see what it looks like. Calling reset_index changes the Series to a DataFrame with the dates as a column.

counts = shelter.date_of_birth.value_counts()
print(table(counts.head().reset_index(), showindex=False))
index date_of_birth
2014-05-05 00:00:00 112
2015-09-01 00:00:00 110
2014-04-21 00:00:00 105
2015-04-28 00:00:00 104
2016-05-01 00:00:00 102

Now we can aggregate the birth-counts by year using resample.

year_counts = counts.resample("Y")
print(year_counts)
DatetimeIndexResampler [freq=<YearEnd: month=12>, axis=0, closed=right, label=right, convention=start, base=0]

Note that this is a grouper, we don't get what we want until we call a method (like count) on it. In this case since we have value counts we want to sum all of the counts for a year (so we need sum).

Now I'm going to aggregate the yearly counts using the sum method.

sums = year_counts.sum()

Calling sum gives us a Series with the dates in the index and the sums as the values.

print(sums.head())
1991-12-31    1
1992-12-31    1
1993-12-31    1
1994-12-31    9
1995-12-31    7
Freq: A-DEC, Name: date_of_birth, dtype: int64

The idxmax method gives us the index of the greatest value and since the dates are in the index, using it will give us the date of the year with the most births, which I'll call sum_peak.

sum_peak = sums.idxmax()

As you may have noticed, all the dates are set for December 31, but for plotting it's better to have them set to January 1 so I'll set it here and do a some other cleanup.

sums = sums.reset_index()
sums.columns = ["birth_date", "Births"]
sum_peak = datetime(sum_peak.year, 1, 1)
sums["birth_date"] = sums.birth_date.apply(lambda date: datetime(date.year, 1, 1))

And now for the plot.

  • The Tools

    First set up the tools

    hover = HoverTool(
    tooltips=[
        ("Year", "@birth_date{%Y}"),
        ("Births", "@Births{0,0}"),
    ],
        formatters= {"birth_date": "datetime", 
                     "Births": "numeral"},
        mode="vline",
    )
    
  • The Plot Parts

    Now I'll create our plotting objects.

    The vertical line will mark the peak year.

    line = holoviews.VLine(sum_peak, label=sum_peak.strftime("%Y"))
    

    And I'm going to add an annotation to it.

    x = datetime(sum_peak.year, 3, 1)
    text = holoviews.Text(x, sums.max()[1]/4,
                          "Max Year: {}".format(sum_peak.year), 
                          halign="left")
    

    Now our data-curve.

    curve = holoviews.Curve(
        sums, ("birth_date", "Date of Birth"), "Births",
    )
    

    Next I'll make two copies of the curve - main will be the larger curve and range_finder will create a plot below it to let us select a range of dates which get linked using the RangeToolLink.

    main = curve.relabel("Births Per Year (1991- 2017)").opts(
        labelled=["y"], 
        tools=[hover], 
        xaxis=None,
        ylabel="Births",
        height=Plots.height)
    range_finder = curve.opts(height=100, yaxis=None, 
                              default_tools=[],
                              xlabel="Year")
    
    link = RangeToolLink(range_finder, main)
    

    Now combine the parts to make our visible plot.

    combination = (line * main * text + line * range_finder)
    

    This next bit is to set some styling on the plot.

  • The Options
    layout = combination.opts(
        opts.Layout(shared_axes=False, merge_tools=False, fontsize=Plots.font_size),
        opts.Curve(width=Plots.width, 
                   color=Plots.blue,
                   padding=0.01,
                   fontsize=Plots.font_size, 
                   line_width=Plots.line_width),
        opts.VLine(color=Plots.red, line_dash="dotted")
    ).cols(1)
    
  • Embed

    Finally, create the javascript and embed it in this notebook.

    Embed(layout, "shelter_births_per_year")()
    

Lollipop Plot

An alternative way to look at this would be a lollipop plot.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@birth_date{%Y}"),
    ("Births", "@Births{0,0}"),
],
    formatters= {"birth_date": "datetime", 
                 "Births": "numeral"},
    mode="vline",
)

# The Parts
line = holoviews.VLine(sum_peak, label=peak.strftime("%Y"))
spikes = holoviews.Spikes(sums, ("birth_date", "Date of Birth"), "Births")
circles = holoviews.Scatter(sums, "birth_date", "Births")

# The Range Finder
main = circles.relabel().opts(
    labelled=["y"], 
    tools=[hover], 
    xaxis=None,
    ylabel="Births",
    height=Plots.height,
    size=10,
    padding=(0, (0, 0.1)))
range_finder = circles.opts(height=100, 
                            yaxis=None, 
                            default_tools=[],
                            size=5,
                            fontsize={"ticks": "14pt"},
                            xlabel="Year of Birth")

link = RangeToolLink(range_finder, main)

# The Layout
combination = (spikes * line * main + spikes * line * range_finder)

# The Styling Options
layout = combination.opts(
    opts.Layout(shared_axes=False, 
                merge_tools=False,
                title="Shelter Animal Births Per Year (1991- 2017)",
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                color=Plots.red, 
                fontsize=Plots.font_size,
                line_width=Plots.line_width),
    opts.Scatter(color=Plots.blue, fontsize={"ticks": "14pt"}, legend_position="left"),
    opts.VLine(color=Plots.green),
).cols(1)

# The HTML and Javascript
Embed(layout, "births_per_year_spikes")()

Note that putting the title in the Layout changes the font. I was trying to set it to Open Sans but HoloViews is horribly documented for most things so I couldn't figure out how to do it.

Animal Shelter Outcomes

While knowing the birthdates of the animals in the shelter is interesting, what about the dates when their cases were resolved? I originally called this Animal Shelter Adoptions but "outcome" doesn't always mean "adopted", unfortunately.

CountPercentage(shelter.outcome_type)()
Value Count Percentage
Adoption 33112 42.32
Transfer 23499 30.03
Return to Owner 14354 18.35
Euthanasia 6080 7.77
Died 680 0.87
Disposal 307 0.39
Rto-Adopt 150 0.19
Missing 46 0.06
Relocate 16 0.02

I don't know what Disposal means, but it doesn't sound good. Neither does Missing, really, especially if there are any restaurants nearby. Anyway, on to the plotting. I'll aggregate the outcome-counts by year.

outcome_counts = shelter.datetime.value_counts()
outcomes = outcome_counts.resample("Y").sum()
print(table(outcome_counts.head().reset_index(), showindex=False))
outcomes = outcomes.reset_index()
outcomes.columns = ["date", "count"]
outcomes["date"] = outcomes.date.apply(lambda date: datetime(date.year, 1, 1))
index datetime
2016-04-18 00:00:00 39
2015-08-11 00:00:00 25
2017-10-17 00:00:00 25
2015-11-17 00:00:00 22
2015-07-02 00:00:00 22

This next part isn't really necessary but I think keeping the names consistent is helpful, especially since I was struggling so much with HoloViews and didn't need the extra confusion about column-names being wrong.

sums = sums.rename(columns=dict(birth_date="date", Births="count"))

This is going to be like the previous plot but I'm going to add a crosshair tool to make it easier to see how things line up with the axis.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Count", "@count{0,0}"),
],
    formatters= {"date": "datetime", 
                 "count": "numeral"},
    mode="vline",
)
crosshairs = CrosshairTool(line_color=Plots.light_blue)

# The Parts
births = holoviews.Scatter(sums, "date", "count", label="Births")
outcome_circles = holoviews.Scatter(outcomes, "date", "count", 
                                    group="outcome", label="Outcomes")
spikes = holoviews.Spikes(outcomes, ("date", 'Year'), ("count", "Count"), 
                          group="outcome")

# The Layout
combination = spikes * outcome_circles * births

# The Styling
layout = combination.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                height=Plots.height,
                title="Shelter Animal Births vs Outcomes Per Year",
                show_title=True,
                fontsize=Plots.font_size,
                padding=(0, (0, .1)),
                color=Plots.blue,
                line_width=Plots.line_width),
    opts.Scatter("outcome", color=Plots.blue, size=10, legend_position="top_left"),
    opts.Scatter(fontsize={"ticks": "14pt"}, color=Plots.red, size=10, 
                 tools=[hover, crosshairs]),
)

# The HTML
Embed(layout, "outcome_lollipops")()

You can see that there are only six years of adoption outcomes although there are sixteen years of birth dates, with a sudden uptick to the peak year of 2014. It's interesting that the births drop off much faster than the outcomes - the animals seemed to be getting older for some reason.

Trading Volume

The previous plot was a count-plot. You can also use other summary-statistics like a mean to see how things changed over time. I'll plot the mean volume per year for the New York Stock Exchange.

volume = nyse.volume.resample("Y")
means = volume.mean().reset_index()
means["date"] = means.date.apply(lambda date: datetime(date.year, 1, 1))

Along with the standard deviations.

deviations = volume.std().reset_index()
means["two_sigma"] = means.volume + 2 * deviations.volume

And now our plot.

# The Tools
hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Volume", "@volume{0,0.00}"),
],
    formatters= {"date": "datetime", 
                 "volume": "numeral",
    },
    mode="vline",
)

# The Parts
top_spread = holoviews.ErrorBars((means.date, means.volume, means.two_sigma),
                              group="volume")

volume_curve = holoviews.Curve(means, 
                               ("date", "Year"), 
                               ("volume", "Volume"), 
                               group="volume")

zero_line = holoviews.HLine(0)

# The Layout
layout = volume_curve * top_spread * zero_line

# The Styling
layout = layout.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Curve(width=Plots.width, 
               height=Plots.height,
               title="Mean NYSE Trading Volume Per Year",
               show_title=True,
               fontsize=Plots.font_size,
               padding=(0, (0, .1)),
               color=Plots.blue,
               line_width=Plots.line_width,
               tools=[hover]),
    opts.HLine(line_color=Plots.gray)
)

# The HTML
Embed(layout, "stock_mean_volume")()

While the standard deviation is important, in this case it's so large that it smashes the mean down flat (although maybe the fact that it's so large tells us that the mean isn't so accurate).

hover = HoverTool(
tooltips=[
    ("Year", "@date{%Y}"),
    ("Volume", "@volume{0,0.00}"),
],
    formatters= {"date": "datetime", 
                 "volume": "numeral"},
    mode="vline",
)

volume_circles = holoviews.Scatter(means, "date", "volume")
volume_spikes = holoviews.Spikes(means, ("date", "Date"), 
                                 ("volume", "Volume"))
combination = volume_spikes * volume_circles
crosshairs = CrosshairTool(line_color=Plots.light_blue, dimensions="height")

layout = combination.opts(
    opts.Layout(shared_axes=False,
                height=Plots.height,
                merge_tools=False,
                show_title=True,
                fontsize=Plots.font_size),
    opts.Spikes(width=Plots.width, 
                height=Plots.height,
                title="NYSE Mean Annual Trading Volume",
                show_title=True,
                fontsize=Plots.font_size,
                padding=(0, (0, .1)),
                color=Plots.blue,
                line_width=Plots.line_width),
    opts.Scatter(color=Plots.blue,
                 size=10, 
                 tools=[hover, crosshairs]),
)
Embed(layout, "stock_lollipops")()

I took the cross-hairs out of the plot with the standard deviations but it was (a little) more helpful for the lollipop plots because you have to be directly above the points to trigger the hover tool, whereas you can be above any part of a segment in the Curve plot and it triggers the hover tool.

Lag Plots

The Lag Plot helps you check if there is a significance to the ordering of the data. You are plotting the value in the inputs vs the next value (e.g. one day against the following day). If there is no significance to the ordering then the plot will look random.

NYSE

The lag_plot function isn't one of the DataFrame methods so I don't think it will work with HoloViews, although I haven't tried yet.

volume = nyse.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("NYSE Volume Lag Plot", weight="bold")
subplot = lag_plot(volume.sum().tail(365), ax=axe)

nil

So, the center points do seem to show a relationship, as the next-days volume goes up along with the previous day's volume, but I don't know what those bands around 0 are. One thing I noticed is that there are holidays in the data.

print(volume.sum().index[-6])
2016-12-25 00:00:00

And there are also weekends in there.

print(volume.sum().index[-13].strftime("%a"))
Sun

So it's likely that there are days in there where there was no trading and so they won't correlate with the days that preceded the start of a break or the ones that followed the end of a break. I think. I don't really know if there's trading all year round.

volume_sums = volume.sum()
for day in volume_sums[volume_sums==0][-9:].index:
    print("{} {}".format(day.strftime("%a"), day))
Sat 2016-12-03 00:00:00
Sun 2016-12-04 00:00:00
Sat 2016-12-10 00:00:00
Sun 2016-12-11 00:00:00
Sat 2016-12-17 00:00:00
Sun 2016-12-18 00:00:00
Sat 2016-12-24 00:00:00
Sun 2016-12-25 00:00:00
Mon 2016-12-26 00:00:00

So it does look like the zeros are weekends and holidays.

UPS

Here's what just the UPS trading volumes look like.

figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Lag Plot", weight="bold")
subplot = lag_plot(ups.volume, ax=axe)

nil

I don't know why but that makes it look better. I guess the market as a whole doesn't move quite so well together day by day as a single stock does.

Autcorrelation Plot

UPS

figure, axe = pyplot.subplots()
figure.suptitle("UPS Trading Volume Daily Autocorrelation", weight="bold")
subplot = autocorrelation_plot(ups.volume, ax=axe)

nil

This plot shows the lag in relationship to correlation over different lag intervals. It looks like up to about 500 days of lag the correlation is positive but it starts to become more negative after that. The horizontal lines are the confidence intervals - the solid grey lines are the 95 % interval and the dashed grey lines are the 99% interval. The points that fall outside of these intervals are statistically significant.

Cryptocurrency

Lag Plot

crypto_daily = currency.volume.resample("D")
figure, axe = pyplot.subplots()
figure.suptitle("Cryptocurrency Volume Lag Plot", weight="bold")
subplot = lag_plot(crypto_daily.sum(), ax=axe)

nil

Unlike the stock-exchange, the cryptocurrencies seem to move together and don't take days off.

Autocorrelation Plot

figure, axe = pyplot.subplots()
figure.suptitle("Dogecoin Auto Correlation", weight="bold")
dogecoin = currency[currency.name=="Dogecoin"]
subplot = autocorrelation_plot(dogecoin.volume, ax=axe)

nil

If my understanding of how this plot works is correct, there is some kind of significance to lags of 125 and 250 days. Is this really true? Possibly.