Testing the Altair Shortcode

The Cloistered Monkey

2024-03-15 20:25

Introduction

The way I've been using altair is by saving the charts as HTML documents and then embedding them into the posts as pages within the page. This works, so I don't know if this is a good idea or not, but I'm going to try embedding them as HTML elements only, the way I do with p5.js. This is a check to make sure I've got the parts working so nikola will compile everything the way I'm expecting it to.

Setup

Imports

These initial imports are for the plotting and not really what this post is testing.

# python
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import be_true, expect
from tabulate import tabulate

import altair
import pandas

Next is the function I'm testing - a new function I created in graeae, called save_vega_embed. Which maybe isn't the best name but it shouldn't clash with my other functions, hopefully. Its purpose is to create the JavaScript function that Altair's Chart.save was creating for me in the earlier posts and then save it where nikola can find it.

from graeae.visualization.altair_helpers import output_path, save_vega_embed

Note to self: Since I usually run the code to create the plot on a different machine from the one where the nikola code is being run to create this site, the graeae function has to be installed on the remote machine where the plots are created, then the file(s) created need to be synced back to the nikola machine to compile the site.

The Output Path

The output_path function takes the slug and checks if there's a folder matching it in the files/posts folder and creates it if it doesn't exist. Its output is a Path to that folder which we can use to tell the code where to save files for the chart.

SLUG = "testing-the-altair-shortcode"
OUTPUT_PATH = output_path(SLUG)
expect(OUTPUT_PATH.is_dir()).to(be_true)

Constants

These are some constants so I don't have to remember them later on.

class Chart:
    year = "year"
    height = 600
    width = "container"

Note: The width = "container" attribute will tell the chart to use the entire width of the container that the chart is in. In this case it's being put into a <div> tag. To make the div fill the column the CSS is giving its HTML class (altair-vega) a width of 100%. By default the chart will sit flush left but this way it's centered (and big). I'll have to figure out some other solution to get smaller charts centered.

See the vega-lite "Responsive Width and Height" documentation

The Data

To have something to plot I'll use the United Nation's World Happiness data, which I downloaded (as excel spreadsheets) and converted to a CSV.

load_dotenv(override=True)
table_path = Path(os.getenv("WORLD_HAPPINESS_TABLE"))

expect(table_path.is_file()).to(be_true)

table = pandas.read_csv(table_path)
print(table.shape)

(2199, 11)

This is overkill, but I had the idea to do this while in the middle of another post that uses this dataset so it was convenient to cut-and-paste the block here.

Counting the Years

The test-plot will be a bar-graph showing the number of entries (countries) for each year in the dataset. The data uses "year" as the name of the column with the years so I'll make a little name holder and then create a data-frame with the counts.

year_counts = (table.year.value_counts()
               .reset_index()
               .sort_values(Chart.year))

The value_counts output has the years as the index of the Series so I'm calling reset_index to move it into a column, giving us a two-column data-frame (with "year" and "counts" as columns).

A Table To Show the Counts

This is to show the values that we're going to plot.

table_counts = year_counts.T

table_counts.columns = table_counts.iloc[0]
table_counts = table_counts.drop(table_counts.index[0])
print(tabulate(table_counts, headers="keys", tablefmt="orgtbl"))

	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
count	27	89	102	110	114	124	146	141	136	144	142	141	147	141	143	116	122	114

Now the Bar-Chart

So, here's the section where we get to what we're testing.

First, we need an identifier for the HTML div tag so that we can tell Vega where to stick the chart.

DIV_ID = "value-counts-0d454587"

Now we'll create the altair Chart.

value_counts_chart = altair.Chart(year_counts).mark_bar().encode(
    x="{}:N".format(Chart.year),
    y="count").properties(height=Chart.height, width=Chart.width)

Now to save it. Previously I was using Altair's save method to save it as an HTML document and embedding the entire document within the post using an <object> tag, which seems to work kind of like a frame. I think. Anyway, this version saves the chart as a JavaScript function instead (using the name we pass in for the file-name) which then gets called by vega-embed to create the chart. The graeae.visualization.altair_helpers.save_vega_embed function is extracting the JSON schema from the chart (using Altair's Chart.to_json()) and adding it to a JavaScript function that I copied from the file created by Altair's Chart.save. The graeae function prints the shortcode to tell nikola to embed the chart here and then returns a Path object pointing to the saved file.

Note: I haven't figured out how to get jupyter-emacs to dump python output without formatting it as a #+RESULTS block so it will need a little clean up after the function is called. If not, the shortcode will work, but it will also create an empty box underneath it.

chart_path = save_vega_embed(chart=value_counts_chart,
                             name="value-counts-bar-chart",
                             div_id=DIV_ID,
                             output_path=OUTPUT_PATH)

print(chart_path.name)

value-counts-bar-chart.js

And a Line Plot

I'm going to re-plot the data as a line chart to make sure there's nothing I created that causes them to mess each other up.

line_chart = value_counts_chart.mark_line()

save_vega_embed(line_chart,
                name="value-counts-line-chart",
                div_id="line-chart-0d454587",
                output_path=OUTPUT_PATH)

Note: The altairdiv shortcode sets the HTML class for the chart's div to "altair-vega" in case it needs styling later.

The End

Well, that seems to work. I was originally going to throw this away once things seemed to be all right, but I'll keep it as a future reference in case I forget how to use this stuff later.

Here's the files that were updated to make this work.

shortcodes/altairdiv.tmpl
themes/custom-jinja/templates/
- altair.tmpl
- altair_helper.tmpl
- index.tmpl

As well as graeae.visualization.altair_helpers.save_vega_embed.

To get the post working it needs .. template: altair.tmpl in the meta-data and the output of the save_vega_embed function cleaned up (to get the shortcode to include the chart).

Links

Saving Altair Charts — Vega-Altair 5.2.0 documentation [Internet]. [cited 2024 Mar 16]. Available from: https://altair-viz.github.io/user_guide/saving_charts.html
vega/vega-embed [Internet]. Vega; 2024 [cited 2024 Mar 16]. Available from: https://github.com/vega/vega-embed
Vega-Lite [Internet]. [cited 2024 Mar 20]. Customizing Size. Available from: https://vega.github.io/vega-lite/docs/size.html
World Happiness Report [Internet]. 2023 [cited 2023 Dec 11]. Available from: https://worldhappiness.report/

Introduction To Altair: Countries Per Year

The Cloistered Monkey

2024-03-06 16:57

Setup

Imports

These initial imports are supports to make creating this post easier and aren't necessarily needed for the altair plots.

# python
from functools import partial
from pathlib import Path
from pprint import pprint

import json
import os
import re

# pypi
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from expects import be, be_true, equal, expect
from tabulate import tabulate

# monkey
from graeae.visualization.altair_helpers import output_path, save_chart

These are the ones that are really needed for the plotting. I installed both of them through pypi.

import altair
import pandas

Some Setting Up

These are some convenience objects to save a little bit of coding when saving the chart.

SLUG = "introduction-to-altair-countries-per-year"
OUTPUT_PATH = output_path(SLUG)

HEIGHT, WIDTH = 600, 800
SAVE_IT = partial(save_chart, output_path=OUTPUT_PATH, height=HEIGHT + 100)

SOUPER = partial(BeautifulSoup, features="lxml")

This is to make printing out a pandas dataframe as a table a little nicer.

TABLE = partial(tabulate,
                headers="keys",
                tablefmt="orgtbl",
                showindex=False)

The Data

load_dotenv(override=True)
table_path = Path(os.getenv("WORLD_HAPPINESS_TABLE"))

expect(table_path.is_file()).to(be_true)

table = pandas.read_csv(table_path)
print(table.shape)

(2199, 11)

The Data Columns

def column_printer(table, headers=("Column", "Type")):
    print(TABLE(
        ((column, str(table[column].dtype))
         for column in table.columns),
        headers=headers))
    return

column_printer(table)

Column	Type
Country name	object
year	int64
Life Ladder	float64
Log GDP per capita	float64
Social support	float64
Healthy life expectancy at birth	float64
Freedom to make life choices	float64
Generosity	float64
Perceptions of corruption	float64
Positive affect	float64
Negative affect	float64

For this initial post I'll only use the year, but

class Column:
    year = "year"

Counting the Years

Using Pandas' `value_counts` Method

year_counts = table.year.value_counts().reset_index().sort_values("year")

table_counts = year_counts.T
table_counts.columns = table_counts.iloc[0]
table_counts = table_counts.drop(table_counts.index[0])
print(TABLE(table_counts, showindex=True))

	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
count	27	89	102	110	114	124	146	141	136	144	142	141	147	141	143	116	122	114

Now as a bar-chart.

value_counts_chart = altair.Chart(year_counts).mark_bar().encode(
    x="{}:N".format(Column.year),
    y="count").properties(height=HEIGHT, width=WIDTH)

VALUE_COUNTS_NAME = "value-counts-bar-chart"
VALUE_COUNTS_HTML = VALUE_COUNTS_NAME + ".html"
SAVE_IT(value_counts_chart, VALUE_COUNTS_NAME)

Using Altair's "count"

altair_counts_chart = altair.Chart(table).mark_bar().encode(
    x="{}:N".format(Column.year),
    y="count()").properties(height=HEIGHT, width=WIDTH)

ALTAIR_COUNTS_NAME = "altair-counts-bar-chart"
ALTAIR_COUNTS_HTML = ALTAIR_COUNTS_NAME + ".html"
SAVE_IT(altair_counts_chart, ALTAIR_COUNTS_NAME)

Comparing the File Sizes

The Files In Bytes

altair_counts_html = OUTPUT_PATH/(ALTAIR_COUNTS_HTML)
pandas_counts_html = OUTPUT_PATH/(VALUE_COUNTS_HTML)
print("Altair counts(): {:,} bytes".format(altair_counts_html.stat().st_size))
print("Pandas value_counts: {:,} bytes".format(pandas_counts_html.stat().st_size))

Altair counts(): 685,111 bytes
Pandas value_counts: 2,067 bytes

Here's one of the problems with altair - it passes along the entire dataset and then tells vega to work with it in the browser. So, in this case it's passing all our happiness data, even though the chart doesn't use any of the columns.

with altair_counts_html.open() as reader:
    altair_soup = SOUPER(reader)

with pandas_counts_html.open() as reader:
    pandas_soup = SOUPER(reader)

def data_printer(soup: BeautifulSoup, index:int=0) -> None:
    """Gets the data from the soup and prints the entry

    Params:

     - soup: BeautifulSoup with the HTML for the chart
     - index: which data row to show
    """
    EVERYTHING = ".*"
    EXTRA_BRACE = "(?=})"

    DATASETS_EXPRESSION = "datasets" + EVERYTHING + "}}"
    DATASET_EXPRESSION = "{" + EVERYTHING + "}" + EXTRA_BRACE

    script = soup.find_all("script")[-1].string
    dataset = re.search(DATASETS_EXPRESSION, script).group()
    dataset = re.search(DATASET_EXPRESSION, dataset).group()
    json_dataset = json.loads(dataset)
    data_key = list(json_dataset.keys())[0]
    data = json_dataset[data_key]

    print("'dataset' has {:,} data entries\n".format(len(data)))
    print("Entry {}:\n".format(index))
    pprint(data[index])
    return

data_printer(pandas_soup)

'dataset' has 18 data entries

Entry 0:

{'count': 27, 'year': 2005}

def frame_print(frame: pandas.DataFrame, index: int=0) -> None:
    """print length and one row of frame

    Params:

     - frame: data-frame to query
     - index: index of row to print
    """
    print("Frame has {:,} rows.".format(len(frame)))
    print("\nRow {}:\n".format(index))
    print(frame.iloc[0])
    return

frame_print(year_counts)

Frame has 18 rows.

Row 0:

year     2005
count      27
Name: 17, dtype: int64

data_printer(altair_soup)

'dataset' has 2,199 data entries

Entry 0:

{'Country name': 'Afghanistan',
 'Freedom to make life choices': 0.718,
 'Generosity': 0.168,
 'Healthy life expectancy at birth': 50.5,
 'Life Ladder': 3.724,
 'Log GDP per capita': 7.35,
 'Negative affect': 0.258,
 'Perceptions of corruption': 0.882,
 'Positive affect': 0.414,
 'Social support': 0.451,
 'year': 2008}

frame_print(table)

Frame has 2,199 rows.

Row 0:

Country name                        Afghanistan
year                                       2008
Life Ladder                               3.724
Log GDP per capita                         7.35
Social support                            0.451
Healthy life expectancy at birth           50.5
Freedom to make life choices              0.718
Generosity                                0.168
Perceptions of corruption                 0.882
Positive affect                           0.414
Negative affect                           0.258
Name: 0, dtype: object

There's a project called vegafusion that is supposed to help with reducing the size but it requires that you use a jupyter notebook for interactivity (it uses python to make a jupyter widget or some such) so it won't work for a static site like this one. So when using altair we have to think about what we're doing if the size of the files is going to be a problem. In most cases it probably makes sense to do the transformations in pandas first and then only pass the data to plot to altair.

See the altair documentation on Large Datasets for more information.

A Chart, Part By Part

Altair's Chart

chart = altair.Chart(year_counts)
print(type(chart))
expect(chart.data).to(be(year_counts))

<class 'altair.vegalite.v5.api.Chart'>

The Chart class is defined in altair.vegalite.v5.api. This is its docstring description:

Create a basic Altair/Vega-Lite chart.

Although it is possible to set all Chart properties as constructor attributes, it is more idiomatic to use methods such as mark_point(), encode(), transform_filter(), properties(), etc. See Altair's documentation for details and examples: http://altair-viz.github.io/.

The attributes set by the Chart class' constructor (it also accepets other keyword parameters that are passed to its parent classes) are:

data
encoding
mark
width
height

By default they're set to Undefined which is an altair-defined object (see altair.utils.schemapi), and as noted, you don't normally set the attributes using the constructor (other than data which isn't mentioned in the docstring but appears to be passed to the Chart constructor by convention).

Here's a diagram of the Chart (defined in altair.vegalite.v5.api).

nil

A Bar Chart

Once we have a chart object we tell altair that we want it to be a bar chart using the mark_bar method.

bar_chart = chart.mark_bar()
print(type(bar_chart))

<class 'altair.vegalite.v5.api.Chart'>

The mark_ methods are defined in the MarkMethodMixin class (a parent of Chart) which is defined in altair.vegalite.v5.schema.mixins module.

MarkMixin Class

Looking in the mark_bar method, there's a lot of arguments you could pass to it, but fundamentally all it's really doing is making a copy of itself, setting the mark attribute to bar and then retu+rning the copy.

print("Original Chart mark: '{}'".format(chart.mark))
print("Bar Chart mark: '{}'".format(bar_chart.mark))

expect(bar_chart).to_not(be(chart))

Original Chart mark: 'Undefined'
Bar Chart mark: 'bar'

SchemaBase

altair.utils.schemapi.

nil

There are many more methods in altair.utils.schemapi.SchemaBase but I'm highlighting copy here because it gets used quite a bit by the other classes but is defined in this somewhat obscure place. The behavior is what you'd expect so I don't see a need to go over it, but it's one of those mystery methods that just pops up when you use deep inheritance like this that makes you wonder what's going on so I'll document it here, for now.

TopLevelUnitSpec

If you look at the parents of the Chart you might notice that it doesn't have the SchemaBase as one of its parents. So how does it end up with the copy method? Well, it does have the core.TopLevelUnitSpec as one of its parents and that in turn (eventually) inherits from the SchemaBase.

nil

I didn't put in the modules for the core classes since they are fairly deep.

Encoded

The encode method is where we tell altair which columns match which parts of the chart. In this case we're only setting the x and y axes.

encoded = bar_chart.encode(
    x="{}:N".format(Column.year),
    y="count")

print(type(encoded))

<class 'altair.vegalite.v5.api.Chart'>

_EncodingMixin

The encode method is defined in the _EncodingMixin class, one of the Chart's parents.

nil

The encoding method takes in whatever combination of positional and keyword arguments you pass into it and then:

copies the Chart
updates the chart's encoding attribute
sets the copy's encoding attribute to an instance of the altair.vegalite.v5.schema.FacetedEncoding class.
returns the copy

print(encoded.encoding)

FacetedEncoding({
  x: X({
    shorthand: 'year:N'
  }),
  y: Y({
    shorthand: 'count'
  })
})

Properties

propertied = encoded.properties(height=HEIGHT, width=WIDTH)
print(type(propertied))

<class 'altair.vegalite.v5.api.Chart'>

nil

Note: This is a huge class with more methods than I'm showing here. The only ones we've encountered so far are to_dict, save and properties. I used to_dict to show that the chart has all the data from the pandas DataFrame and save is buried in the code that saves the chart to display it in this post - properties is the only one we're really interested in here.

The first thing to note about the properties method is that it doesn't define any arguments, it takes in any keyword arguments (and only keyword arguments, no positional arguments) and values for the arguments. Then:

it makes a copy of the chart
validates the arguments (unless the argument is the data)
sets the arguments as attributes of the copy.
returns the copy

Since we passed in height and width to the properties method, we get back a copy of our bar chart with the height and width set on the copy (as well as the "mark" which we set earlier with mark_bar).

print(propertied.mark)
print(propertied.width)
print(propertied.height)
expect(propertied.mark).to(equal("bar"))
expect(propertied.width).to(equal(WIDTH))
expect(propertied.height).to(equal(HEIGHT))

bar
800
600

HVPlot

Links

The Posts In This Series

Starting Post

Tutorial Sources

Fundamentals of Visualization with Dr. Danielle Albers Szafir: Coursera course where the original tutorial came from.
InfoVis Course from Marian Dork: Iniversity of Applied Sciences Potsdam that the Coursera course cites as the source for their tutorial.

The Data

World Happiness Report Dataset [Internet]. [cited 2023 Dec 11]. Available from: https://www.kaggle.com/datasets/unsdsn/world-happiness
Inc G. Gallup.com. 2009 [cited 2023 Dec 11]. Understanding How Gallup Uses the Cantril Scale. Available from: https://news.gallup.com/poll/122453/Understanding-Gallup-Uses-Cantril-Scale.aspx
World Happiness Report [Internet]. 2023 [cited 2023 Dec 11]. Available from: https://worldhappiness.report/

Altair

Layered and Multi-View Charts — Vega-Altair 5.2.0 documentation [Internet]. [cited 2023 Dec 12]. Available from: https://altair-viz.github.io/user_guide/compound_charts.html
Large Datasets — Vega-Altair 5.2.0 documentation [Internet]. [cited 2024 Mar 8]. Available from: https://altair-viz.github.io/user_guide/large_datasets.html

Introduction To Altair

The Cloistered Monkey

2024-01-29 19:52

Posts

Countries Per Year

Setup

# python
from functools import partial
from pathlib import Path
import os

# pypi
from dotenv import load_dotenv
from expects import be_true, expect
from tabulate import tabulate

import altair
import pandas

# monkey
from graeae.visualization.altair_helpers import output_path, save_chart

SLUG = "introduction-to-altair"
OUTPUT_PATH = output_path(SLUG)

HEIGHT, WIDTH = 600, 800
SAVE_IT = partial(save_chart, output_path=OUTPUT_PATH, height=HEIGHT + 100)

TABLE = partial(partial(tabulate,
                        headers="keys",
                        tablefmt="orgtbl",
                        showindex=False))

The Data

load_dotenv(override=True)
table_path = Path(os.getenv("WORLD_HAPPINESS_TABLE"))

expect(table_path.is_file()).to(be_true)

table = pandas.read_csv(table_path)
print(table.shape)

(2199, 11)

def column_printer(table, headers=("Column", "Type")):
    print(TABLE(
        ((column, str(table[column].dtype))
         for column in table.columns),
        headers=headers))
    return

column_printer(table)

Column	Type
Country name	object
year	int64
Life Ladder	float64
Log GDP per capita	float64
Social support	float64
Healthy life expectancy at birth	float64
Freedom to make life choices	float64
Generosity	float64
Perceptions of corruption	float64
Positive affect	float64
Negative affect	float64

class Column:
    __slots__ = ()
    country = "Country name"
    year = "year"
    happiness = "Life Ladder"
    gdp = "Log GDP per capita"
    generosity = "Generosity"

Creating a Chart

chart = altair.Chart(table)
print(type(chart))

<class 'altair.vegalite.v5.api.Chart'>

The Chart class is defined in altair.vegalite.v5.api. This is its docstring description:

Create a basic Altair/Vega-Lite chart.

Although it is possible to set all Chart properties as constructor attributes, it is more idiomatic to use methods such as mark_point(), encode(), transform_filter(), properties(), etc. See Altair's documentation for details and examples: http://altair-viz.github.io/.

The attributes set by the Chart class' constructor (as opposed to being passed to its parent classes) are:

data
encoding
mark
width
height

By default they're set to Undefined which is an altair-defined object (see altair.utils.schemapi), and as noted, you don't normally set the attributes using the constructor (other than data which isn't mentioned in the docstring but appears to be passed to the Chart constructor by convention).

Chart Class

The methods take arguments, but I believe interactive (which I haven't seen called with arguments) is the only method that I'll be using - all the other methods you use belong to parent classes.

TopLevelUnitSpec

nil

Only the top-most parent (SchemaBase) has a method of interest here.

SchemaBase

nil

There are many more methods in altair.utils.schemapi.SchemaBase but I'm highlighting copy here because it gets used quite a bit by the other classes but is defined in this somewhat obscure place. The behavior is what you'd expect so I don't see a need to go over it, but it's one of those mystery methods that just pops up when you use deep inheritance like this that makes you wonder what's going on so I'll document it here, for now.

note: The details of the code probably shouldn't go into the introduction-introduction. Maybe put this stuff further down or in another post.

_EncodingMixin

nil

The encoding method takes in whatever combination of positional and keyword arguments you pass into it and uses them along with the values that are already set in the Chart's encoding attribute to update the encoding before returning the chart (a copy so the original chart isn't changed).

The encoding attribute is an instance of the altair.vegalite.v5.schema.FacetedEncoding class.

Making It a Bar Chart

bar_chart = chart.mark_bar()
print(type(bar_chart))

<class 'altair.vegalite.v5.api.Chart'>

MarkMixin Class

Looking in the mark_bar method, there's a lot of arguments you could pass to it, but fundamentally all it's really doing is setting the mark attribute to bar.

print("Chart Mark: '{}'".format(chart.mark))
print("Bar Chart Mark: '{}'".format(bar_chart.mark))

Chart Mark: 'Undefined'
Bart Chart Mark: 'bar'

Year

year_counts = table.year.value_counts().reset_index().sort_values("year").T
year_counts.columns = year_counts.iloc[0]
year_counts = year_counts.drop(year_counts.index[0])
print(TABLE(year_counts, showindex=True))

	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
count	27	89	102	110	114	124	146	141	136	144	142	141	147	141	143	116	122	114

Now as a bar-chart.

year_chart = bar_chart.encode(
    x="year:N",
    y="count()")

year_chart = year_chart.properties(height=HEIGHT, width=WIDTH)

SAVE_IT(year_chart, "year-counts-bar-chart")

nil

This method makes a copy of the class, validates the arguments and then sets the properties on itself. The arguments are based on the JSON Schema passed to vega-lite.

Life Ladder

By Year

boxplot = chart.mark_boxplot(extent="min-max").encode(
    x="{}:O".format(Column.year),
    y=Column.happiness
).properties(height=HEIGHT, width=WIDTH)

SAVE_IT(boxplot, "happiness-year-boxplots")

The Most Recent Year (2022)

print(table.year.dtype)

int64

data_2022 = table[table.year==2022]
chart_2022 = altair.Chart(data_2022).properties(
    height=HEIGHT, width=WIDTH)

bar_chart_2022 = chart_2022.mark_bar()

ladder_chart = bar_chart_2022.encode(
    x=altair.X(Column.happiness, bin=True),
    y="count()"
)

SAVE_IT(ladder_chart, "ladder-histogram")

GDP

scatter = chart_2022.mark_circle()
print(scatter.mark)

circle

gdp_scatter = scatter.encode(
    x=Column.gdp,
    y=Column.happiness
)

SAVE_IT(gdp_scatter, "gdp-vs-happiness")

With Generosity

gdp_generosity = scatter.encode(
    x=Column.happiness,
    y=Column.generosity,
    color=Column.gdp,
    tooltip=[Column.country, Column.happiness, Column.gdp,
             Column.generosity]
)

SAVE_IT(gdp_generosity, "gdp-and-generosity")

Links

Tutorial Sources

Fundamentals of Visualization with Dr. Danielle Albers Szafir: Coursera course where the original tutorial came from.
InfoVis Course from Marian Dork: Iniversity of Applied Sciences Potsdam that the Coursera course cites as the source for their tutorial.

The Data

World Happiness Report Dataset [Internet]. [cited 2023 Dec 11]. Available from: https://www.kaggle.com/datasets/unsdsn/world-happiness
Inc G. Gallup.com. 2009 [cited 2023 Dec 11]. Understanding How Gallup Uses the Cantril Scale. Available from: https://news.gallup.com/poll/122453/Understanding-Gallup-Uses-Cantril-Scale.aspx
World Happiness Report [Internet]. 2023 [cited 2023 Dec 11]. Available from: https://worldhappiness.report/

Altair

Layered and Multi-View Charts — Vega-Altair 5.2.0 documentation [Internet]. [cited 2023 Dec 12]. Available from: https://altair-viz.github.io/user_guide/compound_charts.html

World Happiness Data

The Cloistered Monkey

2023-12-14 14:58

Imports

# python
from functools import partial
from pathlib import Path
import os

# pypi
from dotenv import load_dotenv
from expects import be_true, expect
from tabulate import tabulate

import pandas

TABLE = partial(partial(tabulate,
                        headers="keys",
                        tablefmt="orgtbl",
                        showindex=False))

load_dotenv(override=True)
kaggle_path = Path(os.getenv("KAGGLE_WORLD_HAPPINESS"))
figure_path = Path(os.getenv("WORLD_HAPPINESS_FIGURE"))
table_path = Path(os.getenv("WORLD_HAPPINESS_TABLE"))
print(kaggle_path)
print(figure_path)
print(table_path)
expect(kaggle_path.is_file()).to(be_true)
expect(table_path.is_file()).to(be_true)
expect(figure_path.is_file()).to(be_true)

/home/bravo/data/datasets/kaggle/world-happiness-report/WHR_2016.csv
/home/bravo/data/datasets/world-happiness-data/world-happiness-report-2023-data-for-figure-2.1.csv
/home/bravo/data/datasets/world-happiness-data/world-happiness-report-2023-data-for-table-2.1.csv

kaggle = pandas.read_csv(kaggle_path)
figure = pandas.read_csv(figure_path)
table = pandas.read_csv(table_path)

print(kaggle.shape)
print(figure.shape)
print(table.shape)

(157, 13)
(137, 19)
(2199, 11)

def column_printer(table, headers=("Column", "Type")):
    print(TABLE(
        ((column, str(table[column].dtype))
         for column in table.columns),
        headers=headers))
    return

column_printer(kaggle)

Column	Type
Country	object
Region	object
Happiness Rank	int64
Happiness Score	float64
Lower Confidence Interval	float64
Upper Confidence Interval	float64
Economy (GDP per Capita)	float64
Family	float64
Health (Life Expectancy)	float64
Freedom	float64
Trust (Government Corruption)	float64
Generosity	float64
Dystopia Residual	float64

column_printer(figure)

Column	Type
Country name	object
Ladder score	float64
Standard error of ladder score	float64
upperwhisker	float64
lowerwhisker	float64
Logged GDP per capita	float64
Social support	float64
Healthy life expectancy	float64
Freedom to make life choices	float64
Generosity	float64
Perceptions of corruption	float64
Ladder score in Dystopia	float64
Explained by: Log GDP per capita	float64
Explained by: Social support	float64
Explained by: Healthy life expectancy	float64
Explained by: Freedom to make life choices	float64
Explained by: Generosity	float64
Explained by: Perceptions of corruption	float64
Dystopia + residual	float64

column_printer(table)

Column	Type
Country name	object
year	int64
Life Ladder	float64
Log GDP per capita	float64
Social support	float64
Healthy life expectancy at birth	float64
Freedom to make life choices	float64
Generosity	float64
Perceptions of corruption	float64
Positive affect	float64
Negative affect	float64

It's hard to say exactly but it looks like Region and Happiness Rank were added by whoever created the kaggle dataset and it isn't clear what the Family column ties into. The only column in the UN data not matched is Social Support but that doesn't seem to have the right value range:

print(kaggle.Family.max())
print(figure["Social support"].max())

1.18326
0.983

We're not going to use Family anyway, so I'll just ignore it.

Country

COUNTRY = "Country name"

print(kaggle.Country.min())
print(table[COUNTRY].min())
print(figure[COUNTRY].min())

Afghanistan
Afghanistan
Afghanistan

print(len(kaggle.Country.unique()))
print(len(table[COUNTRY].unique()))
print(len(figure[COUNTRY].unique()))

157
165
137

The figure data has quite a bit fewer entries than the table data. I'll have to look into that as I was planning to merge them, but I'll have to figure out why those twenty-eight countries are missing.

The Table vs Kaggle

k_countries = set(kaggle.Country)
t_countries = set(table[COUNTRY])

print(sorted(t_countries - k_countries))
print()
print(sorted(k_countries - t_countries))

['Central African Republic', 'Cuba', 'Czechia', 'Djibouti', 'Eswatini', 'Gambia', 'Guyana', 'Hong Kong S.A.R. of China', 'Lesotho', 'Maldives', 'Mozambique', 'North Macedonia', 'Oman', 'Somaliland region', 'State of Palestine', 'Taiwan Province of China', 'Turkiye']

['Czech Republic', 'Hong Kong', 'Macedonia', 'North Cyprus', 'Palestinian Territories', 'Puerto Rico', 'Somaliland Region', 'Taiwan', 'Turkey']

Kaggle	World Happiness Report	Wikipedia
- Missing -	Central African Republic	Central African Republic
- Missing -	Cuba	Cuba
Czech Republic	Czechia	Czech Republic
- Missing -	Djibouti	Djibouti
- Missing -	Eswatini	Eswatini
- Missing -	Gambia	The Gambia
- Missing -	Guyana	Guyana
Hong Kong	Hong Kong S.A.R. of China	Hong Kong
- Missing -	Lesotho	Lesotho
Macedonia	North Macedonia	North Macedonia
- Missing -	Maldives	Maldives
- Missing -	Mozambique	Mozambique
North Cyprus	- Only recognized by Turkey -	Northern Cyprus
- Missing -	Oman	Oman
Palestinian Territories	State of Palestine	State of Palestine, Palestinian Territories
Puerto Rico	- Territory of U.S. -	Puerto Rico
Somaliland Region	Somaliland region
Taiwan	Taiwan Province of China	Taiwan
Turkey	Turkiye	Turkey

I'm more interested in the World Happiness Report so I'll conform Kaggle's country names to match that and ignore the countries that it's missing.

def rename_country(names: dict, data: pandas.DataFrame,
                   country_column: str=COUNTRY) -> pandas.DataFrame:
    """Rename the countries in the kaggle data

    Args:

     - names: dict mapping kaggle names to names you want
     - kaggle_data: the kaggle happiness data to rename countries
     - country_column: name to use for the country column

    Returns:
     kaggle_data with countries renamed
    """
    data = kaggle_data.rename(columns=dict(Country=country_column))
    data[country_column] = data[country_column].replace(names)
    return data

kaggle_to_world = {
    "Czech Republic": "Czechia",
    "Macedonia": "North Macedonia",
    "Palestinian Territories": "State of Palestine",
    "Turkey": "Turkiye"
}

kaggled = kaggle.copy()
kaggled["Country"] = kaggled.Country.replace(kaggle_to_world)
print(set(kaggled.Country) - set(table[COUNTRY]))
print(set(table[COUNTRY]) - set(kaggled.Country))

{'Somaliland Region', 'North Cyprus', 'Taiwan', 'Hong Kong', 'Puerto Rico'}
{'Hong Kong S.A.R. of China', 'Central African Republic', 'Djibouti', 'Oman', 'Lesotho', 'Mozambique', 'Somaliland region', 'Gambia', 'Taiwan Province of China', 'Eswatini', 'Guyana', 'Cuba', 'Maldives'}

world_to_kaggle = {"Hong Kong S.A.R. of China": "Hong Kong",
                   "Somaliland region": "Somaliland Region",
                   "Taiwan Province of China": "Taiwan"}

tabled = table.rename(columns={COUNTRY: "Country"})
tabled["Country"] = tabled.Country.replace(world_to_kaggle)

print(set(kaggled.Country) - set(tabled.Country))
print(set(tabled.Country) - set(kaggled.Country))

{'North Cyprus', 'Puerto Rico'}
{'Central African Republic', 'Djibouti', 'Oman', 'Lesotho', 'Mozambique', 'Gambia', 'Eswatini', 'Guyana', 'Cuba', 'Maldives'}

Figure Data

figured = figure.rename(columns={COUNTRY: "Country"})
figured["Country"] = figured.Country.replace(world_to_kaggle)

print(set(figured.Country) - set(kaggled.Country))
print()
print(set(figured.Country) - set(tabled.Country))

{'Mozambique', 'Gambia'}

set()

The Figure Countries

f_countries = set(figure[COUNTRY])
f_only = f_countries - t_countries
kd_countries = set(kaggled[COUNTRY])

print(sorted(f_only - kd_countries))
print()
print(sorted(kd_countries - f_countries))

[]

['Angola', 'Azerbaijan', 'Belarus', 'Belize', 'Bhutan', 'Burundi', 'Haiti', 'Kuwait', 'Libya', 'North Cyprus', 'Puerto Rico', 'Qatar', 'Rwanda', 'Somalia', 'Somaliland region', 'South Sudan', 'Sudan', 'Suriname', 'Syria', 'Trinidad and Tobago', 'Turkmenistan', 'Yemen']

The figure data has twenty fewer countries than the Kaggle data so it's not surprising that there's a lot left over. It doesn't look like there's any in the figure data that Kaggle doesn't have, though, which is good.

print(sorted(t_countries - f_countries))

['Angola', 'Azerbaijan', 'Belarus', 'Belize', 'Bhutan', 'Burundi', 'Central African Republic', 'Cuba', 'Djibouti', 'Eswatini', 'Guyana', 'Haiti', 'Kuwait', 'Lesotho', 'Libya', 'Maldives', 'Oman', 'Qatar', 'Rwanda', 'Somalia', 'Somaliland region', 'South Sudan', 'Sudan', 'Suriname', 'Syria', 'Trinidad and Tobago', 'Turkmenistan', 'Yemen']

Links

Kaggle World Happiness Data: The World Happiness Data. There are four separate CSV files available for 2015 through 2019.
World Happiness Report: The website for the World Happiness Report.

A Ray

The Cloistered Monkey

2023-03-24 16:57

Panel 4

Panel 5

Panel 6

Previous Bubba

Table of Contents

NYC Shootings (Python Version)

The Cloistered Monkey

2023-03-22 13:18

NYPD Shooting Incident Data Version 2 (with python)

This is a replication (more or less) of the NYPD Shooting Incident Data post which we did using R but this time we'll use python and pandas instead. Once again we'll be looking at the NYPD Shooting Incident Data (Historic) from the DATA.gov catalog of datasets, which lists every shooting from 2006 through the last updated incident (as of March 22, 2023 it shows that it was last updated June 9, 2022 but we'll have to look at it to see what the date of the last incident was).

Imports and Setup

# python
from datetime import datetime
from functools import partial
import sys

# pypi
from expects import equal, expect
from tabulate import tabulate

import altair
import geopandas
import pandas

from graeae.visualization.altair_helpers import output_path, save_chart

First, let's double-check which version of python we're using.

print(sys.version)

3.9.9 (main, Dec  3 2021, 01:15:49) 
[GCC 10.2.1 20210110]

We're running python 3.9.9. I was running it in pypy but for some reason the pandas' groupby method would hang so this is done in cPython.

print(altair.__version__)

5.0.0rc1

I found out after I'd been struggling with altair for a little while that they're about to make a major release that breaks some of the ways you use it (in the prior versions). Since it's been a while since I've used altair I decided to go with the newer, upcoming version rather than re-learn the current one and then learn the new version of altair once it's released.

This next bit is a little setup to make nicer tables.

TABLE = partial(tabulate, headers="keys", showindex=False, tablefmt="orgtbl")

This is to help with the plotting.

SLUG = "nyc-shootings-python-version"
OUTPUT_PATH = output_path(SLUG)
save_it = partial(save_chart, output_path=OUTPUT_PATH, height=650)

The Data

Loading It

The data is available to download as a CSV so we'll pass the URL for the CSV to pandas and make a dataframe.

URL = "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
data = pandas.read_csv(URL)

rows, columns = data.shape
print(f"Rows: {rows:,}\tColumns: {columns}")

Rows: 25,596    Columns: 19

The Columns

We'll start looking at the data by examining what's in the columns. We saw from the data frame's shape that we have 19 columns, how many of them does pandas think are numeric?

print(TABLE(data.describe(), showindex=True))

	INCIDENT_KEY	PRECINCT	JURISDICTION_CODE	X_COORD_CD	Y_COORD_CD	Latitude	Longitude
count	25596	25596	25594	25596	25596	25596	25596
mean	1.12383e+08	65.8694	0.331601	1.00945e+06	207894	40.7372	-73.909
std	6.78612e+07	27.2019	0.742266	18421.4	31857.4	0.087447	0.0664265
min	9.95324e+06	1	0	914928	125757	40.5116	-74.2493
25%	6.15936e+07	44	0	1.00001e+06	182782	40.6683	-73.9431
50%	8.64373e+07	69	0	1.00772e+06	194038	40.6991	-73.9153
75%	1.66661e+08	81	0	1.01684e+06	239429	40.8238	-73.8824
max	2.3849e+08	123	2	1.06682e+06	271128	40.9108	-73.702

Just seven of them, and four are coordinates and the other three are categorical. This isn't really surprising, since the data isn't measuring anything but is rather a listing of shooting incidents reported (each row represents a single incident).

There's a table on the NYPD-Shooting-Incident-Data page that describes the columns.

Column Name	Description	Type
INCIDENT_KEY	Randomly generated persistent ID for each arrest	Plain Text
OCCUR_DATE	Exact date of the shooting incident	Date & Time
OCCUR_TIME	Exact time of the shooting incident	Plain Text
BORO	Borough where the shooting incident occurred	Plain Text
PRECINCT	Precinct where the shooting incident occurred	Plain Text
JURISDICTIONAL_CODE	Jurisdiction where it occurred	Number
LOCATION_DESC	Location of the incident	Plain Text
STATISTICAL_MURDER_FLAG	Victim died	Checkbox
PERP_AGE_GROUP	Perpetrator's age within a category	Plain Text
PERP_SEX	Pepetrator's sex.	Plain Text
PERP_RACE	Perpetrator's race.	Plain Text
VIC_AGE_GROUP	Victim's age with a category.	Plain Text
VIC_SEX	Victim's sex.	Plain Text
VIC_RACE	Victim's Race	Plain Text
X_COORD_CD	Midblock X-coordinate for New York State Plane Coordinate System	Plain Text
Y_COORD_CD	Midblock Y-coordinate	Plain Text
Latitude	Latitude coordinate	Number
Longitude	Longitude	Number
Lon_Lat	Longitude and Latitude Coordinate for mapping	Point

Missing Data

missing = len(data) - data.count()
missing = missing.reset_index().rename(
    columns={"index": "Column",
             0: "Missing"})

some_missing = missing[missing.Missing > 0].copy()
some_missing.loc[:, "Fraction"] = some_missing.Missing/len(data)
print(TABLE(some_missing))

Column	Missing	Fraction
JURISDICTION_CODE	2	7.81372e-05
LOCATION_DESC	14977	0.58513
PERP_AGE_GROUP	9344	0.365057
PERP_SEX	9310	0.363729
PERP_RACE	9310	0.363729

Most entries are missing location descriptions for some reason, and quite a lot of perpertator data is missing, possibly because they didn't catch whoever did the shooting in those cases.

Incident Key

The incident key is an identifier for a specific incident so it's only really useful if you need to look up or refer to one or more of them, but we'll be looking at things in aggregate making them less useful for us, except maybe for looking at anomalies. Let's just make sure that the identifiers are unique as I'm asserting that they are.

id_count = len(data.INCIDENT_KEY.unique())
incidents = len(data)
print(f"Identifiers: {id_count:,}\tIncidents: {incidents:,}")
print(f"There are {incidents - id_count:,} more rows than incident IDs.")

Identifiers: 20,126     Incidents: 25,596
There are 5,470 more rows than incident IDs.

It appears that I wasn't correct in my assumption… let's take a look at one of the incidents.

counts = data.INCIDENT_KEY.value_counts()
up_counts = counts[counts > 1]
top = counts.head(1)
top_id = top.index[0]
print(f"Incident: {top_id}\tCount: {top.iloc[0]}")

Incident: 173354054     Count: 18

Inspecting the dataframe it looks like in some cases more than one person was shot per incident, so there's multiple rows (one per person shot) for a single incident. Kind of scary that eighteen people got shot at one incident, if my interpretation is correct, but that's life in the big city, I guess. Reading the Footnotes (link is to a PDF) it says:

A shooting incident can have multiple victims involved and as a result duplicate INCIDENT_KEYs are produced.

So it appears each row represents the victim of a shooting and each INCIDENT_KEY represents a shooting where one or more person was shot. The footnotes also note that only incidents where a victim was shot are included. If someone fired a gun but didn't hit anyone then it isn't represented in the data set.

Note: There's actually a slight discrepancy between the descriptions of the INCIDENT_KEY between the web-page and the PDF footnotes. According to the Web-Page the ID is for each arrest while the footnotes make it sound like they represent cases where there was at least one victim, whether or not someone was arrested. For our purposes this won't matter, since we're only using the data as a source for data visualization, but if one were really trying to understand what was happening in NYC knowing exactly what the data represents might be important (assuming not all cases with a shooting victim leads to an arrest).

use_counts = up_counts.reset_index()
chart = altair.Chart(use_counts).mark_bar().encode(
    x = altair.X("INCIDENT_KEY",
                 type="nominal",
                 sort="-y",
                 axis=altair.Axis(labels=False)),
    y="count",
    tooltip=[altair.Tooltip("INCIDENT_KEY", type="nominal"),
             altair.Tooltip("count")]
).interactive().properties(
   title="Incidents with Multiple Shots",
   width=800,
   height=525
)

save_it(chart, "multiple_shot_incidents")

It looks like a lot of entries have more than one row. Does this mean many incidents have more than one victim? More than one shooter?

fractions = 100 * counts.value_counts()/len(data)
fractions = fractions.reset_index(name="Percent of Rows").rename(columns={
    "count": "Rows"})

chart = altair.Chart(fractions).mark_bar().encode(
    x=altair.X("Rows", sort=fractions["Percent of Rows"].values),
    y=altair.Y("Percent of Rows", scale=altair.Scale(domain=(-1, 70))),
    tooltip=[altair.Tooltip("Rows"),
             altair.Tooltip("Percent of Rows")]).properties(
                 title="Percent Of Incidents with Multiple Rows",
                 width=800,
                 height=525)

save_it(chart, "fraction_row_incidents")

The majority of the incidents do have only one row in the dataset. Perhaps it's not as unusual as I think it is to have multiple people involved in a shooting.

OCCUR_DATE and OCCUR_TIME

There are two columns that tell us when the shooting is supposed to have happened.

example = data[data.INCIDENT_KEY==top_id].iloc[0]
print(f"OCCUR_DATE: {example.OCCUR_DATE} ({data.OCCUR_DATE.dtype})")
print(f"OCCUR_TIME: {example.OCCUR_TIME} ({data.OCCUR_TIME.dtype})")

OCCUR_DATE: 01/06/2018 (object)
OCCUR_TIME: 21:05:00 (object)

Pandas interpreted both of these as strings, but it'd probably be more useful for us if they were datetime objects.

MONTH, DAY, YEAR = "%m", "%d", "%Y"
HOUR, MINUTE, SECOND = "%H", "%M", "%S"
DATE_FORMAT = "/".join((MONTH, DAY, YEAR))
TIME_FORMAT = ":".join((HOUR, MINUTE, SECOND))
FORMAT = f"{DATE_FORMAT} {TIME_FORMAT}"
DATE_COLUMN = "date_time"
data[DATE_COLUMN] = pandas.to_datetime(data.OCCUR_DATE + " " + data.OCCUR_TIME, format=FORMAT)

check_date = data[data.INCIDENT_KEY==top_id].iloc[0]
print(f"OCCUR_DATE: {check_date.OCCUR_DATE}")
print(f"New Date: {check_date.date_time.date()}")
print(f"OCCUR_TIME: {check_date.OCCUR_TIME}")
print(f"New Time: {check_date.date_time.time()}")

OCCUR_DATE: 01/06/2018
New Date: 2018-01-06
OCCUR_TIME: 21:05:00
New Time: 21:05:00

print(data.OCCUR_DATE.min())
print(data.OCCUR_DATE.max())

01/01/2006
12/31/2021

Our dataset covers the years from 2006 throught 20021. Let's see how many there are from month to month.

Shootings By Month

indexed = data.set_index(DATE_COLUMN)
monthly = indexed.groupby(pandas.Grouper(freq="M"))
monthly_counts = monthly.count()["INCIDENT_KEY"].reset_index().rename(
    columns={"INCIDENT_KEY": "Shootings",
             "date_time": "Month"}
)
expect(monthly_counts["Shootings"].sum()).to(equal(len(data)))

MONTH_YEAR = "%B %Y"
chart = altair.Chart(monthly_counts).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
    x=altair.X("Month", type="temporal"),
    y=altair.Y("Shootings"),
    tooltip=[altair.Tooltip("Month", format=MONTH_YEAR),
             altair.Tooltip("Shootings")]
).properties(
    width=800,
    height=525,
    title="NYC Shootings By Month"
)

save_it(chart, "monthly_incidents")

It looks like shootings went down in 2013 then shot back up again in the Summer of 2020.

90 Day Rolling Window

monthly_counts["Rolling Mean"] = monthly_counts["Shootings"].ewm(
    halflife="90 days", times=monthly_counts.Month).mean()
pre_melt = monthly_counts.rename(columns={"Shootings": "Sum"})
melted = pre_melt.melt("Month", var_name="Aggregation", value_name="Aggregated Value")
chart = altair.Chart(melted).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
               x=altair.X("Month"),
               y=altair.Y("Aggregated Value"),
               color="Aggregation",
               tooltip=[altair.Tooltip("Month", format=MONTH_YEAR),
                        altair.Tooltip("Aggregated Value")]
).properties(
    width=800,
    height=525,
    title="NYC Shootings By 90 Day Exponential Weighted Mean"
)

save_it(chart, "monthly_rolling_incidents")

Using a ninety-day window gives a little better sense of the overall trend downwards until 2020 reversed it.

By Year

yearly = indexed.groupby(pandas.Grouper(freq="Y"))
yearly_counts = yearly.count()["INCIDENT_KEY"].reset_index().rename(
    columns={"INCIDENT_KEY": "Shootings"}
)

yearly_counts["Year"] = yearly_counts.date_time.dt.year
expect(yearly_counts["Shootings"].sum()).to(equal(len(data)))

chart = altair.Chart(yearly_counts).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
    x=altair.X("Year", type="ordinal"),
    y=altair.Y("Shootings"),
    tooltip=[altair.Tooltip("Year"),
             altair.Tooltip("Shootings", format=",")]
).properties(
    width=800,
    height=525,
    title="NYC Shootings By year"
)

save_it(chart, "yearly_incidents")

Although 2020 had that crazy summer, 2021 still exceeded it overall.

Monthly By Year

monthly_counts = monthly.count()["INCIDENT_KEY"].reset_index().rename(
    columns={"INCIDENT_KEY": "Shootings"}
)
# monthly_counts = monthly_counts.rename(columns={"Month": "date-time"})

month_map = dict(zip(range(1, 13), "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split()))
monthly_counts["Year"] = monthly_counts["date_time"].dt.year
monthly_counts["Month"] = monthly_counts["date_time"].dt.month.apply(lambda month: month_map[month])

selection = altair.selection_point(fields=["Year"],
                                   bind="legend")

chart = altair.Chart(monthly_counts).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
    x=altair.X("Month",
               sort=monthly_counts["date_time"].values),
    y=altair.Y("Shootings"),
    color=altair.Color("Year", type="nominal"),
    tooltip=[altair.Tooltip("date_time", title="Month", format=MONTH_YEAR),
             altair.Tooltip("Shootings")],
    opacity=altair.condition(selection,
                             altair.value(0.8),
                             altair.value(0.2))
).properties(
    width=800,
    height=525,
    title="NYC Monthly Shootings By year"
).add_params(selection)

save_it(chart, "month_year_shootings")

Click on a year in the legend to isolate it from the other lines in the plot and on a space somewhere on the legend to highlight all the years again.

It looks like there was an unusually large surge in the Summer of 2020 after years of declining shootings, but regardless of the year there's at least a slight increase in shootings during the Summer months.

Median By Month

by_month_median = monthly_counts.groupby("Month").median().reset_index()

chart = altair.Chart(by_month_median).mark_bar().encode(
    x=altair.X("Month", sort=list(month_map.values())),
    y="Shootings",
    tooltip=[altair.Tooltip("Month"),
             altair.Tooltip("Shootings"),]
).properties(
    width=800,
    height=525,
    title="NYC Median Shootings By Month (2006 through 2012)"
)

save_it(chart, "monthly_shootings")

It looks like there's a definite seasonal effect on shootings. Must be all those farmer's kids off from school to work the fields and shoot each other.

Location

Descriptions

BORO

The BORO column identifies which of the five boroughs of New York City the victim was shot in.

Five Boroughs

Manhattan
Brooklyn
Queens
Bronx
Staten Island

boroughs = data.BORO.value_counts().reset_index().rename(
    columns=dict(BORO="Borough", count="Shootings"))


chart = altair.Chart(boroughs).mark_bar().encode(
    x="Borough",
    y="Shootings",
    tooltip=[altair.Tooltip("Borough"),
             altair.Tooltip("Shootings", format=",")]
).properties(
    width=800,
    height=520,
    title="Shootings by Borough"
)

save_it(chart, "shootings-by-borough")

borough_monthly = monthly.agg({"BORO": "first",
                               "INCIDENT_KEY": "count"}).reset_index().rename(
                                   columns=dict(BORO="Borough", INCIDENT_KEY="Shootings", date_time="Month"))

selection = altair.selection_point(fields=["Borough"], bind="legend")
chart = altair.Chart(borough_monthly).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
               x=altair.X("Month"),
               y=altair.Y("Shootings"),
               color="Borough",
               tooltip=[
                   altair.Tooltip("Borough"),
                   altair.Tooltip("Month", format=MONTH_YEAR),
                   altair.Tooltip("Shootings")
               ],
               opacity=altair.condition(selection,
                                        altair.value(0.8),
                                        altair.value(0.2))
).properties(
    width=900,
    height=525,
    title="Monthly Borough Shootings"
).add_params(selection)

save_it(chart, "borough-monthly-shootings")

The Bronx and Brooklyn look like they have the most shootings, although I didn't weight it by population so I don't know how they compare as a fraction of the population.

borough_bar = data[["BORO", "date_time"]].rename(
    columns=dict(BORO="Borough"))
borough_bar["Month"] = pandas.to_datetime({
    "year": borough_bar.date_time.dt.year,
    "month": borough_bar.date_time.dt.month,
    "day": 1
})

chart = altair.Chart(borough_bar).mark_bar().encode(
    x="Month",
    y="count()",
    color="Borough").properties(
        width=800,
        height=525,
        title="Monthly Shootings With Borough Breakdowns")

save_it(chart, "borough-monthly-bar-chart")

By clicking on the legend you can isolate plot-lines by borough and it makes it clearer that the surge in 2020 happened primarily in the Bronx and Manhattan, although the plot is a little deceptive since Manhattan had some months where there were no shootings, but I didn't infill them with zeros so it doesn't show those months dropping back down.

PRECINCT & JURISDICTION_CODE

There are seventy-seven police precincts in New York City, subdividing the boroughs. This might be useful for isolating where the shootings are occuring even further, but I'll skip it. Same with the JURISDICTION_CODE.

LOCATION_DESC

missing_locations = data[data.LOCATION_DESC.isna()].rename(columns=dict(LOCATION_DESC="Location Description"))
missing_locations = missing_locations[["Location Description", "date_time", "BORO"]]
missing_locations["Year"] = missing_locations.date_time.dt.year

chart = altair.Chart(missing_locations).mark_bar().encode(
    x=altair.X("Year", type="ordinal"),
    y="count()",
    tooltip=[altair.Tooltip("count()"), altair.Tooltip("BORO"), altair.Tooltip("Year")],
    color="BORO"
).properties(
    width=800,
    height=525,
    title="Missing Location Descriptions by Year"
)

save_it(chart, "missing-locations-descriptions")

I was hoping the missing locations might be because they only started recording it after some point, but it looks like it's an ongoing problem not local to any one borough, so the location descriptions are interesting, but the number of missing entries makes it hard to say that they're meaningful.

locations = data.LOCATION_DESC.value_counts().reset_index().rename(columns=dict(count="Count", LOCATION_DESC="Location"))

chart = altair.Chart(locations).mark_bar().encode(
    x=altair.X("Location", sort=locations.Location.values,
               axis=altair.Axis(labelAngle=30)),
    y=altair.Y("Count"),
    tooltip=[altair.Tooltip("Location"), altair.Tooltip("Count", format=",")]
).properties(
    width=900,
    height=525,
    title="Shooting Location Counts"
)

save_it(chart, "shooting-locations-count")

It looks like most shootings happened at homes, and to avoid being shot you should move to a storage facility.

Coordinates

X_COORD_CD & Y_COORD_CD

These coordinates use the State Plane Coordinate System (see What is the State Plane Coordinate System?) Long Island Zone. The coordinates are mid-block locations with the units in feet.

chart = altair.Chart(
    data[["X_COORD_CD", "Y_COORD_CD", "BORO"]]).mark_point(opacity=0.2).encode(
        x=altair.X("X_COORD_CD").scale(domain=(data.X_COORD_CD.min(), data.X_COORD_CD.max())),
        y=altair.Y("Y_COORD_CD").scale(domain=(data.Y_COORD_CD.min(), data.Y_COORD_CD.max())),
        color="BORO"
).properties(
    width=800,
    height=525,
    title="Shootings By Coordinates",
)

save_it(chart, "shootings-by-coordinates")

coordinates = data[["X_COORD_CD", "Y_COORD_CD",
                    "BORO", "date_time",
                    "Longitude", "Latitude"]].rename(columns=dict(
    BORO="Borough",
    X_COORD_CD="X-Coordinate",
    Y_COORD_CD="Y-Coordinate"
))

coordinates["Year"] = coordinates.date_time.dt.year

select_year = altair.selection_point(
    name="Year",
    fields=["Year"],
    bind=altair.binding_range(min=2006, max=2021, step=1, name="Year"),
    value=[{"Year": 2006}],
)

chart = altair.Chart(coordinates).mark_point(opacity=0.2).encode(
    x=altair.X("X-Coordinate").scale(
        domain=(coordinates["X-Coordinate"].min(),
                coordinates["X-Coordinate"].max())),
    y=altair.Y("Y-Coordinate").scale(
        domain=(coordinates["Y-Coordinate"].min(),
                coordinates["Y-Coordinate"].max())),
    color="Borough",
    tooltip=[altair.Tooltip("Borough"),
             altair.Tooltip("date_time")],
).properties(
    width=800,
    height=525,
    title="Shootings By Coordinates and Year",
).add_params(select_year).transform_filter(select_year)

save_it(chart, "shootings-by-coordinates-and-year")

Latitude & Longitude

With GeoPandas

It turns out than the boroughs of New York City are one of the three datasets that come with GeoPandas so we can bring it in to make a map.

path_to_data = geopandas.datasets.get_path("nybb")
nyc_geopandas = geopandas.read_file(path_to_data)

print(nyc_geopandas.head(1))

   BoroCode       BoroName     Shape_Leng    Shape_Area   
0         5  Staten Island  330470.010332  1.623820e+09  \

                                            geometry  
0  MULTIPOLYGON (((970217.022 145643.332, 970227....

Let's look at a map colored with the 2021 shooting counts.

BOROUGH_COLUMN = "BoroName"
SHOOTINGS_COLUMN = "Shootings"
latest = data[data.date_time.dt.year==2021]
latest_count = latest.groupby("BORO").count().reset_index()
expect(latest_count.INCIDENT_KEY.sum()).to(equal(len(latest)))
latest_count = latest_count.rename(columns=dict(BORO=BOROUGH_COLUMN))
latest_count[BOROUGH_COLUMN] = latest_count[BOROUGH_COLUMN].str.title()
latest_count[SHOOTINGS_COLUMN] = latest_count["INCIDENT_KEY"].copy()
latest_count = latest_count[[BOROUGH_COLUMN, SHOOTINGS_COLUMN]]
merged = nyc_geopandas.merge(latest_count, on=BOROUGH_COLUMN)

merged = merged[[BOROUGH_COLUMN, SHOOTINGS_COLUMN, "geometry"]]
explorer = merged.explore(column=SHOOTINGS_COLUMN,
                          tooltip=BOROUGH_COLUMN,
                          popup=True)
explorer.save(f"{OUTPUT_PATH}/nyc-shootings-explore.html")

This is probably most interesting to someone like me who doesn't know much about New York City. I wonder why Queens has so much fewer shootings than Brooklyn or the Bronx when it sits between them? Demographics?

Murder or Not

By Year

MURDER_FLAG = "STATISTICAL_MURDER_FLAG"
MURDERED = "Murdered"
BOROUGH = "Borough"
YEAR = "Year"
data[YEAR] = data.date_time.dt.to_period("Y").dt.to_timestamp()
yearly_murders = data.groupby([YEAR, "BORO"]).sum(
    MURDER_FLAG).reset_index().rename(
        columns={
            "BORO": BOROUGH,
            MURDER_FLAG: MURDERED})
yearly_murders = yearly_murders[[YEAR, BOROUGH, MURDERED]]

SELECTION = altair.selection_point(fields=[BOROUGH], bind="legend")

chart = altair.Chart(yearly_murders).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
               x=altair.X(YEAR),
               y=altair.Y(MURDERED),
               color=BOROUGH,
               tooltip=[altair.Tooltip(YEAR),
                        altair.Tooltip(MURDERED),
                        altair.Tooltip(BOROUGH)],
               opacity=altair.condition(SELECTION,
                                        altair.value(0.8),
                                        altair.value(0.2))
           ).add_params(SELECTION).properties(
               width=800,
               height=525,
               title="NYC Murders By Year"
           )

save_it(chart, "yearly_borough_murders")
#+end_src

By Month

MONTH = "Month"
data[MONTH] = data.date_time.dt.to_period("M").dt.to_timestamp()
murdered = data.groupby([MONTH, "BORO"]).sum(MURDER_FLAG).reset_index()
murdered = murdered[[MONTH, "BORO", MURDER_FLAG]].rename(columns={
    "BORO": BOROUGH,
    MURDER_FLAG: MURDERED,    
})

SELECTION = altair.selection_point(fields=[BOROUGH], bind="legend")

chart = altair.Chart(murdered).mark_line(
    point={"filled": False,
           "fill": "white"}).encode(
               x=altair.X(MONTH),
               y=altair.Y(MURDERED),
               color=BOROUGH,
               tooltip=[altair.Tooltip(MONTH, format=MONTH_YEAR),
                        altair.Tooltip(MURDERED),
                        altair.Tooltip(BOROUGH)],
               opacity=altair.condition(SELECTION,
                                        altair.value(0.8),
                                        altair.value(0.2))
           ).add_params(SELECTION).properties(
               width=800,
               height=525,
               title="NYC Murders By Month"
           )

save_it(chart, "monthly_borough_murders")

It looks like Brooklyn lead in murders in the Summer of 2020,

By Coordinate

murdered = data[data[MURDER_FLAG]]

murder_coordinates = murdered[["BORO", "date_time",
                               "Longitude", "Latitude"
                               ]].rename(columns=dict(
                                   BORO="Borough",
                               ))
murder_coordinates["Year"] = murder_coordinates.date_time.dt.year

chart = (altair.Chart(murder_coordinates)
         .mark_point(opacity=0.2)
         .encode(
             x=altair.X("Longitude").scale(
                 domain=(murder_coordinates["Longitude"].min(),
                         murder_coordinates["Longitude"].max())),
             y=altair.Y("Latitude").scale(
                 domain=(murder_coordinates["Latitude"].min(),
                         murder_coordinates["Latitude"].max())),
             color="Borough",
             tooltip=[altair.Tooltip("Borough"),
                      altair.Tooltip("date_time")],
         ).properties(
             width=800,
             height=525,
             title="Murders By Longitude, Latitude and Year",
    ).add_params(select_year)
         .transform_filter(select_year))

save_it(chart, "murders-by-coordinates-and-year")

Victims and Perpetrators

Perpetrators

PERP_AGE_GROUP

print(TABLE(data.PERP_AGE_GROUP.value_counts(dropna=False).reset_index()))

PERP_AGE_GROUP	count
nan	9344
18-24	5844
25-44	5202
UNKNOWN	3148
<18	1463
45-64	535
65+	57
1020	1
940	1
224	1

PEPR_SEX

print(TABLE(data.PERP_SEX.value_counts(dropna=False).reset_index()))

PERP_SEX	count
M	14416
nan	9310
U	1499
F	371

PERP_RACE

print(TABLE(data.PERP_RACE.value_counts(dropna=False).reset_index()))

PERP_RACE	count
BLACK	10668
nan	9310
WHITE HISPANIC	2164
UNKNOWN	1836
BLACK HISPANIC	1203
WHITE	272
ASIAN / PACIFIC ISLANDER	141
AMERICAN INDIAN/ALASKAN NATIVE	2

It looks like besides the NaN values, missing data is also classified as U or UNKNOWN.

Handling the Unknowns

Let's try and unify the missing values.

data[data.PERP_AGE_GROUP.isna()] = "UNKNOWN"
data[data.PERP_AGE_GROUP == "U"] = "UNKNOWN"

print(TABLE(data.PERP_AGE_GROUP.value_counts(dropna=False).reset_index()))

PERP_AGE_GROUP	count
UNKNOWN	12492
18-24	5844
25-44	5202
<18	1463
45-64	535
65+	57
1020	1
940	1
224	1

data[data.PERP_SEX.isna()] = "UNKNOWN"
data[data.PERP_SEX == "U"] = "UNKNOWN"

print(TABLE(data.PERP_SEX.value_counts(dropna=False).reset_index()))

PERP_SEX	count
M	14416
UNKNOWN	10809
F	371

data[data.PERP_RACE.isna()] = "UNKNOWN"
data[data.PERP_RACE == "U"] = "UNKNOWN"

print(TABLE(data.PERP_RACE.value_counts(dropna=False).reset_index()))

PERP_RACE	count
UNKNOWN	11166
BLACK	10650
WHITE HISPANIC	2164
BLACK HISPANIC	1201
WHITE	272
ASIAN / PACIFIC ISLANDER	141
AMERICAN INDIAN/ALASKAN NATIVE	2

So, the first thing we can notice is that the UNKNOWN counts don't match so sometims some attributes were known but not others, so they don't just represent cases where the perpetrator wasn't caught like I thought it might be. The second thing is that for age and race the unknowns have the highest counts. For all three categories the numbers of unknowns are so high that it's hard to say how they might affect the cases where the values are known.

Victims

VIC_AGE_GROUP

print(TABLE(data.VIC_AGE_GROUP.value_counts(dropna=False).reset_index()))

VIC_AGE_GROUP	count
UNKNOWN	10863
25-44	6500
18-24	5389
<18	1657
45-64	1070
65+	117

VIC_SEX

print(TABLE(data.VIC_SEX.value_counts(dropna=False).reset_index()))

VIC_SEX	count
M	13182
UNKNOWN	10809
F	1598
U	7

VIC_RACE

print(TABLE(data.VIC_RACE.value_counts(dropna=False).reset_index()))

VIC_RACE	count
UNKNOWN	10851
BLACK	9962
WHITE HISPANIC	2515
BLACK HISPANIC	1528
WHITE	485
ASIAN / PACIFIC ISLANDER	249
AMERICAN INDIAN/ALASKAN NATIVE	6

Somewhat unexpectedly, the unknowns are also alarmingly high for the victims, suggesting that they represent poor case records and perhaps the demographic information isn't reliable enough to be useful.

Sources

New York City Map via WikiMedia Commons

I, Bubba

The Cloistered Monkey

2023-03-16 19:54

Bubba Panel One

Bubba Panel Two

Panel Three

Next Bubba

Table of Contents

Activity Selector Implementations

The Cloistered Monkey

2022-09-12 16:11

Example Data

PADDING = 0
start = [PADDING, 1, 3, 0, 5, 3, 5, 6, 8, 8, 2, 12]
finish = [PADDING, 4, 5, 6, 7, 9, 9, 10, 11, 12, 14, 16]
expected_activities = [1, 4, 8, 11]
expect(len(start)).to(equal(len(finish)))

A Recursive Selector

def recursive_activity_selector(start, finish, k: int=0) -> list:
    """Selects the optimal activities

    Note:
     This assumes that the start and finish lists are pre-padded with 
    zeros (1 each).


    Args:
     start: list of start times for each activity
     finish: list of end times for each activity
     k: the current activity (index)

    Returns:
     optimal list of activities
    """
    n = len(start)
    m = k + 1
    while m < n and start[m] < finish[k]:
        m += 1
    if m < n:
        return [m] + recursive_activity_selector(start, finish, m)
    return []

activities = recursive_activity_selector(start, finish)
expect(activities).to(contain_exactly(*expected_activities))
print(activities)

[1, 4, 8, 11]

An Iterative Selector

def iterative_activity_selector(start: list, finish: list) -> list:
    """An iterative activity selector

    Note:
     This assumes that the start and finish lists are pre-padded with 
    dummy values.

    Args:
     start: list of starting times
     finish: list of end times
    """
    n = len(start)
    activities = [1]
    k = 1
    for m in range(2, n):
        if start[m] >= finish[k]:
            activities += [m]
            k = m
    return activities

activities = iterative_activity_selector(start, finish)
expect(activities).to(contain_exactly(*expected_activities))
print(activities)

The Activity Selection Problem

The Cloistered Monkey

2022-08-04 17:25

What is the Activity Selection Problem?

In this problem we have a set of n activities \(S = \{a_1, a_2, \ldots, a_n\}\).
Assume the activities are sorted by increasing finish time
Each activity has a start time \(s_i\) and a finish time \(f_i\)
Only one activity can happen at a time.
Two activities are compatible if the earlier event finishes before the start time of the other event.

The problem we want to solve is finding the most compatible activities we can from our set of activities. This differs from a more common scheduling problem in that the times are fixed - we can't re-arrange activities to make them fit so they either can happen or they can't.

The Greedy Choice

CLRS goes into the more formal aspects of finding the optimal substructure and proving that a greedy strategy works, but in this case the reasoning is fairly intuitive (and I didn't find the proof so easy to follow) so I'll skip the formal stuff. The thing to recognize is just that the sooner the first activity you schedule finishes the more time there is for other activities. So if after each activity you pick you the next pick you make is the remaining activity that doesn't collide with the current pick and finishes first you will be maximizing the number of activities that you can schedule.

The Recursive Activity Selector

\begin{algorithm}
\caption{Recursive Activity Selector}
\begin{algorithmic}
\REQUIRE The activities are sorted in non-decreasing order of finish time
\INPUT $s$: The array of start times for the activities.
\INPUT $f$: The array of finish times for the activities.
\INPUT $k$: The last activity selected so far.
\INPUT $n$: The number of activities.
\OUTPUT The activities that maximize the number of compatible activities.
\PROCEDURE{RecursiveActivitySelector}{$s, f, k, n$}

\STATE \(m \gets k + 1\)

\WHILE {\(m \le n\) and \(s[m] < f[k]\)}
  \STATE \(m \gets m + 1\)
\ENDWHILE

\IF {\(m \le n\)}
  \RETURN \(\{a_m\} \cup \) \textsc{RecursiveActivitySelector}(\(s, f, m, n\))
\ELSE
  \RETURN \{\}
\ENDIF

\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

Note: The activity set (\(\{a_1, a_2, \ldots, a_n\}\)) is treated as a global collection so it is used in the function but not passed in as an argument.

Padding The Start

Because the recursive algorithm is looking for the next earliest finish time after the first event we need to pad the activities with a dummy activity that has a finish time of 0 (\(f_0 = 0\)) and then when the function is first called the k value is set to 0:

\[ RecursiveActivitySelector(s, f, 0, n) \]

The Iterative Activity Selector

\begin{algorithm}
\caption{Greedy Activity Selector}
\begin{algorithmic}
\REQUIRE The activities are sorted in non-decreasing order of finish time
\INPUT $s$: The array of start times for the activities.
\INPUT $f$: The array of finish times for the activities.
\OUTPUT The activities that maximize the number of compatible activities.
\PROCEDURE{GreedyActivitySelector}{\(s, f\)}

\STATE \(n \gets s.length \)
\STATE \(A \gets \{a_1\} \)
\STATE \(k \gets 1 \)

\FOR {\(m  \in \{2, \ldots, n \}\)}
  \IF {\( s[m] \ge f[k] \)}
    \STATE \(A \gets A \cup \{a_m\}\)
    \STATE \(k \gets m \)
  \ENDIF
\ENDFOR

\RETURN \(A\)
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

Sources

Introduction To Algorithms (CLRS)

Weighted Completion Times

The Cloistered Monkey

2022-07-20 16:33

The Weighted Scheduling Problem

This is a look at the problem of arranging a schedule of jobs so that their completion times are lowest, while accounting for different priorities for each job.

Input: \(n\) jobs with positive lengths \(l_1, l_2, \ldots, l_n\) and positive weights \(w_1, w_2,\ldots,w_n\). The larger the weight, the higher the priority.

Output: A job sequence (the schedule) that minimizes the sum of the weight completion times \(w_1l_1 + w_2 l_2 + \cdots + w_n l_n\).

Completion Time

The output above is a little confusing (I'll have to re-write that later). What we're minimizing is the sum of completion times. A job's completion time is the total of all the times for the jobs that precede it as well as the time to finish the job. This means the completion time for the first job is the time to run that job, the completion time for the second job is the time for the first job plus the time for the second job, and so on.

Greedy Metrics

The greedy method for this problem involves finding a metric such that when the jobs are sorted by the metric they form the optimal schedule. One possible metric is the Greedy Difference which is the weight for a job minus the time the job takes (\(w_i - l_i\)).

Another possible metric is the Greedy Ratio which is the ratio of job-weight to job-length (\(\frac{w_i}{l_i}\)).

Source

Algorithms Illuminated - Part III: Greedy Algorithms and Dynamic Programming

Table of Contents

Introduction

Setup

Imports

The Output Path

Constants

The Data

Counting the Years

A Table To Show the Counts

Now the Bar-Chart

And a Line Plot

The End

Links

Table of Contents

Setup

Imports

Some Setting Up

The Data

The Data Columns

Counting the Years

Using Pandas' value_counts Method

Using Altair's "count"

Comparing the File Sizes

The Files In Bytes

A Chart, Part By Part

Altair's Chart

A Bar Chart

SchemaBase

TopLevelUnitSpec

Encoded

_EncodingMixin

Properties

HVPlot

Links

The Posts In This Series

Tutorial Sources

The Data

Altair

Table of Contents

Posts

Setup

The Data

Creating a Chart

TopLevelUnitSpec

SchemaBase

_EncodingMixin

Making It a Bar Chart

Year

Life Ladder

By Year

The Most Recent Year (2022)

GDP

With Generosity

Links

Tutorial Sources

The Data

Altair

Table of Contents

Imports

Country

The Table vs Kaggle

Figure Data

The Figure Countries

Links

Table of Contents

NYPD Shooting Incident Data Version 2 (with python)

Imports and Setup

The Data

Loading It

The Columns

Missing Data

Incident Key

OCCUR_DATE and OCCUR_TIME

Shootings By Month

90 Day Rolling Window

By Year

Monthly By Year

Median By Month

Location

Descriptions

Using Pandas' `value_counts` Method