A First Look at dplyr

The Cloistered Monkey

2021-12-16 16:59

In the Beginning

This is a look at dplyr based on Chapter Three of R For Data Science. It focuses on what the book identifies as the six key dplyr functions.

Function	What It Does	By Another Name
`filter`	Pick out rows based on their values.	Boolean Slice
`arrange`	Re-order the rows.	Sort
`select`	Pick out variables by their names.	Column subset
`mutate`	Create new variables using existing ones.	Feature Engineering
`summarize`	Summarize many values as a single value.	Statistic
`group_by`	Changes scope of other functions to operate on group.

Imports

library(ascii)
library(assertthat)
library(dplyr)
library(ggplot2)
library(knitr)
library(nycflights13)
library(tidyverse)

Note: The stuff in the tidyverse tends to clobber built-in functions, which can make it hard to understand sometimes when things go wrong. If you forget to load dplyer, for instance, there's already a filter function, but it doesn't do the same thing as the one you load with dplyr so you'll get weird error messages if you try to use it. Notably I was getting this message:

Error in filter(airlines, carrier == "US") : object 'carrier' not found

Anyway, now some options for the packages.

options(asciiType="org")
theme_set(theme_minimal(base_family="Palatino"))

And a couple of helpers to print (the tibble objects dump weird characters into the org-mode output).

khead <- function(data) {
  #' Prints a table of the first lines of the dataframe
  #'
  #' @param data The data to print
  table <- kable(head(data))
  table[2] <- gsub(":", "-", table[2])
  table
}

okable <- function(data) {
  #' Prints data as an org-mode table
  #'
  #' @param data The data to print
  table <- kable(data)
  table[2] <- gsub(":", "-", table[2])
  table
}

got_all_your_data <- function(data) {
  #' Checks if there's any missing values
  #'
  #' @param data The vector of data to check
  see_if(noNA(data))
}

The Data

We're going to look at the flights dataframe loaded by the nycflights13 package (which, curiously, R loads using the library function).

str(flights)

tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

The dataset lists all the flights that left New York City airports in 2013 heading to destinations in the United States, Puerto Rico, and the American Virgin Islands.

Filter

The Basic Filter

filter takes the dataframe as the first argument and every argument after that has to be a boolean expression based on the dataframe. If you pass in more than one condition dplyr will and them. Let's see the flights on Christmas.

christmas <- filter(flights, month==12, day==25)
plot = ggplot(data=christmas) +
  geom_bar(
    mapping=aes(x=origin, fill=carrier)
  ) +
  labs(title="Christmas Flights", x="Origin", y="Count")

ggsave("christmas_flights_counts.png")

nil

Or, Not And

Since the default is for the filter to and everything together, if you want to use something else (like OR) you need to create a single argument built with the OR operator (|). If you're using a single column for the OR, there's actually a (possibly) easier way that works like python's in where you check if the value is in a collection. Let's see the flights on Christmas and on Christmas Eve .

both_days <- filter(flights, month==12, day %in% c(24, 25))

plot = ggplot(data=both_days) +
  geom_bar(
    mapping=aes(x=day, fill=origin)
  ) +
  labs(title="Christmas Eve and Day Flights", x="Day", y="Count")

ggsave("christmas_day_and_eve.png")

nil

And and Or

What if we want New-Years Day and Christmas? In this case we need to keep the month with the day so we can't just use the in operator, but we can use the AND and OR operators to make it work.

and_new_years = filter(flights, (month==12 & day==25) | (month==1 & day==1))

plot = ggplot(data=and_new_years) +
  geom_bar(
    mapping = aes(x=day, fill=origin)
  ) +
  labs(title="Christmas and New Years Day", x="Date", y="Count")

ggsave("christmas_new_years.png")

nil

plot = ggplot(data=and_new_years) +
  geom_boxplot(
    mapping = aes(x=origin, y=dep_delay, color=origin)
  ) +
  labs(title="Christmas and New Years Day Delays", x="Departure Airport", y="Delay (Minutes)")

ggsave("christmas_new_years_delay.png")

nil

Arrange

arrange is a function to sort the rows. By default it sorts then in non-decreasing (ascending) order using a column you give it. You can givie it multiple columns in which case whenever there's a tie it will look in the next column you gave it to see if it can break it (the tie, I mean).

Ascending Order

What are the shortest flights?

khead(arrange(flights, distance))

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	destination	air_time	distance	hour	minute	time_hour
2013	7	27	NA	106	NA	NA	245	NA	US	1632	NA	EWR	LGA	NA	17	1	6	2013-07-27 01:00:00
2013	1	3	2127	2129	-2	2222	2224	-2	EV	3833	N13989	EWR	PHL	30	80	21	29	2013-01-03 21:00:00
2013	1	4	1240	1200	40	1333	1306	27	EV	4193	N14972	EWR	PHL	30	80	12	0	2013-01-04 12:00:00
2013	1	4	1829	1615	134	1937	1721	136	EV	4502	N15983	EWR	PHL	28	80	16	15	2013-01-04 16:00:00
2013	1	4	2128	2129	-1	2218	2224	-6	EV	4645	N27962	EWR	PHL	32	80	21	29	2013-01-04 21:00:00
2013	1	5	1155	1200	-5	1241	1306	-25	EV	4193	N14902	EWR	PHL	29	80	12	0	2013-01-05 12:00:00

The data uses the three-letter airport codes, but there's another dataframe loaded with ggplot2 named airports that we can use to translate them to airport names.

okable(filter(airports, faa %in% c("PHL", "LGA", "EWR")))

faa	name	lat	lon	alt	tz	dst	tzone
EWR	Newark Liberty Intl	40.69250	-74.16867	18	-5	A	America/New_York
LGA	La Guardia	40.77725	-73.87261	22	-5	A	America/New_York
PHL	Philadelphia Intl	39.87194	-75.24114	36	-5	A	America/New_York

So the shortest flights are from Newark Liberty International to La Guardia airport and the Philadelphia International Airport. I wonder why someone would take a plane to go 17 miles? I guess the traffic must be really bad, or some people just have money (and jet fuel) to burn.

Descending Order

To get the data to sort in descending order you have to pass in another function call (desc). Weird.

What are the longest flights?

khead(arrange(flights, desc(distance)))

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	destination	air_time	distance	hour	time_hour
2013	1	1	857	900	-3	1516	1530	-14	HA	51	N380HA	JFK	HNL	659	4983	9	2013-01-01 09:00:00
2013	1	2	909	900	9	1525	1530	-5	HA	51	N380HA	JFK	HNL	638	4983	9	2013-01-02 09:00:00
2013	1	3	914	900	14	1504	1530	-26	HA	51	N380HA	JFK	HNL	616	4983	9	2013-01-03 09:00:00
2013	1	4	900	900	0	1516	1530	-14	HA	51	N384HA	JFK	HNL	639	4983	9	2013-01-04 09:00:00
2013	1	5	858	900	-2	1519	1530	-11	HA	51	N381HA	JFK	HNL	635	4983	9	2013-01-05 09:00:00
2013	1	6	1019	900	79	1558	1530	28	HA	51	N385HA	JFK	HNL	611	4983	9	2013-01-06 09:00:00

You can probably guess the airports from the codes, but we'll look them up anyway.

okable(filter(airports, faa %in% c("JFK", "HNL")))

faa	name	lat	lon	alt	tz	dst	tzone
HNL	Honolulu Intl	21.31868	-157.92243	13	-10	N	Pacific/Honolulu
JFK	John F Kennedy Intl	40.63975	-73.77893	13	-5	A	America/New_York

And the longest flights are from John F. Kennedy International airport to Honolulu International airport.

Select

select helps with getting subsets of columns. At first glance this seems like something that should be easy to do without needing a separate function, but it adds more ways to grab the columns than just passing in the names.

Grab Some Columns

The basic way to get a subset of columns is to pass them in as arguments.

khead(select(flights, month, flight))

month	flight
1	1545
1	1714
1	1141
1	725
1	461
1	1696

A Range of Columns

If you want to grab a section of contiguous columns you can pass in the first and last ones separated by a colon (the `:` character, not the vital organ).

khead(select(flights, dep_time:arr_time))

dep_time	sched_dep_time	dep_delay	arr_time
517	515	2	830
533	529	4	850
542	540	2	923
544	545	-1	1004
554	600	-6	812
554	558	-4	740

Helper Functions

Like arrange select uses functions passed in to alter the behavior. Most of them are string-methods-ish.

Function	What it does
`starts_with("<prefix>")`	Pick columns by a prefix.
`ends_with("<suffix>")`	Pick columns with a suffix.
`contains("<sub-string>")`	Pick columns with a sub-string of the name.
`matches("<regular-expression>")`	Pick out stuff using a regular expression.
`num_range("<prefix>", <number range>)`	Grab columns with the prefix followed by each of the numbers
`last_col()`	Get the last column (negative index).
`all_of(<vector>)`	Grab the columns in the vector.
`any_of(<vector>)`	Same as `all_of` but ignores non-existent columns in the vector
`where(<function>)`	Applies the function to the columns and picks the ones that return true

You can also use logical AND and OR operators or a vector to combine the selections, but we'll just do a basic one-function call. Let's grab all the columns with an underscore in their name.

khead(select(flights, contains("_")))

dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	air_time	time_hour
517	515	2	830	819	11	227	2013-01-01 05:00:00
533	529	4	850	830	20	227	2013-01-01 05:00:00
542	540	2	923	850	33	160	2013-01-01 05:00:00
544	545	-1	1004	1022	-18	183	2013-01-01 05:00:00
554	600	-6	812	837	-25	116	2013-01-01 06:00:00
554	558	-4	740	728	12	150	2013-01-01 05:00:00

You can also negate the selections using an exclamation point (!), so we can grab all the columns that don't have an underscore in their name like this.

khead(select(flights, !contains("_")))

year	month	day	carrier	flight	tailnum	origin	dest	distance	hour	minute
2013	1	1	UA	1545	N14228	EWR	IAH	1400	5	15
2013	1	1	UA	1714	N24211	LGA	IAH	1416	5	29
2013	1	1	AA	1141	N619AA	JFK	MIA	1089	5	40
2013	1	1	B6	725	N804JB	JFK	BQN	1576	5	45
2013	1	1	DL	461	N668DN	LGA	ATL	762	6	0
2013	1	1	UA	1696	N39463	EWR	ORD	719	5	58

Renaming

The book mentions renaming under the select section because it says that it is a special case of selection, but it has a dedicated function to do that named rename that you should use instead. Let's rename the first column (year).

khead(rename(flights, Jahr=year))

Jahr	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

Note: Pay attention to the funky syntax - the new name comes first and the original name comes after the equal sign.

Some More Useful Renaming

That wasn't really helpful, so how about we rename some other columns?

flights <- rename(flights, destination=dest)

Mutate

The mutate function adds a column (or columns) to the end of the dataframe based on things you tell it to do to one or more other columns.

First, we'll grabs a subset of columns to make this easier to see.

subset = select(flights, ends_with("delay"), air_time, distance)
khead(subset)

dep_delay	arr_delay	air_time	distance
2	11	227	1400
4	20	227	1416
2	33	160	1089
-1	-18	183	1576
-6	-25	116	762
-4	12	150	719

Now to add a couple of columns giving the difference between how late they arrived and how late they left and the average speed.

khead(mutate(subset,
             Time_Made_Up=dep_delay - arr_delay,
             Miles_Per_Minute = distance/air_time
             ))

dep_delay	arr_delay	air_time	distance	Time_Made_Up	Miles_Per_Minute
2	11	227	1400	-9	6.167401
4	20	227	1416	-16	6.237885
2	33	160	1089	-31	6.806250
-1	-18	183	1576	17	8.612022
-6	-25	116	762	19	6.568966
-4	12	150	719	-16	4.793333

Transmute

This is like the mutate function but it only keeps the new columns you created, not the original columns.

khead(transmute(subset,
       Time_Made_Up=dep_delay - arr_delay,
       Miles_Per_Hour = distance/(air_time/60)
       ))

Time_Made_Up	Miles_Per_Hour
-9	370.0441
-16	374.2731
-31	408.3750
17	516.7213
19	394.1379
-16	287.6000

Summarize

summarize reduces a collection of numbers to a single summary statistic. As an example we can look at the median departure delay. First, let's check if it has any missing values.

got_all_your_data(flights$dep_delay)

[1] FALSE
attr(,"msg")
[1] "data contains 8255 missing values"

Okay, so we'll just tell median to ignore it.

okable(summarize(flights, median_departure_delay=median(dep_delay, na.rm=TRUE)))

median_departure_delay
-2

What if we hadn't ignored the missing?

okable(summarize(flights, median_departure_delay=median(dep_delay)))

median_departure_delay
NA

The R summary functions like median and mean consider any missing data invalid unless you tell it to ignore the missing. How is this use of summarize different from just calling median?

median(flights$dep_delay, na.rm=TRUE)

[1] -2

Well, it doesn't look quite as nice. So, you might think, well, all we did was add a columns name, was that useful?

Grouping

The reason for summarize is that it works in tandem with groups. Instead of finding the median for the whole column, we can find the median by group.

by_month = group_by(flights, month)

okable(summarize(by_month, median_monthly_delay=median(dep_delay, na.rm=TRUE)))

month	median_monthly_delay
1	-2
2	-2
3	-1
4	-2
5	-1
6	0
7	0
8	-1
9	-3
10	-3
11	-3
12	0

Is that right? Let's look at a boxplot of the dataset and check.

plot = ggplot(data=by_month) +
  geom_boxplot(
    mapping=aes(x=month, y=dep_delay, fill=factor(month)), 
    outlier.shape=NA,
    na.rm=TRUE
  ) +
  scale_y_continuous(limits = quantile(by_month$dep_delay, c(0.1, 0.9),
                                       na.rm=TRUE)) +
  labs(title="Monthly Delays", x="Month", y="Delay (Minutes)")


ggsave("monthly_delay_boxplot.png")

nil

So, that isn't so exciting, but it does show that the median values tend to be negative (meaning they left early) like our summary said.

Hitting the Pipe

dplyr adds an operator called the "pipe" (%>%) that works like a unix pipe does, sending the output of one function to another, allowing you to save some keystrokes adding temporary variables or nesting function calls.

So, here's what we're going to do.

We're going to pass the flights dataset to the group_by function and group the data by destination
Then we're going to pass that output to the summarize function and summarize by:
- the counts
- mean distance of the flight
- and the mean arrival delay of each group
Then we're going to pass the summarized data to filter to get the top 20 that didn't go to Honolulu.

First let's do a check for missing values.

got_all_your_data(flights$arr_delay)
got_all_your_data(flights$distance)

[1] FALSE
attr(,"msg")
[1] "data contains 9430 missing values"
[1] TRUE

There's no missing distances but there are missing arrival delays.

delays_vs_distance <- flights %>%
  group_by(destination) %>%
  summarize(
    count = n(),
    distance = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, destination != "HNL")

khead(delays_vs_distance)

destination	count	distance	delay
ABQ	254	1826.0000	4.381890
ACK	265	199.0000	4.852273
ALB	439	143.0000	14.397129
ATL	17215	757.1082	11.300113
AUS	2439	1514.2530	6.019909
AVL	275	583.5818	8.003831

Note that we're using the mean here, not the median, so there's going to be more variation because of all the outliers. Also, because there are some missing values we had to tell the mean function to ignore them - otherwise the default is to return NA if any values are missing. There weren't any missing distance rows, but passing that argument in doesn't do any harm, I don't think. Now let's plot it.

plot = ggplot(data=delays_vs_distance, mapping=aes(x=distance, y=delay)) +
  geom_point(aes(size=count), alpha=0.4) +
  geom_smooth(se=FALSE) +
  labs(title="Distance Traveled vs Arrival Delay", x="Distance (Miles)",
       y="Delay (Minutes)")

ggsave("distance_vs_delay.png")

nil

It looks like up to a point the further you travel the worse the delay, and then it starts going in the opposite direction, with longer flights having shorter delays in their arrival times.

Link Collection

ascii: The R package to help change R-output to work better with org-mode.
assertthat: R assertions.
Bureau of Transportation Statistics: This is where the flights dataset originally comes from.
dplyr: the tidyverse page for this package
rmarkdown-cookbook: This page is about knitr's kable command.
nycflights13: A description of the datasets in this package.

R For Data Science

The Cloistered Monkey

2021-12-14 16:10

Preface
Explore
- Data Visualization with GGPlot2
- Workflow: Basics
- Data Transformation With DPlyr
- Workflow: Scripts
- Exploratory Data Analysis
- Workflow: Projects
Wrangle
- Tibbles With Tibble
- Data Import With Readr
- Tidy Data With Tidyr
- Relational Data With Dplyr
- Strings With Stringr
- Factors With Forcats
- Dates and Times With Lubridate
Program
- Pipes With Magrittr
- Functions
- Vectors
- Iterations With Purrr
Model
- Model Basics With Modelr
- Model Building
- Many Models With Purrr and Broom
Communicate
- R Markdown
- Graphics For Communication With GGPlot2
- R Markdown Formats
- R Markdown Workflow

Citation

[RFDS] Wickham H, Grolemund G. R for data science: import, tidy, transform, visualize, and model data. First edition. Sebastopol, CA: O’Reilly; 2016. 492 p.

A Look At ggplot

The Cloistered Monkey

2021-12-13 14:34

Introduction

This is a quick walk-through of some aspects of ggplot2 based on the first chapter of {% doc %}r-for-data-science{% /doc %}.

Set Up Stuff

Imports

Note: When you run the first code block to start the session, remember to give it the path to the posts-output folder so you don't need the path when saving the plots.

library("assertthat")
library("ggplot2")
library("knitr")
library("tidyverse")

Constants

These are stuff that get re-used in the plotting.

DISPLACEMENT <- "Engine Displacement (Liters)"
EFFICIENCY <-"Gallons Per Mile (Combined)"
DISPLACEMENT_VS_EFFICIENCY = partial(labs, x=DISPLACEMENT, y=EFFICIENCY)

BLUE <- "#4687b7"
RED <- "#ce7b6d"

The GGPlot2 Theme

theme_set(theme_minimal(base_family="Palatino"))

The MPG Dataset

For some reason loading ggplot2 also loads this dataset. Its short description is "Fuel economy data from 1999 to 2008 for 38 popular models of cars". The long description:

This dataset contains a subset of the fuel economy data that the EPA makes available on <URL: https://fueleconomy.gov/>. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

The Variables

It's a data frame with 234 rows and 11 variables:

Variable	Description
manufacturer	manufacturer name
model	model name
displ	engine displacement, in litres
year	year of manufacture
cyl	number of cylinders
trans	type of transmission
drv	drive train (listed below)
cty	city miles per gallon
hwy	highway miles per gallon
fl	fuel type
class	"type" of car

I'll do some column renaming here since I keep forgetting what abbreviations they used.

Renaming Some Columns

mpg <- mpg %>%
  rename(
    displacement=displ,
    cylinders=cyl,
    transmission=trans,
    drive_train=drv,
    city_mpg=cty,
    highway_mpg=hwy,
    fuel_type=fl,
    )

str(mpg)

tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displacement: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cylinders   : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ transmission: chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drive_train : chr [1:234] "f" "f" "f" "f" ...
 $ city_mpg    : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ highway_mpg : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fuel_type   : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Looking at the Data

kable(head(mpg), caption="MPG Dataset (EPA Automobile Data)")

Table: MPG Dataset (EPA Automobile Data)

manufacturer	model	displacement	year	cylinders	transmission	drive_train	city_mpg	highway_mpg	fuel_type	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

Some Renaming

The Drive Trains

Abbreviation	Meaning
f	Front-Wheel Drive
r	Rear-Wheel Drive
4	Four-Wheel Drive

These aren't too obscure, but I think I'll rename them anyway.

mpg$drive_train <- recode(mpg$drive_train,
                          f="Front-Wheel", r="Rear-Wheel", `4`="Four-Wheel" )
kable(unique(mpg$drive_train), col.names=c("Drives"))

Drives
Front-Wheel
Four-Wheel
Rear-Wheel

assert_that(noNA(mpg$drive_train))

Fuel Types

Although this is about ggplot2, and the data is sort of a side effect, since I had to look some of this data up to figure out what it meant (at fueleconomy.gov) I'll use this opportunity to document it by updating the data.

mpg$fuel_type <- recode(mpg$fuel_type,
                        p="Premium Unleaded",
                        r="Regular Unleaded",
                        e="Ethanol",
                        d="Diesel Fuel",
                        c="Compressed Natural Gas")

kable(unique(mpg$fuel_type), col.names = c("Fuel Type"))

Fuel Type
Premium Unleaded
Regular Unleaded
Ethanol
Diesel Fuel
Compressed Natural Gas

For some reason I couldn't find a table that directly mapped to the data set so I'm sort of guessing that this is what they stood for based on the 1999 guide on their download page. There's also a L for Liquified Petroleum Gas (Propane) in the guide but it's not in the data set.

assert_that(noNA(mpg$fuel_type))

Transmission

mpg$transmission <- recode(mpg$transmission,
                           `auto(av)`="Continuously Variable",
                           `auto(s4)`="Automatic 4-Speed",
                           `auto(s5)`="Automatic 5-Speed",
                           `auto(s6)`="Automatic 6-Speed",
                           `auto(l3)`="Automatic Lockup 3-Speed",
                           `auto(l4)`="Automatic Lockup 4-Speed",
                           `auto(l5)`="Automatic Lockup 5-Speed",
                           `auto(l6)`="Automatic Lockup 6-Speed",
                           `manual(m5)`="Manual 5-Speed",
                           `manual(m6)`="Manual 6-Speed"
                           )

kable(unique(mpg$transmission), col.names = c("Transmission"))

Transmission
Automatic Lockup 5-Speed
Manual 5-Speed
Manual 6-Speed
Continuously Variable
Automatic 6-Speed
Automatic Lockup 4-Speed
Automatic Lockup 3-Speed
Automatic Lockup 6-Speed
Automatic 5-Speed
Automatic 4-Speed

Strangely, there's no listing for 4-speed manual transmissions.

assert_that(noNA(mpg$transmission))

Displacement Vs Highway Mileage

Now we come to the basic thing that the chapter in R For Data Science looks at (mostly) - what affects mileage for automobiles. The first thing we'll do is look at how the size of the engine affects the mileage.

plot = ggplot(data=mpg) +
  geom_point(mapping=aes(x=displacement, y=highway_mpg, color=highway_mpg)) +
  labs(x=DISPLACEMENT, y="Highway MPG",
       title="Displacement vs Mileage")

filename = "displacement_vs_mileage.png"
ggsave(filename)

nil

It has a roughly linear relationship, as you might expect, but the rest of this post is about how to get more information out of the data using ggplot2.

Ditching the Highway (Mileage)

Combined Mileage

The EPA uses a weighted average to create a combined metric that uses both Highway and City mileage.

\[ \textrm{Combined Fuel Economy} = 0.55 \times \textit{city} + 0.45 \times \textit{highway} \]

I've also seen it done as the harmonic mean (on Illustrative Mathematics), but if you round to the nearest whole number it comes out about the same.

mpg$combined_fuel_economy <- 0.55 * mpg$city_mpg + 0.45 * mpg$highway_mpg

plot = ggplot(data=mpg) +
  geom_point(mapping=aes(x=displacement, y=combined_fuel_economy, color=combined_fuel_economy)) +
  labs(x=DISPLACEMENT, y="Combined MPG",
       title="Displacement vs Mileage")

filename = "displacement_vs_combined_mileage.png"
ggsave(filename)

nil

Comparing the Two

We can plot them together to see how much the combined MPG differs from the highway MPG.

plot = ggplot(data=mpg) +
  geom_smooth(mapping=aes(x=displacement, y=combined_fuel_economy), color="red") +
  geom_smooth(mapping=aes(x=displacement, y=highway_mpg), color="blue") +
  labs(x=DISPLACEMENT, y="MPG",
       title="Displacement vs Mileage (Highway and Combined)")

filename = "highway_vs_combined_mileage.png"
ggsave(filename)

nil

Adding the city mileage drops the mileage somewhat, although the line looks to be about the same shape.

Gallons Per Mile

The EPA's explanation of their vehicle labels mentions that the reason why they put a Fuel Consumption Rate is that the Miles Per Gallon metric is actually not a good way to compare mileage - it is subject to a "Miles Per Gallon Illusion" (ReaClearScience has an article about it). Here's how the number of gallons used per 1,000 miles driven changes with miles-per-gallon.

mileage <- seq(1, 100)
miles = 1000
gallons_per_1000_miles <- miles/mileage

improvement <- data.frame(MPG=mileage,
                          gallons=gallons_per_1000_miles)
plot = ggplot(data=improvement) +
  geom_point(mapping=aes(x=mileage, y=gallons), alpha=0.5, color="blue") +
  geom_line(mapping=aes(x=mileage, y=gallons), color="blue") +
  labs(title="Fuel Consumption vs MPG",
       x="Miles Per Gallon",
       y="Gallons Used Per 1,000 Miles")

ggsave("gallons_vs_mpg.png")

nil

As you can see the improvement in the number of gallons used goes down pretty fast as the miles-per-gallon goes up. Let's see about adding a fuel-efficency column. What happens when you switch to Gallons Per Mile?

gallons_per_mile <- mileage
gallons_per_1000_miles <- miles * gallons_per_mile

improvement <- data.frame(gpm=gallons_per_mile,
                          gallons=gallons_per_1000_miles)
plot = ggplot(data=improvement) +
  geom_point(mapping=aes(x=gpm, y=gallons), alpha=0.5, color="blue") +
  geom_line(mapping=aes(x=gpm, y=gallons), color="blue") +
  labs(title="Fuel Consumption vs Gallons Per Mile",
       x="Gallons Per Mile",
       y="Gallons Used Per 1,000 Miles")

ggsave("gallons_vs_efficiency.png")

nil

I suppose you could just figure that that'd be the case, but this is about plotting.

mpg$gallons_per_mile <- 1/mpg$combined_fuel_economy

plot = ggplot(data=mpg, mapping=aes(x=displacement, y=gallons_per_mile)) +
  geom_point() +
  DISPLACEMENT_VS_EFFICIENCY(title="Displacement vs Fuel Efficiency")

filename = "displacement_vs_fuel_efficiency.png"
ggsave(filename)

nil

Even though this is just the inverse of the MPG plot it seems more intuitive to me. I didn't really notice that big outlier at the top when I plotted it using MPG, so I think I'll stick with this.

Displacement With Class

The displacement seems to mostly have a linear relationship with mileage, but there's exceptions, let's see if showing the class of the car reveals any interesting patterns.

unique(mpg$class)

[1] "compact"    "midsize"    "suv"        "2seater"    "minivan"   
[6] "pickup"     "subcompact"

Efficiency

plot = ggplot(data=mpg) +
  geom_point(mapping=aes(x=displacement, y=gallons_per_mile, color=class),
             position="jitter") +
  DISPLACEMENT_VS_EFFICIENCY(title="Displacement vs Fuel Efficiency")

filename = "displacement_vs_mileage_with_class.png"
ggsave(filename)

nil

I added a little jitter because it looked like the top outlier was just one point. It looks like 2-seaters and one of the mid-sized cars had unusually good mileage given their engine displacement. What's a two-seater?

kable(filter(mpg, class=="2seater"))

manufacturer	model	displacement	year	cylinders	transmission	drive_train	city_mpg	highway_mpg	fuel_type	class	combined_fuel_economy	gallons_per_mile
chevrolet	corvette	5.7	1999	8	Manual 6-Speed	Rear-Wheel	16	26	Premium Unleaded	2seater	20.50	0.0487805
chevrolet	corvette	5.7	1999	8	Automatic Lockup 4-Speed	Rear-Wheel	15	23	Premium Unleaded	2seater	18.60	0.0537634
chevrolet	corvette	6.2	2008	8	Manual 6-Speed	Rear-Wheel	16	26	Premium Unleaded	2seater	20.50	0.0487805
chevrolet	corvette	6.2	2008	8	Automatic 6-Speed	Rear-Wheel	15	25	Premium Unleaded	2seater	19.50	0.0512821
chevrolet	corvette	7.0	2008	8	Manual 6-Speed	Rear-Wheel	15	24	Premium Unleaded	2seater	19.05	0.0524934

Just Corvettes here.

Faceting The Classes

While the use of color is useful for comparing the classes on the same plot, sometimes it can be easier to interpret plots that have the categorical classes separated into sub-plots (called facets).

By default the faceting functions use aphabetical order, I'm going to use a slightly different ordering to see if it makes it clearer how the classes compare (this is based on the medians that I plotted below).

mpg$class_order <- factor(mpg$class,
                          levels=c("compact", "subcompact", "midsize",
                                   "2seater",
                                   "minivan", "suv", "pickup"))

plot = ggplot(data=mpg) +
  geom_point(mapping=aes(x=displacement, y=gallons_per_mile, color=gallons_per_mile)) +
  facet_grid(. ~ class_order) + 
  DISPLACEMENT_VS_EFFICIENCY(title="Displacement vs Efficiency")

filename = "displacement_vs_mileage_faceted_classes.png"
ggsave(filename)

I'm using facet_grid here, which is actually meant to use two variables (creating row and column separation). By putting the . before the ~ we're telling it to only facet using columns. There's also a facet_wrap function that only does one variable but adds the ability to specify the number of rows you want to use, which might be helpful if you have a lot of classes.

nil

I wouldn't have thought it, but the Corvette falls between the minivan and the sub-compact for fuel efficiency. I guess when you consider that it can only be used to transport two people and not anything relatively large it's still inefficient, just not as much if you're only using it to commute.

Smoothing the Line

Instead of plotting a scatterpoint, you can use geom_smooth to fit a single line representing the trend for the data (using a linear model).

plot = ggplot(data=mpg, mapping=aes(x=displacement, y=gallons_per_mile)) +
  geom_smooth() +
  geom_point() +
  facet_grid(class_order ~ drive_train) +
  DISPLACEMENT_VS_EFFICIENCY(title="Displacement vs Efficiency By Type")

ggsave("displacement_vs_mileage_smooth.png")

nil

One interesting thing here is that the four-wheel-drive compact and sub-compact cars did better than some of the front-wheel drive cars in the same class.

If you look at the front-wheel-drive sub-compact cars you'll see that the curve-fitting can get a little funky if you're not careful.

Note the syntax for the plot. Before we were passing the aes mapping to the geometry object. You can do that here too, but if you pass it to the ggplot call then it will automatically apply it to all the geometries. If you wanted to then change one of them you could pass in the aes mapping to it and override the base mapping you gave to ggplot.

Different Datasets to the Geometries

Besides passing in different aes mappings, you can also pass in different sub-sets of the data to each geometry (using dplyr's filter to isolate the SUVs here).

plot = ggplot(data=mpg, mapping=aes(x=displacement, y=gallons_per_mile)) +
  geom_smooth(data=filter(mpg, class=="suv")) +
  geom_point(mapping=aes(color=class), position="jitter") +
  DISPLACEMENT_VS_EFFICIENCY(title="Displacement vs Fuel Efficiency")

filename = "displacement_vs_mileage_subsets.png"
ggsave(filename)

nil

Looking at the trend-line for SUVs you can see that it encompasses a surprising swath of the displacements, and it is mostly upwardly linear to begin with but then flattens out.

plot = ggplot(
  data = mpg,
  mapping = aes(x = displacement, y = gallons_per_mile, color = drive_train)
) +
  geom_point() +
  geom_smooth(se = FALSE) +
  DISPLACEMENT_VS_EFFICIENCY(title="Effect of Drive (Four, Front, or Rear) on Mileage")

ggsave("displacement_exercise_2.png")

nil

It kind of looks like an emerging pattern that up to a point gasoline consumption goes up with engine size but then once a certain point is passed consumption flattens out, even as the engines get bigger.

Small Four-Wheel Drive Automobiles

Just out of curiosity, I'd like to see what those small four-wheel drive cars are.

small <- filter(mpg, displacement<3 & drive_train=="Four-Wheel")
plot = ggplot(
  data=small,
  mapping=aes(x=displacement, y=gallons_per_mile, color=model)
) +
  geom_point() +
  facet_grid(class_order ~ manufacturer) +
  DISPLACEMENT_VS_EFFICIENCY(title="4WD Less Than 3 Liters")

ggsave("small_4wd.png")

nil

There are only three companies making these small four-wheel-drive automobiles, with Audi being the one focused on sedans.

Bar Plots

By Count

The Bar Plot is an example of what the Grammar of Graphics calls a Statistical Transformation (I guess the scatter plots are so called "Identity" transformations, but, whatever). It doesn't display the values in the data (unless you tell it to) but instead shows how often categorical values appear in the data. We can also have it set the color to create a stacked bar plot.

plot = ggplot(data=mpg) +
  geom_bar(
    mapping = aes(x=class_order, fill=drive_train)
  ) +
  labs(title="Count of Cars by Class", x="Class Ordered by Median MPG", y="Count")

ggsave("class_count.png")

nil

So most models of car are SUVs, how disheartening. This is only a sub-set of cars, though, and it doesn't reflect sales, just model counts.

By Proportion

Instead of using counts we can use proportions to show how the sub-category portions compare across categories.

plot = ggplot(data=mpg) +
  geom_bar(
    mapping = aes(x=class_order, fill=drive_train),
    position="fill"
  ) +
  labs(title="Proportion of Drives by Class", x="Class Ordered by Median MPG", y="Proportion")

ggsave("class_count_filled.png")

nil

This shows what proportion if each class of car has what type of drive.

Un-Stacked

To make it easier to see how sub-categories compare within the class, we can change the positioning so that the bars are put side-by-side.

plot = ggplot(data=mpg) +
  geom_bar(
    mapping = aes(x=class_order, fill=drive_train),
    position="dodge"
  ) +
  labs(title="Count of Model Drives by Class", x="Class Ordered by Median MPG", y="Count")

ggsave("class_count_dodged.png")

nil

Proportions Not Counts

Our original bar graph showed the counts, but we can show the same plot using the proportion that each categories' count is of the total count as the y-axis.

plot = ggplot(data=mpg, mapping=aes(color=class)) +
  geom_bar(
    mapping = aes(x=class_order, y=..prop.., group=1)
  ) +
  labs(title="Proportion of Cars by Class", x="Class Ordered by Median MPG", y="Proportion")

ggsave("class_proportion.png")

nil

Two things:

That funky ..prop.. is really how R specifies that argument
The group=1 argument tells ggplot that there's only one group for the whole dataset, otherwise each category is taken as a separate group and they are all end up as 100% of their group.
I couldn't get ggplot to set the colors based on the class

Stat Summary

This is a way to plot different kinds of summary statistics for the dataset.

plot = ggplot(data=mpg) +
  stat_summary(
    mapping = aes(x=class_order, y=gallons_per_mile, color=class_order),
    fun.min = min,
    fun.max = max,
    fun = median
  ) +
  labs(title="Spread of Efficiency by Class", y=EFFICIENCY, x="Class")

ggsave("mpg_class_spread.png")

nil

Boxplot

That was a nice clean way to show it, and it's flexible, but maybe a good old-fashioned box-plot is the best, in the end.

plot = ggplot(data=mpg) +
  geom_boxplot(
    mapping = aes(x=class_order, y=gallons_per_mile, color=class_order),
  ) +
  labs(title="Spread of Efficiency by Class", y=EFFICIENCY, x="Class")

ggsave("mpg_class_boxplot.png")

nil

The End

Well, that's more than enough of that. The Chapter in R For Data Science has other stuff in there, in particular there's a section on changing the coordinate system, but I think it's time to move on.

Link Collection

fueleconomy.gov: the federal government's site on fuel efficiency for cars sold in the United States
ggplot2 main page
RealClear Science on the MPG Illusion
The Grammar of Graphics: This is a page on the University of Chicago's Computing For Social Sciences site that has an explanation of what makes up the Layered Grammar of Graphics that ggplot2 is based on.

NYPD Shooting Incident Data

The Cloistered Monkey

2021-12-11 14:34

NYPD Shooting Incident Data

NYPD Shooting Incident Data

This is a project that looks at the NYPD Shooting Incident Data (Historic) from the DATA.gov catalog of datasets. This dataset lists every shooting incident in New York City from 2006 through the end of the previous calendar year. It is compiled by NYC Open Data and was last updated on December 11, 2021.

Each row in the data is a "shooting incident" and there are 19 columns.

Loading the Data

library(assertthat)
library(knitr)
library(lubridate)
library(tidyverse)

Lubridate is part of the tidyverse but it has to be loaded for some reason.

DATA_URL <- "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD" 

data = read_csv(DATA_URL)


[1mindexed[0m [32m0B[0m in [36m 0s[0m, [32m0B/s[0m
[1mindexed[0m [32m393.21kB[0m in [36m 0s[0m, [32m983.15kB/s[0m
                                                                              
[1mindexed[0m [32m524.28kB[0m in [36m 1s[0m, [32m1.03MB/s[0m
[1mindexed[0m [32m655.36kB[0m in [36m 1s[0m, [32m1.22MB/s[0m
[1mindexed[0m [32m786.43kB[0m in [36m 1s[0m, [32m1.24MB/s[0m
[1mindexed[0m [32m917.50kB[0m in [36m 1s[0m, [32m1.23MB/s[0m
                                                                              
[1mindexed[0m [32m1.05MB[0m in [36m 1s[0m, [32m1.18MB/s[0m
[1mindexed[0m [32m1.18MB[0m in [36m 1s[0m, [32m1.18MB/s[0m
[1mindexed[0m [32m1.31MB[0m in [36m 1s[0m, [32m1.06MB/s[0m
[1mindexed[0m [32m1.44MB[0m in [36m 2s[0m, [32m947.54kB/s[0m
                                                                              
[1mindexed[0m [32m1.57MB[0m in [36m 2s[0m, [32m1.00MB/s[0m
[1mindexed[0m [32m1.70MB[0m in [36m 2s[0m, [32m1.04MB/s[0m
[1mindexed[0m [32m1.83MB[0m in [36m 2s[0m, [32m1.04MB/s[0m
[1mindexed[0m [32m1.97MB[0m in [36m 2s[0m, [32m1.02MB/s[0m
[1mindexed[0m [32m2.10MB[0m in [36m 2s[0m, [32m1.03MB/s[0m
[1mindexed[0m [32m2.23MB[0m in [36m 2s[0m, [32m994.69kB/s[0m
[1mindexed[0m [32m2.36MB[0m in [36m 2s[0m, [32m965.18kB/s[0m
[1mindexed[0m [32m2.49MB[0m in [36m 3s[0m, [32m939.81kB/s[0m
[1mindexed[0m [32m2.62MB[0m in [36m 3s[0m, [32m951.26kB/s[0m
[1mindexed[0m [32m2.75MB[0m in [36m 3s[0m, [32m919.65kB/s[0m
[1mindexed[0m [32m2.88MB[0m in [36m 3s[0m, [32m855.66kB/s[0m
[1mindexed[0m [32m3.01MB[0m in [36m 3s[0m, [32m892.99kB/s[0m
[1mindexed[0m [32m3.15MB[0m in [36m 3s[0m, [32m904.11kB/s[0m
[1mindexed[0m [32m3.28MB[0m in [36m 4s[0m, [32m867.67kB/s[0m
[1mindexed[0m [32m3.41MB[0m in [36m 4s[0m, [32m878.71kB/s[0m
[1mindexed[0m [32m3.54MB[0m in [36m 4s[0m, [32m888.47kB/s[0m
[1mindexed[0m [32m3.67MB[0m in [36m 4s[0m, [32m899.43kB/s[0m
[1mindexed[0m [32m3.80MB[0m in [36m 4s[0m, [32m846.12kB/s[0m
[1mindexed[0m [32m3.93MB[0m in [36m 5s[0m, [32m861.49kB/s[0m
[1mindexed[0m [32m4.06MB[0m in [36m 5s[0m, [32m876.08kB/s[0m
[1mindexed[0m [32m4.19MB[0m in [36m 5s[0m, [32m854.40kB/s[0m
[1mindexed[0m [32m4.33MB[0m in [36m 5s[0m, [32m877.77kB/s[0m
[1mindexed[0m [32m4.46MB[0m in [36m 5s[0m, [32m855.56kB/s[0m
[1mindexed[0m [32m4.59MB[0m in [36m 5s[0m, [32m865.02kB/s[0m
[1mindexed[0m [32m4.72MB[0m in [36m 6s[0m, [32m855.32kB/s[0m
[1mindexed[0m [32m4.85MB[0m in [36m 6s[0m, [32m847.74kB/s[0m
[1mindexed[0m [32m4.98MB[0m in [36m 6s[0m, [32m870.08kB/s[0m
[1mindexed[0m [32m5.11MB[0m in [36m 6s[0m, [32m892.61kB/s[0m
[1mindexed[0m [32m5.11MB[0m in [36m 6s[0m, [32m892.93kB/s[0m
[1mindexed[0m [32m1.00TB[0m in [36m 6s[0m, [32m174.56GB/s[0m
                                                                              
[1mRows: [22m[34m23585[39m [1mColumns: [22m[34m19[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (10): OCCUR_DATE, BORO, LOCATION_DESC, PERP_AGE_GROUP, PERP_SEX, PERP_R...
[32mdbl[39m   (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
[33mlgl[39m   (1): STATISTICAL_MURDER_FLAG
[34mtime[39m  (1): OCCUR_TIME

[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.

Tidying and Transforming the Data

spec(data)

cols(
  INCIDENT_KEY = [32mcol_double()[39m,
  OCCUR_DATE = [31mcol_character()[39m,
  OCCUR_TIME = [34mcol_time(format = "")[39m,
  BORO = [31mcol_character()[39m,
  PRECINCT = [32mcol_double()[39m,
  JURISDICTION_CODE = [32mcol_double()[39m,
  LOCATION_DESC = [31mcol_character()[39m,
  STATISTICAL_MURDER_FLAG = [33mcol_logical()[39m,
  PERP_AGE_GROUP = [31mcol_character()[39m,
  PERP_SEX = [31mcol_character()[39m,
  PERP_RACE = [31mcol_character()[39m,
  VIC_AGE_GROUP = [31mcol_character()[39m,
  VIC_SEX = [31mcol_character()[39m,
  VIC_RACE = [31mcol_character()[39m,
  X_COORD_CD = [32mcol_double()[39m,
  Y_COORD_CD = [32mcol_double()[39m,
  Latitude = [32mcol_double()[39m,
  Longitude = [32mcol_double()[39m,
  Lon_Lat = [31mcol_character()[39m
)

Incident Key

According to the Data Description the "Incident Key" is a randomly generated identifier for the incident. This is a likely candidate to remove unless we look at specific incidents.

Occur Date

This is the date that the shooting happened. It was loaded as a string so we should convert it to a date.

data$OCCUR_DATE[1]

[1] "08/27/2006"

data$DATE <- as.Date(data$OCCUR_DATE, format="%m/%d/%Y")
str(data$DATE)
data$DATE[1]
assert_that(noNA(data$DATE))

 Date[1:23585], format: "2006-08-27" "2011-03-11" "2019-10-06" "2011-09-04" "2013-05-27" ...
[1] "2006-08-27"
[1] TRUE

Occur Time

This is the time of the shooting. It appears to have already been read in as a time.

str(data$OCCUR_TIME)
assert_that(noNA(data$OCCUR_TIME))

 'hms' num [1:23585] 05:35:00 12:03:00 01:09:00 03:35:00 ...
 - attr(*, "units")= chr "secs"
[1] TRUE

Boro

This is the NYC Borough where the shooting happened.

data$Borough <- as.factor(data$BORO)
assert_that(is.factor(data$Borough))
assert_that(noNA(data$Borough))

[1] TRUE
[1] TRUE

knitr.kable reformats the dataframe to a more readable table. Kind of like tabulate but less flexible.

big_mark <- function(vector, columns){
    kable(table(vector), col.names=columns, format.args=list(big.mark=","))
}
big_mark(data$Borough, c("Borough", "Shootings"))

Borough	Shootings
BRONX	6,701
BROOKLYN	9,734
MANHATTAN	2,922
QUEENS	3,532
STATEN ISLAND	696

Precinct

This is the precinct number where the shooting happened.

n_distinct(data$PRECINCT)
str(data$PRECINCT)
data$PRECINCT_FACTOR <- as.factor(data$PRECINCT)
see_if(noNA(data$PRECINCT))

[1] 77
 num [1:23585] 52 106 77 40 100 67 77 81 101 106 ...
[1] TRUE

Jurisdiction Code

The jurisdiction where the shooting occurred. There are three jurisdictions that are NYC jurisdictions:

Code	Jurisdiction
0	Patrol
1	Transit
2	Housing

Any numbers higher than this are non-NYC jurisdictions.

data$JURISDICTION <- as.factor(data$JURISDICTION_CODE)
see_if(noNA(data$JURISDICTION))

[1] FALSE
attr(,"msg")
[1] "data$JURISDICTION contains 2 missing values"

big_mark(data$JURISDICTION, c("Jurisdiction", "Shootings"))

Jurisdiction	Shootings
0	19,629
1	54
2	3,900

Location Description

data$LOCATION_DESC[1]
see_if(noNA(data$LOCATION_DESC))

[1] NA
[1] FALSE
attr(,"msg")
[1] "data$LOCATION_DESC contains 13581 missing values"

n_distinct(data$LOCATION_DESC)

[1] 40

Here's what the first filled-in entry looks like.

first(na.omit(data$LOCATION_DESC))

[1] "MULTI DWELL - PUBLIC HOUS"

Too bad there's so many missing values.

big_mark(data$LOCATION_DESC, c("Location", "Shootings"))

Location	Shootings
ATM	1
BANK	1
BAR/NIGHT CLUB	562
BEAUTY/NAIL SALON	100
CANDY STORE	6
CHAIN STORE	5
CHECK CASH	1
CLOTHING BOUTIQUE	14
COMMERCIAL BLDG	234
DEPT STORE	5
DOCTOR/DENTIST	1
DRUG STORE	11
DRY CLEANER/LAUNDRY	30
FACTORY/WAREHOUSE	6
FAST FOOD	98
GAS STATION	53
GROCERY/BODEGA	574
GYM/FITNESS FACILITY	3
HOSPITAL	38
HOTEL/MOTEL	24
JEWELRY STORE	12
LIQUOR STORE	36
LOAN COMPANY	1
MULTI DWELL - APT BUILD	2,553
MULTI DWELL - PUBLIC HOUS	4,240
NONE	175
PHOTO/COPY STORE	1
PVT HOUSE	857
RESTAURANT/DINER	188
SCHOOL	1
SHOE STORE	9
SMALL MERCHANT	25
SOCIAL CLUB/POLICY LOCATI	66
STORAGE FACILITY	1
STORE UNCLASSIFIED	35
SUPERMARKET	19
TELECOMM. STORE	5
VARIETY STORE	11
VIDEO STORE	2

Statistical Murder Flag

This is a checkbox indicating that the victim died as a result of the shooting.

see_if(noNA(data$STATISTICAL_MURDER_FLAG))

[1] TRUE

big_mark(data$STATISTICAL_MURDER_FLAG, c("Victim Died", "Count"))

Victim Died	Count
FALSE	19,085
TRUE	4,500

Perpetrator's Age Group

see_if(noNA(data$PERP_AGE_GROUP))
data$PERP_AGE <- as.factor(data$PERP_AGE_GROUP)

[1] FALSE
attr(,"msg")
[1] "data$PERP_AGE_GROUP contains 8295 missing values"

big_mark(data$PERP_AGE, c("Perpetrator's Age Group", "Shootings"))

Perpetrator's Age Group	Shootings
<18	1,368
1020	1
18-24	5,508
224	1
25-44	4,714
45-64	495
65+	54
940	1
UNKNOWN	3,148

Peprpetrator's Sex

There are three values for "Sex" - Female (F), Male (M), or Unknown (U).

see_if(noNA(data$PERP_SEX))
data$PERP_SEX <- as.factor(data$PERP_SEX)

[1] FALSE
attr(,"msg")
[1] "data$PERP_SEX contains 8261 missing values"

big_mark(data$PERP_SEX, c("Perpetrator's Sex", "Shootings"))

Perpetartor's Sex	Shootings
F	335
M	13,490
U	1,499

Perpetrator's Race

see_if(noNA(data$PERP_RACE))
data$PERP_RACE <- as.factor(data$PERP_RACE)

[1] FALSE
attr(,"msg")
[1] "data$PERP_RACE contains 8261 missing values"

big_mark(data$PERP_RACE, c("Perpetrator's Race", "Shootings"))

Perpetrator's Race	Shootings
AMERICAN INDIAN/ALASKAN NATIVE	2
ASIAN / PACIFIC ISLANDER	122
BLACK	10,025
BLACK HISPANIC	1,096
UNKNOWN	1,836
WHITE	255
WHITE HISPANIC	1,988

Victim's Age Group

see_if(noNA(data$VIC_AGE_GROUP))
data$VIC_AGE_GROUP <- as.factor(data$VIC_AGE_GROUP)

[1] TRUE

big_mark(data$VIC_AGE_GROUP, c("Victim's Age Group", "Shootings"))

Victim's Age Group	Shootings
<18	2,525
18-24	9,003
25-44	10,303
45-64	1,541
65+	154
UNKNOWN	59

Victim's Sex

see_if(noNA(data$VIC_SEX))
data$VIC_SEX <- as.factor(data$VIC_SEX)

[1] TRUE

big_mark(data$VIC_SEX, c("Victims' Sex", "Shootings"))

Victims' Sex	Shootings
F	2,204
M	21,370
U	11

Victim's Race

see_if(noNA(data$VIC_RACE))
data$VIC_RACE <- as.factor(data$VIC_RACE)

[1] TRUE

big_mark(data$VIC_RACE, c("Victim's Race", "Shootings"))

Victim's Race	Shootings
AMERICAN INDIAN/ALASKAN NATIVE	9
ASIAN / PACIFIC ISLANDER	327
BLACK	16,869
BLACK HISPANIC	2,245
UNKNOWN	65
WHITE	620
WHITE HISPANIC	3,450

X-Coordinate

"Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)."

see_if(noNA(data$X_COORD_CD))
n_distinct(data$X_COORD_CD)

[1] TRUE
[1] 9911

Y-Coordinate

see_if(noNA(data$Y_COORD_CD))
n_distinct(data$Y_COORD_CD)

[1] TRUE
[1] 9986

Latitude

Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326).

see_if(noNA(data$Latitude))
n_distinct(data$Latitude)

[1] TRUE
[1] 10055

Longitude

Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326).

see_if(noNA(data$Longitude))
n_distinct(data$Longitude)
str(data$Longitude)

[1] TRUE
[1] 10055
 num [1:23585] -73.9 -73.8 -74 -73.9 -73.8 ...

Longitude and Latitude

Longitude and Latitude Coordinates for mapping

see_if(noNA(data$Lon_Lat))
n_distinct(data$Lon_Lat)
str(data$Lon_Lat)

[1] TRUE
[1] 10055
 chr [1:23585] "POINT (-73.87963173099996 40.86905819000003)" ...

Visualization

floor_date is a function from lubridate that comes as part of the tidyverse. It's called floor because it acts like the floor function in math (round down).

Annually

dates <- data
dates$Year <- floor_date(dates$DATE, "year")

annual_shootings <- dates %>% group_by(Year) %>% summarise(Shootings=n())
kable(annual_shootings,
      format.args=list(big.mark=","))

Year	Shootings
2006-01-01	2,055
2007-01-01	1,887
2008-01-01	1,959
2009-01-01	1,828
2010-01-01	1,912
2011-01-01	1,939
2012-01-01	1,717
2013-01-01	1,339
2014-01-01	1,464
2015-01-01	1,434
2016-01-01	1,208
2017-01-01	970
2018-01-01	958
2019-01-01	967
2020-01-01	1,948

theme_set(theme_minimal())
PATH = "../files/posts/nypd-shooting-incident-data/"
join_path = partial(paste, PATH, sep="")

plot = ggplot(data=annual_shootings, aes(x=Year, y=Shootings)) +
  geom_line() + geom_point() +
  scale_x_date(date_labels="%Y", name="Year", date_breaks="1 year") +
  labs(title="NYPD Annual Shootings Incidents Reported")
ggsave(join_path("annual_shootings.png"),
       plot=plot)

nil

By Borough

borough_shootings <- dates %>% group_by(Year, Borough) %>% summarise(Shootings=n())

file_name = "annual_shootings_boroughs.png"
plot = ggplot(data=borough_shootings, aes(x=Year, y=Shootings, group=Borough, colour=Borough)) +
  geom_line() + geom_point() +
  scale_x_date(date_labels="%Y", name="Year", date_breaks="1 year") +
  labs(title="NYPD Annual Shootings by Borough")
ggsave(join_path(file_name),
       plot=plot)

nil

Monthly

dates$year_month <- floor_date(dates$DATE, "month")
monthly_shootings <- dates %>% group_by(year_month) %>% summarise(Shootings=n())

monthly_shootings$month_number <- as.numeric(
  format(as.Date(monthly_shootings$year_month), "%m"))
monthly_shootings$Month <- months(as.Date(monthly_shootings$year_month),
                                  abbreviate=TRUE)
monthly_shootings$Year <- year(ymd(monthly_shootings$year_month))
monthly_shootings

plot = ggplot(data=monthly_shootings,
              aes(x=month_number,
                  y=Shootings,
                  group=Year, colour=Year)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=monthly_shootings$month_number,
                     labels=monthly_shootings$Month, name="Month") +
  labs(title="NYPD Shootings by Month (Year-Over-Year)")

ggsave("../files/posts/nypd-shooting-incident-data/monthly_shootings.png",
       plot=plot)

nil

three_years = monthly_shootings[which(monthly_shootings$Year %in% c(2006, 2018, 2020)),]

plot = ggplot(data=three_years,
              aes(x=month_number,
                  y=Shootings,
                  group=Year, colour=Year)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=three_years$month_number,
                     labels=three_years$Month, name="Month") +
  labs(title="NYPD Shootings by Month (2006, 2018, 2020)")

ggsave("../files/posts/nypd-shooting-incident-data/monthly_shootings_three-years.png",
       plot=plot)

nil

By Borough

borough_monthly <- dates %>% group_by(year_month, BORO) %>% summarise(Shootings=n())

borough_monthly$monthly_number <- as.numeric(
  format(as.Date(borough_monthly$year_month), "%m"))
borough_monthly$Month <- months(as.Date(borough_monthly$year_month),
                                abbreviate=TRUE)
borough_monthly$Year <- year(ymd(borough_monthly$year_month))
borough_monthly

boroughs_2020 = borough_monthly[which(borough_monthly$Year %in% c(2020)),]

plot = ggplot(data=boroughs_2020,
              aes(x=monthly_number,
                  y=Shootings,
                  group=BORO, colour=BORO)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks=boroughs_2020$monthly_number,
                     labels=boroughs_2020$Month, name="Month") +
  labs(title="NYPD Shootings by Borough (2020)")

ggsave(join_path("monthly_shootings_by_borough.png"),
       plot=plot)

nil

By Map

dates$Year <- year(ymd(dates$year_month))
last_year <- dates[dates$Year==2020,]
last_year

minimum_latitude = min(last_year$Latitude)
maximum_latitude = max(last_year$Latitude)
minimum_longitude = min(last_year$Longitude)
maximum_longitude = max(last_year$Longitude)


plot = ggplot(data=last_year,
              aes(x=Longitude,
                  y=Latitude, color=BORO)) +
  geom_point(size=1) +
  scale_x_continuous(limits=c(minimum_longitude, maximum_longitude)) +
  scale_y_continuous(limits=c(minimum_latitude, maximum_latitude)) +
  labs(title="NYPD Shootings by Latitude and Longitude (2020)")

ggsave(join_path("monthly_shootings_latitude_longitude.png"),
       plot=plot)

nil

R Babel Test

The Cloistered Monkey

2021-12-09 17:08

x <- rnorm(100)
summary(x)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.93398 -0.79415  0.01453 -0.01418  0.67102  2.20682

Does it keep the variables alive between blocks?

summary(x)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.93398 -0.79415  0.01453 -0.01418  0.67102  2.20682

Looks like it.

Insertion Sort

The Cloistered Monkey

2021-11-30 17:15

\(\def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}}\)

The Insertion Sort

The insertion sort is an instance of a Decrease And Conquer algorithm and is another case where you maintain a sorted sub-list at the beginning of the list of elements to sort (similar to {selection sort) but instead of scanning the entire unsorted section for the next smallest item to pick (as in "selection sort") you pick the next item in the unsorted and scan down through the sorted section until you find the right place to insert it. This has the slight advantage over selection sort in that you might not have to scan the entire sorted section to find where to insert the next item, while you always have to scan the entire unsorted section to find the next item to select in the selection sort. On the other hand, instead of swapping just two items at the end of each loop, as in the selection sort, with the insertion sort you have to shift every item above where the next item is going to be inserted in order to make a space for it.

The Algorithm Parts

This is taken pretty straight from How To Think About Algorithms.

Specification: This is a Sorting Problem.
Basic Steps: Repeatedly insert an item where it belongs.
Measure of Progress: The number of elements we've inserted so far.
The Loop Invariant: The inserted elements are in sorted order.
Main Steps: Take an unselected item and insert it into the correct place in the sorted list.
Make Progress: We always insert an item.
Maintain the Loop Invariant:
- \( \langle \textit{Loop Invariant} \rangle\): Insertion List is sorted.
- \(\lnot \langle \textit{Exit-Condition} \rangle \): There's more to sort.
- \(\sc{CodeLoop}\): The item that was chosen was inserted in the place that maintains the loop invariant.
Establish the Loop Invariant: Pick the first item from the unsorted list giving us a sorted list with one item.
Exit Condition: We've inserted all the items.
Ending:
- \(\langle \textit{Exit-Condition} \rangle\): All elements were inserted.
- \(\langle \textit{Loop Invariant} \rangle\): All inserted elements are in the right order.
- \(\rightarrow \langle \textit{Post-Condition} \rangle\): All the elements are in sorted order.

The Pseudocode

\begin{algorithm}
\caption{InsertionSort}
\begin{algorithmic}
\INPUT An array of orderable items
\OUTPUT The array sorted in ascending order
\PROCEDURE{InsertionSort}{$A$}
  \FOR{$i \gets 1$ to $n - 1$}
    \STATE $v \gets A[i]$
    \STATE $j \gets i - 1$
    \WHILE{$j \geq 0$ and $A[j] > v$}
       \STATE $A[j + 1] \gets A[j]$
       \STATE $j \gets j - 1$
    \ENDWHILE
    \STATE $A[j + 1] \gets v$
  \ENDFOR
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

The outer loop tracks the boundary between the sorted section of the list (elements whose index is less than i) and the unsorted section (i to the end of the list). Every time through the outer loop we pick the next item to insert (and assign it to v) then in the while loop we traverse the list backwards, moving the items up one in the list until we find a place where the value to insert isn't less than the next item in the array (or we haven't reached the start of the list). Once we end up at the place to insert the value we exit the while loop and insert the value where we stopped. Then we move our outer loop up one and start again until everything's sorted.

Runtime

Given the two loops you can probably guess that it's going to be \(\Theta(n^2)\), but I'll go through it so I don't forget later on.

\begin{align} C_{worst} &= \sum_{i=1}^{n - 1} \sum_{j=0}^{i-1} 1\\ &= \sum_{i=1}^{n - 1} ((i - 1) - 0 + 1) \\ &= \sum_{i=1}^{n - 1} i \end{align}

This is pretty similar to what I had in the Bubble Sort: Runtime Explained post so I won't go over the reasons for the steps, but at this point we can use the summing backwards and forwards of the series trick to see what our worst-case runtime is.

\begin{array}{ccccccccc} & 1 & + & 2 & + & \cdots & + & (n - 2) & + & (n - 1) \\ + & (n - 1) & + & (n - 2) & + & \cdots & + & 2 & + & 1 \\ \hline & n & + & n & + & \cdots & + & n & + & n\\ \end{array}

So we have n - 1 terms of n in our doubled runtime which we can then halve to get:

\[ C_{worst} = \frac{n (n - 1)}{2} \in \Theta(n^2) \]

Implement It

I'll try and copy the way the selection and bubble sorts looked to make it easier to compare them later.

Imports

# python
from collections import namedtuple
from collections.abc import MutableSequence

Some Types

Note: For my future self: using the dataclass is the python way to make a data object, but numba doesn't like it, so we can't use it.

@dataclass
class InsertionOutput:
    element_count: int
    comparisons: int
    swaps: int
    elements: MutableSequence

I'll use the regular namedtuple instead.

InsertionOutput = namedtuple("InsertionOutput", ["element_count",
                                                 "comparisons",
                                                 "swaps",
                                                 "elements"])

The Counter

def insertion_sort(elements: MutableSequence) -> InsertionOutput:
    """Sorts elements using iterative insertion-sort

    Args:
     elements: sortable collection of elements

    Returns:
     count of elements, comparisons made, swaps made, sorted elements
    """
    comparisons = swaps = 0
    for next_unsorted_cell in range(1, len(elements)):
        thing_to_insert = elements[next_unsorted_cell]

        in_front_of_me, to_the_right = (next_unsorted_cell - 1,
                                        next_unsorted_cell)

        while not (in_front_of_me < 0 or
                   elements[in_front_of_me] <= thing_to_insert):
            comparisons += 1
            swaps += 1
            elements[to_the_right] = elements[in_front_of_me]
            in_front_of_me, to_the_right = (in_front_of_me - 1,
                                            in_front_of_me)

        elements[to_the_right] = thing_to_insert
        swaps += 1

    return InsertionOutput(len(elements), comparisons, swaps, elements)

I negated the while-condition and re-stated the body to make more sense to me. Hopefully it's still clear what's going on.

Some Simple Testing

Importing

# python
from functools import partial
import random

# pypi
from expects import contain_exactly, equal, expect
from joblib import Parallel, delayed
from numba import njit
from numpy.random import default_rng

import altair
import numpy
import pandas

# this project
from bowling.sort.insertion import insertion_sort

# my stuff
from graeae import Timer
from graeae.visualization.altair_helpers import output_path, save_chart

Set Up

numba_random = default_rng(2022)
TIMER = Timer()

SLUG = "insertion-sort"
OUTPUT_PATH = output_path(SLUG)
save_it = partial(save_chart, output_path=OUTPUT_PATH)

Worst Case

n = 100
inputs = list(reversed(range(n)))
expected = list(sorted(inputs.copy()))

output = insertion_sort(inputs)

expect(output.elements).to(contain_exactly(*expected))

expect(output.comparisons).to(equal((n * (n - 1))/2))

Best Case

inputs = expected.copy()
output = insertion_sort(inputs)
expect(output.elements).to(contain_exactly(*expected))

expect(output.comparisons).to(equal(0))
expect(output.swaps).to(equal(n - 1))

Maybe comparisons is the wrong term since it's really counting the number of times we get past the sentinel in the while statement, but I don't think there's a good way to count how many times the sentinel gets checked, so the swaps has to act as a proxy for this best-case scenario where we never drop into the while loop.

Random

inputs = random.choices(inputs, k=n)
expected = list(sorted(inputs.copy()))

output = insertion_sort(inputs)

expect(output.elements).to(contain_exactly(*expected))

print((n * (n - 1))/2)
print(output.comparisons)
print(output.swaps)

4950.0
2585
2684

Comparisons and Swaps

numba_sort = njit(insertion_sort)
things_to_sort = [numba_random.integers(low=0, high=count, size=count)
                  for count in range(1, 10**5 + 1, 1000)]

with TIMER:
    sort_output = Parallel(n_jobs=-1)(
        delayed(numba_sort)(thing_to_sort)
        for thing_to_sort in things_to_sort)

Started: 2022-01-16 22:55:12.336777
Ended: 2022-01-16 22:55:23.534382
Elapsed: 0:00:11.197605

SIZE, COMPARISONS, SWAPS = 0, 1, 2
unzipped = list(zip(*sort_output))
count_frame = pandas.DataFrame({"Elements": unzipped[SIZE],
                                "Insertion": unzipped[SWAPS]})

count_frame["n^2"] = count_frame.Elements**2

melted = count_frame.melt(id_vars=["Elements"], var_name="Source", value_name="Runtime")

tooltip = [altair.Tooltip("Elements", format=","),
           altair.Tooltip("Runtime", format=","), 
           altair.Tooltip("Source:N")]

chart = altair.Chart(melted).mark_point().encode(
    x = "Elements",
    y = "Runtime",
    color="Source:N",
    tooltip=tooltip
).properties(
    width=800, height=525, title="Insertion Sort"
)

save_it(chart, "insertion-sort-comparisons")

With our random input the Insertion Sort does better than the theoretical worst case.

Best and Worst Cases

Let's get rid of the \(n^2\) and add the best-case and worst-case data.

frame = count_frame.rename(columns={"Insertion": "Random"})
del frame["n^2"]

best_things = [numpy.arange(count, dtype=int) for count in range(1, 10**5+ 1, 1000)]
worst_things = [numpy.flip(things) for things in best_things]

with TIMER:
    worst_output = Parallel(n_jobs=-1)(
        delayed(numba_sort)(thing) for thing in worst_things
)

Started: 2022-01-16 22:56:34.282090
Ended: 2022-01-16 22:56:56.373266
Elapsed: 0:00:22.091176

with TIMER:
    best_output = Parallel(n_jobs=-1)(
        delayed(numba_sort)(thing) for thing in best_things
)

Started: 2022-01-16 22:57:04.941850
Ended: 2022-01-16 22:57:05.067092
Elapsed: 0:00:00.125242

best_unzipped = list(zip(*best_output))
worst_unzipped = list(zip(*worst_output))

frame["Best"] = best_unzipped[SWAPS]
frame["Worst"] = worst_unzipped[SWAPS]

melted = frame.melt(id_vars=["Elements"],
                    var_name="Input",
                    value_name="Comparisons")

chart = altair.Chart(melted).mark_point().encode(
    x="Elements",
    y="Comparisons",
    color="Input",
    tooltip=[altair.Tooltip("Elements", format=","),
             altair.Tooltip("Comparisons", format=","),
             "Input"],
).properties(
    title="Insertion Sort Best, Worst, Random",
    width=800,
    height=525
).interactive()

save_it(chart, "best-worst-random")

Here we can see why Insertion Sort is considered better than Bubble and Selection Sort. In those cases they have to traverse the entire unsorted collection every time but Insertion Sort only picks whatever is the next unsorted item and searches the sorted until it finds the right place to insert it. In the worst case they're all the same, but whenever it isn't the worst case insertion sort offers a little improvement.

The Swaps

Let's look at what plotting the location of the elements as they are swapped looks like.

def insertion_swaps(elements) -> dict:
    """Keeps track of the element indexes as they are swapped

    Args:
     elements: list of orderable elements

    Returns:
     dict mapping element to list of indices where it was in the elements list
    """
    swaps = {element: [index] for index, element in enumerate(elements)}

    number_of_elements = len(elements)

    for next_unsorted_cell in range(1, len(elements)):
        thing_to_insert = elements[next_unsorted_cell]

        in_front_of_me, to_the_right = (next_unsorted_cell - 1,
                                        next_unsorted_cell)

        while not (in_front_of_me < 0 or
               elements[in_front_of_me] <= thing_to_insert):
            elements[to_the_right] = elements[in_front_of_me]
            in_front_of_me, to_the_right = (in_front_of_me - 1,
                                            in_front_of_me)

        elements[to_the_right] = thing_to_insert

        for index, element in enumerate(elements):
            swaps[element].append(index)

    return swaps

A little sanity check.

inputs = random.choices(inputs, k=n)
expected = list(sorted(inputs.copy()))

output = insertion_swaps(inputs)

expect(inputs).to(contain_exactly(*expected))

Random Case

COUNT = 50

inputs = list(range(COUNT))
random.shuffle(inputs)
swaps = insertion_swaps(inputs)

track_frame = pandas.DataFrame(swaps)
re_indexed = track_frame.reset_index().rename(columns={"index": "Swap"})
melted = re_indexed.melt(var_name="Value To Sort",
                         value_name="Location In Array", id_vars="Swap")

chart = altair.Chart(melted).mark_line().encode(
    x="Swap",
    y="Location In Array",
    color="Value To Sort:O",
    tooltip=["Swap", "Location In Array", "Value To Sort"]
).properties(
    title="Insertion Sort Insertions",
    width=800,
    height=525,
).interactive()

save_it(chart, "insertion-sort-swaps")

To interpret the chart, the y-axis values are the indices of the input list and as we move left to right along the x-axis we are traversing the outer loop. Everytime a line goes from horizontal to a downward slope that means that the element at the y location was plucked from the unsorted section and inserted into the previously sorted section. The longer the downward line, the further back it had to go in the sorted section before being inserted (and so the more items had to be moved aside for it to find its place).

Worst Case Swaps

Now we'll take a look at what the swaps look like when the collection to be sorted is in exactly the reversed order.

COUNT = 50

inputs = list(reversed(range(COUNT)))
swaps = insertion_swaps(inputs)

track_frame = pandas.DataFrame(swaps)
re_indexed = track_frame.reset_index().rename(columns={"index": "Swap"})
melted = re_indexed.melt(var_name="Value To Sort",
                         value_name="Location In Array", id_vars="Swap")

chart = altair.Chart(melted).mark_line().encode(
    x="Swap",
    y="Location In Array",
    color="Value To Sort:O",
    tooltip=["Swap", "Location In Array", "Value To Sort"]
).properties(
    title="Insertion Sort Insertions (Worst-Case)",
    width=800,
    height=525,
).interactive()

save_it(chart, "worst-case-insertion-sort-swaps")

So, here we don't have the mostly short lines of the previous chart because every element had to be inserted at the beginning of the previously sorted section of the list.

Compare The Three Amigos

We've had three brute-force sorters so far, let's see if there's a noticeable difference in their comparisons.

from bowling.sort.bubble.bubble import bubba
from bowling.sort.selection import selection_counter

numba_bubble = njit(bubba)
with TIMER:
    bubble_output = Parallel(n_jobs=-1)(
        delayed(numba_bubble)(thing_to_sort)
        for thing_to_sort in things_to_sort)

Started: 2022-01-11 23:38:40.676138
Ended: 2022-01-11 23:40:07.211075
Elapsed: 0:01:26.534937

numba_selection = njit(selection_counter)
with TIMER:
    selection_output = Parallel(n_jobs=-1)(
        delayed(numba_selection)(thing_to_sort)
        for thing_to_sort in things_to_sort)

Started: 2022-01-11 23:00:03.404894
Ended: 2022-01-11 23:00:43.390039
Elapsed: 0:00:39.985145

unzipped = list(zip(*bubble_output))

count_frame["Bubble"] = unzipped[COMPARISONS]

unzipped = list(zip(*selection_output))

count_frame["Selection"] = unzipped[COMPARISONS]

count_frame = count_frame.rename(columns={"Insertion Comparisons": "Insertion"})

frame = count_frame.melt(id_vars=["Elements"], var_name="Sorter", value_name="Comparisons")

tooltip = [altair.Tooltip("Elements", format=","),
           altair.Tooltip("Comparisons", format=","), 
           altair.Tooltip("Sorter")]

chart = altair.Chart(frame).mark_point().encode(
    x="Elements",
    y="Comparisons",
    color="Sorter",
    tooltip=tooltip,
).properties(
    width=800, height=550, title="The Three Amigos"
).interactive()

save_it(chart, "three-amigos-comparisons")

You might be looking at the chart and wondering - "Where's the Bubble Sort?". Well, since Selection Sort always checks all the unsorted for the next item, it has to do just as many comparisons as Bubble Sort does so they are (almost) overlapping - since I used the short-circuiting Bubble Sort if you zoom way, way in you'll see that the Bubble Sort's points are actually usually slighly lower than the Selection Sort's line.

I guess the main takeaway is that because the Insertion Sort has an out that lets it short-circuit each inner while-loop it generally will perform better than the other two sorts, although it's still a quadratic sort.

Sources

Bubble Sort: An Archeological Algorithmic Analysis

The Cloistered Monkey

2021-11-25 16:38

Source

Citation

Astrachan O. Bubble sort: an archaeological algorithmic analysis. SIGCSE Bull. 2003 Jan 11;35(1):1–5.

Iterative Algorithms

The Cloistered Monkey

2021-11-23 16:16

Iterative algorithms are characterized by loops - the word iteration has at at its roots the notion of repetition.

The Metaphor

HTTAA describes the iterative algorithms as being like a long journey where you can't see the entire route and could possibly take a misstep and fall into a ditch, so you need a way to know that you're making progress as well as a way to know you haven't veered off the road. Since we're dealing with repetitions I kind of think a racetrack might make more sense, or at least something that loops, but anyway…

There are some basic questions we have to be able to answer to make a successful trip:

How do I get onto the road?
How do I take a step without falling off the road?
How do I know when I've gone far enough?
Once I've gone far enough, how do I get off the road and to my ultimate destination?

Too answer these questions we'll be using some basic parts:

Our starting point is our Preconditions and to get onto the road we need some pre-loop code to set up the Loop-Invariant.
We take a step with our loop-code and keep from falling off the road by maintaining the Loop-Invariant.
We know we've gone far enough when our Exit-Condition is met.
Once we've gone far enough we use some post-loop code to clean up and end up at our Postconditions.

Translating that to mathishness shorthand:

\( \langle \textit{Pre-Conditions} \rangle \land \text{Code}_\textit{Pre-Loop} \Rightarrow \langle \textit{Loop-Invariant} \rangle \)
\( \langle \textit{Loop-Invariant} \rangle \land \lnot \langle \textit{Exit-Condition} \rangle \land \text{Code}_\textit{Loop} \Rightarrow \langle \textit{Loop-Invariant'} \rangle \)
\( \langle \textit{Loop-Invariant} \rangle \land \langle \textit{Exit-Condition} \rangle \land \text{Code}_\textit{Post-Loop} \Rightarrow \langle \textit{Post-Conditions} \rangle \)

Source

How To Think About Algorithms

More of the Output

The Cloistered Monkey

2021-11-23 16:10

The More of the Output algorithms are Iterative Algorithms that are characterized by the fact that after each pass through the loop the output has been added to - so the Measure of Progress is based on the size of the output.

Source

How To Think About Algorithms

The Sorting Problem

The Cloistered Monkey

2021-11-23 15:47

English(ish)

The sorting problem involves taking a list of things that can be compared to each other to decide their order and then sorting them into ascending order (actually non-decreasing, since you can have multiple items with the same value, but, this was supposed to be the plain-English version).

Specification

Preconditions: The input is a list of n orderable items, possibly with repetitions.
Postconditions: The output is a list of the same n items in non-decreasing order.

Table of Contents

In the Beginning

Imports

The Data

Filter

The Basic Filter

Or, Not And

And and Or

Arrange

Ascending Order

Descending Order

Select

Grab Some Columns

A Range of Columns

Helper Functions

Renaming

Some More Useful Renaming

Mutate

Transmute

Summarize

Grouping

Hitting the Pipe

Link Collection

Table Of Contents

Links

Citation

Table of Contents

Introduction

Set Up Stuff

Imports

Constants

The GGPlot2 Theme

The MPG Dataset

The Variables

Renaming Some Columns

Looking at the Data

Some Renaming

The Drive Trains

Fuel Types

Transmission

Displacement Vs Highway Mileage

Ditching the Highway (Mileage)

Combined Mileage

Comparing the Two

Gallons Per Mile

Displacement With Class

Efficiency

Faceting The Classes

Smoothing the Line

Different Datasets to the Geometries

Small Four-Wheel Drive Automobiles

Bar Plots

By Count

By Proportion

Un-Stacked

Proportions Not Counts

Stat Summary

Boxplot

The End

Link Collection

Table of Contents

NYPD Shooting Incident Data

Loading the Data

Tidying and Transforming the Data

Incident Key

Occur Date

Occur Time

Boro

Precinct

Jurisdiction Code

Location Description

Statistical Murder Flag

Perpetrator's Age Group

Peprpetrator's Sex

Perpetrator's Race

Victim's Age Group

Victim's Sex

Victim's Race

X-Coordinate

Y-Coordinate