NYC Shootings (Python Version)
Table of Contents
NYPD Shooting Incident Data Version 2 (with python)
This is a replication (more or less) of the NYPD Shooting Incident Data post which we did using R but this time we'll use python and pandas instead. Once again we'll be looking at the NYPD Shooting Incident Data (Historic) from the DATA.gov catalog of datasets, which lists every shooting from 2006 through the last updated incident (as of March 22, 2023 it shows that it was last updated June 9, 2022 but we'll have to look at it to see what the date of the last incident was).
Imports and Setup
# python
from datetime import datetime
from functools import partial
import sys
# pypi
from expects import equal, expect
from tabulate import tabulate
import altair
import geopandas
import pandas
from graeae.visualization.altair_helpers import output_path, save_chart
First, let's double-check which version of python we're using.
print(sys.version)
3.9.9 (main, Dec 3 2021, 01:15:49) [GCC 10.2.1 20210110]
We're running python 3.9.9. I was running it in pypy but for some reason the pandas' groupby
method would hang so this is done in cPython.
print(altair.__version__)
5.0.0rc1
I found out after I'd been struggling with altair for a little while that they're about to make a major release that breaks some of the ways you use it (in the prior versions). Since it's been a while since I've used altair I decided to go with the newer, upcoming version rather than re-learn the current one and then learn the new version of altair once it's released.
This next bit is a little setup to make nicer tables.
TABLE = partial(tabulate, headers="keys", showindex=False, tablefmt="orgtbl")
This is to help with the plotting.
SLUG = "nyc-shootings-python-version"
OUTPUT_PATH = output_path(SLUG)
save_it = partial(save_chart, output_path=OUTPUT_PATH, height=650)
The Data
Loading It
The data is available to download as a CSV so we'll pass the URL for the CSV to pandas and make a dataframe.
URL = "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
data = pandas.read_csv(URL)
rows, columns = data.shape
print(f"Rows: {rows:,}\tColumns: {columns}")
Rows: 25,596 Columns: 19
The Columns
We'll start looking at the data by examining what's in the columns. We saw from the data frame's shape that we have 19 columns, how many of them does pandas think are numeric?
print(TABLE(data.describe(), showindex=True))
INCIDENT_KEY | PRECINCT | JURISDICTION_CODE | X_COORD_CD | Y_COORD_CD | Latitude | Longitude | |
---|---|---|---|---|---|---|---|
count | 25596 | 25596 | 25594 | 25596 | 25596 | 25596 | 25596 |
mean | 1.12383e+08 | 65.8694 | 0.331601 | 1.00945e+06 | 207894 | 40.7372 | -73.909 |
std | 6.78612e+07 | 27.2019 | 0.742266 | 18421.4 | 31857.4 | 0.087447 | 0.0664265 |
min | 9.95324e+06 | 1 | 0 | 914928 | 125757 | 40.5116 | -74.2493 |
25% | 6.15936e+07 | 44 | 0 | 1.00001e+06 | 182782 | 40.6683 | -73.9431 |
50% | 8.64373e+07 | 69 | 0 | 1.00772e+06 | 194038 | 40.6991 | -73.9153 |
75% | 1.66661e+08 | 81 | 0 | 1.01684e+06 | 239429 | 40.8238 | -73.8824 |
max | 2.3849e+08 | 123 | 2 | 1.06682e+06 | 271128 | 40.9108 | -73.702 |
Just seven of them, and four are coordinates and the other three are categorical. This isn't really surprising, since the data isn't measuring anything but is rather a listing of shooting incidents reported (each row represents a single incident).
There's a table on the NYPD-Shooting-Incident-Data page that describes the columns.
Column Name | Description | Type |
---|---|---|
INCIDENT_KEY | Randomly generated persistent ID for each arrest | Plain Text |
OCCUR_DATE | Exact date of the shooting incident | Date & Time |
OCCUR_TIME | Exact time of the shooting incident | Plain Text |
BORO | Borough where the shooting incident occurred | Plain Text |
PRECINCT | Precinct where the shooting incident occurred | Plain Text |
JURISDICTIONAL_CODE | Jurisdiction where it occurred | Number |
LOCATION_DESC | Location of the incident | Plain Text |
STATISTICAL_MURDER_FLAG | Victim died | Checkbox |
PERP_AGE_GROUP | Perpetrator's age within a category | Plain Text |
PERP_SEX | Pepetrator's sex. | Plain Text |
PERP_RACE | Perpetrator's race. | Plain Text |
VIC_AGE_GROUP | Victim's age with a category. | Plain Text |
VIC_SEX | Victim's sex. | Plain Text |
VIC_RACE | Victim's Race | Plain Text |
X_COORD_CD | Midblock X-coordinate for New York State Plane Coordinate System | Plain Text |
Y_COORD_CD | Midblock Y-coordinate | Plain Text |
Latitude | Latitude coordinate | Number |
Longitude | Longitude | Number |
Lon_Lat | Longitude and Latitude Coordinate for mapping | Point |
Missing Data
missing = len(data) - data.count()
missing = missing.reset_index().rename(
columns={"index": "Column",
0: "Missing"})
some_missing = missing[missing.Missing > 0].copy()
some_missing.loc[:, "Fraction"] = some_missing.Missing/len(data)
print(TABLE(some_missing))
Column | Missing | Fraction |
---|---|---|
JURISDICTION_CODE | 2 | 7.81372e-05 |
LOCATION_DESC | 14977 | 0.58513 |
PERP_AGE_GROUP | 9344 | 0.365057 |
PERP_SEX | 9310 | 0.363729 |
PERP_RACE | 9310 | 0.363729 |
Most entries are missing location descriptions for some reason, and quite a lot of perpertator data is missing, possibly because they didn't catch whoever did the shooting in those cases.
Incident Key
The incident key is an identifier for a specific incident so it's only really useful if you need to look up or refer to one or more of them, but we'll be looking at things in aggregate making them less useful for us, except maybe for looking at anomalies. Let's just make sure that the identifiers are unique as I'm asserting that they are.
id_count = len(data.INCIDENT_KEY.unique())
incidents = len(data)
print(f"Identifiers: {id_count:,}\tIncidents: {incidents:,}")
print(f"There are {incidents - id_count:,} more rows than incident IDs.")
Identifiers: 20,126 Incidents: 25,596 There are 5,470 more rows than incident IDs.
It appears that I wasn't correct in my assumption… let's take a look at one of the incidents.
counts = data.INCIDENT_KEY.value_counts()
up_counts = counts[counts > 1]
top = counts.head(1)
top_id = top.index[0]
print(f"Incident: {top_id}\tCount: {top.iloc[0]}")
Incident: 173354054 Count: 18
Inspecting the dataframe it looks like in some cases more than one person was shot per incident, so there's multiple rows (one per person shot) for a single incident. Kind of scary that eighteen people got shot at one incident, if my interpretation is correct, but that's life in the big city, I guess. Reading the Footnotes (link is to a PDF) it says:
A shooting incident can have multiple victims involved and as a result duplicate INCIDENT_KEYs are produced.
So it appears each row represents the victim of a shooting and each INCIDENT_KEY
represents a shooting where one or more person was shot. The footnotes also note that only incidents where a victim was shot are included. If someone fired a gun but didn't hit anyone then it isn't represented in the data set.
Note: There's actually a slight discrepancy between the descriptions of the INCIDENT_KEY
between the web-page and the PDF footnotes. According to the Web-Page the ID is for each arrest while the footnotes make it sound like they represent cases where there was at least one victim, whether or not someone was arrested. For our purposes this won't matter, since we're only using the data as a source for data visualization, but if one were really trying to understand what was happening in NYC knowing exactly what the data represents might be important (assuming not all cases with a shooting victim leads to an arrest).
use_counts = up_counts.reset_index()
chart = altair.Chart(use_counts).mark_bar().encode(
x = altair.X("INCIDENT_KEY",
type="nominal",
sort="-y",
axis=altair.Axis(labels=False)),
y="count",
tooltip=[altair.Tooltip("INCIDENT_KEY", type="nominal"),
altair.Tooltip("count")]
).interactive().properties(
title="Incidents with Multiple Shots",
width=800,
height=525
)
save_it(chart, "multiple_shot_incidents")
It looks like a lot of entries have more than one row. Does this mean many incidents have more than one victim? More than one shooter?
fractions = 100 * counts.value_counts()/len(data)
fractions = fractions.reset_index(name="Percent of Rows").rename(columns={
"count": "Rows"})
chart = altair.Chart(fractions).mark_bar().encode(
x=altair.X("Rows", sort=fractions["Percent of Rows"].values),
y=altair.Y("Percent of Rows", scale=altair.Scale(domain=(-1, 70))),
tooltip=[altair.Tooltip("Rows"),
altair.Tooltip("Percent of Rows")]).properties(
title="Percent Of Incidents with Multiple Rows",
width=800,
height=525)
save_it(chart, "fraction_row_incidents")
The majority of the incidents do have only one row in the dataset. Perhaps it's not as unusual as I think it is to have multiple people involved in a shooting.
OCCUR_DATE and OCCUR_TIME
There are two columns that tell us when the shooting is supposed to have happened.
example = data[data.INCIDENT_KEY==top_id].iloc[0]
print(f"OCCUR_DATE: {example.OCCUR_DATE} ({data.OCCUR_DATE.dtype})")
print(f"OCCUR_TIME: {example.OCCUR_TIME} ({data.OCCUR_TIME.dtype})")
OCCUR_DATE: 01/06/2018 (object) OCCUR_TIME: 21:05:00 (object)
Pandas interpreted both of these as strings, but it'd probably be more useful for us if they were datetime objects.
MONTH, DAY, YEAR = "%m", "%d", "%Y"
HOUR, MINUTE, SECOND = "%H", "%M", "%S"
DATE_FORMAT = "/".join((MONTH, DAY, YEAR))
TIME_FORMAT = ":".join((HOUR, MINUTE, SECOND))
FORMAT = f"{DATE_FORMAT} {TIME_FORMAT}"
DATE_COLUMN = "date_time"
data[DATE_COLUMN] = pandas.to_datetime(data.OCCUR_DATE + " " + data.OCCUR_TIME, format=FORMAT)
check_date = data[data.INCIDENT_KEY==top_id].iloc[0]
print(f"OCCUR_DATE: {check_date.OCCUR_DATE}")
print(f"New Date: {check_date.date_time.date()}")
print(f"OCCUR_TIME: {check_date.OCCUR_TIME}")
print(f"New Time: {check_date.date_time.time()}")
OCCUR_DATE: 01/06/2018 New Date: 2018-01-06 OCCUR_TIME: 21:05:00 New Time: 21:05:00
print(data.OCCUR_DATE.min())
print(data.OCCUR_DATE.max())
01/01/2006 12/31/2021
Our dataset covers the years from 2006 throught 20021. Let's see how many there are from month to month.
Shootings By Month
indexed = data.set_index(DATE_COLUMN)
monthly = indexed.groupby(pandas.Grouper(freq="M"))
monthly_counts = monthly.count()["INCIDENT_KEY"].reset_index().rename(
columns={"INCIDENT_KEY": "Shootings",
"date_time": "Month"}
)
expect(monthly_counts["Shootings"].sum()).to(equal(len(data)))
MONTH_YEAR = "%B %Y"
chart = altair.Chart(monthly_counts).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X("Month", type="temporal"),
y=altair.Y("Shootings"),
tooltip=[altair.Tooltip("Month", format=MONTH_YEAR),
altair.Tooltip("Shootings")]
).properties(
width=800,
height=525,
title="NYC Shootings By Month"
)
save_it(chart, "monthly_incidents")
It looks like shootings went down in 2013 then shot back up again in the Summer of 2020.
90 Day Rolling Window
monthly_counts["Rolling Mean"] = monthly_counts["Shootings"].ewm(
halflife="90 days", times=monthly_counts.Month).mean()
pre_melt = monthly_counts.rename(columns={"Shootings": "Sum"})
melted = pre_melt.melt("Month", var_name="Aggregation", value_name="Aggregated Value")
chart = altair.Chart(melted).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X("Month"),
y=altair.Y("Aggregated Value"),
color="Aggregation",
tooltip=[altair.Tooltip("Month", format=MONTH_YEAR),
altair.Tooltip("Aggregated Value")]
).properties(
width=800,
height=525,
title="NYC Shootings By 90 Day Exponential Weighted Mean"
)
save_it(chart, "monthly_rolling_incidents")
Using a ninety-day window gives a little better sense of the overall trend downwards until 2020 reversed it.
By Year
yearly = indexed.groupby(pandas.Grouper(freq="Y"))
yearly_counts = yearly.count()["INCIDENT_KEY"].reset_index().rename(
columns={"INCIDENT_KEY": "Shootings"}
)
yearly_counts["Year"] = yearly_counts.date_time.dt.year
expect(yearly_counts["Shootings"].sum()).to(equal(len(data)))
chart = altair.Chart(yearly_counts).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X("Year", type="ordinal"),
y=altair.Y("Shootings"),
tooltip=[altair.Tooltip("Year"),
altair.Tooltip("Shootings", format=",")]
).properties(
width=800,
height=525,
title="NYC Shootings By year"
)
save_it(chart, "yearly_incidents")
Although 2020 had that crazy summer, 2021 still exceeded it overall.
Monthly By Year
monthly_counts = monthly.count()["INCIDENT_KEY"].reset_index().rename(
columns={"INCIDENT_KEY": "Shootings"}
)
# monthly_counts = monthly_counts.rename(columns={"Month": "date-time"})
month_map = dict(zip(range(1, 13), "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split()))
monthly_counts["Year"] = monthly_counts["date_time"].dt.year
monthly_counts["Month"] = monthly_counts["date_time"].dt.month.apply(lambda month: month_map[month])
selection = altair.selection_point(fields=["Year"],
bind="legend")
chart = altair.Chart(monthly_counts).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X("Month",
sort=monthly_counts["date_time"].values),
y=altair.Y("Shootings"),
color=altair.Color("Year", type="nominal"),
tooltip=[altair.Tooltip("date_time", title="Month", format=MONTH_YEAR),
altair.Tooltip("Shootings")],
opacity=altair.condition(selection,
altair.value(0.8),
altair.value(0.2))
).properties(
width=800,
height=525,
title="NYC Monthly Shootings By year"
).add_params(selection)
save_it(chart, "month_year_shootings")
Click on a year in the legend to isolate it from the other lines in the plot and on a space somewhere on the legend to highlight all the years again.
It looks like there was an unusually large surge in the Summer of 2020 after years of declining shootings, but regardless of the year there's at least a slight increase in shootings during the Summer months.
Median By Month
by_month_median = monthly_counts.groupby("Month").median().reset_index()
chart = altair.Chart(by_month_median).mark_bar().encode(
x=altair.X("Month", sort=list(month_map.values())),
y="Shootings",
tooltip=[altair.Tooltip("Month"),
altair.Tooltip("Shootings"),]
).properties(
width=800,
height=525,
title="NYC Median Shootings By Month (2006 through 2012)"
)
save_it(chart, "monthly_shootings")
It looks like there's a definite seasonal effect on shootings. Must be all those farmer's kids off from school to work the fields and shoot each other.
Location
Descriptions
BORO
The BORO
column identifies which of the five boroughs of New York City the victim was shot in.
- Manhattan
- Brooklyn
- Queens
- Bronx
- Staten Island
boroughs = data.BORO.value_counts().reset_index().rename(
columns=dict(BORO="Borough", count="Shootings"))
chart = altair.Chart(boroughs).mark_bar().encode(
x="Borough",
y="Shootings",
tooltip=[altair.Tooltip("Borough"),
altair.Tooltip("Shootings", format=",")]
).properties(
width=800,
height=520,
title="Shootings by Borough"
)
save_it(chart, "shootings-by-borough")
borough_monthly = monthly.agg({"BORO": "first",
"INCIDENT_KEY": "count"}).reset_index().rename(
columns=dict(BORO="Borough", INCIDENT_KEY="Shootings", date_time="Month"))
selection = altair.selection_point(fields=["Borough"], bind="legend")
chart = altair.Chart(borough_monthly).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X("Month"),
y=altair.Y("Shootings"),
color="Borough",
tooltip=[
altair.Tooltip("Borough"),
altair.Tooltip("Month", format=MONTH_YEAR),
altair.Tooltip("Shootings")
],
opacity=altair.condition(selection,
altair.value(0.8),
altair.value(0.2))
).properties(
width=900,
height=525,
title="Monthly Borough Shootings"
).add_params(selection)
save_it(chart, "borough-monthly-shootings")
The Bronx and Brooklyn look like they have the most shootings, although I didn't weight it by population so I don't know how they compare as a fraction of the population.
borough_bar = data[["BORO", "date_time"]].rename(
columns=dict(BORO="Borough"))
borough_bar["Month"] = pandas.to_datetime({
"year": borough_bar.date_time.dt.year,
"month": borough_bar.date_time.dt.month,
"day": 1
})
chart = altair.Chart(borough_bar).mark_bar().encode(
x="Month",
y="count()",
color="Borough").properties(
width=800,
height=525,
title="Monthly Shootings With Borough Breakdowns")
save_it(chart, "borough-monthly-bar-chart")
By clicking on the legend you can isolate plot-lines by borough and it makes it clearer that the surge in 2020 happened primarily in the Bronx and Manhattan, although the plot is a little deceptive since Manhattan had some months where there were no shootings, but I didn't infill them with zeros so it doesn't show those months dropping back down.
PRECINCT & JURISDICTION_CODE
There are seventy-seven police precincts in New York City, subdividing the boroughs. This might be useful for isolating where the shootings are occuring even further, but I'll skip it. Same with the JURISDICTION_CODE.
LOCATION_DESC
missing_locations = data[data.LOCATION_DESC.isna()].rename(columns=dict(LOCATION_DESC="Location Description"))
missing_locations = missing_locations[["Location Description", "date_time", "BORO"]]
missing_locations["Year"] = missing_locations.date_time.dt.year
chart = altair.Chart(missing_locations).mark_bar().encode(
x=altair.X("Year", type="ordinal"),
y="count()",
tooltip=[altair.Tooltip("count()"), altair.Tooltip("BORO"), altair.Tooltip("Year")],
color="BORO"
).properties(
width=800,
height=525,
title="Missing Location Descriptions by Year"
)
save_it(chart, "missing-locations-descriptions")
I was hoping the missing locations might be because they only started recording it after some point, but it looks like it's an ongoing problem not local to any one borough, so the location descriptions are interesting, but the number of missing entries makes it hard to say that they're meaningful.
locations = data.LOCATION_DESC.value_counts().reset_index().rename(columns=dict(count="Count", LOCATION_DESC="Location"))
chart = altair.Chart(locations).mark_bar().encode(
x=altair.X("Location", sort=locations.Location.values,
axis=altair.Axis(labelAngle=30)),
y=altair.Y("Count"),
tooltip=[altair.Tooltip("Location"), altair.Tooltip("Count", format=",")]
).properties(
width=900,
height=525,
title="Shooting Location Counts"
)
save_it(chart, "shooting-locations-count")
It looks like most shootings happened at homes, and to avoid being shot you should move to a storage facility.
Coordinates
X_COORD_CD & Y_COORD_CD
These coordinates use the State Plane Coordinate System (see What is the State Plane Coordinate System?) Long Island Zone. The coordinates are mid-block locations with the units in feet.
chart = altair.Chart(
data[["X_COORD_CD", "Y_COORD_CD", "BORO"]]).mark_point(opacity=0.2).encode(
x=altair.X("X_COORD_CD").scale(domain=(data.X_COORD_CD.min(), data.X_COORD_CD.max())),
y=altair.Y("Y_COORD_CD").scale(domain=(data.Y_COORD_CD.min(), data.Y_COORD_CD.max())),
color="BORO"
).properties(
width=800,
height=525,
title="Shootings By Coordinates",
)
save_it(chart, "shootings-by-coordinates")
coordinates = data[["X_COORD_CD", "Y_COORD_CD",
"BORO", "date_time",
"Longitude", "Latitude"]].rename(columns=dict(
BORO="Borough",
X_COORD_CD="X-Coordinate",
Y_COORD_CD="Y-Coordinate"
))
coordinates["Year"] = coordinates.date_time.dt.year
select_year = altair.selection_point(
name="Year",
fields=["Year"],
bind=altair.binding_range(min=2006, max=2021, step=1, name="Year"),
value=[{"Year": 2006}],
)
chart = altair.Chart(coordinates).mark_point(opacity=0.2).encode(
x=altair.X("X-Coordinate").scale(
domain=(coordinates["X-Coordinate"].min(),
coordinates["X-Coordinate"].max())),
y=altair.Y("Y-Coordinate").scale(
domain=(coordinates["Y-Coordinate"].min(),
coordinates["Y-Coordinate"].max())),
color="Borough",
tooltip=[altair.Tooltip("Borough"),
altair.Tooltip("date_time")],
).properties(
width=800,
height=525,
title="Shootings By Coordinates and Year",
).add_params(select_year).transform_filter(select_year)
save_it(chart, "shootings-by-coordinates-and-year")
Latitude & Longitude
With GeoPandas
It turns out than the boroughs of New York City are one of the three datasets that come with GeoPandas so we can bring it in to make a map.
path_to_data = geopandas.datasets.get_path("nybb")
nyc_geopandas = geopandas.read_file(path_to_data)
print(nyc_geopandas.head(1))
BoroCode BoroName Shape_Leng Shape_Area 0 5 Staten Island 330470.010332 1.623820e+09 \ geometry 0 MULTIPOLYGON (((970217.022 145643.332, 970227....
Let's look at a map colored with the 2021 shooting counts.
BOROUGH_COLUMN = "BoroName"
SHOOTINGS_COLUMN = "Shootings"
latest = data[data.date_time.dt.year==2021]
latest_count = latest.groupby("BORO").count().reset_index()
expect(latest_count.INCIDENT_KEY.sum()).to(equal(len(latest)))
latest_count = latest_count.rename(columns=dict(BORO=BOROUGH_COLUMN))
latest_count[BOROUGH_COLUMN] = latest_count[BOROUGH_COLUMN].str.title()
latest_count[SHOOTINGS_COLUMN] = latest_count["INCIDENT_KEY"].copy()
latest_count = latest_count[[BOROUGH_COLUMN, SHOOTINGS_COLUMN]]
merged = nyc_geopandas.merge(latest_count, on=BOROUGH_COLUMN)
merged = merged[[BOROUGH_COLUMN, SHOOTINGS_COLUMN, "geometry"]]
explorer = merged.explore(column=SHOOTINGS_COLUMN,
tooltip=BOROUGH_COLUMN,
popup=True)
explorer.save(f"{OUTPUT_PATH}/nyc-shootings-explore.html")
This is probably most interesting to someone like me who doesn't know much about New York City. I wonder why Queens has so much fewer shootings than Brooklyn or the Bronx when it sits between them? Demographics?
Murder or Not
By Year
MURDER_FLAG = "STATISTICAL_MURDER_FLAG"
MURDERED = "Murdered"
BOROUGH = "Borough"
YEAR = "Year"
data[YEAR] = data.date_time.dt.to_period("Y").dt.to_timestamp()
yearly_murders = data.groupby([YEAR, "BORO"]).sum(
MURDER_FLAG).reset_index().rename(
columns={
"BORO": BOROUGH,
MURDER_FLAG: MURDERED})
yearly_murders = yearly_murders[[YEAR, BOROUGH, MURDERED]]
SELECTION = altair.selection_point(fields=[BOROUGH], bind="legend")
chart = altair.Chart(yearly_murders).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X(YEAR),
y=altair.Y(MURDERED),
color=BOROUGH,
tooltip=[altair.Tooltip(YEAR),
altair.Tooltip(MURDERED),
altair.Tooltip(BOROUGH)],
opacity=altair.condition(SELECTION,
altair.value(0.8),
altair.value(0.2))
).add_params(SELECTION).properties(
width=800,
height=525,
title="NYC Murders By Year"
)
save_it(chart, "yearly_borough_murders")
#+end_src
By Month
MONTH = "Month"
data[MONTH] = data.date_time.dt.to_period("M").dt.to_timestamp()
murdered = data.groupby([MONTH, "BORO"]).sum(MURDER_FLAG).reset_index()
murdered = murdered[[MONTH, "BORO", MURDER_FLAG]].rename(columns={
"BORO": BOROUGH,
MURDER_FLAG: MURDERED,
})
SELECTION = altair.selection_point(fields=[BOROUGH], bind="legend")
chart = altair.Chart(murdered).mark_line(
point={"filled": False,
"fill": "white"}).encode(
x=altair.X(MONTH),
y=altair.Y(MURDERED),
color=BOROUGH,
tooltip=[altair.Tooltip(MONTH, format=MONTH_YEAR),
altair.Tooltip(MURDERED),
altair.Tooltip(BOROUGH)],
opacity=altair.condition(SELECTION,
altair.value(0.8),
altair.value(0.2))
).add_params(SELECTION).properties(
width=800,
height=525,
title="NYC Murders By Month"
)
save_it(chart, "monthly_borough_murders")
It looks like Brooklyn lead in murders in the Summer of 2020,
By Coordinate
murdered = data[data[MURDER_FLAG]]
murder_coordinates = murdered[["BORO", "date_time",
"Longitude", "Latitude"
]].rename(columns=dict(
BORO="Borough",
))
murder_coordinates["Year"] = murder_coordinates.date_time.dt.year
chart = (altair.Chart(murder_coordinates)
.mark_point(opacity=0.2)
.encode(
x=altair.X("Longitude").scale(
domain=(murder_coordinates["Longitude"].min(),
murder_coordinates["Longitude"].max())),
y=altair.Y("Latitude").scale(
domain=(murder_coordinates["Latitude"].min(),
murder_coordinates["Latitude"].max())),
color="Borough",
tooltip=[altair.Tooltip("Borough"),
altair.Tooltip("date_time")],
).properties(
width=800,
height=525,
title="Murders By Longitude, Latitude and Year",
).add_params(select_year)
.transform_filter(select_year))
save_it(chart, "murders-by-coordinates-and-year")
Victims and Perpetrators
Perpetrators
PERP_AGE_GROUP
print(TABLE(data.PERP_AGE_GROUP.value_counts(dropna=False).reset_index()))
PERP_AGE_GROUP | count |
---|---|
nan | 9344 |
18-24 | 5844 |
25-44 | 5202 |
UNKNOWN | 3148 |
<18 | 1463 |
45-64 | 535 |
65+ | 57 |
1020 | 1 |
940 | 1 |
224 | 1 |
PEPR_SEX
print(TABLE(data.PERP_SEX.value_counts(dropna=False).reset_index()))
PERP_SEX | count |
---|---|
M | 14416 |
nan | 9310 |
U | 1499 |
F | 371 |
PERP_RACE
print(TABLE(data.PERP_RACE.value_counts(dropna=False).reset_index()))
PERP_RACE | count |
---|---|
BLACK | 10668 |
nan | 9310 |
WHITE HISPANIC | 2164 |
UNKNOWN | 1836 |
BLACK HISPANIC | 1203 |
WHITE | 272 |
ASIAN / PACIFIC ISLANDER | 141 |
AMERICAN INDIAN/ALASKAN NATIVE | 2 |
It looks like besides the NaN
values, missing data is also classified as U
or UNKNOWN
.
Handling the Unknowns
Let's try and unify the missing values.
data[data.PERP_AGE_GROUP.isna()] = "UNKNOWN"
data[data.PERP_AGE_GROUP == "U"] = "UNKNOWN"
print(TABLE(data.PERP_AGE_GROUP.value_counts(dropna=False).reset_index()))
PERP_AGE_GROUP | count |
---|---|
UNKNOWN | 12492 |
18-24 | 5844 |
25-44 | 5202 |
<18 | 1463 |
45-64 | 535 |
65+ | 57 |
1020 | 1 |
940 | 1 |
224 | 1 |
data[data.PERP_SEX.isna()] = "UNKNOWN"
data[data.PERP_SEX == "U"] = "UNKNOWN"
print(TABLE(data.PERP_SEX.value_counts(dropna=False).reset_index()))
PERP_SEX | count |
---|---|
M | 14416 |
UNKNOWN | 10809 |
F | 371 |
data[data.PERP_RACE.isna()] = "UNKNOWN"
data[data.PERP_RACE == "U"] = "UNKNOWN"
print(TABLE(data.PERP_RACE.value_counts(dropna=False).reset_index()))
PERP_RACE | count |
---|---|
UNKNOWN | 11166 |
BLACK | 10650 |
WHITE HISPANIC | 2164 |
BLACK HISPANIC | 1201 |
WHITE | 272 |
ASIAN / PACIFIC ISLANDER | 141 |
AMERICAN INDIAN/ALASKAN NATIVE | 2 |
So, the first thing we can notice is that the UNKNOWN
counts don't match so sometims some attributes were known but not others, so they don't just represent cases where the perpetrator wasn't caught like I thought it might be. The second thing is that for age and race the unknowns have the highest counts. For all three categories the numbers of unknowns are so high that it's hard to say how they might affect the cases where the values are known.
Victims
VIC_AGE_GROUP
print(TABLE(data.VIC_AGE_GROUP.value_counts(dropna=False).reset_index()))
VIC_AGE_GROUP | count |
---|---|
UNKNOWN | 10863 |
25-44 | 6500 |
18-24 | 5389 |
<18 | 1657 |
45-64 | 1070 |
65+ | 117 |
VIC_SEX
print(TABLE(data.VIC_SEX.value_counts(dropna=False).reset_index()))
VIC_SEX | count |
---|---|
M | 13182 |
UNKNOWN | 10809 |
F | 1598 |
U | 7 |
VIC_RACE
print(TABLE(data.VIC_RACE.value_counts(dropna=False).reset_index()))
VIC_RACE | count |
---|---|
UNKNOWN | 10851 |
BLACK | 9962 |
WHITE HISPANIC | 2515 |
BLACK HISPANIC | 1528 |
WHITE | 485 |
ASIAN / PACIFIC ISLANDER | 249 |
AMERICAN INDIAN/ALASKAN NATIVE | 6 |
Somewhat unexpectedly, the unknowns are also alarmingly high for the victims, suggesting that they represent poor case records and perhaps the demographic information isn't reliable enough to be useful.
Sources
- New York City Map via WikiMedia Commons
I, Bubba
Activity Selector Implementations
Table of Contents
# from pypi
from expects import contain_exactly, equal, expect
Example Data
PADDING = 0
start = [PADDING, 1, 3, 0, 5, 3, 5, 6, 8, 8, 2, 12]
finish = [PADDING, 4, 5, 6, 7, 9, 9, 10, 11, 12, 14, 16]
expected_activities = [1, 4, 8, 11]
expect(len(start)).to(equal(len(finish)))
A Recursive Selector
def recursive_activity_selector(start, finish, k: int=0) -> list:
"""Selects the optimal activities
Note:
This assumes that the start and finish lists are pre-padded with
zeros (1 each).
Args:
start: list of start times for each activity
finish: list of end times for each activity
k: the current activity (index)
Returns:
optimal list of activities
"""
n = len(start)
m = k + 1
while m < n and start[m] < finish[k]:
m += 1
if m < n:
return [m] + recursive_activity_selector(start, finish, m)
return []
activities = recursive_activity_selector(start, finish)
expect(activities).to(contain_exactly(*expected_activities))
print(activities)
[1, 4, 8, 11]
An Iterative Selector
def iterative_activity_selector(start: list, finish: list) -> list:
"""An iterative activity selector
Note:
This assumes that the start and finish lists are pre-padded with
dummy values.
Args:
start: list of starting times
finish: list of end times
"""
n = len(start)
activities = [1]
k = 1
for m in range(2, n):
if start[m] >= finish[k]:
activities += [m]
k = m
return activities
activities = iterative_activity_selector(start, finish)
expect(activities).to(contain_exactly(*expected_activities))
print(activities)
The Activity Selection Problem
Table of Contents
What is the Activity Selection Problem?
- In this problem we have a set of n activities \(S = \{a_1, a_2, \ldots, a_n\}\).
- Assume the activities are sorted by increasing finish time
- Each activity has a start time \(s_i\) and a finish time \(f_i\)
- Only one activity can happen at a time.
- Two activities are compatible if the earlier event finishes before the start time of the other event.
The problem we want to solve is finding the most compatible activities we can from our set of activities. This differs from a more common scheduling problem in that the times are fixed - we can't re-arrange activities to make them fit so they either can happen or they can't.
The Greedy Choice
CLRS goes into the more formal aspects of finding the optimal substructure and proving that a greedy strategy works, but in this case the reasoning is fairly intuitive (and I didn't find the proof so easy to follow) so I'll skip the formal stuff. The thing to recognize is just that the sooner the first activity you schedule finishes the more time there is for other activities. So if after each activity you pick you the next pick you make is the remaining activity that doesn't collide with the current pick and finishes first you will be maximizing the number of activities that you can schedule.
The Recursive Activity Selector
\begin{algorithm} \caption{Recursive Activity Selector} \begin{algorithmic} \REQUIRE The activities are sorted in non-decreasing order of finish time \INPUT $s$: The array of start times for the activities. \INPUT $f$: The array of finish times for the activities. \INPUT $k$: The last activity selected so far. \INPUT $n$: The number of activities. \OUTPUT The activities that maximize the number of compatible activities. \PROCEDURE{RecursiveActivitySelector}{$s, f, k, n$} \STATE \(m \gets k + 1\) \WHILE {\(m \le n\) and \(s[m] < f[k]\)} \STATE \(m \gets m + 1\) \ENDWHILE \IF {\(m \le n\)} \RETURN \(\{a_m\} \cup \) \textsc{RecursiveActivitySelector}(\(s, f, m, n\)) \ELSE \RETURN \{\} \ENDIF \ENDPROCEDURE \end{algorithmic} \end{algorithm}
Note: The activity set (\(\{a_1, a_2, \ldots, a_n\}\)) is treated as a global collection so it is used in the function but not passed in as an argument.
Padding The Start
Because the recursive algorithm is looking for the next earliest finish time after the first event we need to pad the activities with a dummy activity that has a finish time of 0 (\(f_0 = 0\)) and then when the function is first called the k value is set to 0:
\[ RecursiveActivitySelector(s, f, 0, n) \]
The Iterative Activity Selector
\begin{algorithm} \caption{Greedy Activity Selector} \begin{algorithmic} \REQUIRE The activities are sorted in non-decreasing order of finish time \INPUT $s$: The array of start times for the activities. \INPUT $f$: The array of finish times for the activities. \OUTPUT The activities that maximize the number of compatible activities. \PROCEDURE{GreedyActivitySelector}{\(s, f\)} \STATE \(n \gets s.length \) \STATE \(A \gets \{a_1\} \) \STATE \(k \gets 1 \) \FOR {\(m \in \{2, \ldots, n \}\)} \IF {\( s[m] \ge f[k] \)} \STATE \(A \gets A \cup \{a_m\}\) \STATE \(k \gets m \) \ENDIF \ENDFOR \RETURN \(A\) \ENDPROCEDURE \end{algorithmic} \end{algorithm}
Weighted Completion Times
Table of Contents
The Weighted Scheduling Problem
This is a look at the problem of arranging a schedule of jobs so that their completion times are lowest, while accounting for different priorities for each job.
Input: \(n\) jobs with positive lengths \(l_1, l_2, \ldots, l_n\) and positive weights \(w_1, w_2,\ldots,w_n\). The larger the weight, the higher the priority.
Output: A job sequence (the schedule) that minimizes the sum of the weight completion times \(w_1l_1 + w_2 l_2 + \cdots + w_n l_n\).
Completion Time
The output above is a little confusing (I'll have to re-write that later). What we're minimizing is the sum of completion times. A job's completion time is the total of all the times for the jobs that precede it as well as the time to finish the job. This means the completion time for the first job is the time to run that job, the completion time for the second job is the time for the first job plus the time for the second job, and so on.
Greedy Metrics
The greedy method for this problem involves finding a metric such that when the jobs are sorted by the metric they form the optimal schedule. One possible metric is the Greedy Difference which is the weight for a job minus the time the job takes (\(w_i - l_i\)).
Another possible metric is the Greedy Ratio which is the ratio of job-weight to job-length (\(\frac{w_i}{l_i}\)).
Longest Common Subsequence
Table of Contents
The Longest Common Subsequence Problem
A subsequence of a sequence is the original sequence with zero or more elements removed from it. A common subsequence of two sequences is a subsequence that belongs to both sequences. With the Longest Common Subsequence (LCS) problem you are given two sequences - \(X = \langle x_1, x_2,\ldots,x_m\rangle\) and \(Y = \langle y_1, y_2,\ldots,y_n\rangle\) and you want to find a common subsequence of \(X\) and \(Y\) that has the maximum length (there might be more than one common subsequence that have the maximum length).
Note: The elements of the subsequence have to be in both sequences in the same relative order, but don't need to be consecutive. "AK" is a common subsequence of "ARK" and "BACK" because "A" and "K" appears in both sequences and the "A" comes before the "K," even though there are letters between them in both of the original sequences.
Finding The LCS
We basically have to look at three cases.
\[ c[i,j] = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0\\ c[i - 1, j - 1] + 1 & \text{if } i,j > 0 \text{ and } x_i = y_j\\ \text{Max}(c[i, j-1], c[i - 1, j) & \text{if } i, j > 0 \text{ and } x_i \neq y_j \end{cases} \]
- \(c\) is the lookup table that we will build
- \(i\) is the index of an element in \(X\) and a row in \(c\)
- \(j\) is the index of an element in \(Y\) and a column in \(c\)
- \(X\) and \(Y\) use 1-based indices, not 0-based
The Length-Finder
As with the other Dynamic Programming algorithms, our function is going to find the length of the longest sequence, not the actual sequence (this is left to another function).
\begin{algorithm} \caption{Longest Common Subsequence Length Finder} \begin{algorithmic} \INPUT $X$: Sequence of elements \INPUT $Y$: Sequence of elements \OUTPUT Table of common subsequence lengths, table to reconstruct the best subsequence. \PROCEDURE{LCS-Length}{$X, Y$} \STATE \(m \gets X.length\) \STATE \(n \gets Y.length\) \STATE \(b[1\ldots m, 1 \ldots n]\) and \(c[0 \ldots m, 0 \ldots n]\) are new tables. \FOR {\(i \in \{1 \ldots m\}\)} \STATE \(c[i, 0] \gets 0\) \ENDFOR \FOR {\(j \in \{0 \ldots n\}\)} \STATE \(c[0, j] \gets 0\) \ENDFOR \FOR {\(i \in \{1 \ldots m\}\)} \FOR {\(j \in \{1 \ldots n\}\)} \IF {\(X_i = Y_j\)} \STATE \(c[i, j] \gets c[i - 1, j - 1] + 1\) \STATE \(b[i, j] \gets "\nwarrow"\) \ELSEIF {\(c[i - 1, j] \geq c[i, j-1]\)} \STATE \(c[i, j] \gets c[i - 1, j]\) \STATE \(b[i, j] \gets "\uparrow" \) \ELSE \STATE \(c[i, j] \gets c[i, j - 1]\) \STATE \(b[i,j] \gets "\leftarrow"\) \ENDIF \ENDFOR \ENDFOR \RETURN $c, b$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}
Implementation
# python
from collections import namedtuple
# pypi
from attrs import define
from expects import contain_only, equal, expect
Direction = namedtuple("Direction", ["up_left", "up", "left"],
defaults=["\\", "|", "-"])()
@define
class LongestCommonSubsequence:
"""Finds the longest common subsequence of two sequences
Args:
x: first sequence
y: second sequence
"""
x: list
y: list
_table: list=None
_trace_table: list=None
_length: int=None
@property
def table(self) -> list:
"""The memoization table"""
if self._table is None:
self._table = [[0] + [None] * len(self.y)
for row in range(len(self.x) + 1)]
for column in range(len(self.y) + 1):
self._table[0][column] = 0
return self._table
@property
def trace_table(self) -> list:
"""Table to reconstruct longest common subsequence"""
if self._trace_table is None:
self._trace_table = [[None] * len(self.y)
for row in range(len(self.x))]
return self._trace_table
@property
def length(self) -> int:
"""Number of characters in longest common subsequence"""
if self._length is None:
for row in range(1, len(self.x) + 1):
for column in range(1, len(self.y) + 1):
previous_row, previous_column = row - 1, column - 1
unpadded_row, unpadded_column = row - 1, column - 1
if self.x[previous_row] == self.y[previous_column]:
self.table[row][column] = self.table[previous_row][previous_column] + 1
self.trace_table[unpadded_row][unpadded_column] = Direction.up_left
elif self.table[previous_row][column] >= self.table[row][previous_column]:
self.table[row][column] = self.table[previous_row][column]
self.trace_table[unpadded_row][unpadded_column] = Direction.up
else:
self.table[row][column] = self.table[row][previous_column]
self.trace_table[unpadded_row][unpadded_column] = Direction.left
self._length = self.table[-1][-1]
return self._length
def print_lcs(self, row: int, column: int) -> None:
"""Prints the elements of the longest common subsequence
Note:
to start row and column should match the last cell in the trace table.
Args:
row: row in the trace_table to start with
column: column in the trace_table to start with
"""
if row < 0 or column < 0:
return
if self.trace_table[row][column] == Direction.up_left:
self.print_lcs(row - 1, column - 1)
print(self.x[row])
elif self.trace_table[row][column] == Direction.up:
self.print_lcs(row - 1, column)
else:
self.print_lcs(row, column - 1)
return
def print_longest(self):
self.print_lcs(len(self.trace_table) - 1, len(self.trace_table[0]) - 1)
return
x = "A B C B D A B".split()
y = "B D C A B A".split()
lcs = LongestCommonSubsequence(x, y)
expect(len(lcs.table)).to(equal(len(x) + 1))
expect(len(lcs.table[0])).to(equal(len(y) + 1))
for row in lcs.table:
expect(row[0]).to(equal(0))
expect(lcs.table[0]).to(contain_only(*([0] * (len(y) + 1))))
expect(len(lcs.trace_table)).to(equal(len(x)))
expect(len(lcs.trace_table[0])).to(equal(len(y)))
expect(lcs.length).to(equal(4))
Printing The Sequence
\begin{algorithm} \caption{Longest Common Subsequence Printer} \begin{algorithmic} \INPUT $b$: Table of directions from \textsc{LCS-Length} \INPUT $X$: Sequence of elements \INPUT $Y$: Sequence of elements \INPUT $i$: Row number in $b$ \INPUT $j$: Column number in $b$ \PROCEDURE{Print-LCS}{$b, X, i, j$} \IF {\(i = 0 \text{ or } j = 0\)} \RETURN \ENDIF \IF {\(b[i, j] = "\nwarrow" \)} \STATE \textsc{PrintLCS}(\(b, X, i - 1, j - 1\)) \STATE \textsc{Print}(\(X_i\)) \ELSEIF {\(b[i, j] = "\uparrow" \)} \STATE \textsc{PrintLCS}(\(b, X, i- 1, j \)) \ELSE \STATE \textsc{PrintLCS}(\(b, X, i, j - 1\)) \ENDIF \ENDPROCEDURE \end{algorithmic} \end{algorithm}
lcs.print_longest()
B C B A
Sources
Comparing Energy Prices
Table of Contents
- How Many Megajoules For a Dollar of Gas?
- How Many MegaJoules For a Dollar of Electricity?
- How Many Megajoules For a Dollar Of Natural Gas?
- How Many Megajoules For a Dollar of Coal?
- How Many Megajoules Per Dollar In Corn (Biodiesel)?
- What Is the Ratio of the Cost of the Most Expensive to the Cheapest?
- Source
# from python
from fractions import Fraction
def print_energy(megajoules_per_dollar: float, source: str) -> None:
"""Print the energy provided by the source.
Args:
megajoules_per_dollar: how many megajoules provided by the source for each dollar
source: where the energy is coming from
"""
print(f"{source} provides {megajoules_per_dollar:0.2f} megajoules per dollar.")
print(f"{source} costs ${1/megajoules_per_dollar:0.4f} per megajoule.")
return
COSTS = []
How Many Megajoules For a Dollar of Gas?
- A gallon of gasoline carries with it \(\approx 1.3E8\) joules of energy.
- You pay \(5 \frac{\text{dollars}}{\text{gallon}}\).
- How many megajoules (\(10^6\) joules) can you get for a dollar?
Chain-Link Conversion
In Python
gallons_per_dollar = Fraction(1, 5)
joules_per_gallon = 13 * 10**7
joules_per_dollar = joules_per_gallon * gallons_per_dollar
MEGAJOULES_PER_JOULE = Fraction(1, 10**6)
gas_megajoules_per_dollar = joules_per_dollar * megajoules_per_joule
gasoline = 1/gas_megajoules_per_dollar
print_energy(float(gas_megajoules_per_dollar), "Gasoline")
COSTS.append((gasoline, "Gasoline"))
Gasoline provides 26.00 megajoules per dollar. Gasoline costs $0.0385 per megajoule.
How Many MegaJoules For a Dollar of Electricity?
- Electricity is $0.05 per kilowatt-hour.
- \(1\text{ watt} = 1 \frac{joule}{second}\)
- \(1\text{ kilowatt-hour} = 1,000\text{ watts} \times 3,600\text{ seconds}\)
Kilowatt-Hour To Joules
A Dollar's Worth Of Megajoules
Check The Math
kilowatt_hours_per_dollar = 20 # 1/0.05 = 20
megajoules = 3.6
electricity_megajoules_per_dollar = kilowatt_hours_per_dollar * megajoules
electricity = 1/electricity_megajoules_per_dollar
print_energy(electricity_megajoules_per_dollar, "Electricity")
COSTS.append((electricity, "Electricity"))
Electricity provides 72.00 megajoules per dollar. Electricity costs $0.0139 per megajoule.
How Many Megajoules For a Dollar Of Natural Gas?
- A standard cubic foot of natural gas has about \(1.1 \times 10^6\) joules of energy.
- You can get about \(5\times 10^5\) BTUs of gas for a dollar.
- There are about 1,030 BTUs in a cubic foot.
Check The Math
joules_per_cubic_foot = 1.1E6
cubic_feet_per_btu = 1/1030
btus_per_dollar = 5e5
joules_per_btu = joules_per_cubic_foot * cubic_feet_per_btu
joules_per_dollar = joules_per_btu * btus_per_dollar
natural_gas_megajoules_per_dollar = joules_per_dollar * MEGAJOULES_PER_JOULE
natural_gas = 1/natural_gas_megajoules_per_dollar
print_energy(natural_gas_megajoules_per_dollar, "Natural Gas")
COSTS.append((natural_gas, "Natural Gas"))
Natural Gas provides 533.98 megajoules per dollar. Natural Gas costs $0.0019 per megajoule.
How Many Megajoules For a Dollar of Coal?
- A Ton of coal holds about \(3.2 × 10^{10}\) joules of energy.
- A Ton of coal costs $40.
How many Megajoules of energy in the form of coal can you get for a dollar?
\begin{align} \left(\frac{3.2 \times 10^{10}\ joules}{40\ dollars}\right) &= \left(\frac{8 \times 10^{8} joules}{dollar}\right)\\ &= \left(\frac{8 \times 10^{8}\ \cancel{joules}}{dollar}\right)\left(\frac{1\ Megajoule}{1 \times 10^6\ \cancel{joules}}\right)\\ &=8 \times 10^2 \frac{Megajoules}{dollar} \end{align}Checking the Math
coal_megajoules_per_dollar = MEGAJOULES_PER_JOULE * 3.2e10/40
coal = 1/coal_megajoules_per_dollar
print_energy(coal_megajoules_per_dollar, "Coal")
COSTS.append((coal, "Coal"))
Coal provides 800.00 megajoules per dollar. Coal costs $0.0013 per megajoule.
How Many Megajoules Per Dollar In Corn (Biodiesel)?
- Corn oil costs about $0.10 per fluid ounce (wholesale).
- A fluid ounce carries about 240 dietary calories (kilo-calories).
- A calorie is about 4.2 joules.
How many Megajoules of energy in the form of corn oil can you get for a dollar?
\begin{align} \left(\frac{1\cancel{fluid-ounce}}{0.10\ dollar}\right)\left(\frac{240\ kilocalories}{\cancel{fluid- ounce}}\right) &= \left(\frac{240\ \cancel{kilocalories}}{0.10\ dollar}\right)\left(\frac{10^3\ calories}{1\ \cancel{kilocalorie}}\right)\\ &=\left(\frac{24\times 10^4\ \cancel{calories}}{0.10\ dollar}\right)\left(\frac{4.2\ joules}{1 \cancel{calorie}}\right)\\ &=\left(\frac{100.8\times 10^4\ joules}{1 \times 10^{-1} dollar}\right)\\ &=\left(\frac{1.008 \times 10^7 joules}{dollar}\right)\left(\frac{1\ megajoule}{10^6\ joules}\right)\\ &= 1.008 \times 10^{1}\ \frac{megajoules}{dollar}\\ &= 10.008 \end{align}Checking the Math
ounces_per_dollar = 1/0.10
kilocalories_per_ounce = 240
joules_per_calorie = 4.2
calories_per_kilocalorie = 10**3
corn_megajoules_per_dollar = (ounces_per_dollar
* kilocalories_per_ounce
* calories_per_kilocalorie
* joules_per_calorie
* MEGAJOULES_PER_JOULE)
corn = 1/corn_megajoules_per_dollar
print_energy(corn_megajoules_per_dollar, "Corn")
COSTS.append((corn, "Corn"))
Corn provides 10.08 megajoules per dollar. Corn costs $0.0992 per megajoule.
What Is the Ratio of the Cost of the Most Expensive to the Cheapest?
most_expensive, cheapest, cost, name = max(COSTS), min(COSTS), 0, 1
print(f"The most expensive form of energy is {most_expensive[name]} "
f"at ${most_expensive[cost]:0.2f} per megajoule.")
print(f"The cheapest form of energy is {cheapest[name]} at "
f"${cheapest[cost]:0.4f} per megajoule.")
The most expensive form of energy is Corn at $0.10 per megajoule. The cheapest form of energy is Coal at $0.0013 per megajoule.
print(f"{most_expensive[name]} is {most_expensive[cost]/cheapest[cost]:0.2f} "
f"times more expensive than {cheapest[name]}.")
Corn is 79.37 times more expensive than Coal.
Number of Subsets In a Set
Short Answer
The number of subsets for a set of n elements is \(2^n\).
One Way To Remember It
Let's say we have a set of n things and we want to know how many different ways we can pick subsets of those things (including all of them or none of them). If we have three things, for instance, we might pick only the first item, or the first and the last item, or only the second item, etc.
We can represent the items as boxes in a row and if an item is in the subset we put a one in its box and if it isn't then we put a zero there. So for the three item example we can represent the subset that has the first and third item as 101 (the boxes are invisible and elastic). Writing it this way we can think of each element of the set as a binary digit and the number of subsets you can create is the same as the number of binary numbers with the same number of digits as the elements in our set.
To see all the possible subsets you can write out the binary numbers. For the case of three elements we represent them as three digits so finding the subsets will be the equivalent of counting from 0 to 7 in binary.
Item 1 | Item 2 | Item 3 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 |
3 | 0 | 1 | 1 |
4 | 1 | 0 | 0 |
5 | 1 | 0 | 1 |
6 | 1 | 1 | 0 |
7 | 1 | 1 | 1 |
In this case we get 8 possibilities. More generally, for n digits we can count from \(0\) to \(2^n - 1\) for a total of \(2^n\) numbers so the total number of subsets of a set with n elements is \(2^n\).
Algorithms Illuminated - Part III: Greedy Algorithms and Dynamic Programming
Citation
- [AI3] Roughgarden T. Algorithms illuminated. Part 3: Greedy algorithms and dynamic programming. First edition. San Francisco, CA: Soundlikeyourself Publishing, LLC; 2019. 217 p.
On The Web
- Algorithms Illuminated website: Has links to all four books, videos, test cases, etc.
- Tim Roughgarden's (the author) Website