Link Collection

Table of Contents

Data Visualization

Books

Tutorials

NYC With GeoPandas

This is a look at getting a map of the New York City boroughs done using GeoPandas. The GeoPandas examples come from the Introduction To GeoPandas.

Setup

Imports

Note: Using hvplot requires geoviews and scipy on top of geopandas.

# python
from functools import partial 

import sys

# pypi
from tabulate import tabulate

import altair
import geopandas
import hvplot.pandas

# my stuff
from graeae import EmbedHoloviews
from graeae.visualization.altair_helpers import output_path, save_chart
TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False)

SLUG = "nyc-with-geopandas"
OUTPUT_FOLDER = f"files/posts/{SLUG}"
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_FOLDER)

OUTPUT_PATH = output_path(SLUG)
save_altair = partial(save_chart, output_path=OUTPUT_PATH, height=650)

Versions

print(sys.version)
print(altair.__version__)
3.9.9 (main, Dec 21 2021, 10:03:34) 
[GCC 10.2.1 20210110]
5.0.0rc1

I'm using the pre-release version of altair since its interface is different from the 4.x versions.

The Data

GeoPandas comes with some built-in datasets (only three, actually) including one for the Boroughs of New York City, which is what we'll load here.

Here's the available datasets.

for dataset in geopandas.datasets.available:
    print(f" - ~{dataset}~")
  • naturalearth_cities
  • naturalearth_lowres
  • nybb

We want the last one, the nybb. To load the map data into GeoPandas we give it the path to the source file. In this case, since it comes with the built-in data sets we can use their get_path method.

path_to_data = geopandas.datasets.get_path("nybb")
nyc_data = geopandas.read_file(path_to_data)

print(nyc_data.head(1))
   BoroCode       BoroName     Shape_Leng    Shape_Area  \
0         5  Staten Island  330470.010332  1.623820e+09   

                                            geometry  
0  MULTIPOLYGON (((970217.022 145643.332, 970227....  

The GeoPandas methods will automatically use the geometry column when we do any mapping so we don't have to do anything special to tell it what to use. Note that although printing the first row makes it look like we have a pandas DataFrame, it's actually a GeoPandas object, so it has both pandas methods and geopandas methods.

Plotting The Boroughs With HVPlot

plot = nyc_data.hvplot(hover_cols=["BoroName"], legend=False, tools=["hover", "wheel_zoom"],).opts(
    title="New York City Boroughs",
    width=800,
    height=700,
    fontscale=2,
    xaxis=None,
    yaxis=None
)
outcome = Embed(plot=plot, file_name="nyc_boroughs")()
print(outcome)

Figure Missing

Folium Map

The GeoPandas explore method creates a Folium map that will plot the boroughs using the geometry data and put it on top of a street map. To give it something extra we'll add a column for the area of each burough and color the output based on it.

nyc_data["Area"] = nyc_data.area
print(nyc_data.Area)
0    1.623822e+09
1    3.045214e+09
2    1.937478e+09
3    6.364712e+08
4    1.186926e+09
Name: Area, dtype: float64
explorer = nyc_data.explore(column="Area", legend=False,
                            tooltip="BoroName", popup=True)
explorer.save(f"{OUTPUT_FOLDER}/nyc-explore.html")

Figure Missing

  • The column tells geopandas which column to use to color the Choropleth Map
  • popup makes it so that clicking on a borough will show data from all the columns

An Altair Map

Using altair isn't too far from using hvplot, although it does have (a little) more documentation for mapping. For some reason it doesn't recognize our data as geographic data so we'll have to use the project method to tell altair to treat the geometry data as x-y data.

chart = altair.Chart(nyc_data).mark_geoshape().encode(
    color="Area",
    tooltip=[altair.Tooltip("BoroName", type="nominal"),
             altair.Tooltip("Area", type="quantitative", format=".2e")]
).project(type="identity", reflectY=True).properties(
    width=800,
    height=525,
    title="NYC Borough Areas",
)

save_altair(chart, "nyc-borough-areas-altair")

Figure Missing

Sources

GeoPandas and Portland Neighborhoods

Prelude: The Lastest GeoPandas Debacle

As I noted in a previous post you can't just install geopandas and geoviews you also need to install cartopy, but cartopy currently requires version 8 of proj while the stable versions of debian and ubuntu both have version 7.2. I originally tried to download the (unstable) .deb files and install them, but libproj-dev requires libproj22 which requires a newer version of the C++ libraries than is currently in debian-stable… and at that point I decided to give up on it and run everything on debian bookworm (currently in testing) which has the newer versions of the stuff you need for cartopy.

Beginning

This is going to be a simple plot of Portland neighborhods using GeoPandas/GeoJSON. I started doing this post to get Portland Crime Data mapped to the neighborhoods, which is what they (the Portland Police Bureau) uses, but it wasn't as straightforward as I though it'd be so I'm pulling it out here to see if I can keep it simpler for future reference.

Imports

# python
from functools import partial
from pathlib import Path

# pypi
import altair
import hvplot.pandas
import geopandas
import geoviews
import yaml

# my stuff
from graeae import EmbedHoloviews

Set Up

Plotting

SLUG = "geopandas-and-portland-neighborhoods"
OUTPUT = Path(f"files/posts/{SLUG}")
Embed = partial(EmbedHoloviews, folder_path=OUTPUT)

Configurations

This holds were the data files are (or their URLs).

with open("configurations/data.yaml") as reader:
    configuration = yaml.safe_load(reader)["portland"]["neighborhoods"]

The Neighborhoods

Now we can load the data.

path = Path(configuration["folder"]).expanduser()
geojson_file = path/configuration["geojson"]
if not path.is_dir():
    path.mkdir()

if geojson_file.is_file():
    neighborhoods = geopandas.read_file(geojson_file)
else:
    neighborhoods = geopandas.read_file(configuration["geojson-url"])
    neighborhoods.to_file(geojson_file, driver="GeoJSON")

Since I'm going to use the names of the neighborhoods I'll rename the column that holds them to something clearer (to me).

neighborhoods = neighborhoods.rename(columns=dict(MAPLABEL="Neighborhood"))

Plot

Now we'll do the actual plot.

plot = neighborhoods.hvplot(hover_cols=["Neighborhood"], legend=False).opts(
    width=800,
    height=600,
    title="Portland Neighborhoods",
    tools=["hover"])

output = Embed(plot=plot, file_name="neighborhood_plot")()

Figure Missing

Try Altair

This was taken from this Stack Overflow post.

data = altair.Data(url=configuration["geojson-url"],
                   format=altair.DataFormat(property="features", type="json"))

chart = altair.Chart(neighborhoods).mark_geoshape().encode(
    color="Neighborhood:N"
 ).project(type="identity", reflectY=True).properties(
     width=800,
     height=500,
     title="Portland Neighborhoods (Altair)",
 )

chart.save(str(OUTPUT/"portland_neighborhoods_altair.html"))

Figure Missing

Looking at this bug report for altair, it appears that vega-lite doesn't support an interactive map so it seems like it'd be better to stick with GeoPandas for now.

Some Stuff To Check Out

I ran into these while looking for the neighborhood data, maybe I'll check them out later.

Portland Open Data

To get the boundary information, click on the Boundaries Icon and find the dataset you want in the list that comes up. Once you're on the map, to get the dataset you need to click on a little bar at the bottom-left that says "I want to use this" which will change the user-interface. Click on the "View API Resources" tab and it will bring up two URLs, one of which is a GeoJSON link.

  • Portland Open Data: has the links to different map data.
  • Portland Equity Index: Has census population, income, and ethnicity information. See the configurations.yaml file for some of the URLs I picked out.

Other Interesting Sites

Looking at Portland Crime Data

Beginning

Imports

# python
from functools import partial
from pathlib import Path

# pypi
from tabulate import tabulate

import altair
import geopandas
import hvplot.pandas
import pandas
import yaml

# my stuff
from graeae import EmbedHoloviews

Set Up

The Plot Path

SLUG = "looking-at-portland-crime-data"
PATH = Path(f"files/posts/{SLUG}")
if not PATH.is_dir():
    PATH.mkdir()

The Table Printer

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

def print_table(table):
    print(TABLE(table))
    return

The Configuration

with open("configurations/data.yaml") as reader:
    configuration = yaml.safe_load(reader)

portland = configuration["portland"]

Set Up

COUNT_COLUMN = {"CaseNumber": "Count"}

The Metadata

metadata_path = Path(portland["crime-metadata"]).expanduser()

if not metadata_path.is_file():
    table = pandas.read_html(portland["crime-metadata-url"], skiprows=1)[0]
    table.columns = "Column Description".split()
    table.loc[:, "Description"] = table.Description.str.strip()
    table.to_csv(metadata_path, sep="\t", index=False)
else:
    table = pandas.read_csv(metadata_path, sep="\t")
print_table(table)
Column Description
Case Number The case year and number for the reported incident (YY-######).Sensitive cases have been randomly assigned a case number and are denoted by an X following the case year (YY-X######).
Occur Month Year The Month and Year that the incident occured.
Occur Date Date the incident occurred. The exact occur date is sometimes unknown. In most situations, the first possible date the crime could have occurred is used as the occur date. (For example, victims return home from a week-long vacation to find their home burglarized. The burglary could have occurred at any point during the week. The first date of their vacation would be listed as the occur date.)
Occur Time Time the incident occured. The exact occur time is sometimes unknown. In most situations, the first possible time the crime could have occured is used as the occur time.The time is reported in the 24-hour clock format, with the first two digits representing hour (ranges from 00 to 23) and the second two digits representing minutes (ranges from 00 to 59).Note: By default, Microsoft Excel removes leading zeroes when importing data. For more help with this issue, refer to Microsoft's help page.
Address Address of reported incident at the 100 block level (e.g.: 1111 SW 2nd Ave would be 1100 Block SW 2nd Ave).To protect the identity of victims and other privacy concerns, the address location of certain case types are not released.
Open Data X / Y Generalized XY point of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block's midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid. To protect the identity of victims and other privacy concerns, the points of certain case types are not released.XY points use the Oregon State Plane North (3601), NAD83 HARN, US International Feet coordinate system.
Open Data Lat / Lon Generalized Latitude / Longitude of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block's midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid. To protect the identity of victims and other privacy concerns, the points of certain case types are not released.
Neighborhood Neighborhood where incident occurred.If the neighborhood name is missing, the incident occurred outside of the boundaries of the Portland neighborhoods or at a location that could not be assigned to a specific address in the system (e.g., Portland, near Washington Park, on the streetcar, etc.). Note: Neighborhood boundaries and designations vary slightly from those found on the Office of Community & Civic Life website.
Crime Against Crime against category (Person, Property, or Society)
Offense Category Category of offense (for example, Assault Offenses)
Offense Type Type of offense (for example, Aggravated Assault)Note: The statistic for Homicide Offenses has been updated in the Group A Crimes report to align with the 2019 FBI NIBRS definitions. The statistic for Homicide Offenses includes (09A) Murder & Non-negligent Manslaughter and (09B) Negligent Manslaughter. As of January 1, 2019, the FBI expanded the definition of negligent manslaughter to include traffic fatalities that result in an arrest for driving under the influence, distracted driving, or reckless driving. The change in definition impacts the 2019 homicide offenses statistic and the comparability of 2019 homicide statistics to prior year.
Offense Count Number of offenses per incident. Offenses (i.e. this field) are summed for counting purposes.

The Data

data_path = Path(portland["crime"]).expanduser()
data = pandas.concat(
    pandas.read_csv(file_path)
    for file_path in data_path.glob(portland["crime-glob"]))
print_table(data.head())
Address CaseNumber CrimeAgainst Neighborhood OccurDate OccurTime OffenseCategory OffenseType OpenDataLat OpenDataLon ReportDate OffenseCount OpenDataX OpenDataY
nan 21-X5543818 Person Concordia 12/31/2020 1230 Assault Offenses Aggravated Assault nan nan 1/1/2021 1 nan nan
nan 21-X5543818 Property Concordia 12/31/2020 1230 Larceny Offenses All Other Larceny nan nan 1/1/2021 1 nan nan
nan 21-X5543827 Person Pearl 1/1/2021 715 Assault Offenses Simple Assault nan nan 1/1/2021 1 nan nan
nan 21-X5543859 Person Centennial 1/1/2021 2013 Assault Offenses Aggravated Assault nan nan 1/1/2021 1 nan nan
nan 21-X5543864 Person Powellhurst-Gilbert 1/1/2021 816 Assault Offenses Simple Assault nan nan 1/1/2021 1 nan nan
print(f"{len(data):,}")
390,824

Dates and Times

Although these columns aren't the first ones, since they might be useful when looking at the other columns I thought I'd parse them first.

Times

Our first problem is that there's a bunch of times labeled "0" which doesn't match the formatting of the rest of the times. This might represent missing data, unknown data, or 0-o'clock (midnight). The documentation doesn't really say, although it does say that when the time isn't known it uses the earliest possible time, which would seem to be midnight, so that's what I'll assume it is.

TIME = "%H%M"
data.loc[:, "OccurTime"] = data.OccurTime.apply(lambda time: "0000" if time==0 else f"{time:04}")
data["Occured At"] = pandas.to_datetime(data.OccurTime, format=TIME).dt.hour
crime_time = data.groupby(["Occured At"]).count().reset_index()
crime_time = crime_time.rename(columns=COUNT_COLUMN)
chart = altair.Chart(crime_time[["Occured At", "Count"]]).mark_bar().encode(
    x="Occured At",
    y="Count",
    tooltip=["Occured At", "Count"]).properties(
        title="Time of Crime",
        width=800,
        height=600,).interactive()

chart.save(str(PATH/"crime_time.html"))

Figure Missing

Since the times are mostly self-reported I'd assume that the spikes at midnight and noon are just convenient pegs for guessing the time.

DATE = "%m/%d/%Y"

data["Occurred"] = pandas.to_datetime(data.OccurDate, format=DATE)
data["Reported"] = pandas.to_datetime(data.ReportDate, format=DATE)

data["Year Occured"] = data.Occurred.dt.year
data["Year Reported"] = data.Reported.dt.year
crime_year = data.groupby(["Year Reported"]).count().reset_index()
crime_year = crime_year.rename(columns=COUNT_COLUMN)
chart = altair.Chart(crime_year[["Year Reported", "Count"]]).mark_bar().encode(
    x="Year Reported",
    y="Count",
    tooltip=["Year Reported", "Count"]).properties(
        title="Crime by Year",
        width=800,
        height=600,).interactive()

chart.save(str(PATH/"crime_year.html"))

Figure Missing

Address

Not all addresses are given, some are omitted if the case is considered sensitive. I'll look at those in the next section on "Case Number", but since we're here, let's look at what type of crime, based on who the victim is, is considered too sensitive to report the address.

The method to force a side-by-side bar plot came from this github bug report.

data.loc[:, "Addressed"] = data.Address.isnull()
data.loc[data["Addressed"], "Addressed"] = "Missing"
data.loc[data["Addressed"] != "Missing", "Addressed"] = "Has Address"
counts = data.groupby(["CrimeAgainst", "Addressed"]).count().reset_index()

counts = counts.rename(columns=COUNT_COLUMN)
counts = counts[["CrimeAgainst", "Addressed", "Count"]]
addressed_chart = altair.Chart(counts)

crime_against = addressed_chart.mark_bar().encode(
    column=altair.Column("CrimeAgainst",
                         spacing=5,
                         header=altair.Header(labelOrient="bottom")),
    x=altair.X("Addressed:N", sort=["Missing", "Has Address"], axis=None),
    y="Count",
    color="Addressed:N",
    tooltip=["CrimeAgainst", "Count"]).properties(
        title="Victim of a Crime with an Address",
        width=275,
        height=600,).interactive()

crime_against.save(str(PATH/"addressed_crime_against.html"))

Figure Missing

Case Number

This is a unique identifier for each case. It doesn't seem like this would be interesting unless you wanted to look up a specific incident, but one possibly useful bit is that the case numbers are given a prefix of "X" if they are sensitive (by case number I mean just the number section - i.e. <year>-X<case number>), which might be useful in figuring out the reason for missing location data.

data["CaseNumber"] = data.CaseNumber.astype("string")
data["Sensitive"] = data.CaseNumber.str.contains("X")
grouped = data.groupby(["Sensitive", "Year Reported"]).count().reset_index()
grouped = grouped.rename(columns=COUNT_COLUMN)
chart = altair.Chart(grouped[["Count", "Sensitive", "Year Reported"]]).mark_bar().encode(
    color="Sensitive:O", y="Count", x="Year Reported:O",
    tooltip=["Sensitive", "Count", "Year Reported"]).properties(
    title="Count of Sensitive Cases",
    width=800,
    height=600
).interactive()

chart.save(str(PATH/"sensitive_case_count.html"))

Figure Missing

By Category

grouped = data.groupby(["CrimeAgainst", "OffenseCategory"]).count().reset_index()
grouped = grouped.rename(columns=COUNT_COLUMN)

chart = altair.Chart(grouped)

categories = chart.mark_bar().encode(
    x="CrimeAgainst:O", y="Count", color="OffenseCategory",
    tooltip=["CrimeAgainst", "Count", "OffenseCategory"]).properties(
        title="Crime Category by Victim Type",
        width=800,
        height=600,).interactive()

categories.save(str(PATH/"category_crime_against.html"))

Figure Missing

Offense Type

columns = ["CrimeAgainst", "OffenseCategory", "OffenseType"]
grouped = data.groupby(columns).count().reset_index()
grouped = grouped.rename(columns={"CaseNumber": "Count"})
chart = altair.Chart(grouped[columns + ["Count"]])

categories = chart.mark_bar().encode(
    x="OffenseCategory:O", y="Count", color="OffenseType",
    tooltip=["CrimeAgainst", "Count", "OffenseCategory", "OffenseType"]).properties(
        title="Crime Type by Offense Category",
        width=800,
        height=600,).interactive()

categories.save(str(PATH/"category_type_crime_against.html"))

Figure Missing

Neighborhood

https://www.usgs.gov/faqs/what-state-plane-coordinate-system-can-gps-provide-coordinates-these-values

columns = ["CrimeAgainst", "Neighborhood", "OffenseType"]
grouped = data.groupby(columns).count().reset_index()
grouped = grouped.rename(columns={"CaseNumber": "Count"})
chart = altair.Chart(grouped[columns + ["Count"]])

categories = chart.mark_bar().encode(
    x="Neighborhood:O", y="Count", color="OffenseType",
    tooltip=["Neighborhood",
             "CrimeAgainst",
             "Count",
             "OffenseType"]).properties(
        title="Crime Type by Neighborhood",
        width=800,
        height=600,).interactive()

categories.save(str(PATH/"neighborhood_type_crime_against.html"))

Figure Missing

GEOJSON_URL = "https://opendata.arcgis.com/datasets/1ef75e34b8504ab9b14bef0c26cade2c_3.geojson"
neighborhoods = geopandas.read_file(GEOJSON_URL)

background = altair.Chart(neighborhoods).mark_geoshape().encode(
    color="MAPLABEL").properties(width=800, height=600).transform_aggregate(groupby=["type", "geometry"])

background.save(str(PATH/"neighborhood_crime.html"))

Figure Missing

By Date

data.loc[:, "when"] = pandas.to_datetime(data.OccurDate)
print(data.when.min())
1971-01-01 00:00:00

What?

print(data[data.when==data.when.min()])
     Address   CaseNumber CrimeAgainst Neighborhood OccurDate  OccurTime  \
3770     NaN  20-X5515843       Person    Concordia  1/1/1971          0   

     OffenseCategory OffenseType  OpenDataLat  OpenDataLon  ReportDate  \
3770    Sex Offenses    Fondling          NaN          NaN  10/14/2020   

      OffenseCount  OpenDataX  OpenDataY       when  
3770             2        NaN        NaN 1971-01-01  

So, there might be some mistakes in there… or maybe some people wait a long time to report a crime?

By Year

data.loc[:, "year"] = data.when.apply(lambda date: date.year)
columns = ["year", "OffenseType"]
grouped = data.groupby(columns).count().reset_index()
grouped = grouped.rename(columns={"CaseNumber": "Count"})

chart = altair.Chart(grouped[columns + ["Count"]])

categories = chart.mark_bar().encode(
    x="year:N", y="Count", color="OffenseType",
    tooltip=["year",
             "Count",
             "OffenseType"]).properties(
        title="Crime Type by Year",
        width=800,
        height=600,).interactive()

categories.save(str(PATH/"year_type.html"))

Figure Missing

monthly = data.groupby(pandas.Grouper(key="OccurDate", freq="M")).count()

It's only supposed to go back to 2015, what's with the older rows?

data.loc[:, "reported"] = pandas.to_datetime(data.ReportDate)
data.loc[:, "report_year"] = data.reported.apply(lambda row: row.year)

older = data[data.year< 2015]

columns = ["report_year", "year", "OffenseType"]
grouped = older.groupby(columns).count().reset_index()
grouped = grouped.rename(columns={"CaseNumber": "Count"})

chart = altair.Chart(grouped[columns + ["Count"]])
categories = chart.mark_bar().encode(
    x="report_year", y="Count", color="OffenseType",
    tooltip=["year", "report_year",
             "Count",
             "OffenseType"]).properties(
        title="Crime Type by Year/Reporting",
        width=800,
        height=600,).interactive()

categories.save(str(PATH/"year_reported_type.html"))

Figure Missing

Sources

Altair: Testing, Testing

UPDATE: I thought that the plotting wasn't working because I couldn't get the tooltips to show up but I installed the Brave Browser and loaded the plots and the tooltips showed up. There's something up with my firefox setup that's killing the interactivity. Firefox!.

UPDATE 2: After I did a refresh on Firefox the tooltips work. But now all my extensions and settings are gone. Was getting altair working worth it?

Update 3: After wiping Firefox and starting over again I found out that enabling the privacy.resistFingerprinting option is what is breaking altair's interactivity. Strange.

What This Is

This is a (partial) replication of the Basic Statistical Visualization section of the altair documentation - mostly to see if I can see if it works.

Set Up

Imports

# pypi
from tabulate import tabulate
from vega_datasets import data

import altair
import pandas

The Data

To keep it simple we're going to set up a very small pandas DataFrame as our data.

fake_data = pandas.DataFrame({"a": list("CCCDDDEEE"),
                              "b": [2, 7, 4, 1, 2, 6, 8, 4, 7]})

print(tabulate(fake_data, tablefmt="orgtbl", headers="keys", showindex=False))
a b
C 2
C 7
C 4
D 1
D 2
D 6
E 8
E 4
E 7

Exciting.

A Bar Plot

To plot with altair you pass the dataframe to the Chart constructor and call some chained methods. To plot a bar plot you call the chart's mark_bar method and on the object that it returns you call the encode method where you pass in the information on what from the data to plot. It uses a silghtly funky string argument to tell altair to call a function (we're going to call average). This is only a shortcut, though, you can call the functions too. But that's getting ahead of things. Let's just make the plot.

bar_chart = altair.Chart(fake_data).mark_bar(color="firebrick").encode(
    y="a",
    x="average(b)",
    tooltip=["a", "average(b)"],
).properties(
    width=800,
    height=600,
).interactive()

Note: Setting y to the categorical value will make altair rotate the plot. Also the default plot is tiny so I went back and set the size to something bigger.

Save the Plot

SLUG = "altair-testing-testing"
PATH = f"files/posts/{SLUG}/"
bar_chart.save(f"{PATH}test-bar-plot.html")

Figure Missing

A Scatter Plot

source = data.cars()

chart = altair.Chart(source).mark_circle(size=60).encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).properties(
    width=800,
    height=600,
).interactive()

chart.save(f"{PATH}cars_scatter.html")

Figure Missing

The End

Well, I'll leave it there and say it works, but I'm not sure I'll be switching from HoloViews just yet. I couldn't figure out how to get the tooltips working (which is kind of the point of not doing this as a static figure) and trying to navigate the documentation didn't leave me with a good feeling about trying to find stuff on the site. What examples I saw looked like promising, though, so I'll keep it in mind and maybe play around with it more later.

Bureau of Labor Statistics Employment Data (March 2021)

Beginning

Imports

# python
from collections import namedtuple
from datetime import datetime
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import be_true, expect
from tabulate import tabulate

import hvplot.pandas
import numpy
import pandas

# my stuff
from graeae import EmbedHoloviews

Some Setup

A Table Printer

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

The Environment

env_path = Path("~/.env").expanduser()
expect(env_path.is_file()).to(be_true)
load_dotenv(env_path, override=True)

Plotting

SLUG = "bureau-of-labor-statistics-employment-data-march-2021"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")
Plot = namedtuple("Plot", ["width", "height", "fontscale",
                           "tan", "blue",
                           "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Middle

The Data

The data comes from the U.S. Bureau of Labor Statistics Current Employment Statistics (CES) page.

Employment

The first set is the seasonally adjusted employment employment for all employees in the united states.

employment_path = Path(os.environ["SEASONALLY_ADJUSTED_EMPLOYMENT"]).expanduser()
expect(employment_path.is_file()).to(be_true)
employment = pandas.read_csv(employment_path, delim_whitespace=True)
rows, columns = employment.shape
print(f"Rows: {rows:,} Columns: {columns}")
Rows: 360,008 Columns: 5
print(TABLE(employment.iloc[0: 2]))
series_id year period value footnote_codes
CES0000000001 1939 M01 29923 nan
CES0000000001 1939 M02 30100 nan
print(TABLE(employment.iloc[-2:]))
series_id year period value footnote_codes
CES9093299901 2021 M01 943.3 nan
CES9093299901 2021 M02 945.3 P
print(employment.footnote_codes.value_counts())
P    1047
Name: footnote_codes, dtype: int64

Our data set runs from january 1939 through February 2021, but looking at the values it seems they must have changed the way that they calculate it, or else there was a lot more employment in 1939. The "series" referred to in the series_id column is a time series. The period is the month and the "P" footnote code means preliminary (which seems odd given that there are 1,047).

They provide some files to map the more obscure values to names so I guess I'll take advantage of that.

notes_path = Path(os.environ["CES_FOOTNOTES"]).expanduser()
expect(notes_path.is_file()).to(be_true)
with notes_path.open() as reader:
    # get rid of the header
    reader.readline()

    # get the map
    CODE, TEXT = 0, 1
    lines = (line.split() for line in reader)
    notes_map = {line[CODE].strip(): line[TEXT].strip() for line in lines}

notes_map[numpy.nan] = "none"
employment["notes"] = employment.footnote_codes.map(notes_map)
print(TABLE(employment.iloc[-2:]))
series_id year period value footnote_codes notes
CES9093299901 2021 M01 943.3 nan none
CES9093299901 2021 M02 945.3 P preliminary
del(employment["footnote_codes"])
period_path = Path(os.environ["CES_PERIODS"]).expanduser()
expect(period_path.is_file()).to(be_true)
with period_path.open() as reader:
    reader.readline()
    PERIOD, SHORT_MONTH, LONG_MONTH = 0, 1, 2
    lines = (line.split() for line in reader)
    period_map = {line[PERIOD].strip(): line[LONG_MONTH].strip() for line in lines}
employment["month"] = employment.period.map(period_map)
print(TABLE(employment[-2:]))
series_id year period value notes month
CES9093299901 2021 M01 943.3 none January
CES9093299901 2021 M02 945.3 preliminary February

Actualy, now that I think about it I want a datetime object, not a string anyway.

month_map = {period: index + 1 for index, period in enumerate(sorted(period_map))}
print(month_map)
{'M01': 1, 'M02': 2, 'M03': 3, 'M04': 4, 'M05': 5, 'M06': 6, 'M07': 7, 'M08': 8, 'M09': 9, 'M10': 10, 'M11': 11, 'M12': 12, 'M13': 13}

What's M13?

According to the data element dictionary it means "Annual Average", but this particular data set doesn't have it.

print(employment.period.unique())
['M01' 'M02' 'M03' 'M04' 'M05' 'M06' 'M07' 'M08' 'M09' 'M10' 'M11' 'M12']
APPLY_TO_ROWS = "columns"

employment["date"] = employment.apply(
    lambda row: datetime(row.year, month_map[row.period], 1), axis=APPLY_TO_ROWS)
print(TABLE(employment.iloc[-2:]))
series_id year period value notes month date
CES9093299901 2021 M01 943.3 none January 2021-01-01 00:00:00
CES9093299901 2021 M02 945.3 preliminary February 2021-02-01 00:00:00

There's a lot of data there, maybe we can start by aggregating by year.

employment = employment.set_index("date")
yearly = employment.resample("AS")
sums = yearly.sum()
print(TABLE(sums.iloc[-2:], showindex=True))
date year value
2020-01-01 00:00:00 2.1113e+07 1.37505e+07
2021-01-01 00:00:00 3.87628e+06 3.28994e+06

So the year becomes nonsense, and since I'm using the seasonally adjusted the values might not make sense either, but this is just an exploration so let's see what we have.

limit = datetime(2021, 1, 1)
sums = sums[sums.index < limit]
plot = sums.hvplot(y="value", color=PLOT.blue).opts(
    width=PLOT.width,
    height=PLOT.height,
    fontscale=PLOT.fontscale,
    title="Seasonally Adjusted Yearly Employment Totals")
output = Embed(plot=plot, file_name="yearly_employment_sums")()
print(output)

Figure Missing

Earnings

path = Path(os.environ["BLS_EARNINGS"]).expanduser()
expect(path.is_file()).to(be_true)
earnings = pandas.read_csv(path, delim_whitespace=True)

End

  • Triggered by listening to a "The Indicator" episode from December 8, 2017: Where's My Raise?

Geopandas Revisited

Beginning

This is a look at the U.S. Census zip-code mapping data.

Dependencies

Besides geopandas this also requires geoviews which in turn requires cartopy but if you use pypi you have to install proj first. You also need the python header files (so with debian it's sudo apt install python3-dev, note that python-dev will install the python-2 headers for some reason)… Oy.

Take two.

sudo apt install python3-dev
sudo apt install proj-bin
sudo apt install libgeos-dev
sudo apt install libproj-dev

I'm not 100% sure about proj-bin, but, the rest are needed to build cartopy.

Note: I tested it without proj-bin and the cartopy installation fails. Also, as of December 29, 2021 cartopy requires a newer version of proj than is in the stable releases for ubuntu and debian.

pip install cartopy
pip install geopandas
pip install geoviews

The apt stuff has to come before the pip installs, but the actual ordering within each set doesn't matter. You can install geopandas by itself, it will just complain that geoviews isn't installed.

Another Note: Geoviews will install cartopy so you don't need to install it separately (and maybe you shouldn't since this will keep the versions matched).

Additionally, because I'm the plot file is so huge I'm using datashader (which I don't recall explicitly installing, I think holoviews pulled it in) we'll also need Spatial Pandas. Hokey smokes.

Imports

# python
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import equal, expect
from tabulate import tabulate

import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas
import requests
import seaborn

# my stuff
from graeae import EmbedHoloviews

Set Up

load_dotenv("posts/.env")
SLUG = "geopandas-revisited"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Loading the Data

data_path = Path(os.environ["US_ZIP_CODE_SHAPEFILE"]).expanduser()
frame = geopandas.read_file(data_path)

First, let's find the Coordinate Reference System being used.

print(frame.crs)
epsg:4269

According to map tiler it's a North America specific reference system. I don't know if that's good or bad so let's try it as is first.

print(frame.head())
  ZCTA5CE10 GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10    ALAND10  AWATER10  \
0     43451   43451        B5   G6350          S   63484189    157689   
1     43452   43452        B5   G6350          S  121525288  13721740   
2     43456   43456        B5   G6350          S    9320978   1003775   
3     43457   43457        B5   G6350          S   48004678         0   
4     43458   43458        B5   G6350          S    2573816     39915   

    INTPTLAT10    INTPTLON10  \
0  +41.3183193  -083.6150363   
1  +41.5202297  -082.9770334   
2  +41.6318300  -082.8393923   
3  +41.2673301  -083.4274872   
4  +41.5304461  -083.2133648   

                                            geometry  
0  POLYGON ((-83.70873 41.32733, -83.70815 41.327...  
1  POLYGON ((-83.08698 41.53780, -83.08256 41.537...  
2  MULTIPOLYGON (((-82.83558 41.71082, -82.83515 ...  
3  POLYGON ((-83.49650 41.25371, -83.48382 41.253...  
4  POLYGON ((-83.22229 41.53102, -83.22228 41.532...  
ZIP_COLUMN = frame.columns[0]

Unfortunately this makes a huge plot so I'll have to use datashade with it (which isn't necessarily a bad thing, it just isn't what I prefer).

plot = frame.hvplot(hover_cols=["ZCTASCE10"], legend=False, datashade=True).opts(
    title="UNITED States Census Zip Codes",
    width=800,
    height=700,
    fontscale=2,
)

outcome = Embed(plot=plot, file_name="us_zip_codes")()

Well, after all that there seems to be a bug in there somewhere. Datashader is doing something with numpy that raises an exception.

/usr/local/lib/python3.8/dist-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    386     """
    387     if not issubclass_(arg1, generic):
--> 388         arg1 = dtype(arg1).type
    389     if not issubclass_(arg2, generic):
    390         arg2 = dtype(arg2).type

TypeError: Cannot interpret 'MultiPolygonDtype(float64)' as a data type

Maybe Just Portland

There's a site called zipdatamaps that has listings of zip codes (among other things) which I'll use to get the zip codes for Portland, Oregon. I'm going to use pandas' read_html function which also requires you to install lxml.

According to the pandas documentation you can't use https, but that seems to give me a 403 (Forbidden) error so I'll pull the HTML with requests first instead of having pandas pull it directly. The table also has a title above the column headers so we have to skip the first row to avoid a MultiIndex (or fix it later).

URL = "https://www.zipdatamaps.com/zipcodes-portland-or"
response = requests.get(URL)
tables = pandas.read_html(response.text, skiprows=1)

read_html returns a list so I'll pull out the first frame and do a little clean up (dropping the empty rows with dropna).

zips = tables[0]
zips = zips.dropna()
print(zips.head())
  ZIP Code ZIP Code Name Population        Type
     97034   Lake Oswego      18905  Non-Unique
0  97035.0   Lake Oswego    23912.0  Non-Unique
1  97080.0       Gresham    40888.0  Non-Unique
2  97086.0  Happy Valley    26010.0  Non-Unique
3  97201.0      Portland    15484.0  Non-Unique
4  97202.0      Portland    38762.0  Non-Unique

So we still have a problem in that it used the first zip-code as part of the header… I'll just pull the row out and add it back in. One thing to note is that the header values are all strings so to be able to append the row we'll have to do some conversion.

columns = zips.columns.map(lambda column: column[0])
first_row = list(zips.columns.map(lambda column: column[1]))
POPULATION_COLUMN = 2
first_row[POPULATION_COLUMN] = int(first_row[POPULATION_COLUMN])

zips.columns = columns

zips.loc[:, "ZIP Code"] = zips["ZIP Code"].astype(int).astype(str)
zips.loc[:, "Population"] = zips["Population"].astype(int)
zips = zips.append(pandas.DataFrame([first_row], columns=columns), ignore_index=True)
print(zips.head())
  ZIP Code ZIP Code Name  Population        Type
0    97035   Lake Oswego       23912  Non-Unique
1    97080       Gresham       40888  Non-Unique
2    97086  Happy Valley       26010  Non-Unique
3    97201      Portland       15484  Non-Unique
4    97202      Portland       38762  Non-Unique

I converted the zip codes into strings instead of integers because there are zip-codes with leading zeros, although not in Portland so I guess it could go either way.

Now we'll pare down the original data set.

ZIPS_COLUMN = "ZIP Code"
expression = "|".join(zips[ZIPS_COLUMN])
sub = frame[frame[ZIP_COLUMN].str.contains(expression, regex=True)]
print(sub.shape)
expect(len(sub)).to(equal(len(zips)))
(35, 10)

Once again with feeling. I'll add the population too, for no real good reason using join.

sub = sub.rename(columns={ZIP_COLUMN: ZIPS_COLUMN})
plotter = pandas.merge(sub, zips, on=ZIPS_COLUMN, how="left")
expect(len(plotter)).to(equal(len(zips)))

I don't know why but when setting the hover columns for the plot if you put the zip-code column first it doesn't show up but it does if you put it second. Mysterious.

plot = plotter.hvplot(hover_cols=["Population", ZIPS_COLUMN], legend=False).opts(
    title="Portland by Zip Code",
    width=700,
    height=700,
    fontscale=2,
    xaxis=None,
    yaxis=None,
    colorbar=False,
)    
outcome = Embed(plot=plot, file_name="portland_zip_codes")()
print(outcome)

Figure Missing

What About All of Oregon

URL = "https://www.zip-codes.com/state/or.asp"
response = requests.get(URL)
tables = pandas.read_html(response.text)
table = tables[2]
table.columns = table.iloc[0]
table = table.drop(0)
print(table.head(2))
0        ZIP Code      City  County      Type
1  ZIP Code 97001  Antelope   Wasco  P.O. Box
2  ZIP Code 97002    Aurora  Marion  Standard
frame = frame.rename(columns={ZIP_COLUMN: ZIPS_COLUMN})
oregon = pandas.merge(frame, table, on=ZIPS_COLUMN)
print(oregon.iloc[0])
ZIP Code                                                  97824
GEOID10                                                   97824
CLASSFP10                                                    B5
MTFCC10                                                   G6350
FUNCSTAT10                                                    S
ALAND10                                               565896585
AWATER10                                                  20851
INTPTLAT10                                          +45.3543012
INTPTLON10                                         -117.7564700
geometry      POLYGON ((-117.993812 45.369603, -117.993632 4...
City                                                       Cove
County                                                    Union
Type                                                   Standard
Name: 0, dtype: object

This turns out to work, but the file it creates is 29 Megabytes, so maybe not a great idea to use it with holoviews. I'll just do a regular PNG with no annotations.

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set_style("whitegrid", rc={"axes.grid": False})
FIGURE_SIZE = (12, 10)
figure, axe = pyplot.subplots(figsize=FIGURE_SIZE)
plot = oregon.plot(ax=axe)
figure.savefig("images/oregon_zip_codes.png")

Oregon Zip Codes

Portland Again

counties = "Clackamas|Multnomah|Washington"
portland = oregon[[ZIPS_COLUMN, "City", "County"]]
portland = portland[portland.County.str.contains(counties, regex=True)]
portland = pandas.merge(portland, frame, on=ZIPS_COLUMN, how="left")
portland = pandas.merge(portland, zips, on=ZIPS_COLUMN, how="left")

portland = geopandas.GeoDataFrame(portland)

plot = portland.hvplot(hover_cols=["Population", "City", "County", ZIPS_COLUMN], legend=False).opts(
    title="Clackamas, Multnomah, and Washington Counties",
    width=800,
    height=700,
    fontscale=2,
    xaxis=None,
    yaxis=None,
    colorbar=False,
)    
outcome = Embed(plot=plot, file_name="portland_with_city")()
print(outcome)

Figure Missing

This one looks a little better. The gray-areas are the cities that weren't in the first zip-code set. I guess they only count Portland, as Portland, not the Portland Metropolitan area altogether.

It's kind of surprising that the zip code with the highest population is on the West side (the darkest blue area). I guess because it encompasses a larger area than the ones further east (Rock Creek, Cedar Mill, and Bethany, according to Google). It's odd, though, but some of the cities that come up (like Mollala, the furthest south) are listed on the Portland Metro list of cities, maybe now there's too many cities.

Okay, I just checked out the metro's maps page and it looks like the metro area does cut through the outer counties instead of just taking them all in. To get just the metro area would take more work.

End

Well, that was a little harder than I thought it would be. The main thing to remember, I suppose, is that the maps quickly grow too big for holoviews, so if you want to do an overview it's better to do it in matplotlib and save the interactivity for a smaller section.

Source

t Confidence Intervals

Imports

# from python
from collections import namedtuple
from math import sqrt

# pypi
from scipy.stats import t

Whale Mercury

Beginning

n \(\overline{x}\) s min max
19 4.4 2.3 1.7 9.2
WhaleMercury = namedtuple("WhaleMercury",
                          ["n",
                           "mean",
                           "standard_deviation",
                           "minimum",
                           "maximum"])

WHALES = WhaleMercury(n=19,
                      mean=4.4,
                      standard_deviation=2.3,
                      minimum=1.7,
                      maximum=9.2)

Checking the Skew

scalar = 2.5 * WHALES.standard_deviation
print(f"({WHALES.mean - scalar: 0.2f}, {WHALES.mean + scalar})")
(-1.35, 10.15)

Our minimum and maximum are within two-and a half standard deviations from the mean so we'll judge that there isn't a skew.

Middle

\[ \textrm{Confidence Interval} = \overline{x} \pm t^*_{df} SE \]

Standard Error

\[ SE = \frac{s}{\sqrt{n}} \]

standard_error = WHALES.standard_deviation/sqrt(WHALES.n)
print(f"Standard Error: {standard_error: 0.2f}")
Standard Error:  0.53

t Star

To get the \(t^*_{18}\) value we can use the scipy t module which has a ppf (percent point function) that will get it for us. We're going to calculate the 95 % confidence error so we would need to pass in 1 - 0.95 (because we want the area in the tail), but the function actually gives you the one-tailed value so we need to divide this value by two \(\frac{1 - 0.95}{2}\) to get the two-tailed value.

confidence = 0.95
two_tailed = (1 - confidence)/2
degrees_of_freedom = WHALES.n - 1

t_star_18 = abs(t.ppf(two_tailed, degrees_of_freedom))
print(f"t: {t_star_18:0.3f}")
t: 2.101

Margin Of Error

\[ \textrm{Margin of Error} = t^*_{18} SE \]

margin_of_error = t_star_18 * standard_error
print(f"Margin of Error: {margin_of_error: 0.2f}")
Margin of Error:  1.11

The Confidence Interval

lower = WHALES.mean - margin_of_error
upper = WHALES.mean + margin_of_error

print(f"Confidence Interval = {WHALES.mean} +- {t_star_18: 0.2f} x {standard_error: 0.2f}")
print(f"({lower:0.3f}, {upper: 0.3f})")
Confidence Interval = 4.4 +-  2.10 x  0.53
(3.291,  5.509)

End

We can say with 91% confidence that the mean Methyl-Mercury levels for Pilot Whale red meat in Taiji is within 3.291 and 5.509 \(\frac{\mu g}{\textrm{wet g}}\) (the FDA suggested limit is \(1.0 \frac{\mu g}{\textit{wet g}}\).

There's a couple of things to note. One is that although the original paper is behind a paywall so I can't read it, the abstract is available through Research Gate (and Elsevier) and in it the mean is given as 5.9, not 4.4. In addition, the book that I got the values from (OpenIntro Statistics) had the wrong values (3.87, 4.93) when they calculated the confidence interval, so I don't know how accurate the starting numbers are. Also, if you poke around the web there are other researchers who've found that there is a lot of variation in mercury levels found by different researchers.

Sources

  • OpenIntro Statistics
  • Endo T, Haraguchi K. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin. 2010 May 1;60(5):743-7.

Galton, Pearson, and the Peas

Abstract Summary

Most textbooks explain correlation before linear regression, but the publications of Francis Galton and Karl Pearson indicate that Galton's work studying the inheritance of sweet pea characteristics lead to his initial concept of linear regression first and the development of correlation and multiple regression came from later work by Galton and Pearson. So this paper gives a history of how Galton came up with linear regression which instructors can use to introduce linear regression.

Bullets

  • Galton came up with the concepts of linear regression and correlation (he wasn't the first, but this is about Galton and Pearson's discoveries) but Pearson developed it mathematically so it is commonly known as the Pearson Correlation Coefficient (or some variant of that), leading many students to think that Pearson came up with it (see also Auguste Bravais).
  • Galton's Question: How much influence does a generation have on the characteristics of the next?

The Sweet Peas

  • 1875: Galton gave seven friends 700 sweet pea seeds. The friends planted the seeds, harvested the peas, and returned them to Galton.
  • Galton chose sweet peas because they can self-fertilize so he didn't have to deal with the influence of two parents.
  • Galton found that the median size of daughter seeds for each mother's seed size plotted a straight(ish) line with a positive slope less than one (0.33) - the first regression line
  • Galton: Slope < 1 shows "Regression Towards Mediocrity" - daughter's sizes are closer to the mean than parent's sizes are to the mean
  • If slope had been 1, then parent and child means were the same
  • If slope had been horizontal, then child didn't inherit size from parent
  • Slope between 0 and 1 meant that there was some influence from parent
  • Since he had 700 seeds and no calculator (and maybe for other reasons too) Galton used the median and semi-interquartile range \(\left( \frac{\textrm{Inter Quartile Range}}{2} \right)\) instead of mean and deviation
  • Galton estimated the regression line using the scatterplot, not by calculating the slope

Galton and Correlation

  • If the correlation for the characteristic between the parent and child are the same for different data but the slope is different then the variance is what is causing the difference
  • The more difference there is in the variance, the steeper the slope of the line
  • Believed he had found that correlation between generations was a constant even for different characteristics (not something that is currently believed)
  • Although he was wrong about the correlation being constant, thinking about it led him to develop his ideas about regression and correlation

The equation he was working toward that Pearson eventually came up with was:

\begin{align} y &= mx \\ &= \left(\frac{S_y}{S_x} \right) x \end{align}

Where r is the correlation between the parent and child's size, \(S_y\) is the sample standard deviation of the Daughter seed-sizes and \(S_x\) is the sample standard deviation of the Parent seed-sizes. There's a slope intercept too, but the point of it was to show how the spread for the two data variables affects the slope. The less variance there is for the daughter seeds, the smaller the slope would be (of course the same is true of the correlation, but Galton thought this was a fixed value anyway).

Other Stuff

It also touches on the fact that Galton recognized that prior generations also affected the size of the daughter's seeds, giving rise to the idea of multiple regression. And there is a table of the original measurements that Galton gathered for the seeds.

So, what again?

Galton's Regression Towards Mediocrity - or Regression to the Mean as it's more commonly referred to nowadays shows why humans don't split up into giants and little people. A person who is much larger than the mean might have a child that's also larger than the mean, but that child is likely to be closer to the mean than the parent was, just as parents that are smaller than the mean will tend to have children that are larger than they are.

Source

  • Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)