Galton's Sweet Pea Data

Beginning

Imports

# python
from functools import partial
from collections import namedtuple

# pypi
from tabulate import tabulate

import holoviews
import hvplot.pandas
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "galtons-sweet-pea-data"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Plot = namedtuple("Plot", ["width", "height", "fontscale",
                           "tan", "blue",
                           "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Middle

The Data

URL = "http://jse.amstat.org/v9n3/stanton.html#appendixa"

frames = pandas.read_html(URL, header=0)

There are two tables on the page, we want the second one.

data = frames[1]
print(tabulate(data, tablefmt="orgtbl", headers="keys", showindex=False))
Diameter of Parent Seed(0.01 inch) Diameter of Daughter Seed(0.01 inch) Frequency
21 14.67 22
21 15.67 8
21 16.67 10
21 17.67 18
21 18.67 21
21 19.67 13
21 20.67 6
21 22.67 2
20 14.66 23
20 15.66 10
20 16.66 12
20 17.66 17
20 18.66 20
20 19.66 13
20 20.66 3
20 22.66 2
19 14.07 35
19 15.07 16
19 16.07 12
19 17.07 13
19 18.07 11
19 19.07 10
19 20.07 2
19 22.07 1
18 14.35 34
18 15.35 12
18 16.35 13
18 17.35 17
18 18.35 16
18 19.35 6
18 20.35 2
17 13.92 37
17 14.92 16
17 15.92 13
17 16.92 16
17 17.92 13
17 18.92 4
17 19.92 1
16 14.28 34
16 15.28 15
16 16.28 18
16 17.28 16
16 18.28 13
16 19.28 3
16 20.28 1
15 13.77 46
15 14.77 14
15 15.77 9
15 16.77 11
15 17.77 14
15 18.77 4
15 19.77 2

Note that the data is somewhat aggregated - the last column is the number of times that the parent/daughter diameter pairs occured in the original data.

data.columns = ["Parent", "Daughter", "Frequency"]
plot = data.hvplot.scatter(x="Parent", y="Daughter", s=data.Frequency,
                           c="Frequency",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)").opts(
                               height=PLOT.height,
                               width=PLOT.width,
                               fontscale=PLOT.fontscale,
                           )
embedder = Embed(plot=plot, file_name="scatter_plot")
output = embedder()
print(output)

Figure Missing

rows = []
for row in data.itertuples():
    rows.extend([[row.Parent, row.Daughter] for _ in range(row.Frequency)])

frame = pandas.DataFrame(rows, columns="Parent Daughter".split())
plot = frame.hvplot.hist(stacked=True, title="Diameter Distribution", alpha=0.5).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
    xlabel="Diameter",
    ylabel="Count",
)
output = Embed(plot=plot, file_name="histogram")()
print(output)

Figure Missing

parent_min = frame.Parent.min()
parent_max = frame.Parent.max()
y_1 = frame[frame.Parent==parent_min].Daughter.mean()
y_2 = frame[frame.Parent==parent_max].Daughter.mean()

dots = frame.hvplot.scatter(x="Parent", y="Daughter",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)")
line = holoviews.Curve([(parent_min, y_1), (parent_max, y_2)])

plot = (dots * line).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
)

output = Embed(plot=plot, file_name="scatter_plot_with_means")()
print(output)

Figure Missing

End

Source

  • Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)

Zip Code Plotting

Beginning

This is a quick test to see if I can plot some GeoJSON that defines a zipcode map for the state of Oregon. I got the file from GitHub but the README indicates that the original source (see below) was the the U.S. Census Bureau. The files are from 2010, so they're a little old, but I don't imagine that zip codes update that often anyway. Maybe as a future exercise I'll try to replicated what was done with the 2019 update.

Imports

# python
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv

import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "zip-code-plotting"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Loading the Data

This should be pretty straightforward. The documentation for GeoPandas indicates that it can handle GeoJSON directly.

load_dotenv("posts/.env")
path = Path(os.environ["OREGON_ZIP_CODES"]).expanduser()
geoframe = geopandas.read_file(path)

Let's see what's there.

print(geoframe.head())
  STATEFP10 ZCTA5CE10  GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10    ALAND10  \
0        41     97833  4197833        B5   G6350          S  228152974   
1        41     97840  4197840        B5   G6350          S  295777905   
2        41     97330  4197330        B5   G6350          S  199697439   
3        41     97004  4197004        B5   G6350          S  113398767   
4        41     97023  4197023        B5   G6350          S  330220870   

   AWATER10   INTPTLAT10    INTPTLON10 PARTFLG10  \
0         0  +44.9288886  -118.0148791         N   
1  10777783  +44.8847111  -116.9184395         N   
2    814864  +44.6424890  -123.2562655         N   
3     71994  +45.2549625  -122.4493774         N   
4   2345079  +45.2784758  -122.3231876         N   

                                            geometry  
0  MULTIPOLYGON (((-118.15748 44.99903, -118.1575...  
1  POLYGON ((-116.98971 44.88256, -116.98957 44.8...  
2  POLYGON ((-123.18294 44.64477, -123.18293 44.6...  
3  POLYGON ((-122.48691 45.22227, -122.48713 45.2...  
4  POLYGON ((-122.07580 45.10889, -122.07594 45.1...  

This next bit is a little slow.

plot = geoframe.hvplot(hover_cols=["ZCTA5CE10"], legend=False, tools=["hover", "wheel_zoom"],).opts(
    title="Oregon Zip Codes",
    width=800,
    height=700,
    fontscale=2,
)
outcome = Embed(plot=plot, file_name="oregon_zip_codes")()
print(outcome)

Figure Missing

Well, on the one hand it kind of works, but it has a lot of weird holes and I can't get the tools to appear so you can zoom in. I also don't really want the zip-codes to be interpreted as a heat map. But I guess it's a start.

End

Sources