Galton's Sweet Pea Data

Cloistered Monkey

2021-02-24 13:47

Beginning

Imports

# python
from functools import partial
from collections import namedtuple

# pypi
from tabulate import tabulate

import holoviews
import hvplot.pandas
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "galtons-sweet-pea-data"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Plot = namedtuple("Plot", ["width", "height", "fontscale",
                           "tan", "blue",
                           "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Middle

The Data

URL = "http://jse.amstat.org/v9n3/stanton.html#appendixa"

frames = pandas.read_html(URL, header=0)

There are two tables on the page, we want the second one.

data = frames[1]
print(tabulate(data, tablefmt="orgtbl", headers="keys", showindex=False))

Diameter of Parent Seed(0.01 inch)	Diameter of Daughter Seed(0.01 inch)	Frequency
21	14.67	22
21	15.67	8
21	16.67	10
21	17.67	18
21	18.67	21
21	19.67	13
21	20.67	6
21	22.67	2
20	14.66	23
20	15.66	10
20	16.66	12
20	17.66	17
20	18.66	20
20	19.66	13
20	20.66	3
20	22.66	2
19	14.07	35
19	15.07	16
19	16.07	12
19	17.07	13
19	18.07	11
19	19.07	10
19	20.07	2
19	22.07	1
18	14.35	34
18	15.35	12
18	16.35	13
18	17.35	17
18	18.35	16
18	19.35	6
18	20.35	2
17	13.92	37
17	14.92	16
17	15.92	13
17	16.92	16
17	17.92	13
17	18.92	4
17	19.92	1
16	14.28	34
16	15.28	15
16	16.28	18
16	17.28	16
16	18.28	13
16	19.28	3
16	20.28	1
15	13.77	46
15	14.77	14
15	15.77	9
15	16.77	11
15	17.77	14
15	18.77	4
15	19.77	2

Note that the data is somewhat aggregated - the last column is the number of times that the parent/daughter diameter pairs occured in the original data.

data.columns = ["Parent", "Daughter", "Frequency"]
plot = data.hvplot.scatter(x="Parent", y="Daughter", s=data.Frequency,
                           c="Frequency",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)").opts(
                               height=PLOT.height,
                               width=PLOT.width,
                               fontscale=PLOT.fontscale,
                           )
embedder = Embed(plot=plot, file_name="scatter_plot")
output = embedder()

print(output)

rows = []
for row in data.itertuples():
    rows.extend([[row.Parent, row.Daughter] for _ in range(row.Frequency)])

frame = pandas.DataFrame(rows, columns="Parent Daughter".split())

plot = frame.hvplot.hist(stacked=True, title="Diameter Distribution", alpha=0.5).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
    xlabel="Diameter",
    ylabel="Count",
)
output = Embed(plot=plot, file_name="histogram")()

print(output)

parent_min = frame.Parent.min()
parent_max = frame.Parent.max()
y_1 = frame[frame.Parent==parent_min].Daughter.mean()
y_2 = frame[frame.Parent==parent_max].Daughter.mean()

dots = frame.hvplot.scatter(x="Parent", y="Daughter",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)")
line = holoviews.Curve([(parent_min, y_1), (parent_max, y_2)])

plot = (dots * line).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
)

output = Embed(plot=plot, file_name="scatter_plot_with_means")()

print(output)

End

Source

Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)

Zip Code Plotting

Cloistered Monkey

2020-11-24 22:49

Beginning

This is a quick test to see if I can plot some GeoJSON that defines a zipcode map for the state of Oregon. I got the file from GitHub but the README indicates that the original source (see below) was the the U.S. Census Bureau. The files are from 2010, so they're a little old, but I don't imagine that zip codes update that often anyway. Maybe as a future exercise I'll try to replicated what was done with the 2019 update.

Imports

# python
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv

import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "zip-code-plotting"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Loading the Data

This should be pretty straightforward. The documentation for GeoPandas indicates that it can handle GeoJSON directly.

load_dotenv("posts/.env")
path = Path(os.environ["OREGON_ZIP_CODES"]).expanduser()
geoframe = geopandas.read_file(path)

Let's see what's there.

print(geoframe.head())

  STATEFP10 ZCTA5CE10  GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10    ALAND10  \
0        41     97833  4197833        B5   G6350          S  228152974   
1        41     97840  4197840        B5   G6350          S  295777905   
2        41     97330  4197330        B5   G6350          S  199697439   
3        41     97004  4197004        B5   G6350          S  113398767   
4        41     97023  4197023        B5   G6350          S  330220870   

   AWATER10   INTPTLAT10    INTPTLON10 PARTFLG10  \
0         0  +44.9288886  -118.0148791         N   
1  10777783  +44.8847111  -116.9184395         N   
2    814864  +44.6424890  -123.2562655         N   
3     71994  +45.2549625  -122.4493774         N   
4   2345079  +45.2784758  -122.3231876         N   

                                            geometry  
0  MULTIPOLYGON (((-118.15748 44.99903, -118.1575...  
1  POLYGON ((-116.98971 44.88256, -116.98957 44.8...  
2  POLYGON ((-123.18294 44.64477, -123.18293 44.6...  
3  POLYGON ((-122.48691 45.22227, -122.48713 45.2...  
4  POLYGON ((-122.07580 45.10889, -122.07594 45.1...

This next bit is a little slow.

plot = geoframe.hvplot(hover_cols=["ZCTA5CE10"], legend=False, tools=["hover", "wheel_zoom"],).opts(
    title="Oregon Zip Codes",
    width=800,
    height=700,
    fontscale=2,
)
outcome = Embed(plot=plot, file_name="oregon_zip_codes")()

print(outcome)

Well, on the one hand it kind of works, but it has a lot of weird holes and I can't get the tools to appear so you can zoom in. I also don't really want the zip-codes to be interpreted as a heat map. But I guess it's a start.

End

Sources

The Zip Code GeoJSON comes from a GitHub repository created by OpenDataDE.
The Official GeoJSON page (I think), not much there.
Wikipedia on GeoJSON
GeoPandas Documentation

Galton's Sweet Pea Data

Table of Contents

Beginning

Imports

Set Up

Middle

The Data

End

Source

Zip Code Plotting

Table of Contents

Beginning

Imports

Set Up

Middle

Loading the Data

End

Sources