Geopandas Revisited

Beginning

This is a look at the U.S. Census zip-code mapping data.

Dependencies

Besides geopandas this also requires geoviews which in turn requires cartopy but if you use pypi you have to install proj first. You also need the python header files (so with debian it's sudo apt install python3-dev, note that python-dev will install the python-2 headers for some reason)… Oy.

Take two.

sudo apt install python3-dev
sudo apt install proj-bin
sudo apt install libgeos-dev
sudo apt install libproj-dev

I'm not 100% sure about proj-bin, but, the rest are needed to build cartopy.

pip install cartopy
pip install geopandas
pip install geoviews

The apt stuff has to come before the pip installs, but the actual ordering within each set doesn't matter. You can install geopandas by itself, it will just complain that geoviews isn't installed.

Additionally, because I'm the plot file is so huge I'm using datashader (which I don't recall explicitly installing, I think holoviews pulled it in) we'll also need Spatial Pandas. Hokey smokes.

Imports

# python
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from expects import equal, expect
from tabulate import tabulate

import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas
import requests
import seaborn

# my stuff
from graeae import EmbedHoloviews

Set Up

load_dotenv("posts/.env")
SLUG = "geopandas-revisited"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Loading the Data

data_path = Path(os.environ["US_ZIP_CODE_SHAPEFILE"]).expanduser()
frame = geopandas.read_file(data_path)

First, let's find the Coordinate Reference System being used.

print(frame.crs)
epsg:4269

According to map tiler it's a North America specific reference system. I don't know if that's good or bad so let's try it as is first.

print(frame.head())
  ZCTA5CE10 GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10    ALAND10  AWATER10  \
0     43451   43451        B5   G6350          S   63484189    157689   
1     43452   43452        B5   G6350          S  121525288  13721740   
2     43456   43456        B5   G6350          S    9320978   1003775   
3     43457   43457        B5   G6350          S   48004678         0   
4     43458   43458        B5   G6350          S    2573816     39915   

    INTPTLAT10    INTPTLON10  \
0  +41.3183193  -083.6150363   
1  +41.5202297  -082.9770334   
2  +41.6318300  -082.8393923   
3  +41.2673301  -083.4274872   
4  +41.5304461  -083.2133648   

                                            geometry  
0  POLYGON ((-83.70873 41.32733, -83.70815 41.327...  
1  POLYGON ((-83.08698 41.53780, -83.08256 41.537...  
2  MULTIPOLYGON (((-82.83558 41.71082, -82.83515 ...  
3  POLYGON ((-83.49650 41.25371, -83.48382 41.253...  
4  POLYGON ((-83.22229 41.53102, -83.22228 41.532...  
ZIP_COLUMN = frame.columns[0]

Unfortunately this makes a huge plot so I'll have to use datashade with it (which isn't necessarily a bad thing, it just isn't what I prefer).

plot = frame.hvplot(hover_cols=["ZCTASCE10"], legend=False, datashade=True).opts(
    title="UNITED States Census Zip Codes",
    width=800,
    height=700,
    fontscale=2,
)

outcome = Embed(plot=plot, file_name="us_zip_codes")()

Well, after all that there seems to be a bug in there somewhere. Datashader is doing something with numpy that raises an exception.

/usr/local/lib/python3.8/dist-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    386     """
    387     if not issubclass_(arg1, generic):
--> 388         arg1 = dtype(arg1).type
    389     if not issubclass_(arg2, generic):
    390         arg2 = dtype(arg2).type

TypeError: Cannot interpret 'MultiPolygonDtype(float64)' as a data type

Maybe Just Portland

There's a site called zipdatamaps that has listings of zip codes (among other things) which I'll use to get the zip codes for Portland, Oregon. I'm going to use pandas' read_html function which also requires you to install lxml.

According to the pandas documentation you can't use https, but that seems to give me a 403 (Forbidden) error so I'll pull the HTML with requests first instead of having pandas pull it directly. The table also has a title above the column headers so we have to skip the first row to avoid a MultiIndex (or fix it later).

URL = "https://www.zipdatamaps.com/zipcodes-portland-or"
response = requests.get(URL)
tables = pandas.read_html(response.text, skiprows=1)

read_html returns a list so I'll pull out the first frame and do a little clean up (dropping the empty rows with dropna).

zips = tables[0]
zips = zips.dropna()
print(zips.head())
  ZIP Code ZIP Code Name Population        Type
     97034   Lake Oswego      18905  Non-Unique
0  97035.0   Lake Oswego    23912.0  Non-Unique
1  97080.0       Gresham    40888.0  Non-Unique
2  97086.0  Happy Valley    26010.0  Non-Unique
3  97201.0      Portland    15484.0  Non-Unique
4  97202.0      Portland    38762.0  Non-Unique

So we still have a problem in that it used the first zip-code as part of the header… I'll just pull the row out and add it back in. One thing to note is that the header values are all strings so to be able to append the row we'll have to do some conversion.

columns = zips.columns.map(lambda column: column[0])
first_row = list(zips.columns.map(lambda column: column[1]))
POPULATION_COLUMN = 2
first_row[POPULATION_COLUMN] = int(first_row[POPULATION_COLUMN])

zips.columns = columns

zips.loc[:, "ZIP Code"] = zips["ZIP Code"].astype(int).astype(str)
zips.loc[:, "Population"] = zips["Population"].astype(int)
zips = zips.append(pandas.DataFrame([first_row], columns=columns), ignore_index=True)
print(zips.head())
  ZIP Code ZIP Code Name  Population        Type
0    97035   Lake Oswego       23912  Non-Unique
1    97080       Gresham       40888  Non-Unique
2    97086  Happy Valley       26010  Non-Unique
3    97201      Portland       15484  Non-Unique
4    97202      Portland       38762  Non-Unique

I converted the zip codes into strings instead of integers because there are zip-codes with leading zeros, although not in Portland so I guess it could go either way.

Now we'll pare down the original data set.

ZIPS_COLUMN = "ZIP Code"
expression = "|".join(zips[ZIPS_COLUMN])
sub = frame[frame[ZIP_COLUMN].str.contains(expression, regex=True)]
print(sub.shape)
expect(len(sub)).to(equal(len(zips)))
(35, 10)

Once again with feeling. I'll add the population too, for no real good reason using join.

sub = sub.rename(columns={ZIP_COLUMN: ZIPS_COLUMN})
plotter = pandas.merge(sub, zips, on=ZIPS_COLUMN, how="left")
expect(len(plotter)).to(equal(len(zips)))

I don't know why but when setting the hover columns for the plot if you put the zip-code column first it doesn't show up but it does if you put it second. Mysterious.

plot = plotter.hvplot(hover_cols=["Population", ZIPS_COLUMN], legend=False).opts(
    title="Portland by Zip Code",
    width=700,
    height=700,
    fontscale=2,
    xaxis=None,
    yaxis=None,
    colorbar=False,
)    
outcome = Embed(plot=plot, file_name="portland_zip_codes")()
print(outcome)

Figure Missing

What About All of Oregon

URL = "https://www.zip-codes.com/state/or.asp"
response = requests.get(URL)
tables = pandas.read_html(response.text)
table = tables[2]
table.columns = table.iloc[0]
table = table.drop(0)
print(table.head(2))
0        ZIP Code      City  County      Type
1  ZIP Code 97001  Antelope   Wasco  P.O. Box
2  ZIP Code 97002    Aurora  Marion  Standard
frame = frame.rename(columns={ZIP_COLUMN: ZIPS_COLUMN})
oregon = pandas.merge(frame, table, on=ZIPS_COLUMN)
print(oregon.iloc[0])
ZIP Code                                                  97824
GEOID10                                                   97824
CLASSFP10                                                    B5
MTFCC10                                                   G6350
FUNCSTAT10                                                    S
ALAND10                                               565896585
AWATER10                                                  20851
INTPTLAT10                                          +45.3543012
INTPTLON10                                         -117.7564700
geometry      POLYGON ((-117.993812 45.369603, -117.993632 4...
City                                                       Cove
County                                                    Union
Type                                                   Standard
Name: 0, dtype: object

This turns out to work, but the file it creates is 29 Megabytes, so maybe not a great idea to use it with holoviews. I'll just do a regular PNG with no annotations.

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set_style("whitegrid", rc={"axes.grid": False})
FIGURE_SIZE = (12, 10)
figure, axe = pyplot.subplots(figsize=FIGURE_SIZE)
plot = oregon.plot(ax=axe)
figure.savefig("images/oregon_zip_codes.png")

Oregon Zip Codes

Portland Again

counties = "Clackamas|Multnomah|Washington"
portland = oregon[[ZIPS_COLUMN, "City", "County"]]
portland = portland[portland.County.str.contains(counties, regex=True)]
portland = pandas.merge(portland, frame, on=ZIPS_COLUMN, how="left")
portland = pandas.merge(portland, zips, on=ZIPS_COLUMN, how="left")

portland = geopandas.GeoDataFrame(portland)

plot = portland.hvplot(hover_cols=["Population", "City", "County", ZIPS_COLUMN], legend=False).opts(
    title="Clackamas, Multnomah, and Washington Counties",
    width=800,
    height=700,
    fontscale=2,
    xaxis=None,
    yaxis=None,
    colorbar=False,
)    
outcome = Embed(plot=plot, file_name="portland_with_city")()
print(outcome)

Figure Missing

This one looks a little better. The gray-areas are the cities that weren't in the first zip-code set. I guess they only count Portland, as Portland, not the Portland Metropolitan area altogether.

It's kind of surprising that the zip code with the highest population is on the West side (the darkest blue area). I guess because it encompasses a larger area than the ones further east (Rock Creek, Cedar Mill, and Bethany, according to Google). It's odd, though, but some of the cities that come up (like Mollala, the furthest south) are listed on the Portland Metro list of cities, maybe now there's too many cities.

Okay, I just checked out the metro's maps page and it looks like the metro area does cut through the outer counties instead of just taking them all in. To get just the metro area would take more work.

End

Well, that was a little harder than I thought it would be. The main thing to remember, I suppose, is that the maps quickly grow too big for holoviews, so if you want to do an overview it's better to do it in matplotlib and save the interactivity for a smaller section.

Source

t Confidence Intervals

Imports

# from python
from collections import namedtuple
from math import sqrt

# pypi
from scipy.stats import t

Whale Mercury

Beginning

n \(\overline{x}\) s min max
19 4.4 2.3 1.7 9.2
WhaleMercury = namedtuple("WhaleMercury",
                          ["n",
                           "mean",
                           "standard_deviation",
                           "minimum",
                           "maximum"])

WHALES = WhaleMercury(n=19,
                      mean=4.4,
                      standard_deviation=2.3,
                      minimum=1.7,
                      maximum=9.2)

Checking the Skew

scalar = 2.5 * WHALES.standard_deviation
print(f"({WHALES.mean - scalar: 0.2f}, {WHALES.mean + scalar})")
(-1.35, 10.15)

Our minimum and maximum are within two-and a half standard deviations from the mean so we'll judge that there isn't a skew.

Middle

\[ \textrm{Confidence Interval} = \overline{x} \pm t^*_{df} SE \]

Standard Error

\[ SE = \frac{s}{\sqrt{n}} \]

standard_error = WHALES.standard_deviation/sqrt(WHALES.n)
print(f"Standard Error: {standard_error: 0.2f}")
Standard Error:  0.53

t Star

To get the \(t^*_{18}\) value we can use the scipy t module which has a ppf (percent point function) that will get it for us. We're going to calculate the 95 % confidence error so we would need to pass in 1 - 0.95 (because we want the area in the tail), but the function actually gives you the one-tailed value so we need to divide this value by two \(\frac{1 - 0.95}{2}\) to get the two-tailed value.

confidence = 0.95
two_tailed = (1 - confidence)/2
degrees_of_freedom = WHALES.n - 1

t_star_18 = abs(t.ppf(two_tailed, degrees_of_freedom))
print(f"t: {t_star_18:0.3f}")
t: 2.101

Margin Of Error

\[ \textrm{Margin of Error} = t^*_{18} SE \]

margin_of_error = t_star_18 * standard_error
print(f"Margin of Error: {margin_of_error: 0.2f}")
Margin of Error:  1.11

The Confidence Interval

lower = WHALES.mean - margin_of_error
upper = WHALES.mean + margin_of_error

print(f"Confidence Interval = {WHALES.mean} +- {t_star_18: 0.2f} x {standard_error: 0.2f}")
print(f"({lower:0.3f}, {upper: 0.3f})")
Confidence Interval = 4.4 +-  2.10 x  0.53
(3.291,  5.509)

End

We can say with 91% confidence that the mean Methyl-Mercury levels for Pilot Whale red meat in Taiji is within 3.291 and 5.509 \(\frac{\mu g}{\textrm{wet g}}\) (the FDA suggested limit is \(1.0 \frac{\mu g}{\textit{wet g}}\).

There's a couple of things to note. One is that although the original paper is behind a paywall so I can't read it, the abstract is available through Research Gate (and Elsevier) and in it the mean is given as 5.9, not 4.4. In addition, the book that I got the values from (OpenIntro Statistics) had the wrong values (3.87, 4.93) when they calculated the confidence interval, so I don't know how accurate the starting numbers are. Also, if you poke around the web there are other researchers who've found that there is a lot of variation in mercury levels found by different researchers.

Sources

  • OpenIntro Statistics
  • Endo T, Haraguchi K. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin. 2010 May 1;60(5):743-7.

Galton, Pearson, and the Peas

Abstract Summary

Most textbooks explain correlation before linear regression, but the publications of Francis Galton and Karl Pearson indicate that Galton's work studying the inheritance of sweet pea characteristics lead to his initial concept of linear regression first and the development of correlation and multiple regression came from later work by Galton and Pearson. So this paper gives a history of how Galton came up with linear regression which instructors can use to introduce linear regression.

Bullets

  • Galton came up with the concepts of linear regression and correlation (he wasn't the first, but this is about Galton and Pearson's discoveries) but Pearson developed it mathematically so it is commonly known as the Pearson Correlation Coefficient (or some variant of that), leading many students to think that Pearson came up with it (see also Auguste Bravais).
  • Galton's Question: How much influence does a generation have on the characteristics of the next?

The Sweet Peas

  • 1875: Galton gave seven friends 700 sweet pea seeds. The friends planted the seeds, harvested the peas, and returned them to Galton.
  • Galton chose sweet peas because they can self-fertilize so he didn't have to deal with the influence of two parents.
  • Galton found that the median size of daughter seeds for each mother's seed size plotted a straight(ish) line with a positive slope less than one (0.33) - the first regression line
  • Galton: Slope < 1 shows "Regression Towards Mediocrity" - daughter's sizes are closer to the mean than parent's sizes are to the mean
  • If slope had been 1, then parent and child means were the same
  • If slope had been horizontal, then child didn't inherit size from parent
  • Slope between 0 and 1 meant that there was some influence from parent
  • Since he had 700 seeds and no calculator (and maybe for other reasons too) Galton used the median and semi-interquartile range \(\left( \frac{\textrm{Inter Quartile Range}}{2} \right)\) instead of mean and deviation
  • Galton estimated the regression line using the scatterplot, not by calculating the slope

Galton and Correlation

  • If the correlation for the characteristic between the parent and child are the same for different data but the slope is different then the variance is what is causing the difference
  • The more difference there is in the variance, the steeper the slope of the line
  • Believed he had found that correlation between generations was a constant even for different characteristics (not something that is currently believed)
  • Although he was wrong about the correlation being constant, thinking about it led him to develop his ideas about regression and correlation

The equation he was working toward that Pearson eventually came up with was:

\begin{align} y &= mx \\ &= \left(\frac{S_y}{S_x} \right) x \end{align}

Where r is the correlation between the parent and child's size, \(S_y\) is the sample standard deviation of the Daughter seed-sizes and \(S_x\) is the sample standard deviation of the Parent seed-sizes. There's a slope intercept too, but the point of it was to show how the spread for the two data variables affects the slope. The less variance there is for the daughter seeds, the smaller the slope would be (of course the same is true of the correlation, but Galton thought this was a fixed value anyway).

Other Stuff

It also touches on the fact that Galton recognized that prior generations also affected the size of the daughter's seeds, giving rise to the idea of multiple regression. And there is a table of the original measurements that Galton gathered for the seeds.

So, what again?

Galton's Regression Towards Mediocrity - or Regression to the Mean as it's more commonly referred to nowadays shows why humans don't split up into giants and little people. A person who is much larger than the mean might have a child that's also larger than the mean, but that child is likely to be closer to the mean than the parent was, just as parents that are smaller than the mean will tend to have children that are larger than they are.

Source

  • Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)

Galton's Sweet Pea Data

Beginning

Imports

# python
from functools import partial
from collections import namedtuple

# pypi
from tabulate import tabulate

import holoviews
import hvplot.pandas
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "galtons-sweet-pea-data"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Plot = namedtuple("Plot", ["width", "height", "fontscale",
                           "tan", "blue",
                           "red"])
PLOT = Plot(
    width=900,
    height=750,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Middle

The Data

URL = "http://jse.amstat.org/v9n3/stanton.html#appendixa"

frames = pandas.read_html(URL, header=0)

There are two tables on the page, we want the second one.

data = frames[1]
print(tabulate(data, tablefmt="orgtbl", headers="keys", showindex=False))
Diameter of Parent Seed(0.01 inch) Diameter of Daughter Seed(0.01 inch) Frequency
21 14.67 22
21 15.67 8
21 16.67 10
21 17.67 18
21 18.67 21
21 19.67 13
21 20.67 6
21 22.67 2
20 14.66 23
20 15.66 10
20 16.66 12
20 17.66 17
20 18.66 20
20 19.66 13
20 20.66 3
20 22.66 2
19 14.07 35
19 15.07 16
19 16.07 12
19 17.07 13
19 18.07 11
19 19.07 10
19 20.07 2
19 22.07 1
18 14.35 34
18 15.35 12
18 16.35 13
18 17.35 17
18 18.35 16
18 19.35 6
18 20.35 2
17 13.92 37
17 14.92 16
17 15.92 13
17 16.92 16
17 17.92 13
17 18.92 4
17 19.92 1
16 14.28 34
16 15.28 15
16 16.28 18
16 17.28 16
16 18.28 13
16 19.28 3
16 20.28 1
15 13.77 46
15 14.77 14
15 15.77 9
15 16.77 11
15 17.77 14
15 18.77 4
15 19.77 2

Note that the data is somewhat aggregated - the last column is the number of times that the parent/daughter diameter pairs occured in the original data.

data.columns = ["Parent", "Daughter", "Frequency"]
plot = data.hvplot.scatter(x="Parent", y="Daughter", s=data.Frequency,
                           c="Frequency",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)").opts(
                               height=PLOT.height,
                               width=PLOT.width,
                               fontscale=PLOT.fontscale,
                           )
embedder = Embed(plot=plot, file_name="scatter_plot")
output = embedder()
print(output)

Figure Missing

rows = []
for row in data.itertuples():
    rows.extend([[row.Parent, row.Daughter] for _ in range(row.Frequency)])

frame = pandas.DataFrame(rows, columns="Parent Daughter".split())
plot = frame.hvplot.hist(stacked=True, title="Diameter Distribution", alpha=0.5).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
    xlabel="Diameter",
    ylabel="Count",
)
output = Embed(plot=plot, file_name="histogram")()
print(output)

Figure Missing

parent_min = frame.Parent.min()
parent_max = frame.Parent.max()
y_1 = frame[frame.Parent==parent_min].Daughter.mean()
y_2 = frame[frame.Parent==parent_max].Daughter.mean()

dots = frame.hvplot.scatter(x="Parent", y="Daughter",
                           color=PLOT.red,
                           title="Sweet Pea Diameter (0.01 Inch)")
line = holoviews.Curve([(parent_min, y_1), (parent_max, y_2)])

plot = (dots * line).opts(
    height=PLOT.height,
    width=PLOT.width,
    fontscale=PLOT.fontscale,
)

output = Embed(plot=plot, file_name="scatter_plot_with_means")()
print(output)

Figure Missing

End

Source

  • Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)

Zip Code Plotting

Beginning

This is a quick test to see if I can plot some GeoJSON that defines a zipcode map for the state of Oregon. I got the file from GitHub but the README indicates that the original source (see below) was the the U.S. Census Bureau. The files are from 2010, so they're a little old, but I don't imagine that zip codes update that often anyway. Maybe as a future exercise I'll try to replicated what was done with the 2019 update.

Imports

# python
from functools import partial
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv

import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

SLUG = "zip-code-plotting"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/{SLUG}")

Middle

Loading the Data

This should be pretty straightforward. The documentation for GeoPandas indicates that it can handle GeoJSON directly.

load_dotenv("posts/.env")
path = Path(os.environ["OREGON_ZIP_CODES"]).expanduser()
geoframe = geopandas.read_file(path)

Let's see what's there.

print(geoframe.head())
  STATEFP10 ZCTA5CE10  GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10    ALAND10  \
0        41     97833  4197833        B5   G6350          S  228152974   
1        41     97840  4197840        B5   G6350          S  295777905   
2        41     97330  4197330        B5   G6350          S  199697439   
3        41     97004  4197004        B5   G6350          S  113398767   
4        41     97023  4197023        B5   G6350          S  330220870   

   AWATER10   INTPTLAT10    INTPTLON10 PARTFLG10  \
0         0  +44.9288886  -118.0148791         N   
1  10777783  +44.8847111  -116.9184395         N   
2    814864  +44.6424890  -123.2562655         N   
3     71994  +45.2549625  -122.4493774         N   
4   2345079  +45.2784758  -122.3231876         N   

                                            geometry  
0  MULTIPOLYGON (((-118.15748 44.99903, -118.1575...  
1  POLYGON ((-116.98971 44.88256, -116.98957 44.8...  
2  POLYGON ((-123.18294 44.64477, -123.18293 44.6...  
3  POLYGON ((-122.48691 45.22227, -122.48713 45.2...  
4  POLYGON ((-122.07580 45.10889, -122.07594 45.1...  

This next bit is a little slow.

plot = geoframe.hvplot(hover_cols=["ZCTA5CE10"], legend=False, tools=["hover", "wheel_zoom"],).opts(
    title="Oregon Zip Codes",
    width=800,
    height=700,
    fontscale=2,
)
outcome = Embed(plot=plot, file_name="oregon_zip_codes")()
print(outcome)

Figure Missing

Well, on the one hand it kind of works, but it has a lot of weird holes and I can't get the tools to appear so you can zoom in. I also don't really want the zip-codes to be interpreted as a heat map. But I guess it's a start.

End

Sources

The Geopandas DataFrame Coordinates Example

Beginning

Imports

# python
from functools import partial

# pypi
import geopandas
import hvplot.pandas
import matplotlib.pyplot as pyplot
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

Embed = partial(EmbedHoloviews, folder_path="files/posts/the-geopandas-dataframe-coordinates-example")

Middle

Longitudes and Latitudes

data = pandas.DataFrame(
    {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
     'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
     'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
     'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
geoframe = geopandas.GeoDataFrame(data,
                                  geometry=geopandas.points_from_xy(
                                      data.Longitude, data.Latitude))
print(geoframe.head())
           City    Country  Latitude  Longitude                     geometry
0  Buenos Aires  Argentina    -34.58     -58.66  POINT (-58.66000 -34.58000)
1      Brasilia     Brazil    -15.78     -47.91  POINT (-47.91000 -15.78000)
2      Santiago      Chile    -33.45     -70.66  POINT (-70.66000 -33.45000)
3        Bogota   Colombia      4.60     -74.08    POINT (-74.08000 4.60000)
4       Caracas  Venezuela     10.48     -66.86   POINT (-66.86000 10.48000)

Plot It

# read in the plotting data
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# restrict it to South America.
ax = world[world.continent == 'South America'].plot(
    color='white', edgecolor='black')

# We can now plot our ``GeoDataFrame``.
geoframe.plot(ax=ax, color='red')

pyplot.savefig("files/posts/the-geopandas-dataframe-coordinates-example/south_america.png")

south_america.png

With Holoviews

Since this is going to be interactive we can add more to the plot.

print(world.head())
     pop_est      continent                      name iso_a3  gdp_md_est  \
0     920938        Oceania                      Fiji    FJI      8374.0   
1   53950935         Africa                  Tanzania    TZA    150600.0   
2     603253         Africa                 W. Sahara    ESH       906.5   
3   35623680  North America                    Canada    CAN   1674000.0   
4  326625791  North America  United States of America    USA  18560000.0   

                                            geometry  
0  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...  
1  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...  
2  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...  
3  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...  
4  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...  

Let's add the population estimate to the countries and the city to the points.

world_plot = world[world.continent == "South America"].hvplot(
    color="white",
    x="Longitude",
    y="Latitude",
    legend=False,
    hover_cols=["pop_est", "name"])

# NOTE: the overlay uses the pandas.DataFrame, not the GeoDataFrame
points = data.hvplot.points("Longitude", "Latitude", hover_cols=["City"])
plot = (world_plot * points).opts(
    title="South American Cities",
    width=700,
    height=890,
)
outcome = Embed(plot=plot, file_name="south_america", height_in_pixels=1000)()
print(outcome)

Figure Missing

This was mostly a check to see if I could get it working so I'll have to think about adding more explanations later.

Notes

  • To get matplotlib to work I had to install descartes (pip install descartes)
  • To get hvplot to work I had to install geoviews (pip install geoviews)

Links

Secrets of Mental Math

Citation

  • Benjamin A, Shermer M, Benjamin A, 3M Cloud Library. Secrets of mental math: the mathemagician’s guide to lightning calculation and amazing math tricks [Internet]. Place of publication not identified: Crown Publishing Group; 2006 [cited 2020 Nov 9].

Bibliography for Think Complexity

Citation

  1. Downey A. Think complexity: complexity science and computational modeling. Second edition. Beijing ; Boston: O’Reilly; 2018. 185 p.

Comments

This is a creative commons book that is for sale but is also available to read online or as a free PDF from Green Tea Press. There is also a github repository here.