Dörk, M. (2023). Data Visualization: Hands-on tutorials for basic visualization techniques and the necessary data processing. Retrieved from https://infovis.fh-potsdam.de/tutorials/
This is a look at getting a map of the New York City boroughs done using GeoPandas. The GeoPandas examples come from the Introduction To GeoPandas.
Setup
Imports
Note: Using hvplot requires geoviews and scipy on top of geopandas.
# pythonfromfunctoolsimportpartialimportsys# pypifromtabulateimporttabulateimportaltairimportgeopandasimporthvplot.pandas# my stufffromgraeaeimportEmbedHoloviewsfromgraeae.visualization.altair_helpersimportoutput_path,save_chart
We want the last one, the nybb. To load the map data into GeoPandas we give it the path to the source file. In this case, since it comes with the built-in data sets we can use their get_path method.
The GeoPandas methods will automatically use the geometry column when we do any mapping so we don't have to do anything special to tell it what to use. Note that although printing the first row makes it look like we have a pandas DataFrame, it's actually a GeoPandas object, so it has both pandas methods and geopandas methods.
Plotting The Boroughs With HVPlot
plot=nyc_data.hvplot(hover_cols=["BoroName"],legend=False,tools=["hover","wheel_zoom"],).opts(title="New York City Boroughs",width=800,height=700,fontscale=2,xaxis=None,yaxis=None)outcome=Embed(plot=plot,file_name="nyc_boroughs")()
print(outcome)
Folium Map
The GeoPandas explore method creates a Folium map that will plot the boroughs using the geometry data and put it on top of a street map. To give it something extra we'll add a column for the area of each burough and color the output based on it.
The column tells geopandas which column to use to color the Choropleth Map
popup makes it so that clicking on a borough will show data from all the columns
An Altair Map
Using altair isn't too far from using hvplot, although it does have (a little) more documentation for mapping. For some reason it doesn't recognize our data as geographic data so we'll have to use the project method to tell altair to treat the geometry data as x-y data.
As I noted in a previous post you can't just install geopandas and geoviews you also need to install cartopy, but cartopycurrently requires version 8 of proj while the stable versions of debian and ubuntu both have version 7.2. I originally tried to download the (unstable) .deb files and install them, but libproj-dev requires libproj22 which requires a newer version of the C++ libraries than is currently in debian-stable… and at that point I decided to give up on it and run everything on debian bookworm (currently in testing) which has the newer versions of the stuff you need for cartopy.
Beginning
This is going to be a simple plot of Portland neighborhods using GeoPandas/GeoJSON. I started doing this post to get Portland Crime Data mapped to the neighborhoods, which is what they (the Portland Police Bureau) uses, but it wasn't as straightforward as I though it'd be so I'm pulling it out here to see if I can keep it simpler for future reference.
Imports
# pythonfromfunctoolsimportpartialfrompathlibimportPath# pypiimportaltairimporthvplot.pandasimportgeopandasimportgeoviewsimportyaml# my stufffromgraeaeimportEmbedHoloviews
Looking at this bug report for altair, it appears that vega-lite doesn't support an interactive map so it seems like it'd be better to stick with GeoPandas for now.
Some Stuff To Check Out
I ran into these while looking for the neighborhood data, maybe I'll check them out later.
Portland Open Data
To get the boundary information, click on the Boundaries Icon and find the dataset you want in the list that comes up. Once you're on the map, to get the dataset you need to click on a little bar at the bottom-left that says "I want to use this" which will change the user-interface. Click on the "View API Resources" tab and it will bring up two URLs, one of which is a GeoJSON link.
# pythonfromfunctoolsimportpartialfrompathlibimportPath# pypifromtabulateimporttabulateimportaltairimportgeopandasimporthvplot.pandasimportpandasimportyaml# my stufffromgraeaeimportEmbedHoloviews
The case year and number for the reported incident (YY-######).Sensitive cases have been randomly assigned a case number and are denoted by an X following the case year (YY-X######).
Occur Month Year
The Month and Year that the incident occured.
Occur Date
Date the incident occurred. The exact occur date is sometimes unknown. In most situations, the first possible date the crime could have occurred is used as the occur date. (For example, victims return home from a week-long vacation to find their home burglarized. The burglary could have occurred at any point during the week. The first date of their vacation would be listed as the occur date.)
Occur Time
Time the incident occured. The exact occur time is sometimes unknown. In most situations, the first possible time the crime could have occured is used as the occur time.The time is reported in the 24-hour clock format, with the first two digits representing hour (ranges from 00 to 23) and the second two digits representing minutes (ranges from 00 to 59).Note: By default, Microsoft Excel removes leading zeroes when importing data. For more help with this issue, refer to Microsoft's help page.
Address
Address of reported incident at the 100 block level (e.g.: 1111 SW 2nd Ave would be 1100 Block SW 2nd Ave).To protect the identity of victims and other privacy concerns, the address location of certain case types are not released.
Open Data X / Y
Generalized XY point of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block's midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid. To protect the identity of victims and other privacy concerns, the points of certain case types are not released.XY points use the Oregon State Plane North (3601), NAD83 HARN, US International Feet coordinate system.
Open Data Lat / Lon
Generalized Latitude / Longitude of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block's midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid. To protect the identity of victims and other privacy concerns, the points of certain case types are not released.
Neighborhood
Neighborhood where incident occurred.If the neighborhood name is missing, the incident occurred outside of the boundaries of the Portland neighborhoods or at a location that could not be assigned to a specific address in the system (e.g., Portland, near Washington Park, on the streetcar, etc.). Note: Neighborhood boundaries and designations vary slightly from those found on the Office of Community & Civic Life website.
Crime Against
Crime against category (Person, Property, or Society)
Offense Category
Category of offense (for example, Assault Offenses)
Offense Type
Type of offense (for example, Aggravated Assault)Note: The statistic for Homicide Offenses has been updated in the Group A Crimes report to align with the 2019 FBI NIBRS definitions. The statistic for Homicide Offenses includes (09A) Murder & Non-negligent Manslaughter and (09B) Negligent Manslaughter. As of January 1, 2019, the FBI expanded the definition of negligent manslaughter to include traffic fatalities that result in an arrest for driving under the influence, distracted driving, or reckless driving. The change in definition impacts the 2019 homicide offenses statistic and the comparability of 2019 homicide statistics to prior year.
Offense Count
Number of offenses per incident. Offenses (i.e. this field) are summed for counting purposes.
Our first problem is that there's a bunch of times labeled "0" which doesn't match the formatting of the rest of the times. This might represent missing data, unknown data, or 0-o'clock (midnight). The documentation doesn't really say, although it does say that when the time isn't known it uses the earliest possible time, which would seem to be midnight, so that's what I'll assume it is.
crime_year=data.groupby(["Year Reported"]).count().reset_index()crime_year=crime_year.rename(columns=COUNT_COLUMN)chart=altair.Chart(crime_year[["Year Reported","Count"]]).mark_bar().encode(x="Year Reported",y="Count",tooltip=["Year Reported","Count"]).properties(title="Crime by Year",width=800,height=600,).interactive()chart.save(str(PATH/"crime_year.html"))
Address
Not all addresses are given, some are omitted if the case is considered sensitive. I'll look at those in the next section on "Case Number", but since we're here, let's look at what type of crime, based on who the victim is, is considered too sensitive to report the address.
The method to force a side-by-side bar plot came from this github bug report.
data.loc[:,"Addressed"]=data.Address.isnull()data.loc[data["Addressed"],"Addressed"]="Missing"data.loc[data["Addressed"]!="Missing","Addressed"]="Has Address"counts=data.groupby(["CrimeAgainst","Addressed"]).count().reset_index()counts=counts.rename(columns=COUNT_COLUMN)counts=counts[["CrimeAgainst","Addressed","Count"]]addressed_chart=altair.Chart(counts)crime_against=addressed_chart.mark_bar().encode(column=altair.Column("CrimeAgainst",spacing=5,header=altair.Header(labelOrient="bottom")),x=altair.X("Addressed:N",sort=["Missing","Has Address"],axis=None),y="Count",color="Addressed:N",tooltip=["CrimeAgainst","Count"]).properties(title="Victim of a Crime with an Address",width=275,height=600,).interactive()crime_against.save(str(PATH/"addressed_crime_against.html"))
Case Number
This is a unique identifier for each case. It doesn't seem like this would be interesting unless you wanted to look up a specific incident, but one possibly useful bit is that the case numbers are given a prefix of "X" if they are sensitive (by case number I mean just the number section - i.e. <year>-X<case number>), which might be useful in figuring out the reason for missing location data.
chart=altair.Chart(grouped[["Count","Sensitive","Year Reported"]]).mark_bar().encode(color="Sensitive:O",y="Count",x="Year Reported:O",tooltip=["Sensitive","Count","Year Reported"]).properties(title="Count of Sensitive Cases",width=800,height=600).interactive()chart.save(str(PATH/"sensitive_case_count.html"))
By Category
grouped=data.groupby(["CrimeAgainst","OffenseCategory"]).count().reset_index()grouped=grouped.rename(columns=COUNT_COLUMN)chart=altair.Chart(grouped)categories=chart.mark_bar().encode(x="CrimeAgainst:O",y="Count",color="OffenseCategory",tooltip=["CrimeAgainst","Count","OffenseCategory"]).properties(title="Crime Category by Victim Type",width=800,height=600,).interactive()categories.save(str(PATH/"category_crime_against.html"))
Offense Type
columns=["CrimeAgainst","OffenseCategory","OffenseType"]grouped=data.groupby(columns).count().reset_index()grouped=grouped.rename(columns={"CaseNumber":"Count"})chart=altair.Chart(grouped[columns+["Count"]])categories=chart.mark_bar().encode(x="OffenseCategory:O",y="Count",color="OffenseType",tooltip=["CrimeAgainst","Count","OffenseCategory","OffenseType"]).properties(title="Crime Type by Offense Category",width=800,height=600,).interactive()categories.save(str(PATH/"category_type_crime_against.html"))
columns=["CrimeAgainst","Neighborhood","OffenseType"]grouped=data.groupby(columns).count().reset_index()grouped=grouped.rename(columns={"CaseNumber":"Count"})chart=altair.Chart(grouped[columns+["Count"]])categories=chart.mark_bar().encode(x="Neighborhood:O",y="Count",color="OffenseType",tooltip=["Neighborhood","CrimeAgainst","Count","OffenseType"]).properties(title="Crime Type by Neighborhood",width=800,height=600,).interactive()categories.save(str(PATH/"neighborhood_type_crime_against.html"))
Address CaseNumber CrimeAgainst Neighborhood OccurDate OccurTime \
3770 NaN 20-X5515843 Person Concordia 1/1/1971 0
OffenseCategory OffenseType OpenDataLat OpenDataLon ReportDate \
3770 Sex Offenses Fondling NaN NaN 10/14/2020
OffenseCount OpenDataX OpenDataY when
3770 2 NaN NaN 1971-01-01
So, there might be some mistakes in there… or maybe some people wait a long time to report a crime?
columns=["year","OffenseType"]grouped=data.groupby(columns).count().reset_index()grouped=grouped.rename(columns={"CaseNumber":"Count"})chart=altair.Chart(grouped[columns+["Count"]])categories=chart.mark_bar().encode(x="year:N",y="Count",color="OffenseType",tooltip=["year","Count","OffenseType"]).properties(title="Crime Type by Year",width=800,height=600,).interactive()categories.save(str(PATH/"year_type.html"))
It's only supposed to go back to 2015, what's with the older rows?
data.loc[:,"reported"]=pandas.to_datetime(data.ReportDate)data.loc[:,"report_year"]=data.reported.apply(lambdarow:row.year)older=data[data.year<2015]columns=["report_year","year","OffenseType"]grouped=older.groupby(columns).count().reset_index()grouped=grouped.rename(columns={"CaseNumber":"Count"})chart=altair.Chart(grouped[columns+["Count"]])categories=chart.mark_bar().encode(x="report_year",y="Count",color="OffenseType",tooltip=["year","report_year","Count","OffenseType"]).properties(title="Crime Type by Year/Reporting",width=800,height=600,).interactive()categories.save(str(PATH/"year_reported_type.html"))
UPDATE: I thought that the plotting wasn't working because I couldn't get the tooltips to show up but I installed the Brave Browser and loaded the plots and the tooltips showed up. There's something up with my firefox setup that's killing the interactivity. Firefox!.
UPDATE 2: After I did a refresh on Firefox the tooltips work. But now all my extensions and settings are gone. Was getting altair working worth it?
Update 3: After wiping Firefox and starting over again I found out that enabling the privacy.resistFingerprinting option is what is breaking altair's interactivity. Strange.
What This Is
This is a (partial) replication of the Basic Statistical Visualization section of the altair documentation - mostly to see if I can see if it works.
To plot with altair you pass the dataframe to the Chart constructor and call some chained methods. To plot a bar plot you call the chart's mark_bar method and on the object that it returns you call the encode method where you pass in the information on what from the data to plot. It uses a silghtly funky string argument to tell altair to call a function (we're going to call average). This is only a shortcut, though, you can call the functions too. But that's getting ahead of things. Let's just make the plot.
Note: Setting y to the categorical value will make altair rotate the plot. Also the default plot is tiny so I went back and set the size to something bigger.
Well, I'll leave it there and say it works, but I'm not sure I'll be switching from HoloViews just yet. I couldn't figure out how to get the tooltips working (which is kind of the point of not doing this as a static figure) and trying to navigate the documentation didn't leave me with a good feeling about trying to find stuff on the site. What examples I saw looked like promising, though, so I'll keep it in mind and maybe play around with it more later.
# pythonfromcollectionsimportnamedtuplefromdatetimeimportdatetimefromfunctoolsimportpartialfrompathlibimportPathimportos# pypifromdotenvimportload_dotenvfromexpectsimportbe_true,expectfromtabulateimporttabulateimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews
Our data set runs from january 1939 through February 2021, but looking at the values it seems they must have changed the way that they calculate it, or else there was a lot more employment in 1939. The "series" referred to in the series_id column is a time series. The period is the month and the "P" footnote code means preliminary (which seems odd given that there are 1,047).
They provide some files to map the more obscure values to names so I guess I'll take advantage of that.
notes_path=Path(os.environ["CES_FOOTNOTES"]).expanduser()expect(notes_path.is_file()).to(be_true)withnotes_path.open()asreader:# get rid of the headerreader.readline()# get the mapCODE,TEXT=0,1lines=(line.split()forlineinreader)notes_map={line[CODE].strip():line[TEXT].strip()forlineinlines}notes_map[numpy.nan]="none"
So the year becomes nonsense, and since I'm using the seasonally adjusted the values might not make sense either, but this is just an exploration so let's see what we have.
This is a look at the U.S. Census zip-code mapping data.
Dependencies
Besides geopandas this also requires geoviews which in turn requires cartopy but if you use pypi you have to install proj first. You also need the python header files (so with debian it's sudo apt install python3-dev, note that python-dev will install the python-2 headers for some reason)… Oy.
I'm not 100% sure about proj-bin, but, the rest are needed to build cartopy.
Note: I tested it without proj-bin and the cartopy installation fails. Also, as of December 29, 2021 cartopy requires a newer version of proj than is in the stable releases for ubuntu and debian.
The apt stuff has to come before the pip installs, but the actual ordering within each set doesn't matter. You can install geopandas by itself, it will just complain that geoviews isn't installed.
Another Note: Geoviews will install cartopy so you don't need to install it separately (and maybe you shouldn't since this will keep the versions matched).
Additionally, because I'm the plot file is so huge I'm using datashader (which I don't recall explicitly installing, I think holoviews pulled it in) we'll also need Spatial Pandas. Hokey smokes.
Imports
# pythonfromfunctoolsimportpartialfrompathlibimportPathimportos# pypifromdotenvimportload_dotenvfromexpectsimportequal,expectfromtabulateimporttabulateimportgeopandasimporthvplot.pandasimportmatplotlib.pyplotaspyplotimportpandasimportrequestsimportseaborn# my stufffromgraeaeimportEmbedHoloviews
Unfortunately this makes a huge plot so I'll have to use datashade with it (which isn't necessarily a bad thing, it just isn't what I prefer).
plot=frame.hvplot(hover_cols=["ZCTASCE10"],legend=False,datashade=True).opts(title="UNITED States Census Zip Codes",width=800,height=700,fontscale=2,)outcome=Embed(plot=plot,file_name="us_zip_codes")()
Well, after all that there seems to be a bug in there somewhere. Datashader is doing something with numpy that raises an exception.
/usr/local/lib/python3.8/dist-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
386 """
387 if not issubclass_(arg1, generic):
--> 388 arg1 = dtype(arg1).type
389 if not issubclass_(arg2, generic):
390 arg2 = dtype(arg2).type
TypeError: Cannot interpret 'MultiPolygonDtype(float64)' as a data type
Maybe Just Portland
There's a site called zipdatamaps that has listings of zip codes (among other things) which I'll use to get the zip codes for Portland, Oregon. I'm going to use pandas' read_html function which also requires you to install lxml.
According to the pandas documentation you can't use https, but that seems to give me a 403 (Forbidden) error so I'll pull the HTML with requests first instead of having pandas pull it directly. The table also has a title above the column headers so we have to skip the first row to avoid a MultiIndex (or fix it later).
ZIP Code ZIP Code Name Population Type
97034 Lake Oswego 18905 Non-Unique
0 97035.0 Lake Oswego 23912.0 Non-Unique
1 97080.0 Gresham 40888.0 Non-Unique
2 97086.0 Happy Valley 26010.0 Non-Unique
3 97201.0 Portland 15484.0 Non-Unique
4 97202.0 Portland 38762.0 Non-Unique
So we still have a problem in that it used the first zip-code as part of the header… I'll just pull the row out and add it back in. One thing to note is that the header values are all strings so to be able to append the row we'll have to do some conversion.
ZIP Code ZIP Code Name Population Type
0 97035 Lake Oswego 23912 Non-Unique
1 97080 Gresham 40888 Non-Unique
2 97086 Happy Valley 26010 Non-Unique
3 97201 Portland 15484 Non-Unique
4 97202 Portland 38762 Non-Unique
I converted the zip codes into strings instead of integers because there are zip-codes with leading zeros, although not in Portland so I guess it could go either way.
I don't know why but when setting the hover columns for the plot if you put the zip-code column first it doesn't show up but it does if you put it second. Mysterious.
plot=plotter.hvplot(hover_cols=["Population",ZIPS_COLUMN],legend=False).opts(title="Portland by Zip Code",width=700,height=700,fontscale=2,xaxis=None,yaxis=None,colorbar=False,)outcome=Embed(plot=plot,file_name="portland_zip_codes")()
ZIP Code 97824
GEOID10 97824
CLASSFP10 B5
MTFCC10 G6350
FUNCSTAT10 S
ALAND10 565896585
AWATER10 20851
INTPTLAT10 +45.3543012
INTPTLON10 -117.7564700
geometry POLYGON ((-117.993812 45.369603, -117.993632 4...
City Cove
County Union
Type Standard
Name: 0, dtype: object
This turns out to work, but the file it creates is 29 Megabytes, so maybe not a great idea to use it with holoviews. I'll just do a regular PNG with no annotations.
counties="Clackamas|Multnomah|Washington"portland=oregon[[ZIPS_COLUMN,"City","County"]]portland=portland[portland.County.str.contains(counties,regex=True)]portland=pandas.merge(portland,frame,on=ZIPS_COLUMN,how="left")portland=pandas.merge(portland,zips,on=ZIPS_COLUMN,how="left")portland=geopandas.GeoDataFrame(portland)plot=portland.hvplot(hover_cols=["Population","City","County",ZIPS_COLUMN],legend=False).opts(title="Clackamas, Multnomah, and Washington Counties",width=800,height=700,fontscale=2,xaxis=None,yaxis=None,colorbar=False,)outcome=Embed(plot=plot,file_name="portland_with_city")()
print(outcome)
This one looks a little better. The gray-areas are the cities that weren't in the first zip-code set. I guess they only count Portland, as Portland, not the Portland Metropolitan area altogether.
It's kind of surprising that the zip code with the highest population is on the West side (the darkest blue area). I guess because it encompasses a larger area than the ones further east (Rock Creek, Cedar Mill, and Bethany, according to Google). It's odd, though, but some of the cities that come up (like Mollala, the furthest south) are listed on the Portland Metro list of cities, maybe now there's too many cities.
Okay, I just checked out the metro's maps page and it looks like the metro area does cut through the outer counties instead of just taking them all in. To get just the metro area would take more work.
End
Well, that was a little harder than I thought it would be. The main thing to remember, I suppose, is that the maps quickly grow too big for holoviews, so if you want to do an overview it's better to do it in matplotlib and save the interactivity for a smaller section.
Map of All ZIP Codes in Portland, Oregon - Updated March 2021 [Internet]. Zipdatamaps.com. [cited 2021 Mar 7]. Available from: https://www.zipdatamaps.com/
To get the \(t^*_{18}\) value we can use the scipy t module which has a ppf (percent point function) that will get it for us. We're going to calculate the 95 % confidence error so we would need to pass in 1 - 0.95 (because we want the area in the tail), but the function actually gives you the one-tailed value so we need to divide this value by two \(\frac{1 - 0.95}{2}\) to get the two-tailed value.
Confidence Interval = 4.4 +- 2.10 x 0.53
(3.291, 5.509)
End
We can say with 91% confidence that the mean Methyl-Mercury levels for Pilot Whale red meat in Taiji is within 3.291 and 5.509 \(\frac{\mu g}{\textrm{wet g}}\) (the FDA suggested limit is \(1.0 \frac{\mu g}{\textit{wet g}}\).
There's a couple of things to note. One is that although the original paper is behind a paywall so I can't read it, the abstract is available through Research Gate (and Elsevier) and in it the mean is given as 5.9, not 4.4. In addition, the book that I got the values from (OpenIntro Statistics) had the wrong values (3.87, 4.93) when they calculated the confidence interval, so I don't know how accurate the starting numbers are. Also, if you poke around the web there are other researchers who've found that there is a lot of variation in mercury levels found by different researchers.
Endo T, Haraguchi K. High mercury levels in hair samples from residents of Taiji, a Japanese whaling town. Marine Pollution Bulletin. 2010 May 1;60(5):743-7.
Most textbooks explain correlation before linear regression, but the publications of Francis Galton and Karl Pearson indicate that Galton's work studying the inheritance of sweet pea characteristics lead to his initial concept of linear regression first and the development of correlation and multiple regression came from later work by Galton and Pearson. So this paper gives a history of how Galton came up with linear regression which instructors can use to introduce linear regression.
Bullets
Galton came up with the concepts of linear regression and correlation (he wasn't the first, but this is about Galton and Pearson's discoveries) but Pearson developed it mathematically so it is commonly known as the Pearson Correlation Coefficient (or some variant of that), leading many students to think that Pearson came up with it (see also Auguste Bravais).
Galton's Question: How much influence does a generation have on the characteristics of the next?
The Sweet Peas
1875: Galton gave seven friends 700 sweet pea seeds. The friends planted the seeds, harvested the peas, and returned them to Galton.
Galton chose sweet peas because they can self-fertilize so he didn't have to deal with the influence of two parents.
Galton found that the median size of daughter seeds for each mother's seed size plotted a straight(ish) line with a positive slope less than one (0.33) - the first regression line
Galton: Slope < 1 shows "Regression Towards Mediocrity" - daughter's sizes are closer to the mean than parent's sizes are to the mean
If slope had been 1, then parent and child means were the same
If slope had been horizontal, then child didn't inherit size from parent
Slope between 0 and 1 meant that there was some influence from parent
Since he had 700 seeds and no calculator (and maybe for other reasons too) Galton used the median and semi-interquartile range \(\left( \frac{\textrm{Inter Quartile Range}}{2} \right)\) instead of mean and deviation
Galton estimated the regression line using the scatterplot, not by calculating the slope
Galton and Correlation
If the correlation for the characteristic between the parent and child are the same for different data but the slope is different then the variance is what is causing the difference
The more difference there is in the variance, the steeper the slope of the line
Believed he had found that correlation between generations was a constant even for different characteristics (not something that is currently believed)
Although he was wrong about the correlation being constant, thinking about it led him to develop his ideas about regression and correlation
The equation he was working toward that Pearson eventually came up with was:
\begin{align} y &= mx \\ &= \left(\frac{S_y}{S_x} \right) x \end{align}
Where r is the correlation between the parent and child's size, \(S_y\) is the sample standard deviation of the Daughter seed-sizes and \(S_x\) is the sample standard deviation of the Parent seed-sizes. There's a slope intercept too, but the point of it was to show how the spread for the two data variables affects the slope. The less variance there is for the daughter seeds, the smaller the slope would be (of course the same is true of the correlation, but Galton thought this was a fixed value anyway).
Other Stuff
It also touches on the fact that Galton recognized that prior generations also affected the size of the daughter's seeds, giving rise to the idea of multiple regression. And there is a table of the original measurements that Galton gathered for the seeds.
So, what again?
Galton's Regression Towards Mediocrity - or Regression to the Mean as it's more commonly referred to nowadays shows why humans don't split up into giants and little people. A person who is much larger than the mean might have a child that's also larger than the mean, but that child is likely to be closer to the mean than the parent was, just as parents that are smaller than the mean will tend to have children that are larger than they are.
Source
Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)