Galton, Pearson, and the Peas
Table of Contents
Abstract Summary
Most textbooks explain correlation before linear regression, but the publications of Francis Galton and Karl Pearson indicate that Galton's work studying the inheritance of sweet pea characteristics lead to his initial concept of linear regression first and the development of correlation and multiple regression came from later work by Galton and Pearson. So this paper gives a history of how Galton came up with linear regression which instructors can use to introduce linear regression.
Bullets
- Galton came up with the concepts of linear regression and correlation (he wasn't the first, but this is about Galton and Pearson's discoveries) but Pearson developed it mathematically so it is commonly known as the Pearson Correlation Coefficient (or some variant of that), leading many students to think that Pearson came up with it (see also Auguste Bravais).
- Galton's Question: How much influence does a generation have on the characteristics of the next?
The Sweet Peas
- 1875: Galton gave seven friends 700 sweet pea seeds. The friends planted the seeds, harvested the peas, and returned them to Galton.
- Galton chose sweet peas because they can self-fertilize so he didn't have to deal with the influence of two parents.
- Galton found that the median size of daughter seeds for each mother's seed size plotted a straight(ish) line with a positive slope less than one (0.33) - the first regression line
- Galton: Slope < 1 shows "Regression Towards Mediocrity" - daughter's sizes are closer to the mean than parent's sizes are to the mean
- If slope had been 1, then parent and child means were the same
- If slope had been horizontal, then child didn't inherit size from parent
- Slope between 0 and 1 meant that there was some influence from parent
- Since he had 700 seeds and no calculator (and maybe for other reasons too) Galton used the median and semi-interquartile range \(\left( \frac{\textrm{Inter Quartile Range}}{2} \right)\) instead of mean and deviation
- Galton estimated the regression line using the scatterplot, not by calculating the slope
Galton and Correlation
- If the correlation for the characteristic between the parent and child are the same for different data but the slope is different then the variance is what is causing the difference
- The more difference there is in the variance, the steeper the slope of the line
- Believed he had found that correlation between generations was a constant even for different characteristics (not something that is currently believed)
- Although he was wrong about the correlation being constant, thinking about it led him to develop his ideas about regression and correlation
The equation he was working toward that Pearson eventually came up with was:
\begin{align} y &= mx \\ &= \left(\frac{S_y}{S_x} \right) x \end{align}Where r is the correlation between the parent and child's size, \(S_y\) is the sample standard deviation of the Daughter seed-sizes and \(S_x\) is the sample standard deviation of the Parent seed-sizes. There's a slope intercept too, but the point of it was to show how the spread for the two data variables affects the slope. The less variance there is for the daughter seeds, the smaller the slope would be (of course the same is true of the correlation, but Galton thought this was a fixed value anyway).
Other Stuff
It also touches on the fact that Galton recognized that prior generations also affected the size of the daughter's seeds, giving rise to the idea of multiple regression. And there is a table of the original measurements that Galton gathered for the seeds.
So, what again?
Galton's Regression Towards Mediocrity - or Regression to the Mean as it's more commonly referred to nowadays shows why humans don't split up into giants and little people. A person who is much larger than the mean might have a child that's also larger than the mean, but that child is likely to be closer to the mean than the parent was, just as parents that are smaller than the mean will tend to have children that are larger than they are.
Source
- Stanton JM. Galton, Pearson, and the peas: A brief history of linear regression for statistics instructors. Journal of Statistics Education. 2001 Jan 1;9(3). (Link)