PCA Exploration

Beginning

In this post I'm going to walk through the Lab for Coursera's NLP Specialization in which we take a look at Principal Component Analysis which we're going to use for Dimensionality Reduction later on. While PCA can be used as a black box it's useful to get an intuitive understanding of what it's doing so we'll take a look at a couple of simplified examples and pick apart a little bit of what's going on.

Imports

Just the usual suspects.

# python
from argparse import Namespace
from functools import partial

import math
import random

# pypi
from numpy.random import default_rng
from sklearn.decomposition import PCA

import holoviews
import hvplot.pandas
import numpy
import pandas

# my stuff
from graeae import EmbedHoloviews

Set Up

Plotting

This is a little bit of convenience code for the HoloViews plotting.

SLUG = "pca-exploration"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/nlp/{SLUG}")

Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)

Randomness

numpy's default_rng creates a random-number generator. The only argument it takes is the seed which I'll set to 0.

numpy_random = default_rng(0)

Middle

Example One: Random Uniform Data

This first example will use a set of data that's in a straight line so that we can see a really basic example of what the PCA does to a straight line.

The Dataset

To start I'll create a dataset generated by numpy's uniform function, which takes three arguments - low, high, and size.

correlation = 1
x = numpy_random.uniform(1, 2, 1000)
y = x.copy()

Since \(x=y\) we're going to end up with a line segment at a 45 degree angle.

Center It

For PCA they recommend that you center the data by subtracting the mean.

x -= numpy.mean(x)
y -= numpy.mean(y)

Note: according to sklearn's PCA documentation they center it for you so this is probably an unnecessary step.

data = pandas.DataFrame(dict(x=x, y=y))

The PCA Transformation

We're going to use sklearn's PCA for Principal Component Analysis. The n_components argument is the number of components it will keep - we'll keep 2.

pca = PCA(n_components=2)

Now fit it to the data.

transformation_model = pca.fit(data)

And then transform it.

pca_data = pandas.DataFrame(
    transformation_model.transform(data),
    columns=["Principal Component 1", "Principal Component 2"])

Plot the Transformation

original = data.hvplot.scatter(x="x", y="y", color=Plot.blue)
transformed = pca_data.hvplot.scatter(x="Principal Component 1", y="Principal Component 2", color=Plot.red)
plot = (original * transformed).opts(
    title="Correlated and PCA Data",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
)

outcome = Embed(plot=plot, file_name="correlated_data")()
print(outcome)

Figure Missing

Our blue line is the original data and the red line is the transformed data. So it looks like the PCA transform rotates the line to a horizontal one.

Understanding the Model

Now that we have the model we can look at the Eigenvalues and Eigenvectors that it created to do the transformation.

  • The Eigenvectors (principal component).
    print(transformation_model.components_)
    
    [[ 0.70710678  0.70710678]
     [-0.70710678  0.70710678]]
    

    The numbers look a little inscrutable at first, but what you need to know that it's a rotation matrix.

    \[ R = \begin{bmatrix} cos(45^o) & sin(45^o)\\ -sin(45^o) & cos(45^o)\\ \end{bmatrix} \]

    And since our line is at a \(45^\circ\) angle, the values in the Eigenvectors are the sin and cos of \(45^\circ\) that are used to rotate the line flat.

    print(math.cos(math.radians(45)))
    print(math.sin(math.radians(45)))
    
    0.7071067811865476
    0.7071067811865475
    
  • The Eigenvalues (explained variance).

    Also part of the model are the eigenvalues which give the amount of variance explained by each of the components.

    print(transformation_model.explained_variance_)
    
    [1.59912782e-01 7.31437644e-33]
    

    So, what does that mean? Start with the fact that the equation for variance of a uniform distribution is:

    \[ Var = \frac{(b - a)^2}{12} \]

    And remember that hen we called the uniform function we set b to 2, and a to 1, so we get.

    print((2 - 1)**2/12)
    
    0.08333333333333333
    

    If you look at the Eigenvalues we got, the second term is \(7 \times 10^{-33}\) which is pretty much zero, and the first term is about 0.16, so what we have here is.

    \begin{align} Var &= \langle Var(x) + Var(y), 0\rangle\\ &= \langle 0.083 + 0.083, 0 \rangle\\ &= \langle 0.16, 0 \rangle\\ \end{align}

    It rounds more to 0.167, but close enough, the point is that the first component contributed all the variance and the second didn't contribute any.

Example Two: Normal Random Data

Now we'll move onto normally-distributed data so we can see something a little more interesting.

Generate the Data

Now we'll to use numpy's random normal function to generate the data. The three arguments it takes are loc (the mean), scale (the standard deviation), and size (the number of numbers to generate).

standard_deviation_1 = 1
standard_deviation_2 = 0.333
points = 10**3

x = numpy_random.normal(0, standard_deviation_1, points)
y = numpy_random.normal(0, standard_deviation_2, points)

Even though we specify that the mean is 0, because it the data is generated randomly it isn't exactly zero so we'll center it.

print(f"x mean start: {x.mean()}")
print(f"y mean start: {y.mean()}")
x = x - x.mean()
y = y - y.mean()

print(f"\nx mean: {x.mean()}")
print(f"y mean: {y.mean()}")
x mean start: -0.012000607736595292
y mean start: -0.01409218413437418

x mean: 3.552713678800501e-18
y mean: 2.6645352591003758e-18

Plot It

And now a plot to show the data.

data = pandas.DataFrame(dict(x=x, y=y))
plot = data.hvplot.scatter(x="x", y="y").opts(
    title="Random Normal Data",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
    color=Plot.blue,
)
outcome = Embed(plot=plot, file_name="random_normal_data")()
print(outcome)

Figure Missing

As you can see, the data is pretty uncorrelated so we're going to rotate it to make it a little less of a blob.

Rotate The Data

Now we're going to put the x and y data into a matrix and rotate it.

covariance = 1
in_degrees = 45
angle = math.radians(in_degrees)
print(f"angle: {math.degrees(angle)}\n")

rotation_matrix = numpy.array([[numpy.cos(angle), numpy.sin(angle)],
                               [-numpy.sin(angle), numpy.cos(angle)]])
print(rotation_matrix)
angle: 45.0

[[ 0.70710678  0.70710678]
 [-0.70710678  0.70710678]]

You might notice that this is the same rotation matrix that we had before with the sklearn eigenvectors, so we could have used that, but this is how you would roll your own.

Now we can apply the rotation by taking the dot-product between the data array and the rotation-matrix.

rotated = data.dot(rotation_matrix)
rotated.columns = ["x", "y"]

Plot The Rotated Data

To get a sense of what our transformation did we can plot it. In addition we'll plot the axes created by the rotation matrix so we can see how they're related. So first thing is to unpack the axes contained within the rotation matrix. In addition we'll scale the axes by the standard deviation we used along each of the original axes to see how that relates to the shape of the data.

FIRST_ROW, SECOND_ROW = 0, 1
FIRST_COLUMN, SECOND_COLUMN = 0, 1
ORIGIN = [0, 0]
SCALAR = 3
FIRST_SPREAD, SECOND_SPREAD = (standard_deviation_1 * SCALAR,
                               standard_deviation_2 * SCALAR)
COLUMNS = "x y".split()

first_axis = pandas.DataFrame([
    ORIGIN,
    rotation_matrix[FIRST_ROW][:]],
                              columns=COLUMNS)
first_axis *= FIRST_SPREAD


second_axis = pandas.DataFrame([
    ORIGIN,
    rotation_matrix[SECOND_ROW][:]],
                               columns=COLUMNS)
second_axis *= SECOND_SPREAD
first_axis_plot = first_axis.hvplot(x="x", y="y", color="red")
second_axis_plot = second_axis.hvplot(x="x", y="y", color="orange")
rotated_plot = rotated.hvplot.scatter(x="x", y="y", color=Plot.blue)

plot = (rotated_plot * first_axis_plot * second_axis_plot).opts(
    title="Rotated Normal Data",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale,
)
outcome = Embed(plot=plot, file_name="rotated_normal_data")()
print(outcome)

Figure Missing

So our data is now grouped around a 45-degree angle and spread further along the axis that had more variance.

Apply the PCA

pca = PCA(n_components=2)
fitted = pca.fit(rotated)

Once again, the Eigenvectors (the transformation matirix).

print(fitted.components_)
[[-0.70844626 -0.70576476]
 [-0.70576476  0.70844626]]

And then the Eigenvalues (the variance).

variance = fitted.explained_variance_
print(variance)
[1.05270169 0.10604603]

Now we apply the PCA transformation.

pca_data = fitted.transform(rotated)
pca_data = pandas.DataFrame(pca_data, columns="x y".split())

Plot the PCA Transformed Data

We're going to plot the rotated and the transformed data along with the axes for the rotated data so the first

transformed = pca_data.hvplot.scatter(x="x", y="y", color=Plot.red, fill_alpha=0)
rotated_plot = rotated.hvplot.scatter(x="x", y="y", color=Plot.blue, fill_alpha=0)
first_axis_plot = first_axis.hvplot(x="x", y="y", color="red")
second_axis_plot = second_axis.hvplot(x="x", y="y", color="orange")

plot = (transformed * rotated_plot * first_axis_plot * second_axis_plot).opts(
    title="PCA of Random Normal Data",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale
)

outcome = Embed(plot=plot, file_name="pca_random_normal")()
print(outcome)

Figure Missing

Looking at the model

  • The rotation matrix took the original uncorrelated variables and transformed them into correllated variables (the blue circles).
  • Fitting the PCA to our correlated data finds the rotation matrix that was used to create the blue points.
  • Applying the PCA transformation undoes the rotation (but the spread doesn't return).

Our orginal standard deviations were 1 and 0.333 and if we look at the Explained Variance it is roughly our original standard deviations squared.

print(numpy.sqrt(variance))
[0.99140088 0.32958007]

Dimensionality Reduction

The previous sections were meant to understand what PCA is doing, but to use the PCA for visualization we will use it to reduce the number of dimensions of a data set so that it can be plotted. We can get a sense of how that works here by looking at our rotated data set with either the entire x-axis set to 0 or the entire y-axis set to 0.

first_component = rotated.copy()
first_component["y"] = 0
second_component = rotated.copy()
second_component["x"] = 0

original = rotated.hvplot.scatter(x="x", y="y", color=Plot.tan,
                                  fill_alpha=0)
first = first_component.hvplot.scatter(x="x", y="y",
                                       color=Plot.blue, fill_alpha=0)
second = second_component.hvplot.scatter(x="x", y="y",
                                         color=Plot.red, fill_alpha=0)

plot = (original * first * second).opts(
    title="Data Decomposition",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale
)
outcome = Embed(plot=plot, file_name="data_decomposed")()
print(outcome)

Figure Missing

This is only a teaser to doing an actual dimensionality reduction.

End

This is a walk-through of a lab for Coursera's NLP Specialization.