Word Embeddings: Visualizing the Embeddings

Cloistered Monkey

2020-12-16 15:44

Extracting and Visualizing the Embeddings

In the previous post we built a Continuous Bag of Words model to predict a word based on the fraction of words each word surrounding it made up within a window (e.g. the fraction of the four words surrounding the word that each word made up). Now we're going to use the weights of the model as word embeddings and see if we can visualize them.

Imports

# python
from argparse import Namespace
from functools import partial

# pypi
from sklearn.decomposition import PCA

import holoviews
import hvplot.pandas
import pandas

# this project
from neurotic.nlp.word_embeddings import (
    Batches,
    CBOW,
    DataCleaner,
    MetaData,
    TheTrainer,
    )
# my other stuff
from graeae import EmbedHoloviews, Timer

Set Up

cleaner = DataCleaner()
meta = MetaData(cleaner.processed)
TIMER = Timer(speak=False)
SLUG = "word-embeddings-visualizing-the-embeddings"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{SLUG}")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 250
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layer, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

trainer = TheTrainer(model, batches, emit_point=50, verbose=True)
with TIMER:
    trainer()

2020-12-16 16:32:17,189 graeae.timers.timer start: Started: 2020-12-16 16:32:17.189213
50: loss=9.88889093658385
new learning rate: 0.0198
100: loss=9.138356897918037
150: loss=9.149555378031549
new learning rate: 0.013068000000000001
200: loss=9.077599951734605
2020-12-16 16:32:37,403 graeae.timers.timer end: Ended: 2020-12-16 16:32:37.403860
2020-12-16 16:32:37,405 graeae.timers.timer end: Elapsed: 0:00:20.214647
250: loss=8.607763835003631

print(trainer.best_loss)

8.186490214727549

Middle

Set It Up

We're going to use the method of averaging the weights of the two layers to form the embeddings.

embeddings = (trainer.best_weights.input_weights.T
              + trainer.best_weights.hidden_weights)/2

And now our words.

words = ["king", "queen","lord","man", "woman","dog","wolf",
         "rich","happy","sad"]

Now we need to translate the words into their indices so we can grab the rows in the mebedding that match.

indices = [meta.word_to_index[word] for word in words]
X = embeddings[indices, :]
print(X.shape, indices)

(10, 50) [2745, 3951, 2961, 3023, 5675, 1452, 5674, 4191, 2316, 4278]

There are 10 rows to match our ten words and 50 columns to match the number chosen for the hidden layer.

Visualizing

We're going to use sklearn's PCA for Principal Component Analysis. The n_components argument is the number of components it will keep - we'll keep 2.

pca = PCA(n_components=2)
reduced = pca.fit(X).transform(X)
pca_data = pandas.DataFrame(
    reduced,
    columns=["X", "Y"])

pca_data["Word"] = words

points = pca_data.hvplot.scatter(x="X",
                                 y="Y", color=Plot.red)
labels = pca_data.hvplot.labels(x="X", y="Y", text="Word", text_baseline="top")
plot = (points * labels).opts(
    title="PCA Embeddings",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
)
outcome = Embed(plot=plot, file_name="embeddings_pca")()

print(outcome)

Well, that's pretty horrible. Might need work.

End

This is the final post in the series looking at using a Continuous Bag of Words model to create word embeddings. Here are the other posts.

Table of Contents