Word Embeddings
Table of Contents
Beginning
This is a walk through a lab for week 3 of Coursera's Natural Language Processing course. It's going to use some pretrained word embeddings to develop some sense of how to use them.
Set Up
Imports
# python
from functools import partial
from pathlib import Path
import os
import pickle
# pypi
from dotenv import load_dotenv
from expects import (
equal,
expect,
)
import hvplot.pandas
import numpy
import pandas
# my stuff
from graeae import EmbedHoloviews
Plotting
load_dotenv("posts/nlp/.env")
SLUG = "word-embeddings"
Embed = partial(EmbedHoloviews, folder_path=f"files/posts/nlp/{SLUG}")
plot_path = Path(os.environ["TWITTER_PLOT"])
assert plot_path.is_file()
with plot_path.open("rb") as reader:
Plot = pickle.load(reader)
The Embeddings
Like I mentioned above, I'm going to use pre-trained word embeddings that have been pickled so I'll load them here.
path = Path(os.environ["WORD_EMBEDDINGS"])
with path.open("rb") as reader:
embeddings = pickle.load(reader)
expect(len(embeddings)).to(equal(243))
Middle*
Inspecting the Embeddings
The embeddings
is a dictionary of words to word-vectors that represent them. Here's the first 5 words.
print(type(embeddings))
print(list(embeddings.keys())[:5])
<class 'dict'> ['country', 'city', 'China', 'Iraq', 'oil']
vector = embeddings["country"]
print(type(vector))
print(vector.shape)
<class 'numpy.ndarray'> (300,)
Each word-embedding vector has 300 entries.
Plotting
Since there are 300 columns you can't easily visualize them without using PCA or some other method, but this is more about getting an intuition as to how the linear-algebra works, so instead we're going to reduce a subset of words to only two columns so that we can plot them.
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
plot_data = pandas.DataFrame([embeddings[word] for word in words])
plot_columns = [3, 2]
plot_data = plot_data[plot_columns]
plot_data.columns = ["x", "y"]
plot_data["Word"] = words
origins = plot_data * 0
origins["Word"] = words
combined_plot_data = pandas.concat([origins, plot_data])
segment_plot = combined_plot_data.hvplot(x="x", y="y", by="Word")
scatter_plot = plot_data.hvplot.scatter(x="x", y="y", by="Word")
plot = (segment_plot * scatter_plot).opts(
title="Embeddings Columns 3 and 2",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale
)
outcome = Embed(plot=plot, file_name="embeddings_segments")()
print(outcome)
You can see that words like "village" and "town" are similar while "city" and "oil" are opposites for whatever reason. Oddly, "joyful" and "country" are also very similar (although I'm only looking at two out of three-hundred columns so that might not be the case once the other columns enter into place).
Word Distance
This is supposed to be a visualization of the difference vectors between sad and happy and town and village, but as far as I can see holoviews doesn't have the equivalent of matplotlib's arrow which lets you use the base coordinate and distance in each dimension to draw arrows, so it's kind of a fake version where I use the points directly. Oh, well.
words = ['sad', 'happy', 'town', 'village']
plot_data = pandas.DataFrame([embeddings[word] for word in words])
plot_data = plot_data[plot_columns]
plot_data.columns = ["x", "y"]
plot_data.index = words
This is the fake part - when you take the difference between two "points" it gives you a vector with the base at the origin so you have to add the base point back in to move it from the origin, but then all you're doing is undoing the subtraction, giving you what you started with.
difference = pandas.DataFrame([
plot_data.loc["happy"] - plot_data.loc["sad"] + plot_data.loc["sad"],
plot_data.loc["town"] - plot_data.loc["village"] + plot_data.loc["village"]
])
difference["Word"] = ["sad", "village"]
plot_data = plot_data.reset_index().rename(columns=dict(index="Word"))
difference = pandas.concat([difference,
plot_data[plot_data.Word=="sad"],
plot_data[plot_data.Word=="village"]])
with_origin = pandas.concat([origins[origins.Word.isin(words)], plot_data])
scatter = plot_data.hvplot.scatter(x="x", y="y", by="Word")
segments = with_origin.hvplot(x="x", y="y", by="Word")
distances = difference.hvplot(x="x", y="y", by="Word")
plot = (distances * segments * scatter).opts(
title="Vector Differences",
height=Plot.height,
width=Plot.width,
fontscale=Plot.font_scale,
)
outcome = Embed(plot=plot, file_name="vector_differences")()
print(outcome)
Linear Algebra on Word Embeddings
The norm
First I'll check out the norm of some word vectors using numpy.linalg.norm. This calculates the Euclidean Distance between vectors (but oddly we won't use it here).
print(numpy.linalg.norm(embeddings["town"]))
print(numpy.linalg.norm(embeddings["sad"]))
2.3858097 2.9004838
Predicting Capitals
Here we'll see how to use the embeddings to predict what country a city is the capital of. To encode the concept of "capital" into a vector we'll use the difference between a specific country and its real capital (in this case France and Paris).
capital = embeddings["France"] - embeddings["Paris"]
Now that we have the concept of a capital encoded as a word embedding we can add it to the embedding of "Madrid" to get a vector near where "Spain" would be. Note that although there is a "Spain" in the embeddings we're going to use this to see if we can find it without knowing that Madrid is the capital of Spain.
country = embeddings["Madrid"] + capital
To make a prediction we have to find the embeddings that are closest to a country. We're going to convert the embeddings to a pandas DataFrame and since our embeddings are a dictionary of arrays we'll have to do a little unpacking first.
keys = embeddings.keys()
embeddings = pandas.DataFrame([embeddings[key] for key in keys], index=keys)
Now we'll make a function to find the closest embeddings for a word vector.
def closest_word(vector: numpy.ndarray) -> str:
"""Find the word closest to a given vector
Args:
vector: the vector to match
Returns:
name of the closest embedding
"""
differences = embeddings - vector
expect(differences.shape).to(equal(embeddings.shape))
distances = (differences**2).sum(axis="columns")
expect(distances.shape).to(equal((len(differences),)))
return embeddings.iloc[numpy.argmin(distances)].name
Now we can check what word most closesly matches Madrid + (France - Paris).
print(closest_word(country))
Spain
Like magic.
More Countries
What happens if we use a different know country and its capital instead of France and Paris?
print(closest_word(embeddings.loc['Italy'] - embeddings.loc['Rome']
+ embeddings.loc['Madrid']))
Spain
So swapping the capital derivation didn't change the prediction. Now we'll go back to using France - Paris
but try different cities.
for word in "Tokyo Moscow".split():
print(f"{word} is the capital of {closest_word(embeddings.loc[word] + capital)}")
Tokyo is the capital of Japan Moscow is the capital of Russia
That seems to be working, but here's a case where our search fails.
print(closest_word(embeddings.loc['Lisbon'] + capital))
Lisbon
For some reason "Lisbon" is closer to itself than portugal. I tried it with Germany and Italy instead of France as the template capital but it still didn't work. If you try random cities from the embeddings you'll see that a fair amount of them fail.
Sentence Vectors
To use this for sentences you construct a vector with all the vectors for each word and then sum up all the columns to get back to a single vector.
sentence = "Canada oil city town".split()
vectors = [embeddings.loc[token] for token in sentence]
summed = numpy.sum(vectors, axis=0)
print(closest_word(summed))
city
Not exciting, but that's how you do it.