Implementing k-Nearest Neighbors for Machine Translation

Beginning

This continues from the post where we found the transformation matrix. It's part of a series of posts whose links are gathered in the Machine Translation post.

Imports

# pypi
import numpy

# my stuff
from graeae import Timer
from neurotic.nlp.word_embeddings.english_french import TrainingData
from neurotic.nlp.word_embeddings.training import TheTrainer
TIMER = Timer()

Middle

Testing the translation

k-Nearest neighbors algorithm

  • The k-Nearest neighbors algorithm is a method which takes a vector as input and finds the other vectors in the dataset that are closest to it.
  • The 'k' is the number of "nearest neighbors" to find (e.g. k=2 finds the closest two neighbors).

Searching for the translation embedding

Since we're approximating the translation function from English to French embeddings with a linear transformation matrix \(\mathbf{R}\), most of the time we won't get the exact embedding of a French word when we transform embedding \(\mathbf{e}\) of some particular English word into the French embedding space.

This is where k-NN becomes really useful! By using 1-NN with \(\mathbf{eR}\) as input, we can search for an embedding \(\mathbf{f}\) (as a row) in the matrix \(\mathbf{Y}\) which is the closest to the transformed vector \(\mathbf{eR}\).

Cosine similarity

Cosine similarity between vectors u and v calculated as the cosine of the angle between them.

The formula is:

\[ \cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|} \]

  • \(\cos(u,v) = 1\) when u and v lie on the same line and have the same direction.
  • \(\cos(u,v) = -1\) when they have exactly opposite directions.
  • \(\cos(u,v) = 0\) when the vectors are orthogonal (perpendicular) to each other.

Note: Distance and similarity are pretty much opposite things.

  • We can obtain distance metric from cosine similarity, but the cosine similarity can't be used directly as the distance metric.
  • When the cosine similarity increases (towards 1), the "distance" between the two vectors decreases (towards 0).
  • We can define the cosine distance between u and v as

\[ d_{\text{cos}}(u,v)=1-\cos(u,v) \]

The Cosine Similarity

def cosine_similarity(vector_1: numpy.ndarray, vector_2: numpy.ndarray) -> float:
    """Calculates the similarity between two vectors

    Args:
     vector_1: array to compare
     vector_2: array to compare to vector_1

    Returns:
     cosine similarity between the two vectors
    """
    return numpy.dot(vector_1, vector_2)/(numpy.linalg.norm(vector_1) *
                                          numpy.linalg.norm(vector_2))

the nearest_neighbor() function

Inputs:

  • Vector v
  • A set of possible nearest neighbors candidates
  • k nearest neighbors to find.
  • The distance metric should be based on cosine similarity.
  • cosine_similarity function is already implemented and imported for you. It's arguments are two vectors and it returns the cosine of the angle between them.
  • Iterate over rows in candidates, and save the result of similarities between current row and vector v in a python list. Take care that similarities are in the same order as row vectors of candidates.
  • Now you can use numpy argsort to sort the indices for the rows of candidates.
  • Hints
    • numpy.argsort sorts values from most negative to most positive (smallest to largest)
    • The candidates that are nearest to v should have the highest cosine similarity
    • To get the last element of a list tmp, the notation is tmp[-1:]
    # UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
    def nearest_neighbor(v, candidates, k=1):
        """
        Input:
          - v, the vector you are going find the nearest neighbor for
          - candidates: a set of vectors where we will find the neighbors
          - k: top k nearest neighbors to find
        Output:
          - k_idx: the indices of the top k closest vectors in sorted form
        """
        # cosine_similarities = [cosine_similarity(v, row) for row in candidates]
    
        # for each candidate vector...
        #for row in candidates:
        #    # get the cosine similarity
        #    cos_similarity = cosine_similarity(v, row)
        #
        #    # append the similarity to the list
        #    similarity_l.append(cos_similarity)
    
        # sort the similarity list and get the indices of the sorted list
        # sorted_ids = numpy.argsort(similarity_l)
    
        # get the indices of the k most similar candidate vectors
        # k_idx = sorted_ids[-k:]
        ### END CODE HERE ###
        return numpy.argsort([cosine_similarity(v, row) for row in candidates])[-k:]
    

Test your implementation:

v = numpy.array([1, 0, 1], dtype="float64")
candidates = numpy.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
expected = numpy.array([
    [9, 9, 9],
    [1, 0, 5],
    [2, 0, 1]])
actual = candidates[nearest_neighbor(v, candidates, 3)]
print(actual)
assert (actual == expected).all()
[[9 9 9]
 [1 0 5]
 [2 0 1]]

Test your translation and compute its accuracy

Complete the function test_vocabulary which takes in English embedding matrix X, French embedding matrix Y and the R matrix and returns the accuracy of translations from X to Y by R.

  • Iterate over transformed English word embeddings and check if the closest French word vector belongs to French word that is the actual translation.
  • Obtain an index of the closest French embedding by using nearest_neighbor (with argument k=1), and compare it to the index of the English embedding you have just transformed.
  • Keep track of the number of times you get the correct translation.
  • Calculate accuracy as

    \[ \text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})} \]

# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_vocabulary(X, Y, R):
    '''
    Input:
       X: a matrix where the columns are the English embeddings.
       Y: a matrix where the columns correspond to the French embeddings.
       R: the transform matrix which translates word embeddings from
       English to French word vector space.
    Output:
       accuracy: for the English to French capitals
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # The prediction is X times R
    pred = numpy.dot(X, R)

    # initialize the number correct to zero
    #num_correct = 0
    #predictions = (nearest_neighbor(row, Y) == index for index, row in enumerate(pred))
    # accuracy = sum(predictions)/len(red)
    # loop through each row in pred (each transformed embedding)
    #for index, row_vector in enumerate(pred):
    #    # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
    #    pred_idx = nearest_neighbor(row_vector, Y)
    #
    #    # if the index of the nearest neighbor equals the row of i... \
    #    if pred_idx == index:
    #        # increment the number correct by 1.
    #        num_correct += 1
    #
    ## accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    #accuracy = num_correct/len(pred)
    #
    #### END CODE HERE ###

    return sum([nearest_neighbor(row, Y) == index for index, row in enumerate(pred)])/len(pred)

Let's see how is your translation mechanism working on the unseen data:

# X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)
data = TrainingData()
trainer = TheTrainer(data.x_train, data.y_train)
r = trainer.fit()

You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything.

with TIMER:
    acc = test_vocabulary(data.x_train, data.y_train, trainer.transformation)  # this might take a minute or two
2020-10-24 19:57:36,633 graeae.timers.timer start: Started: 2020-10-24 19:57:36.632998
2020-10-24 20:05:48,225 graeae.timers.timer end: Ended: 2020-10-24 20:05:48.225526
2020-10-24 20:05:48,226 graeae.timers.timer end: Elapsed: 0:08:11.592528
print(f"accuracy on test set is {acc[0]:.3f}")
accuracy on test set is 0.552

Expected Output:

#+RESULTS 0.557

You managed to translate words from one language to another language without ever seing them with almost 56% accuracy by using some basic linear algebra and learning a mapping of words from one language to another!

Bundling It Up

<<imports>>

<<nearest-neighbor>>

    <<nearest-cosine-similarity>>

    <<nearest-nearest-neighbor>>

    <<nearest-call>>

Imports

# pypi
import attr
import numpy

Nearest Neighbor

@attr.s(auto_attribs=True)
class NearestNeighbors:
    """Finds the nearest neighbor(s) to a vector

    Args:
     candidates: set of vectors that are potential neighbors
     k: number of neighbors to find
    """
    candidates: numpy.ndarray    
    k: int=1

Cosine Similarity Method

def cosine_similarity(self, vector_1: numpy.ndarray, vector_2: numpy.ndarray) -> float:
    """Calculates the similarity between two vectors

    Args:
     vector_1: array to compare
     vector_2: array to compare to vector_1

    Returns:
     cosine similarity between the two vectors
    """
    return numpy.dot(vector_1, vector_2)/(numpy.linalg.norm(vector_1) *
                                          numpy.linalg.norm(vector_2))

Nearest Neighbor Method

def nearest_neighbors(self, vector: numpy.ndarray) -> numpy.ndarray:
    """Find the nearest neghbor(s) to a vector

    Args:
      - vector, the vector you are going find the nearest neighbor for

    Returns:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    return numpy.argsort([self.cosine_similarity(vector, row)
                          for row in self.candidates])[-self.k:]

Nearest Neighbor Call

def __call__(self, vector: numpy.ndarray) -> numpy.ndarray:
    """Alias for the `nearest_neighbors` method

    Args:
      - vector, the vector you are going find the nearest neighbor for

    Returns:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    return self.nearest_neighbors(vector)

Testing It

from neurotic.nlp.word_embeddings.nearest_neighbors import NearestNeighbors


v = numpy.array([1, 0, 1], dtype="float64")
candidates = numpy.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])

testing = NearestNeighbors(candidates=candidates, k=3)

expected = numpy.array([
    [9, 9, 9],
    [1, 0, 5],
    [2, 0, 1]])
actual = candidates[testing.nearest_neighbors(v)]
print(actual)
assert (actual == expected).all()
[[9 9 9]
 [1 0 5]
 [2 0 1]]
with TIMER:
    data = TrainingData()
    trainer = TheTrainer(data.x_train, data.y_train)
    r = trainer.fit()
    predictions = numpy.dot(data.x_train, trainer.transformation)
with TIMER:
    tester = NearestNeighbors(k=1, candidates=data.y_train)
    accuracy = sum([tester(row) == index
                    for index, row in enumerate(predictions)])/len(predictions)
2020-10-31 19:35:14,133 graeae.timers.timer start: Started: 2020-10-31 19:35:14.133884
2020-10-31 19:43:29,695 graeae.timers.timer end: Ended: 2020-10-31 19:43:29.695745
2020-10-31 19:43:29,697 graeae.timers.timer end: Elapsed: 0:08:15.561861
print(f"Accuracy: {100 * accuracy[0]: 0.2f} %")
Accuracy:  55.23 %

End