Modified Triplet Loss
Table of Contents
Beginning
We'll be looking at how to calculate the full triplet loss as well as a matrix of similarity scores.
Background
This is the original triplet loss function:
\[ \mathcal{L_\mathrm{Original}} = \max{(\mathrm{s}(A,N) -\mathrm{s}(A,P) +\alpha, 0)} \]
It can be improved by including the mean negative and the closest negative, to create a new full loss function. The inputs are the Anchor \(\mathrm{A}\), Positive \(\mathrm{P}\) and Negative \(\mathrm{N}\).
\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}Imports
# from pypi
import numpy
Middle
Similarity Scores
The first step is to calculate the matrix of similarity scores using cosine similarity so that you can look up \(\mathrm{s}(A,P)\), \(\mathrm{s}(A,N)\) as needed for the loss formulas.
Two Vectors
First, this is how to calculate the similarity score, using cosine similarity, for 2 vectors.
\[ \mathrm{s}(v_1,v_2) = \mathrm{cosine \ similarity}(v_1,v_2) = \frac{v_1 \cdot v_2}{||v_1||~||v_2||} \]
Similarity score
def cosine_similarity(v1: numpy.ndarray, v2: numpy.ndarray) -> float:
"""Calculates the cosine similarity between two vectors
Args:
v1: first vector
v2: vector to compare to v1
Returns:
the cosine similarity between v1 and v2
"""
numerator = numpy.dot(v1, v2)
denominator = numpy.sqrt(numpy.dot(v1, v1)) * numpy.sqrt(numpy.dot(v2, v2))
return numerator / denominator
- Similar vectors
v1 = numpy.array([1, 2, 3], dtype=float) v2 = numpy.array([1, 2, 3.5]) print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
cosine similarity : 0.9974
- Identical Vectors
v2 = v1 print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
cosine similarity : 1.0000
- Opposite Vectors
v2 = -v1 print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
cosine similarity : -1.0000
- Dissimilar Vectors
v2 = numpy.array([0,-42,1]) print(f"cosine similarity : {cosine_similarity(v1, v2):0.4f}")
cosine similarity : -0.5153
Two Batches of Vectors
Now let's look at how to calculate the similarity scores, using cosine similarity, for 2 batches of vectors. These are rows of individual vectors, just like in the example above, but stacked vertically into a matrix. They would look like the image below for a batch size (row count) of 4 and embedding size (column count) of 5.
The data is setup so that \(v_{1\_1}\) and \(v_{2\_1}\) represent duplicate inputs, but they are not duplicates with any other rows in the batch. This means \(v_{1\_1}\) and \(v_{2\_1}\) (green and green) have more similar vectors than say \(v_{1\_1}\) and \(v_{2\_2}\) (green and magenta).
We'll use two different methods for calculating the matrix of similarities from 2 batches of vectors.
The Input data.
v1_1 = numpy.array([1, 2, 3])
v1_2 = numpy.array([9, 8, 7])
v1_3 = numpy.array([-1, -4, -2])
v1_4 = numpy.array([1, -7, 2])
v1 = numpy.vstack([v1_1, v1_2, v1_3, v1_4])
print("v1 :")
print(v1, "\n")
v2_1 = v1_1 + numpy.random.normal(0, 2, 3) # add some noise to create approximate duplicate
v2_2 = v1_2 + numpy.random.normal(0, 2, 3)
v2_3 = v1_3 + numpy.random.normal(0, 2, 3)
v2_4 = v1_4 + numpy.random.normal(0, 2, 3)
v2 = numpy.vstack([v2_1, v2_2, v2_3, v2_4])
print("v2 :")
print(v2, "\n")
v1 : [[ 1 2 3] [ 9 8 7] [-1 -4 -2] [ 1 -7 2]] v2 : [[ 1.34263076 1.18510671 1.04373534] [ 8.96692933 6.50763316 7.03243982] [-3.4497247 -6.08808183 -4.54327564] [-0.77144774 -9.08449817 4.4633513 ]]
For this to work the batch sizes must match.
assert len(v1) == len(v2)
Now let's look at the similarity scores.
- Option 1 : nested loops and the cosine similarity function
batch_size, columns = v1.shape scores_1 = numpy.zeros([batch_size, batch_size]) rows, columns = scores_1.shape for row in range(rows): for column in range(columns): scores_1[row, column] = cosine_similarity(v1[row], v2[column]) print("Option 1 : Loop") print(scores_1)
Option 1 : Loop [[ 0.88245143 0.87735873 -0.93717609 -0.14613242] [ 0.99999485 0.99567656 -0.95998199 -0.34214656] [-0.86016573 -0.81584759 0.96484391 0.60584372] [-0.31943701 -0.23354642 0.49063636 0.96181686]]
- Option 2 : Vector Normalization and the Dot Product
def norm(x: numpy.ndarray) -> numpy.ndarray: """Normalize x""" return x / numpy.sqrt(numpy.sum(x * x, axis=1, keepdims=True))
scores_2 = numpy.dot(norm(v1), norm(v2).T) print("Option 2 : Vector Norm & dot product") print(scores_2)
Option 2 : Vector Norm & dot product [[ 0.88245143 0.87735873 -0.93717609 -0.14613242] [ 0.99999485 0.99567656 -0.95998199 -0.34214656] [-0.86016573 -0.81584759 0.96484391 0.60584372] [-0.31943701 -0.23354642 0.49063636 0.96181686]]
Check
Let's make sure we get the same answer in both cases.
assert numpy.allclose(scores_1, scores_2)
Hard Negative Mining
Now we'll calculate the mean negative \(mean\_neg\) and the closest negative \(close\_neg\) used in calculating \(\mathcal{L_\mathrm{1}}\) and \(\mathcal{L_\mathrm{2}}\).
\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \end{align}We'll do this using the matrix of similarity scores for a batch size of 4. The diagonal of the matrix contains all the \(\mathrm{s}(A,P)\) values, similarities from duplicate question pairs (aka Positives). This is an important attribute for the calculations to follow.
Mean Negative
mean_neg is the average of the off diagonals, the \(\mathrm{s}(A,N)\) values, for each row.
Closest Negative
closest_neg is the largest off diagonal value, \(\mathrm{s}(A,N)\), that is smaller than the diagonal \(\mathrm{s}(A,P)\) for each row.
We'll start with some hand-made similarity scores.
similarity_scores = numpy.array(
[
[0.9, -0.8, 0.3, -0.5],
[-0.4, 0.5, 0.1, -0.1],
[0.3, 0.1, -0.4, -0.8],
[-0.5, -0.2, -0.7, 0.5],
]
)
Positives
All the s(A,P) values are similarities from duplicate question pairs (aka Positives). These are along the diagonal.
sim_ap = numpy.diag(similarity_scores)
print("s(A, P) :\n")
print(numpy.diag(sim_ap))
s(A, P) : [[ 0.9 0. 0. 0. ] [ 0. 0.5 0. 0. ] [ 0. 0. -0.4 0. ] [ 0. 0. 0. 0.5]]
Negatives
All the s(A,N) values are similarities of the non duplicate question pairs (aka Negatives). These are in the cells not on the diagonal.
sim_an = similarity_scores - numpy.diag(sim_ap)
print("s(A, N) :\n")
print(sim_an)
s(A, N) : [[ 0. -0.8 0.3 -0.5] [-0.4 0. 0.1 -0.1] [ 0.3 0.1 0. -0.8] [-0.5 -0.2 -0.7 0. ]]
Mean negative
This is the average of the s(A,N) values for each row.
batch_size = similarity_scores.shape[0]
mean_neg = numpy.sum(sim_an, axis=1, keepdims=True) / (batch_size - 1)
print("mean_neg :\n")
print(mean_neg)
mean_neg : [[-0.33333333] [-0.13333333] [-0.13333333] [-0.46666667]]
Closest negative
These are the Max s(A,N) that is <= s(A,P) for each row.
mask_1 = numpy.identity(batch_size) == 1 # mask to exclude the diagonal
mask_2 = sim_an > sim_ap.reshape(batch_size, 1) # mask to exclude sim_an > sim_ap
mask = mask_1 | mask_2
sim_an_masked = numpy.copy(sim_an) # create a copy to preserve sim_an
sim_an_masked[mask] = -2
closest_neg = numpy.max(sim_an_masked, axis=1, keepdims=True)
print("Closest Negative :\n")
print(closest_neg)
Closest Negative : [[ 0.3] [ 0.1] [-0.8] [-0.2]]
The Loss Functions
The last step is to calculate the loss functions.
\begin{align} \mathcal{L_\mathrm{1}} &= \max{(mean\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{2}} &= \max{(closest\_neg -\mathrm{s}(A,P) +\alpha, 0)}\\ \mathcal{L_\mathrm{Full}} &= \mathcal{L_\mathrm{1}} + \mathcal{L_\mathrm{2}}\\ \end{align}The Alpha margin.
alpha = 0.25
Modified triplet loss
loss_1 = numpy.maximum(mean_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_2 = numpy.maximum(closest_neg - sim_ap.reshape(batch_size, 1) + alpha, 0)
loss_full = loss_1 + loss_2
Cost
cost = numpy.sum(loss_full)
print("Loss Full :\n")
print(loss_full)
print(f"\ncost : {cost:.3f}")
Loss Full : [[0. ] [0. ] [0.51666667] [0. ]] cost : 0.517