Deep N-Grams: Evaluating the Model
Table of Contents
Evaluating the Model
Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:
\[ P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}} \]
As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN
, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure that is artificially good).
Instructions: Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use tl.one_hot
to transform the target into the same dimension. You then multiply them and sum.
You also have to create a mask to only get the non-padded probabilities. Good luck!
Hints
- To convert the target into the same dimension as the predictions tensor use tl.one.hot with target and preds.shape[-1].
- You will also need the np.equal function in order to unpad the data and properly compute perplexity.
- Keep in mind while implementing the formula above that \(w_i\) represents a letter from our 256 letter alphabet.
Imports
# python
from collections import namedtuple
from pathlib import Path
import os
# pypi
from dotenv import load_dotenv
from trax import layers
import trax.fastmath.numpy as numpy
import jax
# this project
from neurotic.nlp.deep_rnn import DataLoader, DataGenerator, GRUModel
Set Up
load_dotenv("posts/nlp/.env")
DataSettings = namedtuple(
"DataSettings",
"batch_size max_length output".split())
SETTINGS = DataSettings(batch_size=32,
max_length=64,
output="~/models/gru-shakespeare-model/")
loader = DataLoader()
training_generator = DataGenerator(data=loader.training, data_loader=loader,
batch_size=SETTINGS.batch_size,
max_length=SETTINGS.max_length,
shuffle=False)
Middle
def test_model(preds: jax.interpreters.xla.DeviceArray,
target: jax.interpreters.xla.DeviceArray) -> float:
"""Function to test the model.
Args:
preds: Predictions of a list of batches of tensors corresponding to lines of text.
target: Actual list of batches of tensors corresponding to lines of text.
Returns:
float: log_perplexity of the model.
"""
total_log_ppx = numpy.sum(layers.one_hot(x=target, n_categories=preds.shape[-1]) * preds, axis= -1) # HINT: tl.one_hot() should replace one of the Nones
non_pad = 1.0 - numpy.equal(target, 0) # You should check if the target equals 0
ppx = total_log_ppx * non_pad # Get rid of the padding
log_ppx = numpy.sum(ppx) / numpy.sum(non_pad)
return -log_ppx
Testing
Pre-Built Model
We're going to start with a pre-built file and see how it does relative to our model.
gru = GRUModel()
model = gru.model
pre_built = Path(os.environ["PRE_BUILT_MODEL"]).expanduser()
model.init_from_file(pre_built)
print(model)
Serial[ Serial[ ShiftRight(1) ] Embedding_256_512 GRU_512 GRU_512 Dense_256 LogSoftmax ]
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 2.0370717 7.6681223
Our Model
gru = GRUModel()
model = gru.model
ours = Path("~/models/gru-shakespeare-model/model.pkl.gz").expanduser()
model.init_from_file(ours)
print(model)
Serial[ Serial[ ShiftRight(1) ] Embedding_256_512 GRU_512 GRU_512 Dense_256 LogSoftmax ]
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 0.93021315 2.5350494
On the one hand I over-trained my model, on the other hand… why such a big difference?