Deep N-Grams: Evaluating the Model

Evaluating the Model

Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:

\[ P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}} \]

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure that is artificially good).

\begin{align} log P(W) &= {log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,\ldots,w_{n-1})}}\right)} \\ &= {log\left({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,\ldots,w_{n-1})}}\right)^{\frac{1}{N}}}\\ & = {log\left({\prod_{i=1}^{N}{P(w_i| w_1,\ldots,w_{n-1})}}\right)^{-\frac{1}{N}}} \\ & = -\frac{1}{N}{log\left({\prod_{i=1}^{N}{P(w_i| w_1,\ldots,w_{n-1})}}\right)} \\ & = -\frac{1}{N}{\left({\sum_{i=1}^{N}{logP(w_i| w_1,\ldots,w_{n-1})}}\right)} \end{align}

Instructions: Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use tl.one_hot to transform the target into the same dimension. You then multiply them and sum.

You also have to create a mask to only get the non-padded probabilities. Good luck!

Hints

  • To convert the target into the same dimension as the predictions tensor use tl.one.hot with target and preds.shape[-1].
  • You will also need the np.equal function in order to unpad the data and properly compute perplexity.
  • Keep in mind while implementing the formula above that \(w_i\) represents a letter from our 256 letter alphabet.

Imports

# python
from collections import namedtuple
from pathlib import Path

import os

# pypi
from dotenv import load_dotenv
from trax import layers

import trax.fastmath.numpy as numpy
import jax
# this project
from neurotic.nlp.deep_rnn import DataLoader, DataGenerator, GRUModel

Set Up

load_dotenv("posts/nlp/.env")
DataSettings = namedtuple(
    "DataSettings",
    "batch_size max_length output".split())
SETTINGS = DataSettings(batch_size=32,
                        max_length=64,
                        output="~/models/gru-shakespeare-model/")
loader = DataLoader()
training_generator = DataGenerator(data=loader.training, data_loader=loader,
                                   batch_size=SETTINGS.batch_size,
                                   max_length=SETTINGS.max_length,
                                   shuffle=False)

Middle

def test_model(preds: jax.interpreters.xla.DeviceArray,
               target: jax.interpreters.xla.DeviceArray) -> float:
    """Function to test the model.

    Args:
       preds: Predictions of a list of batches of tensors corresponding to lines of text.
       target: Actual list of batches of tensors corresponding to lines of text.

    Returns:
       float: log_perplexity of the model.
    """
    total_log_ppx = numpy.sum(layers.one_hot(x=target, n_categories=preds.shape[-1]) * preds, axis= -1) # HINT: tl.one_hot() should replace one of the Nones

    non_pad = 1.0 - numpy.equal(target, 0)          # You should check if the target equals 0
    ppx = total_log_ppx * non_pad                             # Get rid of the padding

    log_ppx = numpy.sum(ppx) / numpy.sum(non_pad)

    return -log_ppx

Testing

Pre-Built Model

We're going to start with a pre-built file and see how it does relative to our model.

gru = GRUModel()
model = gru.model
pre_built = Path(os.environ["PRE_BUILT_MODEL"]).expanduser()
model.init_from_file(pre_built)
print(model)
Serial[
  Serial[
    ShiftRight(1)
  ]
  Embedding_256_512
  GRU_512
  GRU_512
  Dense_256
  LogSoftmax
]
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 2.0370717 7.6681223

Our Model

gru = GRUModel()
model = gru.model
ours = Path("~/models/gru-shakespeare-model/model.pkl.gz").expanduser()
model.init_from_file(ours)
print(model)
Serial[
  Serial[
    ShiftRight(1)
  ]
  Embedding_256_512
  GRU_512
  GRU_512
  Dense_256
  LogSoftmax
]
batch = next(training_generator)
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, numpy.exp(log_ppx))
The log perplexity and perplexity of your model are respectively 0.93021315 2.5350494

On the one hand I over-trained my model, on the other hand… why such a big difference?