Embeddings from Scratch

Beginning

This is a walk-through of the tensorflow Word Embeddings tutorial, just to make sure I can do it.

Imports

Python

from argparse import Namespace
from functools import partial

PyPi

from tensorflow import keras
from tensorflow.keras import layers
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import EmbedHoloviews, Timer

Set Up

Plotting

prefix = "../../files/posts/keras/"
slug = "embeddings-from-scratch"

Embed = partial(EmbedHoloviews, folder_path=f"{prefix}{slug}")

The Timer

TIMER = Timer()

Middle

Some Constants

Text = Namespace(
    vocabulary_size=1000,
    embeddings_size=16,
    max_length=500,
    padding="post",
)

Tokens = Namespace(
    padding = "<PAD>",
    start = "<START>",
    unknown = "<UNKNOWN>",
    unused = "<UNUSED>",
)

The Embeddings Layer

print(layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.
  
embedding_layer = layers.Embedding(Text.vocabulary_size, Text.embeddings_size)

The first argument is the number of possible words in the vocabulary and the second is the number of dimensions. The Emebdding is a sort of lookup table that maps an integer that represents a word to a vector. In this case we're going to build a vocabulary of 1,000 words represented by vectors with a length of 32. The weights in the vectors are learned when we train the model and will encode the distance between words.

The input to the embeddings layer is a 2D tensor of integers with the shape (number of samples, sequence_length). The sequences are integer-encoded sentences of the same length - so you have to pad the shorter sentences to match the longest one (the sequence_length).

The ouput of the embeddings layer is a 3D tensor with the shape (number of samples, sequence_length, embedding_dimensionality).

The Dataset

(train_data, test_data), info = tensorflow_datasets.load(
    "imdb_reviews/subwords8k",
    split=(tensorflow_datasets.Split.TRAIN,
           tensorflow_datasets.Split.TEST),
    with_info=True, as_supervised=True)
encoder = info.features["text"].encoder
print(encoder.subwords[:10])
['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br']

Add Padding

padded_shapes = ([None], ())
train_batches = train_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes)
test_batches = test_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes
)

Checkout a Sample

batch, labels = next(iter(train_batches))
print(batch.numpy())
[[  62    9    4 ...    0    0    0]
 [  19 2428    6 ...    0    0    0]
 [ 691    2  594 ... 7961 1457 7975]
 ...
 [6072 5644 8043 ...    0    0    0]
 [ 977   15   57 ...    0    0    0]
 [5646    2    1 ...    0    0    0]]

Build a Model

model = keras.Sequential([
    layers.Embedding(encoder.vocab_size, Text.embeddings_size),
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation="sigmoid")
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0
_________________________________________________________________
None

Compile and Train

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
ONCE_PER_EPOCH = 2
with TIMER:
    history = model.fit(train_batches, epochs=10,
                        validation_data=test_batches,
                        verbose=ONCE_PER_EPOCH,
                        validation_steps=20)
2019-09-28 17:14:52,764 graeae.timers.timer start: Started: 2019-09-28 17:14:52.764725
I0928 17:14:52.764965 140515023214400 timer.py:70] Started: 2019-09-28 17:14:52.764725
W0928 17:14:52.806057 140515023214400 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/10
 val_loss: 0.3015 - val_accuracy: 0.8900
2019-09-28 17:17:36,036 graeae.timers.timer end: Ended: 2019-09-28 17:17:36.036090
I0928 17:17:36.036139 140515023214400 timer.py:77] Ended: 2019-09-28 17:17:36.036090
2019-09-28 17:17:36,037 graeae.timers.timer end: Elapsed: 0:02:43.271365
I0928 17:17:36.037808 140515023214400 timer.py:78] Elapsed: 0:02:43.271365

End

data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="Training/Validation Performance",
                          width=1000,
                          height=800)
Embed(plot=plot, file_name="training")()

Figure Missing

Amazingly, even with such a simple model, it managed a 92 % validation accuracy.