Embeddings from Scratch
Table of Contents
Beginning
This is a walk-through of the tensorflow Word Embeddings tutorial, just to make sure I can do it.
Imports
Python
from argparse import Namespace
from functools import partial
PyPi
from tensorflow import keras
from tensorflow.keras import layers
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets
Others
from graeae import EmbedHoloviews, Timer
Set Up
Plotting
prefix = "../../files/posts/keras/"
slug = "embeddings-from-scratch"
Embed = partial(EmbedHoloviews, folder_path=f"{prefix}{slug}")
The Timer
TIMER = Timer()
Middle
Some Constants
Text = Namespace(
vocabulary_size=1000,
embeddings_size=16,
max_length=500,
padding="post",
)
Tokens = Namespace(
padding = "<PAD>",
start = "<START>",
unknown = "<UNKNOWN>",
unused = "<UNUSED>",
)
The Embeddings Layer
print(layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size. e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]` This layer can only be used as the first layer in a model. Example: ```python model = Sequential() model.add(Embedding(1000, 64, input_length=10)) # the model will take as input an integer matrix of size (batch, # input_length). # the largest integer (i.e. word index) in the input should be no larger # than 999 (vocabulary size). # now model.output_shape == (None, 10, 64), where None is the batch # dimension. input_array = np.random.randint(1000, size=(32, 10)) model.compile('rmsprop', 'mse') output_array = model.predict(input_array) assert output_array.shape == (32, 10, 64) ``` Arguments: input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. output_dim: int >= 0. Dimension of the dense embedding. embeddings_initializer: Initializer for the `embeddings` matrix. embeddings_regularizer: Regularizer function applied to the `embeddings` matrix. embeddings_constraint: Constraint function applied to the `embeddings` matrix. mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is `True` then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1). input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect `Flatten` then `Dense` layers upstream (without it, the shape of the dense outputs cannot be computed). Input shape: 2D tensor with shape: `(batch_size, input_length)`. Output shape: 3D tensor with shape: `(batch_size, input_length, output_dim)`.
embedding_layer = layers.Embedding(Text.vocabulary_size, Text.embeddings_size)
The first argument is the number of possible words in the vocabulary and the second is the number of dimensions. The Emebdding is a sort of lookup table that maps an integer that represents a word to a vector. In this case we're going to build a vocabulary of 1,000 words represented by vectors with a length of 32. The weights in the vectors are learned when we train the model and will encode the distance between words.
The input to the embeddings layer is a 2D tensor of integers with the shape (number of samples
, sequence_length
). The sequences are integer-encoded sentences of the same length - so you have to pad the shorter sentences to match the longest one (the sequence_length
).
The ouput of the embeddings layer is a 3D tensor with the shape (number of samples
, sequence_length
, embedding_dimensionality
).
The Dataset
(train_data, test_data), info = tensorflow_datasets.load(
"imdb_reviews/subwords8k",
split=(tensorflow_datasets.Split.TRAIN,
tensorflow_datasets.Split.TEST),
with_info=True, as_supervised=True)
encoder = info.features["text"].encoder
print(encoder.subwords[:10])
['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br']
Add Padding
padded_shapes = ([None], ())
train_batches = train_data.shuffle(Text.vocabulary_size).padded_batch(
10, padded_shapes=padded_shapes)
test_batches = test_data.shuffle(Text.vocabulary_size).padded_batch(
10, padded_shapes=padded_shapes
)
Checkout a Sample
batch, labels = next(iter(train_batches))
print(batch.numpy())
[[ 62 9 4 ... 0 0 0] [ 19 2428 6 ... 0 0 0] [ 691 2 594 ... 7961 1457 7975] ... [6072 5644 8043 ... 0 0 0] [ 977 15 57 ... 0 0 0] [5646 2 1 ... 0 0 0]]
Build a Model
model = keras.Sequential([
layers.Embedding(encoder.vocab_size, Text.embeddings_size),
layers.GlobalAveragePooling1D(),
layers.Dense(1, activation="sigmoid")
])
print(model.summary())
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, None, 16) 130960 _________________________________________________________________ global_average_pooling1d (Gl (None, 16) 0 _________________________________________________________________ dense (Dense) (None, 1) 17 ================================================================= Total params: 130,977 Trainable params: 130,977 Non-trainable params: 0 _________________________________________________________________ None
Compile and Train
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
ONCE_PER_EPOCH = 2
with TIMER:
history = model.fit(train_batches, epochs=10,
validation_data=test_batches,
verbose=ONCE_PER_EPOCH,
validation_steps=20)
2019-09-28 17:14:52,764 graeae.timers.timer start: Started: 2019-09-28 17:14:52.764725 I0928 17:14:52.764965 140515023214400 timer.py:70] Started: 2019-09-28 17:14:52.764725 W0928 17:14:52.806057 140515023214400 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Epoch 1/10 val_loss: 0.3015 - val_accuracy: 0.8900 2019-09-28 17:17:36,036 graeae.timers.timer end: Ended: 2019-09-28 17:17:36.036090 I0928 17:17:36.036139 140515023214400 timer.py:77] Ended: 2019-09-28 17:17:36.036090 2019-09-28 17:17:36,037 graeae.timers.timer end: Elapsed: 0:02:43.271365 I0928 17:17:36.037808 140515023214400 timer.py:78] Elapsed: 0:02:43.271365
End
data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="Training/Validation Performance",
width=1000,
height=800)
Embed(plot=plot, file_name="training")()
Amazingly, even with such a simple model, it managed a 92 % validation accuracy.