Neural Machine Translation: The Data
Table of Contents
The Data
This is the first post in a series that will look at creating a Long-Short-Term-Memory (LSTM) model with attention for Machine Learning. The previous post was an overview that holds the links to all the posts in the series.
Imports
# python
from pathlib import Path
import random
# pypi
from termcolor import colored
import numpy
import trax
Middle
Loading the Data
Next, we will import the dataset we will use to train the model. If you are running out of space, you can just use a small dataset from Opus, a growing collection of translated texts from the web. Particularly, we will get an English to German translation subset specified as opus/medical
which has medical related texts. If storage is not an issue, you can opt to get a larger corpus such as the English to German translation dataset from ParaCrawl, a large multi-lingual translation dataset created by the European Union. Both of these datasets are available via Tensorflow Datasets (TFDS) and you can browse through the other available datasets here. As you'll see below, you can easily access this dataset from TFDS with trax.data.TFDS
. The result is a python generator function yielding tuples. Use the keys
argument to select what appears at which position in the tuple. For example, keys=('en', 'de')
below will return pairs as (English sentence, German sentence).
The para_crawl/ende
dataset is 4.04 GiB while the opus/medical
dataset is 188.85 MiB.
Note: Trying to download the ParaCrawl dataset using trax creates an out of resource error. You can try downloading the source from:
https://s3.amazonaws.com/web-language-models/paracrawl/release4/en-de.bicleaner07.txt.gz
Although I haven't figured out how to get it into the trax data yet so I'm sticking with the smaller data set.
The Training Data
The first time you run this it will download the dataset, after that it will just load it from the file.
path = Path("~/data/tensorflow/translation/").expanduser()
data_set = "opus/medical"
# data_set = "para_crawl/ende"
train_stream_fn = trax.data.TFDS(data_set,
data_dir=path,
keys=('en', 'de'),
eval_holdout_size=0.01,
train=True)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-6-fb62d04026f5> in <module> 4 # data_set = "para_crawl/ende" 5 ----> 6 train_stream_fn = trax.data.TFDS(data_set, 7 data_dir=path, 8 keys=('en', 'de'), /usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs) 1067 scope_info = " in scope '{}'".format(scope_str) if scope_str else '' 1068 err_str = err_str.format(name, fn_or_cls, scope_info) -> 1069 utils.augment_exception_message_and_reraise(e, err_str) 1070 1071 return gin_wrapper /usr/local/lib/python3.8/dist-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message) 39 proxy = ExceptionProxy() 40 ExceptionProxy.__qualname__ = type(exception).__qualname__ ---> 41 raise proxy.with_traceback(exception.__traceback__) from None 42 43 /usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs) 1044 1045 try: -> 1046 return fn(*new_args, **new_kwargs) 1047 except Exception as e: # pylint: disable=broad-except 1048 err_str = '' /usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs) 1067 scope_info = " in scope '{}'".format(scope_str) if scope_str else '' 1068 err_str = err_str.format(name, fn_or_cls, scope_info) -> 1069 utils.augment_exception_message_and_reraise(e, err_str) 1070 1071 return gin_wrapper /usr/local/lib/python3.8/dist-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message) 39 proxy = ExceptionProxy() 40 ExceptionProxy.__qualname__ = type(exception).__qualname__ ---> 41 raise proxy.with_traceback(exception.__traceback__) from None 42 43 /usr/local/lib/python3.8/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs) 1044 1045 try: -> 1046 return fn(*new_args, **new_kwargs) 1047 except Exception as e: # pylint: disable=broad-except 1048 err_str = '' ~/trax/trax/data/tf_inputs.py in TFDS(dataset_name, data_dir, tfds_preprocess_fn, keys, train, shuffle_train, host_id, n_hosts, eval_holdout_size) 279 else: 280 subsplit = None --> 281 (train_data, eval_data, _) = _train_and_eval_dataset( 282 dataset_name, data_dir, eval_holdout_size, 283 train_shuffle_files=shuffle_train, subsplit=subsplit) ~/trax/trax/data/tf_inputs.py in _train_and_eval_dataset(dataset_name, data_dir, eval_holdout_size, train_shuffle_files, eval_shuffle_files, subsplit) 224 if eval_holdout_examples > 0 or subsplit is not None: 225 n_train = train_examples - eval_holdout_examples --> 226 train_start = int(n_train * subsplit[0]) 227 train_end = int(n_train * subsplit[1]) 228 if train_end - train_start < 1: TypeError: 'NoneType' object is not subscriptable In call to configurable 'TFDS' (<function TFDS at 0x7f960c527280>) In call to configurable 'TFDS' (<function TFDS at 0x7f960c526f70>)
The Evaluation Data
Since we already downloaded the data in the previous code-block, this will just load the evaluation set from the downloaded data.
eval_stream_fn = trax.data.TFDS('opus/medical',
data_dir=path,
keys=('en', 'de'),
eval_holdout_size=0.01,
train=False)
Notice that TFDS returns a generator function, not a generator. This is because in Python, you cannot reset generators so you cannot go back to a previously yielded value. During deep learning training, you use Stochastic Gradient Descent and don't actually need to go back – but it is sometimes good to be able to do that, and that's where the functions come in. Let's print a a sample pair from our train and eval data. Notice that the raw output is represented in bytes (denoted by the b'
prefix) and these will be converted to strings internally in the next steps.
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()
[31mtrain data (en, de) tuple:[0m (b'Tel: +421 2 57 103 777\n', b'Tel: +421 2 57 103 777\n')
eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))
[31meval data (en, de) tuple:[0m (b'Lutropin alfa Subcutaneous use.\n', b'Pulver zur Injektion Lutropin alfa Subkutane Anwendung\n')
Tokenization and Formatting
Now that we have imported our corpus, we will be preprocessing the sentences into a format that our model can accept. This will be composed of several steps:
Tokenizing the sentences using subword representations: We want to represent each sentence as an array of integers instead of strings. For our application, we will use subword representations to tokenize our sentences. This is a common technique to avoid out-of-vocabulary words by allowing parts of words to be represented separately. For example, instead of having separate entries in your vocabulary for –"fear", "fearless", "fearsome", "some", and "less"–, you can simply store –"fear", "some", and "less"– then allow your tokenizer to combine these subwords when needed. This allows it to be more flexible so you won't have to save uncommon words explicitly in your vocabulary (e.g. stylebender, nonce, etc). Tokenizing is done with the `trax.data.Tokenize()` command and we have provided you the combined subword vocabulary for English and German (i.e. `ende_32k.subword`) retrieved from https://storage.googleapis.com/trax-ml/vocabs/ende_32k.subword (I'm using the web-interface, but you could also just download it and put it in a directory).
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = "gs://trax-ml/vocabs/" # google storage
# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)
Append an end-of-sentence token to each sentence: We will assign a token (i.e. in this case 1
) to mark the end of a sentence. This will be useful in inference/prediction so we'll know that the model has completed the translation.
Integer assigned as end-of-sentence (EOS)
EOS = 1
def append_eos(stream):
"""helper to add end of sentence token to sentences in the stream
Yields:
next tuple of numpy arrays with EOS token added (inputs, targets)
"""
for (inputs, targets) in stream:
inputs_with_eos = list(inputs) + [EOS]
targets_with_eos = list(targets) + [EOS]
yield numpy.array(inputs_with_eos), numpy.array(targets_with_eos)
return
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)
Filter long sentences
We will place a limit on the number of tokens per sentence to ensure we won't run out of memory. This is done with the trax.data.FilterByLength()
method and you can see its syntax below.
Filter too long sentences to not run out of memory. length_keys=[0, 1] means we filter both English and German sentences, so both must not be longer that 256 tokens for training and 512 tokens for evaluation.
filtered_train_stream = trax.data.FilterByLength(
max_length=256, length_keys=[0, 1])(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
max_length=512, length_keys=[0, 1])(tokenized_eval_stream)
train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)
[31mSingle tokenized example input:[0m [ 2538 2248 30 12114 23184 16889 5 2 20852 6456 20592 5812 3932 96 5178 3851 30 7891 3550 30650 4729 992 1] [31mSingle tokenized example target:[0m [ 1872 11 3544 39 7019 17877 30432 23 6845 10 14222 47 4004 18 21674 5 27467 9513 920 188 10630 18 3550 30650 4729 992 1]
tokenize & detokenize helper functions
Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following:
- word2Ind: a dictionary mapping the word to its index.
- ind2Word: a dictionary mapping the index to its word.
- word2Count: a dictionary mapping the word to the number of times it appears.
- num_words: total number of words that have appeared.
def tokenize(input_str: str,
vocab_file: str=None, vocab_dir: str=None, EOS: int=EOS) -> numpy.ndarray:
"""Encodes a string to an array of integers
Args:
input_str: human-readable string to encode
vocab_file: filename of the vocabulary text file
vocab_dir: path to the vocabulary file
Returns:
tokenized version of the input string
"""
# Use the trax.data.tokenize method. It takes streams and returns streams,
# we get around it by making a 1-element stream with `iter`.
inputs = next(trax.data.tokenize(iter([input_str]),
vocab_file=vocab_file,
vocab_dir=vocab_dir))
# Mark the end of the sentence with EOS
inputs = list(inputs) + [EOS]
# Adding the batch dimension to the front of the shape
batch_inputs = numpy.reshape(numpy.array(inputs), [1, -1])
return batch_inputs
def detokenize(integers: numpy.ndarray,
vocab_file: str=None,
vocab_dir: str=None,
EOS: int=EOS) -> str:
"""Decodes an array of integers to a human readable string
Args:
integers: array of integers to decode
vocab_file: filename of the vocabulary text file
vocab_dir: path to the vocabulary file
Returns:
str: the decoded sentence.
"""
# Remove the dimensions of size 1
integers = list(numpy.squeeze(integers))
# Remove the EOS to decode only the original tokens
if EOS in integers:
integers = integers[:integers.index(EOS)]
return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)
Let's see how we might use these functions:
Detokenize an input-target pair of tokenized sentences
print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()
[31mSingle detokenized example input:[0m During treatment with olanzapine, adolescents gained significantly more weight compared with adults. [31mSingle detokenized example target:[0m Während der Behandlung mit Olanzapin nahmen die Jugendlichen im Vergleich zu Erwachsenen signifikant mehr Gewicht zu.
Tokenize and detokenize a word that is not explicitly saved in the vocabulary file. See how it combines the subwords – 'hell' and 'o'– to form the word 'hello'.
print(colored("tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored("detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
[32mtokenize('hello'): [0m [[17332 140 1]] [32mdetokenize([17332, 140, 1]): [0m hello
Bucketing
Bucketing the tokenized sentences is an important technique used to speed up training in NLP. Here is a nice article describing it in detail but the gist is very simple. Our inputs have variable lengths and you want to make these the same when batching groups of sentences together. One way to do that is to pad each sentence to the length of the longest sentence in the dataset. This might lead to some wasted computation though. For example, if there are multiple short sentences with just two tokens, do we want to pad these when the longest sentence is composed of a 100 tokens? Instead of padding with 0s to the maximum length of a sentence each time, we can group our tokenized sentences by length and bucket.
We batch the sentences with similar length together and only add minimal padding to make them have equal length (usually up to the nearest power of two). This allows us to waste less computation when processing padded sequences.
In Trax, it is implemented in the bucket_by_length function.
Bucketing to create streams of batches.
Buckets are defined in terms of boundaries and batch sizes. Batch_sizes[i] determines the batch size for items with length < boundaries[i]. So below, we'll take a batch of 256 sentences of length < 8, 128 if length is between 8 and 16, and so on – and only 2 if length is over 512. We'll do the bucketing using bucket_by_length.
boundaries = [2**power_of_two for power_of_two in range(3, 10)]
batch_sizes = [2**power_of_two for power_of_two in range(8, 0, -1)]
Create the generators.
train_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes,
length_keys=[0, 1] # As before: count inputs and targets to length.
)(filtered_train_stream)
eval_batch_stream = trax.data.BucketByLength(
boundaries, batch_sizes,
length_keys=[0, 1]
)(filtered_eval_stream)
Add masking for the padding (0s) using add_loss_weights (we're using AddLossWeights
but the documentation for that just says "see add_loss_weights"). I can't find any documentation for it, but I think the 0's are what BucketByLength uses for padding.
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)
Exploring the data
We will now be displaying some of our data. You will see that the functions defined above (i.e. tokenize()
and detokenize()
) do the same things you have been doing again and again throughout the specialization. We gave these so you can focus more on building the model from scratch. Let us first get the data generator and get one batch of the data.
input_batch, target_batch, mask_batch = next(train_batch_stream)
Let's see the data type of a batch.
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))
input_batch data type: <class 'numpy.ndarray'> target_batch data type: <class 'numpy.ndarray'>
Let's see the shape of this particular batch (batch length, sentence length).
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)
input_batch shape: (32, 64) target_batch shape: (32, 64)
The input_batch
and target_batch
are Numpy arrays consisting of tokenized English sentences and German sentences respectively. These tokens will later be used to produce embedding vectors for each word in the sentence (so the embedding for a sentence will be a matrix). The number of sentences in each batch is usually a power of 2 for optimal computer memory usage.
We can now visually inspect some of the data. You can run the cell below several times to shuffle through the sentences. Just to note, while this is a standard data set that is used widely, it does have some known wrong translations. With that, let's pick a random sentence and print its tokenized representation.
Pick a random index less than the batch size.
index = random.randrange(len(input_batch))
Use the index to grab an entry from the input and target batch.
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')
[31mTHIS IS THE ENGLISH SENTENCE: [0m Kidneys and urinary tract (no effects were found to be common); uncommon: blood in the urine, proteins in the urine, sugar in the urine; rare: urge to pass urine, kidney pain, passing urine frequently. [31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: [0m [ 5381 17607 3093 8 8670 6086 105 19166 5 50 154 1743 152 1103 9 32 568 8076 19124 6847 64 6196 6 4 8670 510 2 13355 823 6 4 8670 510 2 4968 6 4 8670 510 115 7227 64 7628 9 2685 8670 510 2 12220 5509 12095 2 19632 8670 510 7326 3550 30650 4729 992 1 0 0 0] [31mTHIS IS THE GERMAN TRANSLATION: [0m Harndrang, Nierenschmerzen, häufiges Wasserlassen. [31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: [0m [ 5135 14970 2920 2 6262 4594 27552 28 2 20052 33 3736 530 3550 30650 4729 992 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Bundle it Up
Imports
# python
from collections import namedtuple
from pathlib import Path
# pypi
import attr
import numpy
import trax
Constants
DataDefaults = namedtuple("DataDefaults",
["path",
"dataset",
"keys",
"evaluation_size",
"end_of_sentence",
"vocabulary_file",
"vocabulary_path",
"length_keys",
"boundaries",
"batch_sizes",
"padding_token"])
DEFAULTS = DataDefaults(
path=Path("~/data/tensorflow/translation/").expanduser(),
dataset="opus/medical",
keys=("en", "de"),
evaluation_size=0.01,
end_of_sentence=1,
vocabulary_file="ende_32k.subword",
vocabulary_path="gs://trax-ml/vocabs/",
length_keys=[0, 1],
boundaries=[2**power_of_two for power_of_two in range(3, 10)],
batch_sizes=[2**power_of_two for power_of_two in range(8, 0, -1)],
padding_token=0,
)
MaxLength = namedtuple("MaxLength", "train evaluate".split())
MAX_LENGTH = MaxLength(train=256, evaluate=512)
END_OF_SENTENCE = 1
Tokenizer/Detokenizer
Tokenizer
def tokenize(input_str: str,
vocab_file: str=None, vocab_dir: str=None,
end_of_sentence: int=DEFAULTS.end_of_sentence) -> numpy.ndarray:
"""Encodes a string to an array of integers
Args:
input_str: human-readable string to encode
vocab_file: filename of the vocabulary text file
vocab_dir: path to the vocabulary file
end_of_sentence: token for the end of sentence
Returns:
tokenized version of the input string
"""
# The trax.data.tokenize method takes streams and returns streams,
# we get around it by making a 1-element stream with `iter`.
inputs = next(trax.data.tokenize(iter([input_str]),
vocab_file=vocab_file,
vocab_dir=vocab_dir))
# Mark the end of the sentence with EOS
inputs = list(inputs) + [end_of_sentence]
# Adding the batch dimension to the front of the shape
batch_inputs = numpy.reshape(numpy.array(inputs), [1, -1])
return batch_inputs
Detokenizer
def detokenize(integers: numpy.ndarray,
vocab_file: str=None,
vocab_dir: str=None,
end_of_sentence: int=DEFAULTS.end_of_sentence) -> str:
"""Decodes an array of integers to a human readable string
Args:
integers: array of integers to decode
vocab_file: filename of the vocabulary text file
vocab_dir: path to the vocabulary file
end_of_sentence: token to mark the end of a sentence
Returns:
str: the decoded sentence.
"""
# Remove the dimensions of size 1
integers = list(numpy.squeeze(integers))
# Remove the EOS to decode only the original tokens
if end_of_sentence in integers:
integers = integers[:integers.index(end_of_sentence)]
return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)
Data Generator
@attr.s(auto_attribs=True)
class DataGenerator:
"""Generates the streams of data
Args:
training: whether this generates training data or not
path: path to the data set
data_set: name of the data set (from tensorflow datasets)
keys: the names of the data
max_length: longest allowed set of tokens
evaluation_fraction: how much of the data is saved for evaluation
length_keys: keys (indexes) to use when setting length
boundaries: upper limits for batch sizes
batch_sizes: batch_size for each boundary
padding_token: which token is used for padding
vocabulary_file: name of the sub-words vocabulary file
vocabulary_path: where to find the vocabulary file
end_of_sentence: token to indicate the end of a sentence
"""
training: bool=True
path: Path=DEFAULTS.path
data_set: str=DEFAULTS.dataset
keys: tuple=DEFAULTS.keys
max_length: int=MAX_LENGTH.train
length_keys: list=DEFAULTS.length_keys
boundaries: list=DEFAULTS.boundaries
batch_sizes: list=DEFAULTS.batch_sizes
evaluation_fraction: float=DEFAULTS.evaluation_size
vocabulary_file: str=DEFAULTS.vocabulary_file
vocabulary_path: str=DEFAULTS.vocabulary_path
padding_token: int=DEFAULTS.padding_token
end_of_sentence: int=DEFAULTS.end_of_sentence
_generator_function: type=None
_batch_generator: type=None
Append End of Sentence
def end_of_sentence_generator(self, original):
"""Generator that adds end of sentence tokens
Args:
original: generator to add the end of sentence tokens to
Yields:
next tuple of arrays with EOS token added
"""
for inputs, targets in original:
inputs = list(inputs) + [self.end_of_sentence]
targets = list(targets) + [self.end_of_sentence]
yield numpy.array(inputs), numpy.array(targets)
return
Generator Function
@property
def generator_function(self):
"""Function to create the data generator"""
if self._generator_function is None:
self._generator_function = trax.data.TFDS(self.data_set,
data_dir=self.path,
keys=self.keys,
eval_holdout_size=self.evaluation_fraction,
train=self.training)
return self._generator_function
Batch Stream
@property
def batch_generator(self):
"""batch data generator"""
if self._batch_generator is None:
generator = self.generator_function()
generator = trax.data.Tokenize(
vocab_file=self.vocabulary_file,
vocab_dir=self.vocabulary_path)(generator)
generator = self.end_of_sentence_generator(generator)
generator = trax.data.FilterByLength(
max_length=self.max_length,
length_keys=self.length_keys)(generator)
generator = trax.data.BucketByLength(
self.boundaries, self.batch_sizes,
length_keys=self.length_keys
)(generator)
self._batch_generator = trax.data.AddLossWeights(
id_to_mask=self.padding_token)(generator)
return self._batch_generator
Try It Out
from neurotic.nlp.machine_translation import DataGenerator, detokenize
generator = DataGenerator().batch_generator
input_batch, target_batch, mask_batch = next(generator)
index = random.randrange(len(batch))
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')
[31mTHIS IS THE ENGLISH SENTENCE: [0m Signs of hypersensitivity reactions include hives, generalised urticaria, tightness of the chest, wheezing, hypotension and anaphylaxis. [31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: [0m [10495 14 7 10224 19366 10991 1020 3481 2486 2 9547 7417 103 4572 11927 9371 2 13197 1496 7 4 24489 62 2 16402 24010 211 2 4814 23010 12122 22 8 4867 19606 6457 5175 14 3550 30650 4729 992 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [31mTHIS IS THE GERMAN TRANSLATION: [0m Überempfindlichkeitsreaktionen können sich durch Anzeichen wie Nesselausschlag, generalisierte Urtikaria, Engegefühl im Brustkorb, Pfeifatmung, Blutdruckabfall und Anaphylaxie äußern. [31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: [0m [ 3916 29551 13504 5020 4094 13522 119 51 121 8602 93 31508 6050 30327 6978 2 9547 7417 2446 5618 4581 5530 1384 2 26006 7831 13651 5 47 8584 4076 5262 868 2 25389 8898 28268 2 9208 29697 17944 83 12 9925 19606 6457 16384 5 11790 3550 30650 4729 992 1 0 0 0 0 0 0 0 0 0 0]
End
Now that we have our data prepared it's time to move on to defining the Attention Model.