N-Gram: Building the Language Model
Table of Contents
Building the Language Model
Imports
# python
from collections import defaultdict
from functools import partial
import random
# pypi
from tabulate import tabulate
import numpy
import pandas
Set Up
TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")
Middle
The Count Matrix
To calculate the n-gram probability, you will need to count frequencies of n-grams and n-gram prefixes in the training dataset. In some of the code assignment exercises, you will store the n-gram frequencies in a dictionary.
In other parts of the assignment, you will build a count matrix that keeps counts of (n-1)-gram prefix followed by all possible last words in the vocabulary.
The following code shows how to check, retrieve and update counts of n-grams in the word count dictionary.
n_gram_counts = {
('i', 'am', 'happy'): 2,
('am', 'happy', 'because'): 1}
# get count for an n-gram tuple
print(f"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}")
count of n-gram ('i', 'am', 'happy'): 2
# check if n-gram is present in the dictionary
n_gram = ('i', 'am', 'learning')
status = "found" if n_gram in n_gram_counts else "missing"
print(f"n-gram {n_gram} {status}")
n-gram ('i', 'am', 'learning') missing
# update the count in the word count dictionary
n_gram_counts[n_gram] = 1
status = "found" if ('i', 'am', 'learning') in n_gram_counts else "missing"
print(f"n-gram {n_gram} {status}")
n-gram ('i', 'am', 'learning') found
The next code snippet shows how to merge two tuples in Python. That will be handy when creating the n-gram from the prefix and the last word.
concatenate tuple for prefix and tuple with the last word to create the n_gram
prefix = ('i', 'am', 'happy')
word = 'because'
# note here the syntax for creating a tuple for a single word
n_gram = prefix + (word,)
print(n_gram)
('i', 'am', 'happy', 'because')
Single-Pass Trigram Count Matrix
In the lecture, you've seen that the count matrix could be made in a single pass through the corpus. Here is one approach to do that.
def single_pass_trigram_count_matrix(corpus: list) -> tuple:
"""
Creates the trigram count matrix from the input corpus in a single pass through the corpus.
Args:
corpus: Pre-processed and tokenized corpus.
Returns:
bigrams: list of all bigram prefixes, row index
vocabulary: list of all found words, the column index
count_matrix: pandas dataframe with bigram prefixes as rows,
vocabulary words as columns
and the counts of the bigram/word combinations (i.e. trigrams) as values
"""
bigrams = []
vocabulary = []
count_matrix_dict = defaultdict(dict)
# go through the corpus once with a sliding window
for i in range(len(corpus) - 3 + 1):
# the sliding window starts at position i and contains 3 words
trigram = tuple(corpus[i : i + 3])
bigram = trigram[0 : -1]
if not bigram in bigrams:
bigrams.append(bigram)
last_word = trigram[-1]
if not last_word in vocabulary:
vocabulary.append(last_word)
if (bigram,last_word) not in count_matrix_dict:
count_matrix_dict[bigram,last_word] = 0
count_matrix_dict[bigram,last_word] += 1
# convert the count_matrix to numpy.array to fill in the blanks
count_matrix = numpy.zeros((len(bigrams), len(vocabulary)))
for trigram_key, trigam_count in count_matrix_dict.items():
count_matrix[bigrams.index(trigram_key[0]),
vocabulary.index(trigram_key[1])] = trigam_count
# numpy.array to pandas dataframe conversion
count_matrix = pandas.DataFrame(count_matrix, index=bigrams, columns=vocabulary)
return bigrams, vocabulary, count_matrix
corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
bigrams, vocabulary, count_matrix = single_pass_trigram_count_matrix(corpus)
print(TABLE(count_matrix))
happy | because | i | am | learning | . | |
---|---|---|---|---|---|---|
('i', 'am') | 1 | 0 | 0 | 0 | 1 | 0 |
('am', 'happy') | 0 | 1 | 0 | 0 | 0 | 0 |
('happy', 'because') | 0 | 0 | 1 | 0 | 0 | 0 |
('because', 'i') | 0 | 0 | 0 | 1 | 0 | 0 |
('am', 'learning') | 0 | 0 | 0 | 0 | 0 | 1 |
The Probability Matrix
The next step is to build a probability matrix from the count matrix. You can use an object dataframe from library pandas and its methods sum and div to normalize the cell counts with the sum of the respective rows.
Create the Probability Matrix from the Count Matrix
row_sums = count_matrix.sum(axis="columns")
Divide Each Row By Its Sum
prob_matrix = count_matrix.div(row_sums, axis="rows")
print(TABLE(prob_matrix))
happy | because | i | am | learning | . | |
---|---|---|---|---|---|---|
('i', 'am') | 0.5 | 0 | 0 | 0 | 0.5 | 0 |
('am', 'happy') | 0 | 1 | 0 | 0 | 0 | 0 |
('happy', 'because') | 0 | 0 | 1 | 0 | 0 | 0 |
('because', 'i') | 0 | 0 | 0 | 1 | 0 | 0 |
('am', 'learning') | 0 | 0 | 0 | 0 | 0 | 1 |
Find the Probability of a Trigram
Since the columns of the probability matrix are the suffix-words and the index is made up of the bigram-prefix we'll need to unpack those to look up our probability.
trigram = ('i', 'am', 'happy')
bigram = trigram[:-1]
print(f'prefix-bigram: {bigram}')
prefix-bigram: ('i', 'am')
word = trigram[-1]
print(f"last word of the trigram: {word}")
last word of the trigram: happy
print(f"trigram probability: {prob_matrix[word][bigram]}")
trigram probability: 0.5
Which if you look at our corpus or count matrix, is the correct value ("i am" appears twice and one of those times it's "i am happy").
List all the words in the vocabulary starting with a given prefix
This is just a demonstration of checking the prefix of a string in python.
vocabulary = ['i', 'am', 'happy', 'because', 'learning', '.', 'have', 'you', 'seen','it', '?']
starts_with = 'ha'
print(f'words in vocabulary starting with prefix: {starts_with}\n')
for word in (candidate for candidate in vocabulary
if candidate.startswith(starts_with)):
print(word)
words in vocabulary starting with prefix: ha happy have
Building Training, Validation, and Testing Sets
def train_validation_test_split(data, train_percent, validation_percent):
"""
Splits the input data to train/validation/test according to the percentage provided
Args:
data: Pre-processed and tokenized corpus, i.e. list of sentences.
train_percent: integer 0-100, defines the portion of input corpus allocated for training
validation_percent: integer 0-100, defines the portion of input corpus allocated for validation
Note: train_percent + validation_percent need to be <=100
the reminder to 100 is allocated for the test set
Returns:
train_data: list of sentences, the training part of the corpus
validation_data: list of sentences, the validation part of the corpus
test_data: list of sentences, the test part of the corpus
"""
# fixed seed here for reproducibility
random.seed(87)
# reshuffle all input sentences
random.shuffle(data)
train_size = int(len(data) * train_percent / 100)
train_data = data[0:train_size]
validation_size = int(len(data) * validation_percent / 100)
validation_data = data[train_size:train_size + validation_size]
test_data = data[train_size + validation_size:]
return train_data, validation_data, test_data
Check the Sets
data = list(range (0, 100))
train_data, validation_data, test_data = train_validation_test_split(data, 80, 10)
print("split 80/10/10:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n",
f"test data:{test_data}\n")
train_data, validation_data, test_data = train_validation_test_split(data, 98, 1)
print("split 98/1/1:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n",
f"test data:{test_data}\n")
split 80/10/10: train data:[28, 76, 5, 0, 62, 29, 54, 95, 88, 58, 4, 22, 92, 14, 50, 77, 47, 33, 75, 68, 56, 74, 43, 80, 83, 84, 73, 93, 66, 87, 9, 91, 64, 79, 20, 51, 17, 27, 12, 31, 67, 81, 7, 34, 45, 72, 38, 30, 16, 60, 40, 86, 48, 21, 70, 59, 6, 19, 2, 99, 37, 36, 52, 61, 97, 44, 26, 57, 89, 55, 53, 85, 3, 39, 10, 71, 23, 32, 25, 8] validation data:[78, 65, 63, 11, 49, 98, 1, 46, 15, 41] test data:[90, 96, 82, 42, 35, 13, 69, 24, 94, 18] split 98/1/1: train data:[66, 23, 29, 28, 52, 87, 70, 13, 15, 2, 62, 43, 82, 50, 40, 32, 30, 79, 71, 89, 6, 10, 34, 78, 11, 49, 39, 42, 26, 46, 58, 96, 97, 8, 56, 86, 33, 93, 92, 91, 57, 65, 95, 20, 72, 3, 12, 9, 47, 37, 67, 1, 16, 74, 53, 99, 54, 68, 5, 18, 27, 17, 48, 36, 24, 45, 73, 19, 41, 59, 21, 98, 0, 31, 4, 85, 80, 64, 84, 88, 25, 44, 61, 22, 60, 94, 76, 38, 77, 81, 90, 69, 63, 7, 51, 14, 55, 83] validation data:[35] test data:[75]
Perplexity
In order to implement the perplexity formula, you'll need to know how to implement m-th order root of a variable.
\begin{equation*} PP(W)=\sqrt[M]{\prod_{i=1}^{m}{\frac{1}{P(w_i|w_{i-1})}}} \end{equation*}Remember from calculus:
\begin{equation*} \sqrt[M]{\frac{1}{x}} = x^{-\frac{1}{M}} \end{equation*}Here is some code that will help you with the formula.
p = 10 ** (-250)
M = 100
print(f"perplexity = {p ** (-1 / M):0.3f}")
perplexity = 316.228