Auto-Complete: the N-Gram Model
Table of Contents
Develop an N-Gram Based Language Model
We'll continue on from the previous post in which we finished pre-processing the data to build our Auto-Complete system.
In this section, you will develop the n-grams language model.
- Assume the probability of the next word depends only on the previous n-gram.
- The previous n-gram is the series of the previous 'n' words.
The conditional probability for the word at position 't' in the sentence, given that the words preceding it are \(w_{t-1}, w_{t-2} \cdots w_{t-n}\) is:
\[ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1} \]
You can estimate this probability by counting the occurrences of these series of words in the training data.
- The probability can be estimated as a ratio, where
- The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
- The denominator is the number of times word t-1 through t-n appears in the training data.
\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]
- The function \(C(\cdots)\) denotes the number of occurences of the given sequence.
- \(\hat{P}\) means the estimation of P.
- Notice that denominator of the equation (2) is the number of occurence of the previous n words, and the numerator is the same sequence followed by the word \(w_t\).
Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.
The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).
Imports
# python
from functools import partial
from pprint import pprint
import math
# pypi
from expects import be_true, expect, have_keys
from tabulate import tabulate
import numpy
import pandas
Set Up
TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")
Middle
Count N-Grams
Next, you will implement a function that computes the counts of n-grams for an arbitrary number \(n\).
When computing the counts for n-grams, prepare the sentence beforehand by prepending n-1 starting markers "<s\>" to indicate the beginning of the sentence.
- For example, in the bi-gram model (N=2), a sequence with two start tokens "<s\><s\>" should predict the first word of a sentence.
- So, if the sentence is "I like food", modify it to be "<s\><s\> I like food".
- Also prepare the sentence for counting by appending an end token "<e\>" so that the model can predict when to finish a sentence.
Technical note: In this implementation, you will store the counts as a dictionary.
- The key of each key-value pair in the dictionary is a tuple of n words (and not a list)
- The value in the key-value pair is the number of occurrences.
- The reason for using a tuple as a key instead of a list is because a list in Python is a mutable object (it can be changed after it is first created). A tuple is "immutable", so it cannot be altered after it is first created. This makes a tuple suitable as a data type for the key in a dictionary.
Hints
- To prepend or append, you can create lists and concatenate them using the + operator
- To create a list of a repeated value, you can follow this syntax:
['a'] * 3
to get['a','a','a']
- To set the range for index 'i', think of this example: An n-gram where n=2 (bigram), and the sentence is length N=5 (including two start tokens and one end token). So the index positions are
[0,1,2,3,4]
. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram. - Remember that the
range()
function excludes the value that is used for the maximum of the range.range(3)
produces (0,1,2) but excludes 3.
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: count_n_grams ###
def count_n_grams(data: list, n: int, start_token: str='<s>', end_token: str='<e>') -> dict:
"""
Count all n-grams in the data
Args:
data: List of lists of words
n: number of words in a sequence
Returns:
A dictionary that maps a tuple of n-words to its frequency
"""
# Initialize dictionary of n-grams and their counts
n_grams = {}
### START CODE HERE (Replace instances of 'None' with your code) ###
# Go through each sentence in the data
for sentence in data: # complete this line
# prepend start token n times, and append <e> one time
sentence = [start_token] * n + sentence + [end_token]
# convert list to tuple
# So that the sequence of words can be used as
# a key in the dictionary
sentence = tuple(sentence)
# Use 'i' to indicate the start of the n-gram
# from index 0
# to the last index where the end of the n-gram
# is within the sentence.
for i in range(0, len(sentence) - (n - 1)): # complete this line
# Get the n-gram from i to i+n
n_gram = sentence[i: i + n]
# check if the n-gram is in the dictionary
if n_gram in n_grams: # complete this line
# Increment the count for this n-gram
n_grams[n_gram] += 1
else:
# Initialize this n-gram count to 1
n_grams[n_gram] = 1
### END CODE HERE ###
return n_grams
Test It
# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
# **** Unigram ****
print("Uni-gram:")
expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
actual = count_n_grams(sentences, 1)
print(actual)
expect(actual).to(have_keys(expected))
# **** Bi-Gram ****
print("Bi-gram:")
expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}
actual = count_n_grams(sentences, 2)
print(actual)
expect(actual).to(have_keys(expected))
Uni-gram: {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1} Bi-gram: {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}
Probability Estimates
Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.
\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]
This formula doesn't work when a count of an n-gram is zero..
- Suppose we encounter an n-gram that did not occur in the training data.
- Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).
A way to handle zero counts is to add k-smoothing.
- K-smoothing adds a positive constant k to each numerator and \(k \times |V|\) in the denominator, where \(|V|\) is the number of words in the vocabulary.
\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} \]
For n-grams that have a zero count, the equation (3) becomes \(\frac{1}{|V|}\).
- This means that any n-gram with zero count has the same probability of \(\frac{1}{|V|}\).
Define a function that computes the probability estimate (3) from n-gram counts and a constant k.
- The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.
- The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word.
Hints
- To define a tuple containing a single value, add a comma after that value. For example:
('apple',)
is a tuple containing a single string 'apple' - To concatenate two tuples, use the '+' operator
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: estimate_probability ###
def estimate_probability(word: str,
previous_n_gram: tuple,
n_gram_counts: dict,
n_plus1_gram_counts: dict,
vocabulary_size: int,
k: float=1.0) -> float:
"""
Estimate the probabilities of a next word using the n-gram counts with k-smoothing
Args:
word: next word
previous_n_gram: A sequence of words of length n
n_gram_counts: Dictionary of counts of n-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary_size: number of words in the vocabulary
k: positive constant, smoothing parameter
Returns:
A probability
"""
# convert list to tuple to use it as a dictionary key
previous_n_gram = tuple(previous_n_gram)
### START CODE HERE (Replace instances of 'None' with your code) ###
# Set the denominator
# If the previous n-gram exists in the dictionary of n-gram counts,
# Get its count. Otherwise set the count to zero
# Use the dictionary that has counts for n-grams
previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
# Calculate the denominator using the count of the previous n gram
# and apply k-smoothing
denominator = previous_n_gram_count + k * vocabulary_size
# Define n plus 1 gram as the previous n-gram plus the current word as a tuple
n_plus1_gram = previous_n_gram + (word,)
# Set the count to the count in the dictionary,
# otherwise 0 if not in the dictionary
# use the dictionary that has counts for the n-gram plus current word
n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
# Define the numerator use the count of the n-gram plus current word,
# and apply smoothing
numerator = n_plus1_gram_count + k
# Calculate the probability as the numerator divided by denominator
probability = numerator/denominator
### END CODE HERE ###
return probability
Test Code
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)
The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333
Estimate probabilities for all words
The function defined below loops over all words in the vocabulary to calculate probabilities for all possible words.
- This function is provided for you.
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
"""
Estimate the probabilities of next words using the n-gram counts with k-smoothing
Args:
previous_n_gram: A sequence of words of length n
n_gram_counts: Dictionary of counts of (n+1)-grams
n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
vocabulary: List of words
k: positive constant, smoothing parameter
Returns:
A dictionary mapping from next words to the probability.
"""
# convert list to tuple to use it as a dictionary key
previous_n_gram = tuple(previous_n_gram)
# add <e> <unk> to the vocabulary
# <s> is not needed since it should not appear as the next word
vocabulary = vocabulary + ["<e>", "<unk>"]
vocabulary_size = len(vocabulary)
probabilities = {}
for word in vocabulary:
probability = estimate_probability(word, previous_n_gram,
n_gram_counts, n_plus1_gram_counts,
vocabulary_size, k=k)
probabilities[word] = probability
return probabilities
Test It
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)
expected = {'cat': 0.2727272727272727,
'i': 0.09090909090909091,
'this': 0.09090909090909091,
'a': 0.09090909090909091,
'is': 0.09090909090909091,
'like': 0.09090909090909091,
'dog': 0.09090909090909091,
'<e>': 0.09090909090909091,
'<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)
{'<e>': 0.09090909090909091, '<unk>': 0.09090909090909091, 'a': 0.09090909090909091, 'cat': 0.2727272727272727, 'dog': 0.09090909090909091, 'i': 0.09090909090909091, 'is': 0.09090909090909091, 'like': 0.09090909090909091, 'this': 0.09090909090909091}
trigram_counts = count_n_grams(sentences, 3)
actual = estimate_probabilities(["<s>", "<s>"], bigram_counts, trigram_counts, unique_words, k=1)
expected = {'cat': 0.09090909090909091,
'i': 0.18181818181818182,
'this': 0.18181818181818182,
'a': 0.09090909090909091,
'is': 0.09090909090909091,
'like': 0.09090909090909091,
'dog': 0.09090909090909091,
'<e>': 0.09090909090909091,
'<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)
{'<e>': 0.09090909090909091, '<unk>': 0.09090909090909091, 'a': 0.09090909090909091, 'cat': 0.09090909090909091, 'dog': 0.09090909090909091, 'i': 0.18181818181818182, 'is': 0.09090909090909091, 'like': 0.09090909090909091, 'this': 0.18181818181818182}
Count and probability matrices
As we have seen so far, the n-gram counts computed above are sufficient for computing the probabilities of the next word.
- It can be more intuitive to present them as count or probability matrices.
- The functions defined in the next cells return count or probability matrices.
- This function is provided for you.
def make_count_matrix(n_plus1_gram_counts, vocabulary):
# add <e> <unk> to the vocabulary
# <s> is omitted since it should not appear as the next word
vocabulary = vocabulary + ["<e>", "<unk>"]
# obtain unique n-grams
n_grams = []
for n_plus1_gram in n_plus1_gram_counts.keys():
n_gram = n_plus1_gram[0:-1]
n_grams.append(n_gram)
n_grams = list(set(n_grams))
# mapping from n-gram to row
row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
# mapping from next word to column
col_index = {word:j for j, word in enumerate(vocabulary)}
nrow = len(n_grams)
ncol = len(vocabulary)
count_matrix = numpy.zeros((nrow, ncol))
for n_plus1_gram, count in n_plus1_gram_counts.items():
n_gram = n_plus1_gram[0:-1]
word = n_plus1_gram[-1]
if word not in vocabulary:
continue
i = row_index[n_gram]
j = col_index[word]
count_matrix[i, j] = count
count_matrix = pandas.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
return count_matrix
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print('bigram counts')
print(TABLE(make_count_matrix(bigram_counts, unique_words)))
cat | a | like | this | dog | i | is | <e> | <unk> | |
---|---|---|---|---|---|---|---|---|---|
('dog',) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
('<s>',) | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
('cat',) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
('is',) | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
('a',) | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
('i',) | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
('like',) | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
('this',) | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Show trigram counts
print('\ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_count_matrix(trigram_counts, unique_words)))
trigram counts
cat | a | like | this | dog | i | is | <e> | <unk> | |
---|---|---|---|---|---|---|---|---|---|
('<s>', 'i') | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
('i', 'like') | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
('<s>', 'this') | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
('like', 'a') | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
('<s>', '<s>') | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
('is', 'like') | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
('dog', 'is') | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
('this', 'dog') | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
('a', 'cat') | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Probability Matrix
The following function calculates the probabilities of each word given the previous n-gram, and stores this in matrix form.
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
count_matrix += k
prob_matrix = count_matrix.div(count_matrix.sum(axis="columns"), axis="rows")
return prob_matrix
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
print(TABLE(make_probability_matrix(bigram_counts, unique_words, k=1)))
bigram probabilities
cat | a | like | this | dog | i | is | <e> | <unk> | |
---|---|---|---|---|---|---|---|---|---|
('dog',) | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 |
('<s>',) | 0.0909091 | 0.0909091 | 0.0909091 | 0.181818 | 0.0909091 | 0.181818 | 0.0909091 | 0.0909091 | 0.0909091 |
('cat',) | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.272727 | 0.0909091 |
('is',) | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('a',) | 0.272727 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 |
('i',) | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('like',) | 0.0909091 | 0.272727 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 |
('this',) | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 |
print("trigram probabilities")
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_probability_matrix(trigram_counts, unique_words, k=1)))
trigram probabilities
cat | a | like | this | dog | i | is | <e> | <unk> | |
---|---|---|---|---|---|---|---|---|---|
('<s>', 'i') | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('i', 'like') | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('<s>', 'this') | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 |
('like', 'a') | 0.272727 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 |
('<s>', '<s>') | 0.0909091 | 0.0909091 | 0.0909091 | 0.181818 | 0.0909091 | 0.181818 | 0.0909091 | 0.0909091 | 0.0909091 |
('is', 'like') | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('dog', 'is') | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
('this', 'dog') | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 |
('a', 'cat') | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.0909091 | 0.272727 | 0.0909091 |
Confirm that you obtain the same results as for the `estimate_probabilities` function that you implemented.
Bundle It Up
<<imports>>
<<n-gram>>
<<start-tokens>>
<<end-tokens>>
<<sentences>>
<<n-grams>>
<<counts>>
<<n-gram-probability>>
<<n-grams-model>>
<<n-plus-one>>
<<vocabulary>>
<<vocabulary-size>>
<<probability>>
<<probabilities>>
Imports
# python
from collections import Counter
from itertools import chain
# pypi
import attr
The N-Gram
@attr.s(auto_attribs=True)
class NGrams:
"""The N-Gram Language Model
Args:
data: the training data
n: the size of the n-grams
start_token: string to represent the start of a sentence
end_token: string to represent the end of a sentence
"""
data: list
n: int
start_token: str="<s>"
end_token: str="<e>"
_start_tokens: list=None
_end_tokens: list=None
_sentences: list=None
_n_grams: list=None
_counts: dict=None
- Start Tokens
@property def start_tokens(self) -> list: """List of 'n' start tokens""" if self._start_tokens is None: self._start_tokens = [self.start_token] * self.n return self._start_tokens
- End Tokens
@property def end_tokens(self) -> list: """List of 1 end-tokens""" if self._end_tokens is None: self._end_tokens = [self.end_token] return self._end_tokens
- Sentences
@property def sentences(self) -> list: """The data augmented with tags and converted to tuples""" if self._sentences is None: self._sentences = [tuple(self.start_tokens + sentence + self.end_tokens) for sentence in self.data] return self._sentences
- N-Grams
@property def n_grams(self) -> list: """The n-grams from the data Warning: this flattens the n-grams so there isn't any sentence structure """ if self._n_grams is None: self._n_grams = chain.from_iterable([ [sentence[cut: cut + self.n] for cut in range(0, len(sentence) - (self.n - 1))] for sentence in self.sentences ]) return self._n_grams
- Count
@property def counts(self) -> Counter: """A count of all n-grams in the data Returns: A dictionary that maps a tuple of n-words to its frequency """ if self._counts is None: self._counts = Counter(self.n_grams) return self._counts
N-Gram Probability
@attr.s(auto_attribs=True)
class NGramProbability:
"""Probability model for n-grams
Args:
data: the source for the n-grams
n: the size of the n-grams
k: smoothing factor
augment_vocabulary: hack because the two probability functions use different vocabularies
"""
data: list
n: int
k: float=1.0
augment_vocabulary: bool=True
_n_grams: NGrams=None
_n_plus_one: NGrams=None
_vocabulary: set=None
_vocabulary_size: int=None
_probabilities: dict=None
- N-Grams
@property def n_grams(self) -> NGrams: if self._n_grams is None: self._n_grams = NGrams(data=self.data, n=self.n) return self._n_grams
- N-Plus-One Grams
@property def n_plus_one(self) -> NGrams: """N+1 Grams""" if self._n_plus_one is None: self._n_plus_one = NGrams(data=self.data, n=self.n + 1) return self._n_plus_one
- The Vocabulary
@property def vocabulary(self) -> set: """Unique words in the dictionary""" if self._vocabulary is None: data = list(chain.from_iterable(self.data)).copy() if self.augment_vocabulary: data += ["<e>", "<unk>"] self._vocabulary = set(data) return self._vocabulary
- Vocabulary Size
@property def vocabulary_size(self) -> int: """Number of unique tokens in the data""" if self._vocabulary_size is None: self._vocabulary_size = len(self.vocabulary) return self._vocabulary_size
- Probability
def probability(self, word: str, previous_n_gram: tuple) -> float: """Calculates the probability of the word given the previous n-gram""" # just in case it's a list previous_n_gram = tuple(previous_n_gram) previous_n_gram_count = self.n_grams.counts.get(previous_n_gram, 0) denominator = previous_n_gram_count + self.k * self.vocabulary_size n_plus1_gram = previous_n_gram + (word,) n_plus1_gram_count = self.n_plus_one.counts.get(n_plus1_gram, 0) numerator = n_plus1_gram_count + self.k return numerator/denominator
- Probabilities
def probabilities(self, previous_n_gram: tuple) -> dict: """Finds the probability of each word in the vocabulary Args: previous_n_gram: the preceding tuple to calculate probabilities Returns: word:<probability word follows previous n-gram> for the vocabulary """ previous_n_gram = tuple(previous_n_gram) return {word: self.probability(word=word, previous_n_gram=previous_n_gram) for word in self.vocabulary}
Test It Out
from neurotic.nlp.autocomplete import NGrams
# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
# **** Unigram ****
expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2,
('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
uni_grams = NGrams(sentences, 1)
actual = uni_grams.counts
expect(actual).to(have_keys(expected))
# **** Bi-Gram ****
expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1,
('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2,
('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1,
('is', 'like'): 1}
bi_grams = NGrams(sentences, 2)
actual = bi_grams.counts
expect(actual).to(have_keys(expected))
from neurotic.nlp.autocomplete import NGramProbability
sentences = [['i', 'like', 'a', 'cat'],
['this', 'dog', 'is', 'like', 'a', 'cat']]
# the examples for the two probability functions don't behave the same
# so for this case don't augment the vocabulary with empty and unknown tokens
model = NGramProbability(sentences, n=1, augment_vocabulary=False)
actual = model.probability("cat", "a")
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)
# the probabilities test examples assume that you did augment the vocabulary
model = NGramProbability(sentences, n=1)
actual = model.probabilities("a")
expected = {'cat': 0.2727272727272727,
'i': 0.09090909090909091,
'this': 0.09090909090909091,
'a': 0.09090909090909091,
'is': 0.09090909090909091,
'like': 0.09090909090909091,
'dog': 0.09090909090909091,
'<e>': 0.09090909090909091,
'<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
model = NGramProbability(sentences, n=2)
actual = model.probabilities(["<s>", "<s>"])
expected = {'cat': 0.09090909090909091,
'i': 0.18181818181818182,
'this': 0.18181818181818182,
'a': 0.09090909090909091,
'is': 0.09090909090909091,
'like': 0.09090909090909091,
'dog': 0.09090909090909091,
'<e>': 0.09090909090909091,
'<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)
End
Now that we have the N-Gram model we'll move on to checking its Perplexity.