Auto-Complete: the N-Gram Model

Cloistered Monkey

2020-12-04 15:17

Develop an N-Gram Based Language Model

We'll continue on from the previous post in which we finished pre-processing the data to build our Auto-Complete system.

In this section, you will develop the n-grams language model.

Assume the probability of the next word depends only on the previous n-gram.
The previous n-gram is the series of the previous 'n' words.

The conditional probability for the word at position 't' in the sentence, given that the words preceding it are \(w_{t-1}, w_{t-2} \cdots w_{t-n}\) is:

\[ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1} \]

You can estimate this probability by counting the occurrences of these series of words in the training data.

The probability can be estimated as a ratio, where
The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
The denominator is the number of times word t-1 through t-n appears in the training data.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]

The function \(C(\cdots)\) denotes the number of occurences of the given sequence.
\(\hat{P}\) means the estimation of P.
Notice that denominator of the equation (2) is the number of occurence of the previous n words, and the numerator is the same sequence followed by the word \(w_t\).

Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.

The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).

Imports

# python
from functools import partial
from pprint import pprint

import math

# pypi
from expects import be_true, expect, have_keys
from tabulate import tabulate

import numpy
import pandas

Set Up

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

Middle

Count N-Grams

Next, you will implement a function that computes the counts of n-grams for an arbitrary number \(n\).

When computing the counts for n-grams, prepare the sentence beforehand by prepending n-1 starting markers "<s\>" to indicate the beginning of the sentence.

For example, in the bi-gram model (N=2), a sequence with two start tokens "<s\><s\>" should predict the first word of a sentence.
So, if the sentence is "I like food", modify it to be "<s\><s\> I like food".
Also prepare the sentence for counting by appending an end token "<e\>" so that the model can predict when to finish a sentence.

Technical note: In this implementation, you will store the counts as a dictionary.

The key of each key-value pair in the dictionary is a tuple of n words (and not a list)
The value in the key-value pair is the number of occurrences.
The reason for using a tuple as a key instead of a list is because a list in Python is a mutable object (it can be changed after it is first created). A tuple is "immutable", so it cannot be altered after it is first created. This makes a tuple suitable as a data type for the key in a dictionary.

Hints

To prepend or append, you can create lists and concatenate them using the + operator
To create a list of a repeated value, you can follow this syntax: ['a'] * 3 to get ['a','a','a']
To set the range for index 'i', think of this example: An n-gram where n=2 (bigram), and the sentence is length N=5 (including two start tokens and one end token). So the index positions are [0,1,2,3,4]. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram.
Remember that the range() function excludes the value that is used for the maximum of the range. range(3) produces (0,1,2) but excludes 3.

# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: count_n_grams ###
def count_n_grams(data: list, n: int, start_token: str='<s>', end_token: str='<e>') -> dict:
    """
    Count all n-grams in the data

    Args:
       data: List of lists of words
       n: number of words in a sequence

    Returns:
       A dictionary that maps a tuple of n-words to its frequency
    """

    # Initialize dictionary of n-grams and their counts
    n_grams = {}

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Go through each sentence in the data
    for sentence in data: # complete this line

        # prepend start token n times, and  append <e> one time
        sentence = [start_token] * n + sentence + [end_token]

        # convert list to tuple
        # So that the sequence of words can be used as
        # a key in the dictionary
        sentence = tuple(sentence)

        # Use 'i' to indicate the start of the n-gram
        # from index 0
        # to the last index where the end of the n-gram
        # is within the sentence.

        for i in range(0, len(sentence) - (n - 1)): # complete this line

            # Get the n-gram from i to i+n
            n_gram = sentence[i: i + n]

            # check if the n-gram is in the dictionary
            if n_gram in n_grams: # complete this line

                # Increment the count for this n-gram
                n_grams[n_gram] += 1
            else:
                # Initialize this n-gram count to 1
                n_grams[n_gram] = 1

            ### END CODE HERE ###
    return n_grams

Test It

# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# **** Unigram ****
print("Uni-gram:")
expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
actual = count_n_grams(sentences, 1)
print(actual)
expect(actual).to(have_keys(expected))

# **** Bi-Gram ****
print("Bi-gram:")
expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}
actual = count_n_grams(sentences, 2)
print(actual)
expect(actual).to(have_keys(expected))

Uni-gram:
{('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}

Probability Estimates

Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]

This formula doesn't work when a count of an n-gram is zero..

Suppose we encounter an n-gram that did not occur in the training data.
Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).

A way to handle zero counts is to add k-smoothing.

K-smoothing adds a positive constant k to each numerator and \(k \times |V|\) in the denominator, where \(|V|\) is the number of words in the vocabulary.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} \]

For n-grams that have a zero count, the equation (3) becomes \(\frac{1}{|V|}\).

This means that any n-gram with zero count has the same probability of \(\frac{1}{|V|}\).

Define a function that computes the probability estimate (3) from n-gram counts and a constant k.

The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.
The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word.

Hints

To define a tuple containing a single value, add a comma after that value. For example: ('apple',) is a tuple containing a single string 'apple'
To concatenate two tuples, use the '+' operator

# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: estimate_probability ###
def estimate_probability(word: str,
                         previous_n_gram: tuple, 
                         n_gram_counts: dict,
                         n_plus1_gram_counts: dict,
                         vocabulary_size: int,
                         k: float=1.0) -> float:
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing

    Args:
       word: next word
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of n-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary_size: number of words in the vocabulary
       k: positive constant, smoothing parameter

    Returns:
       A probability
    """
    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Set the denominator
    # If the previous n-gram exists in the dictionary of n-gram counts,
    # Get its count.  Otherwise set the count to zero
    # Use the dictionary that has counts for n-grams
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)

    # Calculate the denominator using the count of the previous n gram
    # and apply k-smoothing
    denominator = previous_n_gram_count + k * vocabulary_size

    # Define n plus 1 gram as the previous n-gram plus the current word as a tuple
    n_plus1_gram = previous_n_gram + (word,)

    # Set the count to the count in the dictionary,
    # otherwise 0 if not in the dictionary
    # use the dictionary that has counts for the n-gram plus current word
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)

    # Define the numerator use the count of the n-gram plus current word,
    # and apply smoothing
    numerator = n_plus1_gram_count + k

    # Calculate the probability as the numerator divided by denominator
    probability = numerator/denominator

    ### END CODE HERE ###

    return probability

Test Code

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)

The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333

Estimate probabilities for all words

The function defined below loops over all words in the vocabulary to calculate probabilities for all possible words.

This function is provided for you.

def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    """
    Estimate the probabilities of next words using the n-gram counts with k-smoothing

    Args:
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of (n+1)-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary: List of words
       k: positive constant, smoothing parameter

    Returns:
       A dictionary mapping from next words to the probability.
    """

    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)

    # add <e> <unk> to the vocabulary
    # <s> is not needed since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)

    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

Test It

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)
expected =  {'cat': 0.2727272727272727,
             'i': 0.09090909090909091,
             'this': 0.09090909090909091,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)

{'<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091,
 'a': 0.09090909090909091,
 'cat': 0.2727272727272727,
 'dog': 0.09090909090909091,
 'i': 0.09090909090909091,
 'is': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.09090909090909091}

trigram_counts = count_n_grams(sentences, 3)
actual = estimate_probabilities(["<s>", "<s>"], bigram_counts, trigram_counts, unique_words, k=1)

expected =  {'cat': 0.09090909090909091,
             'i': 0.18181818181818182,
             'this': 0.18181818181818182,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)

{'<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091,
 'a': 0.09090909090909091,
 'cat': 0.09090909090909091,
 'dog': 0.09090909090909091,
 'i': 0.18181818181818182,
 'is': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.18181818181818182}

Count and probability matrices

As we have seen so far, the n-gram counts computed above are sufficient for computing the probabilities of the next word.

It can be more intuitive to present them as count or probability matrices.
The functions defined in the next cells return count or probability matrices.
This function is provided for you.

def make_count_matrix(n_plus1_gram_counts, vocabulary):
    # add <e> <unk> to the vocabulary
    # <s> is omitted since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]

    # obtain unique n-grams
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))

    # mapping from n-gram to row
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
    # mapping from next word to column
    col_index = {word:j for j, word in enumerate(vocabulary)}

    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = numpy.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count

    count_matrix = pandas.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    return count_matrix

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

print('bigram counts')
print(TABLE(make_count_matrix(bigram_counts, unique_words)))

	cat	a	like	this	dog	i	is	<e>
('dog',)	0	0	0	0	0	0	1	0
('<s>',)	0	0	0	1	0	1	0	0
('cat',)	0	0	0	0	0	0	0	2
('is',)	0	0	1	0	0	0	0	0
('a',)	2	0	0	0	0	0	0	0
('i',)	0	0	1	0	0	0	0	0
('like',)	0	2	0	0	0	0	0	0
('this',)	0	0	0	0	1	0	0	0

Show trigram counts

print('\ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_count_matrix(trigram_counts, unique_words)))

trigram counts

	cat	a	like	this	dog	i	is	<e>
('<s>', 'i')	0	0	1	0	0	0	0	0
('i', 'like')	0	1	0	0	0	0	0	0
('<s>', 'this')	0	0	0	0	1	0	0	0
('like', 'a')	2	0	0	0	0	0	0	0
('<s>', '<s>')	0	0	0	1	0	1	0	0
('is', 'like')	0	1	0	0	0	0	0	0
('dog', 'is')	0	0	1	0	0	0	0	0
('this', 'dog')	0	0	0	0	0	0	1	0
('a', 'cat')	0	0	0	0	0	0	0	2

Probability Matrix

The following function calculates the probabilities of each word given the previous n-gram, and stores this in matrix form.

def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis="columns"), axis="rows")
    return prob_matrix

sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
print(TABLE(make_probability_matrix(bigram_counts, unique_words, k=1)))

bigram probabilities

	cat	a	like	this	dog	i	is	<e>	<unk>
('dog',)	0.1	0.1	0.1	0.1	0.1	0.1	0.2	0.1	0.1
('<s>',)	0.0909091	0.0909091	0.0909091	0.181818	0.0909091	0.181818	0.0909091	0.0909091	0.0909091
('cat',)	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.272727	0.0909091
('is',)	0.1	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1
('a',)	0.272727	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091
('i',)	0.1	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1
('like',)	0.0909091	0.272727	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091
('this',)	0.1	0.1	0.1	0.1	0.2	0.1	0.1	0.1	0.1

print("trigram probabilities")
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_probability_matrix(trigram_counts, unique_words, k=1)))

trigram probabilities

	cat	a	like	this	dog	i	is	<e>	<unk>
('<s>', 'i')	0.1	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1
('i', 'like')	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1	0.1
('<s>', 'this')	0.1	0.1	0.1	0.1	0.2	0.1	0.1	0.1	0.1
('like', 'a')	0.272727	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091
('<s>', '<s>')	0.0909091	0.0909091	0.0909091	0.181818	0.0909091	0.181818	0.0909091	0.0909091	0.0909091
('is', 'like')	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1	0.1
('dog', 'is')	0.1	0.1	0.2	0.1	0.1	0.1	0.1	0.1	0.1
('this', 'dog')	0.1	0.1	0.1	0.1	0.1	0.1	0.2	0.1	0.1
('a', 'cat')	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.0909091	0.272727	0.0909091

Confirm that you obtain the same results as for the `estimate_probabilities` function that you implemented.

Bundle It Up

<<imports>>

<<n-gram>>

    <<start-tokens>>

    <<end-tokens>>

    <<sentences>>

    <<n-grams>>

    <<counts>>


<<n-gram-probability>>

    <<n-grams-model>>

    <<n-plus-one>>

    <<vocabulary>>

    <<vocabulary-size>>

    <<probability>>

    <<probabilities>>

Imports

# python
from collections import Counter
from itertools import chain

# pypi
import attr

The N-Gram

@attr.s(auto_attribs=True)
class NGrams:
    """The N-Gram Language Model

    Args:
     data: the training data
     n: the size of the n-grams
     start_token: string to represent the start of a sentence
     end_token: string to represent the end of a sentence
    """
    data: list
    n: int
    start_token: str="<s>"
    end_token: str="<e>"
    _start_tokens: list=None
    _end_tokens: list=None
    _sentences: list=None
    _n_grams: list=None
    _counts: dict=None

Start Tokens

@property
def start_tokens(self) -> list:
    """List of 'n' start tokens"""
    if self._start_tokens is None:
        self._start_tokens = [self.start_token] * self.n
    return self._start_tokens

End Tokens

@property
def end_tokens(self) -> list:
    """List of 1 end-tokens"""
    if self._end_tokens is None:
        self._end_tokens = [self.end_token]
    return self._end_tokens

Sentences

@property
def sentences(self) -> list:
    """The data augmented with tags and converted to tuples"""
    if self._sentences is None:
        self._sentences = [tuple(self.start_tokens + sentence + self.end_tokens)
                           for sentence in self.data]
    return self._sentences

N-Grams

@property
def n_grams(self) -> list:
    """The n-grams from the data

    Warning:
     this flattens the n-grams so there isn't any sentence structure
    """
    if self._n_grams is None:
        self._n_grams = chain.from_iterable([
            [sentence[cut: cut + self.n] for cut in range(0, len(sentence) - (self.n - 1))]
            for sentence in self.sentences
        ])
    return self._n_grams

Count

@property
def counts(self) -> Counter:
    """A count of all n-grams in the data

    Returns:
       A dictionary that maps a tuple of n-words to its frequency
    """
    if self._counts is None:        
        self._counts = Counter(self.n_grams)
    return self._counts

N-Gram Probability

@attr.s(auto_attribs=True)
class NGramProbability:
    """Probability model for n-grams

    Args:
     data: the source for the n-grams
     n: the size of the n-grams
     k: smoothing factor
     augment_vocabulary: hack because the two probability functions use different vocabularies
    """
    data: list
    n: int
    k: float=1.0
    augment_vocabulary: bool=True
    _n_grams: NGrams=None
    _n_plus_one: NGrams=None
    _vocabulary: set=None
    _vocabulary_size: int=None
    _probabilities: dict=None

N-Grams

@property
def n_grams(self) -> NGrams:
    if self._n_grams is None:
        self._n_grams = NGrams(data=self.data, n=self.n)
    return self._n_grams

N-Plus-One Grams

@property
def n_plus_one(self) -> NGrams:
    """N+1 Grams"""
    if self._n_plus_one is None:
        self._n_plus_one = NGrams(data=self.data, n=self.n + 1)
    return self._n_plus_one

The Vocabulary

@property
def vocabulary(self) -> set:
    """Unique words in the dictionary"""
    if self._vocabulary is None:
        data = list(chain.from_iterable(self.data)).copy()
        if self.augment_vocabulary:
            data += ["<e>", "<unk>"]
        self._vocabulary = set(data)
    return self._vocabulary

Vocabulary Size

@property
def vocabulary_size(self) -> int:
    """Number of unique tokens in the data"""
    if self._vocabulary_size is None:
        self._vocabulary_size = len(self.vocabulary)
    return self._vocabulary_size

Probability

def probability(self, word: str, previous_n_gram: tuple) -> float:
    """Calculates the probability of the word given the previous n-gram"""
    # just in case it's a list
    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = self.n_grams.counts.get(previous_n_gram, 0)
    denominator = previous_n_gram_count + self.k * self.vocabulary_size

    n_plus1_gram = previous_n_gram + (word,)
    n_plus1_gram_count = self.n_plus_one.counts.get(n_plus1_gram, 0)
    numerator = n_plus1_gram_count + self.k
    return numerator/denominator

Probabilities

def probabilities(self, previous_n_gram: tuple) -> dict:
    """Finds the probability of each word in the vocabulary

    Args:
     previous_n_gram: the preceding tuple to calculate probabilities

    Returns:
     word:<probability word follows previous n-gram> for the vocabulary
    """
    previous_n_gram = tuple(previous_n_gram)
    return {word: self.probability(word=word, previous_n_gram=previous_n_gram)
            for word in self.vocabulary}

Test It Out

from neurotic.nlp.autocomplete import NGrams

# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# **** Unigram ****

expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2,
            ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
uni_grams = NGrams(sentences, 1)
actual = uni_grams.counts
expect(actual).to(have_keys(expected))

# **** Bi-Gram ****

expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1,
            ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2,
            ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1,
            ('is', 'like'): 1}
bi_grams = NGrams(sentences, 2)
actual = bi_grams.counts
expect(actual).to(have_keys(expected))

from neurotic.nlp.autocomplete import NGramProbability

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# the examples for the two probability functions don't behave the same
# so for this case don't augment the vocabulary with empty and unknown tokens
model = NGramProbability(sentences, n=1, augment_vocabulary=False)
actual = model.probability("cat", "a")
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)

# the probabilities test examples assume that you did augment the vocabulary
model = NGramProbability(sentences, n=1)
actual = model.probabilities("a")
expected =  {'cat': 0.2727272727272727,
             'i': 0.09090909090909091,
             'this': 0.09090909090909091,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))

model = NGramProbability(sentences, n=2)
actual = model.probabilities(["<s>", "<s>"])

expected =  {'cat': 0.09090909090909091,
             'i': 0.18181818181818182,
             'this': 0.18181818181818182,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)

End

Now that we have the N-Gram model we'll move on to checking its Perplexity.

Table of Contents

Develop an N-Gram Based Language Model

Imports

Set Up

Middle

Count N-Grams

Hints

Test It

Probability Estimates

Hints

Test Code

Estimate probabilities for all words

Test It

Count and probability matrices

Show trigram counts

Probability Matrix

Bundle It Up

Imports

The N-Gram

N-Gram Probability

Test It Out

End