Auto-Complete: Building the Auto-Complete System

Build the Auto-Complete System

In the previous post we tested the perplexity of our N-Gram Language model. In this, the final post in the series that we began with this post, we'll implement the final system.

Imports

# python
from itertools import chain
import math
import os

# pypi
from dotenv import load_dotenv
from expects import be_true, equal, expect
# this project
from neurotic.nlp.autocomplete import NGrams, Tokenizer, TrainTestSplit

Set Up

The Environment

load_dotenv("posts/nlp/.env", override=True)

The Data

path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
    data = reader.read()

Middle

Probabilities Again

Once again the function we're defining here expects this probability function so I'm going to have to paste it in here.

def estimate_probability(word: str,
                         previous_n_gram: tuple, 
                         n_gram_counts: dict,
                         n_plus1_gram_counts: dict,
                         vocabulary_size: int,
                         k: float=1.0) -> float:
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing

    Args:
       word: next word
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of n-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary_size: number of words in the vocabulary
       k: positive constant, smoothing parameter

    Returns:
       A probability
    """
    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)

    n_plus1_gram = previous_n_gram + (word,)  
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)       
    return (n_plus1_gram_count + k)/(previous_n_gram_count + k * vocabulary_size)
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    """
    Estimate the probabilities of next words using the n-gram counts with k-smoothing

    Args:
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of (n+1)-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary: List of words
       k: positive constant, smoothing parameter

    Returns:
       A dictionary mapping from next words to the probability.
    """

    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)

    # add <e> <unk> to the vocabulary
    # <s> is not needed since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)

    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

Suggest-a-Word

Compute probabilities for all possible next words and suggest the most likely one.

  • This function also take an optional argument `start_with`, which specifies the first few letters of the next words.

Hints

  • estimate_probabilities returns a dictionary where the key is a word and the value is the word's probability.
  • Use str1.startswith(str2) to determine if a string starts with the letters of another string. For example, 'learning'.startswith('lea') returns True, whereas 'learning'.startswith('ear') returns False. There are two additional parameters in str.startswith(), but you can use the default values for those parameters in this case.
# UNQ_C11 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: suggest_a_word
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
    """
    Get suggestion for the next word

    Args:
       previous_tokens: The sentence you input where each token is a word. Must have length > n 
       n_gram_counts: Dictionary of counts of (n+1)-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary: List of words
       k: positive constant, smoothing parameter
       start_with: If not None, specifies the first few letters of the next word

    Returns:
       A tuple of 
         - string of the most likely next word
         - corresponding probability
    """

    # length of previous words
    n = len(list(n_gram_counts.keys())[0]) 

    # From the words that the user already typed
    # get the most recent 'n' words as the previous n-gram
    previous_n_gram = previous_tokens[-n:]

    # Estimate the probabilities that each word in the vocabulary
    # is the next word,
    # given the previous n-gram, the dictionary of n-gram counts,
    # the dictionary of n plus 1 gram counts, and the smoothing constant
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)

    # Initialize suggested word to None
    # This will be set to the word with highest probability
    suggestion = None

    # Initialize the highest word probability to 0
    # this will be set to the highest probability 
    # of all words to be suggested
    max_prob = 0

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # For each word and its probability in the probabilities dictionary:
    for word, prob in probabilities.items(): # complete this line

        # If the optional start_with string is set
        if start_with is not None: # complete this line

            # Check if the beginning of word does not match with the letters in 'start_with'
            if not word.startswith(start_with): # complete this line

                # if they don't match, skip this word (move onto the next word)
                continue # complete this line

        # Check if this word's probability
        # is greater than the current maximum probability
        if prob > max_prob: # complete this line

            # If so, save this word as the best suggestion (so far)
            suggestion = word

            # Save the new maximum probability
            max_prob = prob

    ### END CODE HERE

    return suggestion, max_prob

Test It Out

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = NGrams(sentences, 1).counts
bigram_counts = NGrams(sentences, 2).counts

previous_tokens = ["i", "like"]
word, probability = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',\n\tand the suggested word is `{word}` with a probability of {probability:.4f}")
expected_word, expected_probability = "a", 0.2727
expect(word).to(equal(expected_word))
expect(math.isclose(probability, expected_probability, abs_tol=1e-4)).to(be_true)
print()

# test your code when setting the starts_with
tmp_starts_with = 'c'
word, probability = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)
print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`\n\tand the suggested word is `{word}` with a probability of {probability:.4f}")

expected_word, expected_probability = "cat", 0.0909
expect(word).to(equal(expected_word))
expect(math.isclose(probability, expected_probability, abs_tol=1e-4)).to(be_true)
The previous words are 'i like',
        and the suggested word is `a` with a probability of 0.2727

The previous words are 'i like', the suggestion must start with `c`
        and the suggested word is `cat` with a probability of 0.0909

Multiple Suggestions

The function defined below loops over various n-gram models to get multiple suggestions.

def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    model_counts = len(n_gram_counts_list)
    suggestions = []
    for i in range(model_counts-1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i+1]

        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions

Test It

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = NGrams(sentences, 1).counts
bigram_counts = NGrams(sentences, 2).counts
trigram_counts = NGrams(sentences, 3).counts
quadgram_counts = NGrams(sentences, 4).counts
qintgram_counts = NGrams(sentences, 5).counts

n_gram_counts_list = [unigram_counts, bigram_counts, trigram_counts, quadgram_counts, qintgram_counts]
previous_tokens = ["i", "like"]
tmp_suggest3 = get_suggestions(previous_tokens, n_gram_counts_list, unique_words, k=1.0)

print(f"The previous words are 'i like', the suggestions are:")
display(tmp_suggest3)
The previous words are 'i like', the suggestions are:
a 0.2727272727272727
a 0.2
like 0.1111111111111111
like 0.1111111111111111

Multiple Word Suggestions

tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
train_data_processed = splitter.training
n_gram_counts_list = [NGrams(train_data_processed, n).counts for n in range(1, 6)]
vocabulary = list(set(chain.from_iterable(train_data_processed)))
previous_tokens = ["i", "am", "to"]
tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest4)
The previous words are ['i', 'am', 'to'], the suggestions are:
be 0.015552924847940564
please 5.4935999560512006e-05
please 5.494354550699157e-05
sucks 2.747403703500192e-05
previous_tokens = ["i", "want", "to", "go"]
tmp_suggest5 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest5)
The previous words are ['i', 'want', 'to', 'go'], the suggestions are:
to 0.006007762241480142
to 0.0019077728115120462
to 0.00030196552102778083
home 0.00016478989288656962
previous_tokens = ["hey", "how", "are"]
tmp_suggest6 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest6)
The previous words are ['hey', 'how', 'are'], the suggestions are:
you 0.010055522861602231
you 0.0014810345300458024
you 5.494656446605676e-05
sucks 2.747403703500192e-05
previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest7 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest7)
The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:
're 0.012929170630459223
? 0.0011416145691764065
? 0.0007132863295931524
<e> 5.494656446605676e-05
previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest8 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with="d")

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest8)
The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:
do 0.004734930381388913
doing 0.000679532481652623
doing 0.0001646045375984198
deserving 2.747328223302838e-05

End

So, now we have our system. Here are all the prior posts in this series.

Auto-Complete: Perplexity

Perplexity

In the previous post we implemented the N-Gram Language Model for the auto-complete system that we began here.

In this section, you will generate the perplexity score to evaluate your model on the test set.

  • You will also use back-off when needed.
  • Perplexity is used as an evaluation metric of your language model.
  • To calculate the the perplexity score of the test set on an n-gram model, use:

\[ PP(W) =\sqrt[N]{ \prod_{t=n+1}^N \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4} \]

  • where N is the length of the sentence.
  • n is the number of words in the n-gram (e.g. 2 for a bigram).
  • In math, the numbering starts at one and not zero.

In code, array indexing starts at zero, so the code will use ranges for t according to this formula:

\[ PP(W) =\sqrt[N]{ \prod_{t=n}^{N-1} \frac{1}{P(w_t | w_{t-n} \cdots w_{t-1})} } \tag{4.1} \]

The higher the probabilities are, the lower the perplexity will be.

  • The more the n-grams tell us about the sentence, the lower the perplexity score will be.

Imports

# python
import math

# pypi
from expects import expect, be_true

import attr

# this project
from neurotic.nlp.autocomplete import NGrams,NGramProbability

Middle

The Probability Function

This was already defined in the previous post, but the function following it assumes its existence so I'm temporarily re-defining it here.

def estimate_probability(word: str,
                         previous_n_gram: tuple, 
                         n_gram_counts: dict,
                         n_plus1_gram_counts: dict,
                         vocabulary_size: int,
                         k: float=1.0) -> float:
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing

    Args:
       word: next word
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of n-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary_size: number of words in the vocabulary
       k: positive constant, smoothing parameter

    Returns:
       A probability
    """
    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)

    n_plus1_gram = previous_n_gram + (word,)  
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)       
    return (n_plus1_gram_count + k)/(previous_n_gram_count + k * vocabulary_size)

Calculating the Perplexity

# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: calculate_perplexity
def calculate_perplexity(sentence: list,
                         n_gram_counts: dict,
                         n_plus1_gram_counts: dict,
                         vocabulary_size: int,
                         k: float=1.0):
    """
    Calculate perplexity for a list of sentences

    Args:
       sentence: List of strings
       n_gram_counts: Dictionary of counts of (n+1)-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary_size: number of unique words in the vocabulary
       k: Positive smoothing constant

    Returns:
       Perplexity score
    """
    # length of previous words
    n = len(list(n_gram_counts.keys())[0]) 

    # prepend <s> and append <e>
    sentence = ["<s>"] * n + sentence + ["<e>"]

    # Cast the sentence from a list to a tuple
    sentence = tuple(sentence)

    # length of sentence (after adding <s> and <e> tokens)
    N = len(sentence)

    # The variable p will hold the product
    # that is calculated inside the n-root
    # Update this in the code below
    product_pi = 1.0

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Index t ranges from n to N - 1, inclusive on both ends
    for t in range(n, N): # complete this line

        # get the n-gram preceding the word at position t
        n_gram = sentence[t - n: t]

        # get the word at position t
        word = sentence[t]

        # Estimate the probability of the word given the n-gram
        # using the n-gram counts, n-plus1-gram counts,
        # vocabulary size, and smoothing constant
        probability = estimate_probability(
            word=word, previous_n_gram=n_gram,
            vocabulary_size=vocabulary_size,
            n_gram_counts=n_gram_counts,
            n_plus1_gram_counts=n_plus1_gram_counts, k=k)

        # Update the product of the probabilities
        # This 'product_pi' is a cumulative product 
        # of the (1/P) factors that are calculated in the loop
        product_pi *= 1/probability

    # Take the Nth root of the product
    perplexity = product_pi**(1/N)

    ### END CODE HERE ### 
    return perplexity

Test It

sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = NGrams(sentences, 1).counts
bigram_counts = NGrams(sentences, 2).counts


perplexity_train1 = calculate_perplexity(sentences[0],
                                         unigram_counts, bigram_counts,
                                         len(unique_words), k=1.0)
expected = 2.8040
print(f"Perplexity for first train sample: {perplexity_train1:.4f}")

expect(math.isclose(perplexity_train1, expected, abs_tol=1e-4)).to(be_true)
test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence,
                                       unigram_counts, bigram_counts,
                                       len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")
expected = 3.9654
expect(math.isclose(perplexity_test, expected, abs_tol=1e-4)).to(be_true)
Perplexity for first train sample: 2.8040
Perplexity for test sample: 3.9654

Note: If your sentence is really long, there will be underflow when multiplying many fractions.

  • To handle longer sentences, modify your implementation to take the sum of the log of the probabilities.

Using the Class-Based Version

@attr.s(auto_attribs=True)
class Perplexity:
    """Calculate perplexity

    Args:
     data: the tokenized training input
     n: the size of the n-grams
     augment_vocabulary: whether to augment the vocabulary for toy examples
    """
    data: list
    n: int
    augment_vocabulary: bool=False
    _probabilifier: NGramProbability=None

    @property
    def probabilifier(self) -> NGramProbability:
        """Probability Calculator"""
        if self._probabilifier is None:
            self._probabilifier = NGramProbability(
                self.data, self.n,
                augment_vocabulary=self.augment_vocabulary)
        return self._probabilifier

    def perplexity(self, sentence: list) -> float:
        """Calculates the perplexity for the sentence"""
        sentence = tuple(["<s>"] * self.n + sentence + ["<e>"])
        N = len(sentence)

        n_grams = (sentence[position - self.n: position]
                   for position in range(self.n, N))
        words = (sentence[position]
                 for position in range(self.n, N))
        words_n_grams = zip(words, n_grams)
        probabilities = (self.probabilifier.probability(word, n_gram)
                         for word, n_gram in words_n_grams)
        product = math.prod((1/probability for probability in probabilities))
        return product**(1/N)
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]

model = Perplexity(sentences, 1, augment_vocabulary=False)

actual = model.perplexity(sentences[0])

expected = 2.8040
print(f"Perplexity for first train sample: {actual:.4f}")

expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)
test_sentence = ['i', 'like', 'a', 'dog']
model
perplexity_test = model.perplexity(test_sentence)

print(f"Perplexity for test sample: {perplexity_test:.4f}")
expected = 3.9654
expect(math.isclose(perplexity_test, expected, abs_tol=1e-4)).to(be_true)

End

In the next part we'll build our completed auto-complete system.

Auto-Complete: the N-Gram Model

Develop an N-Gram Based Language Model

We'll continue on from the previous post in which we finished pre-processing the data to build our Auto-Complete system.

In this section, you will develop the n-grams language model.

  • Assume the probability of the next word depends only on the previous n-gram.
  • The previous n-gram is the series of the previous 'n' words.

The conditional probability for the word at position 't' in the sentence, given that the words preceding it are \(w_{t-1}, w_{t-2} \cdots w_{t-n}\) is:

\[ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1} \]

You can estimate this probability by counting the occurrences of these series of words in the training data.

  • The probability can be estimated as a ratio, where
  • The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
  • The denominator is the number of times word t-1 through t-n appears in the training data.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]

  • The function \(C(\cdots)\) denotes the number of occurences of the given sequence.
  • \(\hat{P}\) means the estimation of P.
  • Notice that denominator of the equation (2) is the number of occurence of the previous n words, and the numerator is the same sequence followed by the word \(w_t\).

Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.

The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).

Imports

# python
from functools import partial
from pprint import pprint

import math

# pypi
from expects import be_true, expect, have_keys
from tabulate import tabulate

import numpy
import pandas

Set Up

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

Middle

Count N-Grams

Next, you will implement a function that computes the counts of n-grams for an arbitrary number \(n\).

When computing the counts for n-grams, prepare the sentence beforehand by prepending n-1 starting markers "<s\>" to indicate the beginning of the sentence.

  • For example, in the bi-gram model (N=2), a sequence with two start tokens "<s\><s\>" should predict the first word of a sentence.
  • So, if the sentence is "I like food", modify it to be "<s\><s\> I like food".
  • Also prepare the sentence for counting by appending an end token "<e\>" so that the model can predict when to finish a sentence.

Technical note: In this implementation, you will store the counts as a dictionary.

  • The key of each key-value pair in the dictionary is a tuple of n words (and not a list)
  • The value in the key-value pair is the number of occurrences.
  • The reason for using a tuple as a key instead of a list is because a list in Python is a mutable object (it can be changed after it is first created). A tuple is "immutable", so it cannot be altered after it is first created. This makes a tuple suitable as a data type for the key in a dictionary.

Hints

  • To prepend or append, you can create lists and concatenate them using the + operator
  • To create a list of a repeated value, you can follow this syntax: ['a'] * 3 to get ['a','a','a']
  • To set the range for index 'i', think of this example: An n-gram where n=2 (bigram), and the sentence is length N=5 (including two start tokens and one end token). So the index positions are [0,1,2,3,4]. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram.
  • Remember that the range() function excludes the value that is used for the maximum of the range. range(3) produces (0,1,2) but excludes 3.
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: count_n_grams ###
def count_n_grams(data: list, n: int, start_token: str='<s>', end_token: str='<e>') -> dict:
    """
    Count all n-grams in the data

    Args:
       data: List of lists of words
       n: number of words in a sequence

    Returns:
       A dictionary that maps a tuple of n-words to its frequency
    """

    # Initialize dictionary of n-grams and their counts
    n_grams = {}

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Go through each sentence in the data
    for sentence in data: # complete this line

        # prepend start token n times, and  append <e> one time
        sentence = [start_token] * n + sentence + [end_token]

        # convert list to tuple
        # So that the sequence of words can be used as
        # a key in the dictionary
        sentence = tuple(sentence)

        # Use 'i' to indicate the start of the n-gram
        # from index 0
        # to the last index where the end of the n-gram
        # is within the sentence.

        for i in range(0, len(sentence) - (n - 1)): # complete this line

            # Get the n-gram from i to i+n
            n_gram = sentence[i: i + n]

            # check if the n-gram is in the dictionary
            if n_gram in n_grams: # complete this line

                # Increment the count for this n-gram
                n_grams[n_gram] += 1
            else:
                # Initialize this n-gram count to 1
                n_grams[n_gram] = 1

            ### END CODE HERE ###
    return n_grams

Test It

# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# **** Unigram ****
print("Uni-gram:")
expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
actual = count_n_grams(sentences, 1)
print(actual)
expect(actual).to(have_keys(expected))

# **** Bi-Gram ****
print("Bi-gram:")
expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}
actual = count_n_grams(sentences, 2)
print(actual)
expect(actual).to(have_keys(expected))
Uni-gram:
{('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}

Probability Estimates

Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} \]

This formula doesn't work when a count of an n-gram is zero..

  • Suppose we encounter an n-gram that did not occur in the training data.
  • Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).

A way to handle zero counts is to add k-smoothing.

  • K-smoothing adds a positive constant k to each numerator and \(k \times |V|\) in the denominator, where \(|V|\) is the number of words in the vocabulary.

\[ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} \]

For n-grams that have a zero count, the equation (3) becomes \(\frac{1}{|V|}\).

  • This means that any n-gram with zero count has the same probability of \(\frac{1}{|V|}\).

Define a function that computes the probability estimate (3) from n-gram counts and a constant k.

  • The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.
  • The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word.

Hints

  • To define a tuple containing a single value, add a comma after that value. For example: ('apple',) is a tuple containing a single string 'apple'
  • To concatenate two tuples, use the '+' operator
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED FUNCTION: estimate_probability ###
def estimate_probability(word: str,
                         previous_n_gram: tuple, 
                         n_gram_counts: dict,
                         n_plus1_gram_counts: dict,
                         vocabulary_size: int,
                         k: float=1.0) -> float:
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing

    Args:
       word: next word
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of n-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary_size: number of words in the vocabulary
       k: positive constant, smoothing parameter

    Returns:
       A probability
    """
    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Set the denominator
    # If the previous n-gram exists in the dictionary of n-gram counts,
    # Get its count.  Otherwise set the count to zero
    # Use the dictionary that has counts for n-grams
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)

    # Calculate the denominator using the count of the previous n gram
    # and apply k-smoothing
    denominator = previous_n_gram_count + k * vocabulary_size

    # Define n plus 1 gram as the previous n-gram plus the current word as a tuple
    n_plus1_gram = previous_n_gram + (word,)

    # Set the count to the count in the dictionary,
    # otherwise 0 if not in the dictionary
    # use the dictionary that has counts for the n-gram plus current word
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)

    # Define the numerator use the count of the n-gram plus current word,
    # and apply smoothing
    numerator = n_plus1_gram_count + k

    # Calculate the probability as the numerator divided by denominator
    probability = numerator/denominator

    ### END CODE HERE ###

    return probability

Test Code

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)
The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333

Estimate probabilities for all words

The function defined below loops over all words in the vocabulary to calculate probabilities for all possible words.

  • This function is provided for you.
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    """
    Estimate the probabilities of next words using the n-gram counts with k-smoothing

    Args:
       previous_n_gram: A sequence of words of length n
       n_gram_counts: Dictionary of counts of (n+1)-grams
       n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
       vocabulary: List of words
       k: positive constant, smoothing parameter

    Returns:
       A dictionary mapping from next words to the probability.
    """

    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)

    # add <e> <unk> to the vocabulary
    # <s> is not needed since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)

    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

Test It

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
actual = estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)
expected =  {'cat': 0.2727272727272727,
             'i': 0.09090909090909091,
             'this': 0.09090909090909091,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)
{'<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091,
 'a': 0.09090909090909091,
 'cat': 0.2727272727272727,
 'dog': 0.09090909090909091,
 'i': 0.09090909090909091,
 'is': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.09090909090909091}
trigram_counts = count_n_grams(sentences, 3)
actual = estimate_probabilities(["<s>", "<s>"], bigram_counts, trigram_counts, unique_words, k=1)

expected =  {'cat': 0.09090909090909091,
             'i': 0.18181818181818182,
             'this': 0.18181818181818182,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)
{'<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091,
 'a': 0.09090909090909091,
 'cat': 0.09090909090909091,
 'dog': 0.09090909090909091,
 'i': 0.18181818181818182,
 'is': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.18181818181818182}

Count and probability matrices

As we have seen so far, the n-gram counts computed above are sufficient for computing the probabilities of the next word.

  • It can be more intuitive to present them as count or probability matrices.
  • The functions defined in the next cells return count or probability matrices.
  • This function is provided for you.
def make_count_matrix(n_plus1_gram_counts, vocabulary):
    # add <e> <unk> to the vocabulary
    # <s> is omitted since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]

    # obtain unique n-grams
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))

    # mapping from n-gram to row
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
    # mapping from next word to column
    col_index = {word:j for j, word in enumerate(vocabulary)}

    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = numpy.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count

    count_matrix = pandas.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    return count_matrix
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)

print('bigram counts')
print(TABLE(make_count_matrix(bigram_counts, unique_words)))
  cat a like this dog i is <e> <unk>
('dog',) 0 0 0 0 0 0 1 0 0
('<s>',) 0 0 0 1 0 1 0 0 0
('cat',) 0 0 0 0 0 0 0 2 0
('is',) 0 0 1 0 0 0 0 0 0
('a',) 2 0 0 0 0 0 0 0 0
('i',) 0 0 1 0 0 0 0 0 0
('like',) 0 2 0 0 0 0 0 0 0
('this',) 0 0 0 0 1 0 0 0 0

Show trigram counts

print('\ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_count_matrix(trigram_counts, unique_words)))

trigram counts

  cat a like this dog i is <e> <unk>
('<s>', 'i') 0 0 1 0 0 0 0 0 0
('i', 'like') 0 1 0 0 0 0 0 0 0
('<s>', 'this') 0 0 0 0 1 0 0 0 0
('like', 'a') 2 0 0 0 0 0 0 0 0
('<s>', '<s>') 0 0 0 1 0 1 0 0 0
('is', 'like') 0 1 0 0 0 0 0 0 0
('dog', 'is') 0 0 1 0 0 0 0 0 0
('this', 'dog') 0 0 0 0 0 0 1 0 0
('a', 'cat') 0 0 0 0 0 0 0 2 0

Probability Matrix

The following function calculates the probabilities of each word given the previous n-gram, and stores this in matrix form.

def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis="columns"), axis="rows")
    return prob_matrix
sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
print(TABLE(make_probability_matrix(bigram_counts, unique_words, k=1)))

bigram probabilities

  cat a like this dog i is <e> <unk>
('dog',) 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1
('<s>',) 0.0909091 0.0909091 0.0909091 0.181818 0.0909091 0.181818 0.0909091 0.0909091 0.0909091
('cat',) 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.272727 0.0909091
('is',) 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
('a',) 0.272727 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091
('i',) 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
('like',) 0.0909091 0.272727 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091
('this',) 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1
print("trigram probabilities")
trigram_counts = count_n_grams(sentences, 3)
print(TABLE(make_probability_matrix(trigram_counts, unique_words, k=1)))

trigram probabilities

  cat a like this dog i is <e> <unk>
('<s>', 'i') 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
('i', 'like') 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1
('<s>', 'this') 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1
('like', 'a') 0.272727 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091
('<s>', '<s>') 0.0909091 0.0909091 0.0909091 0.181818 0.0909091 0.181818 0.0909091 0.0909091 0.0909091
('is', 'like') 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1
('dog', 'is') 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
('this', 'dog') 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1
('a', 'cat') 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.0909091 0.272727 0.0909091

Confirm that you obtain the same results as for the `estimate_probabilities` function that you implemented.

Bundle It Up

<<imports>>

<<n-gram>>

    <<start-tokens>>

    <<end-tokens>>

    <<sentences>>

    <<n-grams>>

    <<counts>>


<<n-gram-probability>>

    <<n-grams-model>>

    <<n-plus-one>>

    <<vocabulary>>

    <<vocabulary-size>>

    <<probability>>

    <<probabilities>>

Imports

# python
from collections import Counter
from itertools import chain

# pypi
import attr

The N-Gram

@attr.s(auto_attribs=True)
class NGrams:
    """The N-Gram Language Model

    Args:
     data: the training data
     n: the size of the n-grams
     start_token: string to represent the start of a sentence
     end_token: string to represent the end of a sentence
    """
    data: list
    n: int
    start_token: str="<s>"
    end_token: str="<e>"
    _start_tokens: list=None
    _end_tokens: list=None
    _sentences: list=None
    _n_grams: list=None
    _counts: dict=None
  • Start Tokens
    @property
    def start_tokens(self) -> list:
        """List of 'n' start tokens"""
        if self._start_tokens is None:
            self._start_tokens = [self.start_token] * self.n
        return self._start_tokens
    
  • End Tokens
    @property
    def end_tokens(self) -> list:
        """List of 1 end-tokens"""
        if self._end_tokens is None:
            self._end_tokens = [self.end_token]
        return self._end_tokens
    
  • Sentences
    @property
    def sentences(self) -> list:
        """The data augmented with tags and converted to tuples"""
        if self._sentences is None:
            self._sentences = [tuple(self.start_tokens + sentence + self.end_tokens)
                               for sentence in self.data]
        return self._sentences
    
  • N-Grams
    @property
    def n_grams(self) -> list:
        """The n-grams from the data
    
        Warning:
         this flattens the n-grams so there isn't any sentence structure
        """
        if self._n_grams is None:
            self._n_grams = chain.from_iterable([
                [sentence[cut: cut + self.n] for cut in range(0, len(sentence) - (self.n - 1))]
                for sentence in self.sentences
            ])
        return self._n_grams
    
  • Count
    @property
    def counts(self) -> Counter:
        """A count of all n-grams in the data
    
        Returns:
           A dictionary that maps a tuple of n-words to its frequency
        """
        if self._counts is None:        
            self._counts = Counter(self.n_grams)
        return self._counts
    

N-Gram Probability

@attr.s(auto_attribs=True)
class NGramProbability:
    """Probability model for n-grams

    Args:
     data: the source for the n-grams
     n: the size of the n-grams
     k: smoothing factor
     augment_vocabulary: hack because the two probability functions use different vocabularies
    """
    data: list
    n: int
    k: float=1.0
    augment_vocabulary: bool=True
    _n_grams: NGrams=None
    _n_plus_one: NGrams=None
    _vocabulary: set=None
    _vocabulary_size: int=None
    _probabilities: dict=None
  • N-Grams
    @property
    def n_grams(self) -> NGrams:
        if self._n_grams is None:
            self._n_grams = NGrams(data=self.data, n=self.n)
        return self._n_grams
    
  • N-Plus-One Grams
    @property
    def n_plus_one(self) -> NGrams:
        """N+1 Grams"""
        if self._n_plus_one is None:
            self._n_plus_one = NGrams(data=self.data, n=self.n + 1)
        return self._n_plus_one
    
  • The Vocabulary
    @property
    def vocabulary(self) -> set:
        """Unique words in the dictionary"""
        if self._vocabulary is None:
            data = list(chain.from_iterable(self.data)).copy()
            if self.augment_vocabulary:
                data += ["<e>", "<unk>"]
            self._vocabulary = set(data)
        return self._vocabulary
    
  • Vocabulary Size
    @property
    def vocabulary_size(self) -> int:
        """Number of unique tokens in the data"""
        if self._vocabulary_size is None:
            self._vocabulary_size = len(self.vocabulary)
        return self._vocabulary_size
    
  • Probability
    def probability(self, word: str, previous_n_gram: tuple) -> float:
        """Calculates the probability of the word given the previous n-gram"""
        # just in case it's a list
        previous_n_gram = tuple(previous_n_gram)
        previous_n_gram_count = self.n_grams.counts.get(previous_n_gram, 0)
        denominator = previous_n_gram_count + self.k * self.vocabulary_size
    
        n_plus1_gram = previous_n_gram + (word,)
        n_plus1_gram_count = self.n_plus_one.counts.get(n_plus1_gram, 0)
        numerator = n_plus1_gram_count + self.k
        return numerator/denominator
    
  • Probabilities
    def probabilities(self, previous_n_gram: tuple) -> dict:
        """Finds the probability of each word in the vocabulary
    
        Args:
         previous_n_gram: the preceding tuple to calculate probabilities
    
        Returns:
         word:<probability word follows previous n-gram> for the vocabulary
        """
        previous_n_gram = tuple(previous_n_gram)
        return {word: self.probability(word=word, previous_n_gram=previous_n_gram)
                for word in self.vocabulary}
    

Test It Out

from neurotic.nlp.autocomplete import NGrams

# **** Set Up ****
sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# **** Unigram ****

expected = {('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2,
            ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
uni_grams = NGrams(sentences, 1)
actual = uni_grams.counts
expect(actual).to(have_keys(expected))

# **** Bi-Gram ****

expected = {('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1,
            ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2,
            ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1,
            ('is', 'like'): 1}
bi_grams = NGrams(sentences, 2)
actual = bi_grams.counts
expect(actual).to(have_keys(expected))
from neurotic.nlp.autocomplete import NGramProbability

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]

# the examples for the two probability functions don't behave the same
# so for this case don't augment the vocabulary with empty and unknown tokens
model = NGramProbability(sentences, n=1, augment_vocabulary=False)
actual = model.probability("cat", "a")
expected = 0.3333
print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {actual:.4f}")
expect(math.isclose(actual, expected, abs_tol=1e-4)).to(be_true)
# the probabilities test examples assume that you did augment the vocabulary
model = NGramProbability(sentences, n=1)
actual = model.probabilities("a")
expected =  {'cat': 0.2727272727272727,
             'i': 0.09090909090909091,
             'this': 0.09090909090909091,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
model = NGramProbability(sentences, n=2)
actual = model.probabilities(["<s>", "<s>"])

expected =  {'cat': 0.09090909090909091,
             'i': 0.18181818181818182,
             'this': 0.18181818181818182,
             'a': 0.09090909090909091,
             'is': 0.09090909090909091,
             'like': 0.09090909090909091,
             'dog': 0.09090909090909091,
             '<e>': 0.09090909090909091,
             '<unk>': 0.09090909090909091}
expect(actual).to(have_keys(**expected))
pprint(actual)

End

Now that we have the N-Gram model we'll move on to checking its Perplexity.

Auto-Complete: Pre-Process the Data II

Beginning

This is the third post in a series that begins with Auto-Complete. In the previous entry we did some basic preprocessing to transform the raw tweet data into a form closer to what we wanted. In this post we'll add some counts to the data so that we can use it to build our model.

Imports

# python
import os

# pypi
from dotenv import load_dotenv
from expects import (
    contain_exactly,
    contain_only,
    equal,
    expect,
    have_keys)

# this series
from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit

Set Up

The Environment

load_dotenv("posts/nlp/.env", override=True)

The Data

path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
    data = reader.read()
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
train_data, test_data = splitter.training, splitter.testing

Middle

Count Words

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: count_words ###
def count_words(tokenized_sentences: list) -> dict:
    """
    Count the number of word appearence in the tokenized sentences

    Args:
       tokenized_sentences: List of lists of strings

    Returns:
       dict that maps word (str) to the frequency (int)
    """

    word_counts = {}
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Loop through each sentence
    for sentence in tokenized_sentences: # complete this line

        # Go through each token in the sentence
        for token in sentence: # complete this line

            # If the token is not in the dictionary yet, set the count to 1
            if token not in word_counts: # complete this line
                word_counts[token] = 1

            # If the token is already in the dictionary, increment the count by 1
            else:
                word_counts[token] += 1

    ### END CODE HERE ###

    return word_counts

Test the Code

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
actual = count_words(tokenized_sentences)

expected =  {'sky': 1,
             'is': 1,
             'blue': 1,
             '.': 3,
             'leaves': 1,
             'are': 2,
             'green': 1,
             'roses': 1,
             'red': 1}

expect(actual).to(have_keys(**expected))

Out-Of-Vocabulary Words

If your model is performing autocomplete, but encounters a word that it never saw during training, it won't have an input word to help it determine the next word to suggest. The model will not be able to predict the next word because there are no counts for the current word.

  • This 'new' word is called an 'unknown word', or <b>out of vocabulary (OOV)</b> words.
  • The percentage of unknown words in the test set is called the <b> OOV </b> rate.

To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.

  • Modify the training data so that it has some 'unknown' words to train on.
  • Words to convert into "unknown" words are those that do not occur very frequently in the training set.
  • Create a list of the most frequent words in the training set, called the <b> closed vocabulary </b>.
  • Convert all the other words that are not part of the closed vocabulary to the token 'unk'.

Create a function that takes in a text document and a threshold `count_threshold`.

  • Any word whose count is greater than or equal to the threshold `count_threshold` is kept in the closed vocabulary.
  • Returns the word closed vocabulary list.
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_words_with_nplus_frequency ###
def get_words_with_nplus_frequency(tokenized_sentences: list, count_threshold: int) -> list:
    """
    Find the words that appear N times or more

    Args:
       tokenized_sentences: List of lists of sentences
       count_threshold: minimum number of occurrences for a word to be in the closed vocabulary.

    Returns:
       List of words that appear N times or more
    """
    # Initialize an empty list to contain the words that
    # appear at least 'minimum_freq' times.
    closed_vocab = []

    # Get the word couts of the tokenized sentences
    # Use the function that you defined earlier to count the words
    word_counts = count_words(tokenized_sentences)

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # for each word and its count
    for word, cnt in word_counts.items(): # complete this line

        # check that the word's count
        # is at least as great as the minimum count
        if cnt >= count_threshold:

            # append the word to the list
            closed_vocab.append(word)
    ### END CODE HERE ###

    return closed_vocab

Test The Code

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
actual = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(actual)
expected = ['.', 'are']
expect(actual).to(contain_exactly(*expected))
Closed vocabulary:
['.', 'are']

Parts Unknown

The words that appear `count_threshold` times or more are in the closed vocabulary.

  • All other words are regarded as `unknown`.
  • Replace words not in the closed vocabulary with the token `<unk>`.
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: replace_oov_words_by_unk ###
def replace_oov_words_by_unk(tokenized_sentences: list,
                             vocabulary: list,
                             unknown_token: str="<unk>") -> list:
    """
    Replace words not in the given vocabulary with '<unk>' token.

    Args:
       tokenized_sentences: List of lists of strings
       vocabulary: List of strings that we will use
       unknown_token: A string representing unknown (out-of-vocabulary) words

    Returns:
       List of lists of strings, with words not in the vocabulary replaced
    """

    # Place vocabulary into a set for faster search
    vocabulary = set(vocabulary)

    # Initialize a list that will hold the sentences
    # after less frequent words are replaced by the unknown token
    replaced_tokenized_sentences = []

    # Go through each sentence
    for sentence in tokenized_sentences:

        # Initialize the list that will contain
        # a single sentence with "unknown_token" replacements
        replaced_sentence = []
        ### START CODE HERE (Replace instances of 'None' with your code) ###

        # for each token in the sentence
        for token in sentence: # complete this line

            # Check if the token is in the closed vocabulary
            if token in vocabulary: # complete this line
                # If so, append the word to the replaced_sentence
                replaced_sentence.append(token)
            else:
                # otherwise, append the unknown token instead
                replaced_sentence.append(unknown_token)
        ### END CODE HERE ###

        # Append the list of tokens to the list of lists
        replaced_tokenized_sentences.append(replaced_sentence)

    return replaced_tokenized_sentences

Test It

tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)

print(f"Original sentence:")
print(tokenized_sentences)
expecteds = [['dogs', 'run'], ['cats', 'sleep']]
for actual, expected in zip(tokenized_sentences, expecteds):
    expect(actual).to(contain_exactly(*expected))

print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)
expecteds = [['dogs', '<unk>'], ['<unk>', 'sleep']]
for actual,expected in zip(tmp_replaced_tokenized_sentences, expecteds):
    expect(actual).to(contain_exactly(*expected))
Original sentence:
[['dogs', 'run'], ['cats', 'sleep']]
tokenized_sentences with less frequent words converted to '<unk>':
[['dogs', '<unk>'], ['<unk>', 'sleep']]

Combine Them

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: preprocess_data ###
def preprocess_data(train_data: list, test_data: list, count_threshold: int) -> tuple:
    """
    Preprocess data, i.e.,
       - Find tokens that appear at least N times in the training data.
       - Replace tokens that appear less than N times by "<unk>" both for training and test data.        
    Args:
       train_data, test_data: List of lists of strings.
       count_threshold: Words whose count is less than this are 
                     treated as unknown.

    Returns:
       Tuple of
       - training data with low frequent words replaced by "<unk>"
       - test data with low frequent words replaced by "<unk>"
       - vocabulary of words that appear n times or more in the training data
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Get the closed vocabulary using the train data
    vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)

    # For the train data, replace less common words with "<unk>"
    train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)

    # For the test data, replace less common words with "<unk>"
    test_data_replaced =  replace_oov_words_by_unk(test_data, vocabulary)

    ### END CODE HERE ###
    return train_data_replaced, test_data_replaced, vocabulary
tmp_train = [['sky', 'is', 'blue', '.'],
     ['leaves', 'are', 'green']]
tmp_test = [['roses', 'are', 'red', '.']]

tmp_train_repl, tmp_test_repl, tmp_vocab = preprocess_data(tmp_train, 
                                                           tmp_test, 
                                                           count_threshold = 1)

print("tmp_train_repl")
print(tmp_train_repl)
expecteds = [['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]
for actual, expected in zip(tmp_train_repl, expecteds):
    expect(actual).to(contain_exactly(*expected))
print()
print("tmp_test_repl")
print(tmp_test_repl)

expecteds = [['<unk>', 'are', '<unk>', '.']]

for actual, expected in zip(tmp_test_repl, expecteds):
    expect(actual).to(contain_exactly(*expected))
print()
print("tmp_vocab")
print(tmp_vocab)
expected = ['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']
expect(tmp_vocab).to(contain_exactly(*expected))
tmp_train_repl
[['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]

tmp_test_repl
[['<unk>', 'are', '<unk>', '.']]

tmp_vocab
['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']

Preprocess the Real Data

minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, 
                                                                        test_data, 
                                                                        minimum_freq)
print("last preprocessed testing sample:")
actual = test_data_processed[-1]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()

print("preprocessed training sample:")
actual = train_data_processed[9592]
expected = ['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()

print("First 10 vocabulary:")
actual = vocabulary[0:10]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the']
print(actual)
#expect(actual).to(contain_exactly(*expected))
print()
actual = len(vocabulary)
print(f"Size of vocabulary: {actual:,}")
expected = 14821
#expect(actual).to(equal(expected))
last preprocessed testing sample:
['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']

preprocessed training sample:
['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']

First 10 vocabulary:
['d', '&', 's', 'is', 'covering', 'the', 'event', 'with', 'thomas', ',']

Size of vocabulary: 14,679

Note: My shuffling is different from theirs, even though I'm setting the seed, so it seems to come out differently.

Put It All Together

The Imports

# python
from collections import Counter
from itertools import chain

# from pypi
import attr

The Processor

@attr.s(auto_attribs=True)
class CountProcessor:
    """Processes the data to have unknowns

    Args:
     training: the tokenized training data (list of lists)
     testing: the tokenized testing data
     count_threshold: minimum number of times token needs to appear
     unknown_token: string to use for words below threshold
    """
    training: list
    testing: list
    count_threshold: int=2
    unknown_token: str="<unk>"
    _counts: dict=None
    _vocabulary: set=None
    _train_unknown: list=None
    _test_unknown: list=None
  • Counts
    @property
    def counts(self) -> Counter:
        """Count of each word in the training data"""
        if self._counts is None:
            self._counts = Counter(chain.from_iterable(self.training))
        return self._counts
    
  • The Vocabulary
    @property
    def vocabulary(self) -> set:
        """The tokens in training that appear at least ``count_threshold`` times"""
        if self._vocabulary is None:
            self._vocabulary = set((token for token, count in self.counts.items()
                                if count >= self.count_threshold))
        return self._vocabulary
    
  • Train Unknown
    @property
    def train_unknown(self) -> list:
        """Training data with words below threshold replaced"""
        if self._train_unknown is None:
            self._train_unknown = self.parts_unknown(self.training)
        return self._train_unknown
    
  • Test Unknown
    @property
    def test_unknown(self) -> list:
        """Testing data with words below threshold replaced"""
        if self._test_unknown is None:
            self._test_unknown = self.parts_unknown(self.testing)
        return self._test_unknown
    
  • Parts Unknown
    def parts_unknown(self, source: list) -> list:
        """Replaces tokens in source that aren't in vocabulary
    
        Args:
         source: nested list of lists with tokens to check
    
        Returns: source with unknown words replaced by unknown_token
        """
        return [
                [token if token in self.vocabulary else self.unknown_token
                 for token in tokens]
            for tokens in source
        ]    
    

Test It Out

from neurotic.nlp.autocomplete import CountProcessor

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]

testing = [[]]
processor = CountProcessor(tokenized_sentences, testing)
actual = processor.counts

expected =  {'sky': 1,
             'is': 1,
             'blue': 1,
             '.': 3,
             'leaves': 1,
             'are': 2,
             'green': 1,
             'roses': 1,
             'red': 1}

# note to future self: if you pass key=value to have_keys it checks both
expect(actual).to(have_keys(**expected))

actual = processor.vocabulary
expected = ['.', 'are']
expect(actual).to(contain_only(*expected))
tokenized_sentences = [["dogs", "run", "sleep"], ["cats", "sleep", "dogs"]]

testing = [["cows", "dogs"], ["pigs", "sleep"]]
processor = CountProcessor(training=tokenized_sentences, testing=testing)

actuals = processor.train_unknown

UNKNOWN = "<unk>"
expecteds = [["dogs", UNKNOWN, "sleep"], [UNKNOWN, "sleep", "dogs"]]
for actual,expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

actuals = processor.test_unknown
expecteds = [[UNKNOWN, "dogs"], [UNKNOWN, "sleep"]]
for actual,expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

End

Now that we have the data in the basic form we want we'll move on to building the N-Gram Language Model.

Auto-Complete: Pre-Process the Data I

Beginning

This is the second part of a series implementing an n-gram-based auto-complete system for tweets. The starting post is Auto-Complete.

Imports

# python
import os
import random

# pypi
from dotenv import load_dotenv

from expects import (
    contain_exactly,
    equal,
    expect
)

import nltk

Set Up

The Environment

load_dotenv("posts/nlp/.env", override=True)

Middle

Load the Data

We're going to use twitter data. There's no listing of the source, so I'm assuming it's the NLTK twitter data. But maybe not.

path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
    data = reader.read()
print("Data type:", type(data))
print(f"Number of letters: {len(data):,}")
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")

print("Last 300 letters of the data")
print("-------")
display(data[-300:])
print("-------")    
Data type: <class 'str'>
Number of letters: 3,335,477
First 300 letters of the data
-------
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A 
-------
Last 300 letters of the data
-------
ust had one a few weeks back....hopefully we will be back soon! wish you the best yo\nColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”\n#GutsiestMovesYouCanMake Giving a cat a bath.\nCoffee after 5 was a TERRIBLE idea.\n
-------

So the data looks like it's just tweets, without any metadata and is separated using newlines.

Split To Sentences

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: split_to_sentences ###
def split_to_sentences(data: str) -> list:
    """
    Split data by linebreak "\n"

    Args:
       data: str

    Returns:
       A list of sentences
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    sentences = data.split("\n")
    ### END CODE HERE ###

    # Additional clearning (This part is already implemented)
    # - Remove leading and trailing spaces from each sentence
    # - Drop sentences if they are empty strings.
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]

    return sentences

Test The Code

x = """
I have a pen.\nI have an apple. \nAh\nApple pen.\n
"""
print(x)

expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']
actual = split_to_sentences(x)
expect(actual).to(contain_exactly(*expected))

I have a pen.
I have an apple. 
Ah
Apple pen.

Tokenize Sentences

The next step is to tokenize sentences (split a sentence into a list of words).

  • Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
  • Append each tokenized list of words into a list of tokenized sentences.

Hints:

  • Use str.lower to convert strings to lowercase.
  • Please use nltk.word_tokenize to split sentences into tokens.
  • If you used str.split insteaad of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: tokenize_sentences ###
def tokenize_sentences(sentences: list) -> list:
    """
    Tokenize sentences into tokens (words)

    Args:
       sentences: List of strings

    Returns:
       List of lists of tokens
    """

    # Initialize the list of lists of tokenized sentences
    tokenized_sentences = []
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Go through each sentence
    for sentence in sentences:

        # Convert to lowercase letters
        sentence = sentence.lower()

        # Convert into a list of words
        tokenized = nltk.word_tokenize(sentence)

        # append the list of words to the list of lists
        tokenized_sentences.append(tokenized)

    ### END CODE HERE ###

    return tokenized_sentences

Test the Code

sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]

expecteds = [['sky', 'is', 'blue', '.'],
            ['leaves', 'are', 'green', '.'],
            ['roses', 'are', 'red', '.']]

actuals = tokenize_sentences(sentences)
for expected, actual in zip(expecteds, actuals):
    expect(actual).to(contain_exactly(*expected))

Combine Split Sentences and Tokenize

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_tokenized_data ###
def get_tokenized_data(data: str) -> list:
    """
    Make a list of tokenized sentences

    Args:
       data: String

    Returns:
       List of lists of tokens
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Get the sentences by splitting up the data
    sentences = split_to_sentences(data)

    # Get the list of lists of tokens by tokenizing the sentences
    tokenized_sentences = tokenize_sentences(sentences)

    ### END CODE HERE ###

    return tokenized_sentences

Test It

x = "Sky is blue.\nLeaves are green\nRoses are red."
actuals = get_tokenized_data(x)
expecteds =  [['sky', 'is', 'blue', '.'],
              ['leaves', 'are', 'green'],
              ['roses', 'are', 'red', '.']]
for actual, expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

Split Train and Test Sets

tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]
actual_data, expected_data = len(tokenized_data), 47961
actual_training, expected_training = len(train_data), 38368
actual_testing, expected_testing = len(test_data), 9593

print((f"{actual_data:,} are split into {actual_training:,} training entries"
       f" and {actual_testing:,} test set entries."))

for label, actual, expected in zip(
        "data training testing".split(),
        (actual_data, actual_training, actual_testing),
        (expected_data, expected_training, expected_testing)):
    expect(actual).to(equal(expected)), (label, actual, expected)
47,961 are split into 38,368 training entries and 9,593 test set entries.
print("First training sample:")
actual = train_data[0]
print(actual)
expected = ["i", "personally", "would", "like", "as", "our", "official", "glove",
            "of", "the", "team", "local", "company", "and", "quality",
            "production"]
expect(actual).to(contain_exactly(*expected))
First training sample:
['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print("First test sample")
actual = test_data[0]
print(actual)
expected = ["that", "picture", "i", "just", "seen", "whoa", "dere", "!", "!",
            ">", ">", ">", ">", ">", ">", ">"]
expect(actual).to(contain_exactly(*expected))
First test sample
['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']

Object-Oriented

<<imports>>


<<the-tokenizer>>

    <<sentences>>

    <<tokenized>>


<<train-test-split>>

    <<shuffled-data>>

    <<training-data>>

    <<testing-data>>

    <<split>>

Imports

# python
import random

# pypi
import attr
import nltk

The Tokenizer

@attr.s(auto_attribs=True)
class Tokenizer:
    """Tokenizes string sentences

    Args:
     source: string data to tokenize
     end_of_sentence: what to split sentences on

    """
    source: str
    end_of_sentence: str="\n"
    _sentences: list=None
    _tokenized: list=None
    _training_data: list=None
  • Sentences
    @property
    def sentences(self) -> list:
        """The data split into sentences"""
        if self._sentences is None:
            self._sentences = self.source.split(self.end_of_sentence)
            self._sentences = (sentence.strip() for sentence in self._sentences)
            self._sentences = [sentence for sentence in self._sentences if sentence]
        return self._sentences
    
  • Tokenized
    @property
    def tokenized(self) -> list:
        """List of tokenized sentence"""
        if self._tokenized is None:
            self._tokenized = [nltk.word_tokenize(sentence.lower())
                               for sentence in self.sentences]
        return self._tokenized
    

Train-Test-Split

@attr.s(auto_attribs=True)
class TrainTestSplit:
    """splits up the training and testing sets

    Args:
     data: list of data to split
     training_fraction: how much to put in the training set
     seed: something to seed the random call
    """
    data: list
    training_fraction: float=0.8
    seed: int=87
    _shuffled: list=None
    _training: list=None
    _testing: list=None
    _split: int=None
  • Shuffled Data
    @property
    def shuffled(self) -> list:
        """The data shuffled"""
        if self._shuffled is None:
            random.seed(self.seed)
            self._shuffled = random.sample(self.data, k=len(self.data))
        return self._shuffled
    
  • Split
    @property
    def split(self) -> int:
        """The slice value for training and testing"""
        if self._split is None:
            self._split = int(len(self.data) * self.training_fraction)
        return self._split
    
  • Training Data
    @property
    def training(self) -> list:
        """The Training Portion of the Set"""
        if self._training is None:
            self._training = self.shuffled[0:self.split]
        return self._training
    
  • Testing Data
    @property
    def testing(self) -> list:
        """The testing data"""
        if self._testing is None:
            self._testing = self.shuffled[self.split:]
        return self._testing
    

Test It Out

  • Sentences
    from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit
    
    x = """
    I have a pen.\nI have an apple. \nAh\nApple pen.\n
    """
    expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']
    tokenizer = Tokenizer(x)
    
    actual = tokenizer.sentences
    expect(actual).to(contain_exactly(*expected))
    
  • Tokens
    source = "\n".join(["Sky is blue.", "Leaves are green.", "Roses are red."])
    
    expecteds = [['sky', 'is', 'blue', '.'],
                ['leaves', 'are', 'green', '.'],
                ['roses', 'are', 'red', '.']]
    
    tokenizer = Tokenizer(source)
    actuals = tokenizer.tokenized
    for expected, actual in zip(expecteds, actuals):
        expect(actual).to(contain_exactly(*expected))
    

Training And Test Sets

random.seed(87)
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
actual_data, expected_data = len(tokenizer.tokenized), 47961
actual_training, expected_training = len(splitter.training), 38368
actual_testing, expected_testing = len(splitter.testing), 9593

print((f"{actual_data:,} are split into {actual_training:,} training entries"
       f" and {actual_testing:,} test set entries."))

for label, actual, expected in zip(
        "data training testing".split(),
        (actual_data, actual_training, actual_testing),
        (expected_data, expected_training, expected_testing)):
    expect(actual).to(equal(expected)), (label, actual, expected)

End

The next post in this series is Pre-Processing II in which we'll add counts to the tweets.

N-Grams: Out-of-Vocabulary Words

Beginning

# python
from collections import Counter

Middle

We're going to look at a method of deciding whether an unknown word belongs to our vocabulary. It requires that we know the target size of the vocabulary in advance and the vocabulary has the words and their counts from the training set.

Build the vocabulary

First we'll define the vocabulary target size.

vocabulary_target_size = 3

Now build a counter - with a real vocabulary we could use the Counter object to build the counts directly, but since we don't have a real corpus we can create it with a dict.

counts = Counter({"happy": 5,
          "because": 3,
          "i":2,
          "am":2,
          "learning": 3,
          ".": 1})

Now trim it down to our target size.

vocabulary = counts.most_common(vocabulary_target_size)

Now get the words only.

vocabulary = set([word for word, count in vocabulary])
print(f"The {vocabulary_target_size} most frequent words: {vocabulary}")
The 3 most frequent words: {'because', 'learning', 'happy'}

Replace Unknown Words

original = "am i learning".split()

output = []

for word in original:
    word = word if word in vocabulary else "<UNK>"
    output.append(word)

print(f"Original: {original}")
print(f"Processed: {output}")
Original: ['am', 'i', 'learning']
Processed: ['<UNK>', '<UNK>', 'learning']

A Specific Frequency

There might also be cases where we need to filter by a specific frequency instead of just the largest frequencies. Here's one way to do it.

match = 3

word_counts = {"happy": 5,
               "because": 3,
               "i": 2,
               "am": 2,
               "learning": 3,
               ".": 1}
matches = (word for word, count in word_counts.items() if count==match)
for word in matches:
    print(word)
because
learning

Too Many Unknowns

We're going to use perplexity to assess the performance of our model. If you have too many unknowns your perplexity will be low even though your model isn't doing well. Here's an example of this effect.

Rather than going through the trouble of creating the corpus, let's just pretend we calculated the probabilities (the bigram-probabilities for the training set were calculated in the previous post).

Here's the case where everything is known.

training_set = ['i', 'am', 'happy', 'because','i', 'am', 'learning', '.']

# pre-calculated probabilities
bigram_probabilities = {('i', 'am'): 1.0,
                        ('am', 'happy'): 0.5,
                        ('happy', 'because'): 1.0,
                        ('because', 'i'): 1.0,
                        ('am', 'learning'): 0.5,
                        ('learning', '.'): 1.0}
test_set = ['i', 'am', 'learning']

And here's the case where the training set has a lot of unknowns (Out-of-Vocabulary words).

training_set_unk = ['i', 'am', '<UNK>', '<UNK>','i', 'am', '<UNK>', '<UNK>']

test_set_unk = ['i', 'am', '<UNK>']

And here's our bigram probabilities for the set with unknowns.

bigram_probabilities_unk = {('i', 'am'): 1.0,
                            ('am', '<UNK>'): 1.0,
                            ('<UNK>', '<UNK>'): 0.5,
                            ('<UNK>', 'i'): 0.25}

"i" is always followed by "am" so the first probability is going to be 1. "am" is always followed by "<UNK>" so the second probability will also be 1. Two of the four "<UNK>"s are followed by an "<UNK>" so the third probability is 1/2 and "<UNK>" is followed by "i" once, so the last probability is 1/4.

M = len(test_set)
probability = 1
probability_unk = 1

n, N = len(test_set), 2

for i in range(n - N + 1):
    bigram = tuple(test_set[i: i + N])
    probability = probability * bigram_probabilities[bigram]

    bigram_unk = tuple(test_set_unk[i: i + N])
    probability_unk = probability_unk * bigram_probabilities_unk[bigram_unk]

# calculate perplexity for both original test set and test set with <UNK>
perplexity = probability ** (-1 / M)
perplexity_unk = probability_unk ** (-1 / M)

print(f"perplexity for the training set: {perplexity}")
print(f"perplexity for the training set with <UNK>: {perplexity_unk}")
perplexity for the training set: 1.2599210498948732
perplexity for the training set with <UNK>: 1.0

So our training set with unknown words does better than our training set with all the words in our test set. It's a little mysterious to me why you would choose to put all these unknowns in the training set, unless you're trying to save space or something. I'll have to go back and read about that.

Smoothing

As with prior cases where we had to calculate probabilities, we need to be able to handle probabilities for n-grams that we didn't learn. We're going to use add-k smoothing here as an example.

def add_k_smoothing(k: int, vocabulary_size: int, n_gram_count: int,
                    n_gram_prefix_count: int) -> float:
    return  ((n_gram_count + k)/
             (n_gram_prefix_count + k * vocabulary_size))

We'll take a look at k=1 (Laplacian) smoothing for a trigram.

trigram_probabilities = {('i', 'am', 'happy') : 2}
bigram_probabilities = {( 'am', 'happy') : 10}
vocabulary_size = 5
k = 1

probability_known_trigram = add_k_smoothing(k, vocabulary_size, trigram_probabilities[('i', 'am', 'happy')], 
                           bigram_probabilities[( 'am', 'happy')])

probability_unknown_trigram = add_k_smoothing(k, vocabulary_size, 0, 0)

print(f"probability_known_trigram: {probability_known_trigram: 0.03f}")
print(f"probability_unknown_trigram: {probability_unknown_trigram: 0.03f}")
probability_known_trigram:  0.200
probability_unknown_trigram:  0.200

So, here's a problem with add-k smoothing - when the n-gram is unknown, we still get a 20% probability, which in this case happens to be the same as a trigram that was in the training set.

Back-Off

Here's an alternate way to handle unknown n-grams - if the n-gram isn't known, use a probability for a smaller n.

Here are our pre-calculated probabilities of all types of n-grams.

trigram_probabilities = {('i', 'am', 'happy'): 0}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

Here's the trigram that we want the probability for. As you can see, we don't have "you" in our known n-grams.

trigram = ('are', 'you', 'happy')
bigram, unigram = trigram[1: 3], trigram[2]

Now we can do a brute-force search for the probabilities. \(\lambda\) was discovered experimentally.

lambda_factor = 0.4
probability_hat_trigram = 0

# search for first non-zero probability starting with the trigram
# to generalize this for any order of n-gram hierarchy, 
# you could loop through the probability dictionaries instead of if/else cascade
if trigram not in trigram_probabilities or trigram_probabilities[trigram] == 0:
    print(f"probability for trigram {trigram} not found")

    if bigram not in bigram_probabilities or bigram_probabilities[bigram] == 0:
        print(f"probability for bigram {bigram} not found")

        if unigram in unigram_probabilities:
            print(f"probability for unigram {unigram} found\n")
            probability_hat_trigram = lambda_factor * lambda_factor * unigram_probabilities[unigram]
        else:
            probability_hat_trigram = 0
    else:
        probability_hat_trigram = lambda_factor * bigram_probabilities[bigram]
else:
    probability_hat_trigram = trigram_probabilities[trigram]

print(f"probability for trigram {trigram} estimated as {probability_hat_trigram:0.3f}")
probability for trigram ('are', 'you', 'happy') not found
probability for bigram ('you', 'happy') not found
probability for unigram happy found

probability for trigram ('are', 'you', 'happy') estimated as 0.064

Interpolation

Yet another way to handle unknown n-grams. In this case you always use trigrams, bigrams, and unigrams, thus eliminating some of the overhead and use a weighted value instead. As always, there's no free lunch - you have to find the best weights to make this work (but we'll take some pre-made ones).

Pre-calculated probabilities of all types of n-grams.

trigram_probabilities = {('i', 'am', 'happy'): 0.15}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

The weights come from optimization on a validation set.

lambda_1 = 0.8
lambda_2 = 0.15
lambda_3 = 0.05

And now the trigram whose probability we want to estimate as well as derived bigrams and unigrams.

trigram = ('i', 'am', 'happy')
bigram, unigram = trigram[1: 3], trigram[2]
probability_hat_trigram = lambda_1 * trigram_probabilities[trigram] 
+ lambda_2 * bigram_probabilities[bigram]
+ lambda_3 * unigram_probabilities[unigram]

print(f"estimated probability of the input trigram {trigram} is {probability_hat_trigram: 0.4f}")
estimated probability of the input trigram ('i', 'am', 'happy') is  0.1200

End

So, there's various ways to handle both individual words as well as n-grams we don't recognize.

N-Gram: Building the Language Model

Building the Language Model

Imports

# python
from collections import defaultdict
from functools import partial

import random

# pypi
from tabulate import tabulate

import numpy
import pandas

Set Up

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

Middle

The Count Matrix

To calculate the n-gram probability, you will need to count frequencies of n-grams and n-gram prefixes in the training dataset. In some of the code assignment exercises, you will store the n-gram frequencies in a dictionary.

In other parts of the assignment, you will build a count matrix that keeps counts of (n-1)-gram prefix followed by all possible last words in the vocabulary.

The following code shows how to check, retrieve and update counts of n-grams in the word count dictionary.

n_gram_counts = {
    ('i', 'am', 'happy'): 2,
    ('am', 'happy', 'because'): 1}

# get count for an n-gram tuple
print(f"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}")
count of n-gram ('i', 'am', 'happy'): 2
# check if n-gram is present in the dictionary
n_gram = ('i', 'am', 'learning')
status = "found" if n_gram in n_gram_counts else "missing"

print(f"n-gram {n_gram} {status}")
n-gram ('i', 'am', 'learning') missing
# update the count in the word count dictionary
n_gram_counts[n_gram] = 1
status = "found" if ('i', 'am', 'learning') in n_gram_counts else "missing"
print(f"n-gram {n_gram} {status}")
n-gram ('i', 'am', 'learning') found

The next code snippet shows how to merge two tuples in Python. That will be handy when creating the n-gram from the prefix and the last word.

concatenate tuple for prefix and tuple with the last word to create the n_gram

prefix = ('i', 'am', 'happy')
word = 'because'

# note here the syntax for creating a tuple for a single word
n_gram = prefix + (word,)
print(n_gram)
('i', 'am', 'happy', 'because')

Single-Pass Trigram Count Matrix

In the lecture, you've seen that the count matrix could be made in a single pass through the corpus. Here is one approach to do that.

def single_pass_trigram_count_matrix(corpus: list) -> tuple:
    """
    Creates the trigram count matrix from the input corpus in a single pass through the corpus.

    Args:
       corpus: Pre-processed and tokenized corpus. 

    Returns:
       bigrams: list of all bigram prefixes, row index
       vocabulary: list of all found words, the column index
       count_matrix: pandas dataframe with bigram prefixes as rows, 
                     vocabulary words as columns 
                     and the counts of the bigram/word combinations (i.e. trigrams) as values
    """
    bigrams = []
    vocabulary = []
    count_matrix_dict = defaultdict(dict)

    # go through the corpus once with a sliding window
    for i in range(len(corpus) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tuple(corpus[i : i + 3])

        bigram = trigram[0 : -1]
        if not bigram in bigrams:
            bigrams.append(bigram)        

        last_word = trigram[-1]
        if not last_word in vocabulary:
            vocabulary.append(last_word)

        if (bigram,last_word) not in count_matrix_dict:
            count_matrix_dict[bigram,last_word] = 0

        count_matrix_dict[bigram,last_word] += 1

    # convert the count_matrix to numpy.array to fill in the blanks
    count_matrix = numpy.zeros((len(bigrams), len(vocabulary)))
    for trigram_key, trigam_count in count_matrix_dict.items():
        count_matrix[bigrams.index(trigram_key[0]),
                     vocabulary.index(trigram_key[1])] = trigam_count

    # numpy.array to pandas dataframe conversion
    count_matrix = pandas.DataFrame(count_matrix, index=bigrams, columns=vocabulary)
    return bigrams, vocabulary, count_matrix
corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

bigrams, vocabulary, count_matrix = single_pass_trigram_count_matrix(corpus)

print(TABLE(count_matrix))
  happy because i am learning .
('i', 'am') 1 0 0 0 1 0
('am', 'happy') 0 1 0 0 0 0
('happy', 'because') 0 0 1 0 0 0
('because', 'i') 0 0 0 1 0 0
('am', 'learning') 0 0 0 0 0 1

The Probability Matrix

The next step is to build a probability matrix from the count matrix. You can use an object dataframe from library pandas and its methods sum and div to normalize the cell counts with the sum of the respective rows.

Create the Probability Matrix from the Count Matrix

row_sums = count_matrix.sum(axis="columns")

Divide Each Row By Its Sum

prob_matrix = count_matrix.div(row_sums, axis="rows")

print(TABLE(prob_matrix))
  happy because i am learning .
('i', 'am') 0.5 0 0 0 0.5 0
('am', 'happy') 0 1 0 0 0 0
('happy', 'because') 0 0 1 0 0 0
('because', 'i') 0 0 0 1 0 0
('am', 'learning') 0 0 0 0 0 1

Find the Probability of a Trigram

Since the columns of the probability matrix are the suffix-words and the index is made up of the bigram-prefix we'll need to unpack those to look up our probability.

trigram = ('i', 'am', 'happy')

bigram = trigram[:-1]
print(f'prefix-bigram: {bigram}')
prefix-bigram: ('i', 'am')
word = trigram[-1]
print(f"last word of the trigram: {word}")
last word of the trigram: happy
print(f"trigram probability: {prob_matrix[word][bigram]}")
trigram probability: 0.5

Which if you look at our corpus or count matrix, is the correct value ("i am" appears twice and one of those times it's "i am happy").

List all the words in the vocabulary starting with a given prefix

This is just a demonstration of checking the prefix of a string in python.

vocabulary = ['i', 'am', 'happy', 'because', 'learning', '.', 'have', 'you', 'seen','it', '?']
starts_with = 'ha'

print(f'words in vocabulary starting with prefix: {starts_with}\n')
for word in (candidate for candidate in vocabulary
             if candidate.startswith(starts_with)):
    print(word)
words in vocabulary starting with prefix: ha

happy
have

Building Training, Validation, and Testing Sets

def train_validation_test_split(data, train_percent, validation_percent):
    """
    Splits the input data to  train/validation/test according to the percentage provided

    Args:
       data: Pre-processed and tokenized corpus, i.e. list of sentences.
       train_percent: integer 0-100, defines the portion of input corpus allocated for training
       validation_percent: integer 0-100, defines the portion of input corpus allocated for validation

       Note: train_percent + validation_percent need to be <=100
             the reminder to 100 is allocated for the test set

    Returns:
       train_data: list of sentences, the training part of the corpus
       validation_data: list of sentences, the validation part of the corpus
       test_data: list of sentences, the test part of the corpus
    """
    # fixed seed here for reproducibility
    random.seed(87)

    # reshuffle all input sentences
    random.shuffle(data)

    train_size = int(len(data) * train_percent / 100)
    train_data = data[0:train_size]

    validation_size = int(len(data) * validation_percent / 100)
    validation_data = data[train_size:train_size + validation_size]

    test_data = data[train_size + validation_size:]

    return train_data, validation_data, test_data

Check the Sets

data = list(range (0, 100))

train_data, validation_data, test_data = train_validation_test_split(data, 80, 10)
print("split 80/10/10:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n", 
      f"test data:{test_data}\n")

train_data, validation_data, test_data = train_validation_test_split(data, 98, 1)
print("split 98/1/1:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n", 
      f"test data:{test_data}\n")
split 80/10/10:
 train data:[28, 76, 5, 0, 62, 29, 54, 95, 88, 58, 4, 22, 92, 14, 50, 77, 47, 33, 75, 68, 56, 74, 43, 80, 83, 84, 73, 93, 66, 87, 9, 91, 64, 79, 20, 51, 17, 27, 12, 31, 67, 81, 7, 34, 45, 72, 38, 30, 16, 60, 40, 86, 48, 21, 70, 59, 6, 19, 2, 99, 37, 36, 52, 61, 97, 44, 26, 57, 89, 55, 53, 85, 3, 39, 10, 71, 23, 32, 25, 8]
 validation data:[78, 65, 63, 11, 49, 98, 1, 46, 15, 41]
 test data:[90, 96, 82, 42, 35, 13, 69, 24, 94, 18]

split 98/1/1:
 train data:[66, 23, 29, 28, 52, 87, 70, 13, 15, 2, 62, 43, 82, 50, 40, 32, 30, 79, 71, 89, 6, 10, 34, 78, 11, 49, 39, 42, 26, 46, 58, 96, 97, 8, 56, 86, 33, 93, 92, 91, 57, 65, 95, 20, 72, 3, 12, 9, 47, 37, 67, 1, 16, 74, 53, 99, 54, 68, 5, 18, 27, 17, 48, 36, 24, 45, 73, 19, 41, 59, 21, 98, 0, 31, 4, 85, 80, 64, 84, 88, 25, 44, 61, 22, 60, 94, 76, 38, 77, 81, 90, 69, 63, 7, 51, 14, 55, 83]
 validation data:[35]
 test data:[75]

Perplexity

In order to implement the perplexity formula, you'll need to know how to implement m-th order root of a variable.

\begin{equation*} PP(W)=\sqrt[M]{\prod_{i=1}^{m}{\frac{1}{P(w_i|w_{i-1})}}} \end{equation*}

Remember from calculus:

\begin{equation*} \sqrt[M]{\frac{1}{x}} = x^{-\frac{1}{M}} \end{equation*}

Here is some code that will help you with the formula.

p = 10 ** (-250)
M = 100

print(f"perplexity = {p ** (-1 / M):0.3f}")
perplexity = 316.228

N-Gram Pre-Processing

Beginning

Imports

# python
import os
import re

# pypi
from dotenv import load_dotenv

import nltk

Set Up

The Environment

load_dotenv("posts/nlp/.env")

The Data Set

This is the pre-trained tokenizer that comes with NLTK.

nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/hades/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

N-gram Corpus Preprocessing

Some common preprocessing steps for the language models include:

  • lowercasing the text
  • remove special characters
  • split text to list of sentences
  • split sentence into list words

Lowercase

corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()

# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)
learning% makes 'me' happy. i am happy be-cause i am learning! :)

Remove Special Characters

corpus = "learning% makes 'me' happy. i am happy be-cause i am learning! :)"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)
learning makes me happy. i am happy because i am learning! 

Splitting Text

# split text by a delimiter to array
input_date="Sat May  9 07:33:35 CEST 2020"

# get the date parts in array
date_parts = input_date.split(" ")
print(f"date parts = {date_parts}")

#get the time parts in array
time_parts = date_parts[4].split(":")
print(f"time parts = {time_parts}")
date parts = ['Sat', 'May', '', '9', '07:33:35', 'CEST', '2020']
time parts = ['07', '33', '35']

Tokenizing Sentences

sentence = 'i am happy because i am learning.'
tokenized_sentence = nltk.word_tokenize(sentence)
print(f'{sentence} -> {tokenized_sentence}')
i am happy because i am learning. -> ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

Find the Length of Each Word

sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
word_lengths = [(word, len(word)) for word in sentence] # Create a list with the word lengths using a list comprehension
print(f' Lengths of the words: \n{word_lengths}')
 Lengths of the words: 
[('i', 1), ('am', 2), ('happy', 5), ('because', 7), ('i', 1), ('am', 2), ('learning', 8), ('.', 1)]

Convert to N-Grams

def sentence_to_trigram(tokenized_sentence):
    """
    Prints all trigrams in the given tokenized sentence.

    Args:
       tokenized_sentence: The words list.

    Returns:
       No output
    """
    # note that the last position of i is 3rd to the end
    for i in range(len(tokenized_sentence) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tokenized_sentence[i : i + 3]
        print(trigram)

tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

print(f'List all trigrams of sentence: {tokenized_sentence}\n')
sentence_to_trigram(tokenized_sentence)
List all trigrams of sentence: ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

['i', 'am', 'happy']
['am', 'happy', 'because']
['happy', 'because', 'i']
['because', 'i', 'am']
['i', 'am', 'learning']
['am', 'learning', '.']

N-Gram Prefix

\begin{equation*} P(w_n|w_1^{n-1})=\frac{C(w_1^n)}{C(w_1^{n-1})} \end{equation*}
# get trigram prefix from a 4-gram
fourgram = ['i', 'am', 'happy','because']
trigram = fourgram[0:-1] # Get the elements from 0, included, up to the last element, not included.
print(trigram)
['i', 'am', 'happy']

Adding Start and End Tokens

n = 3
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
tokenized_sentence = ["<s>"] * (n - 1) + tokenized_sentence + ["</s>"]
print(tokenized_sentence)
['<s>', '<s>', 'i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.', '</s>']