Auto-Complete: Pre-Process the Data II

Beginning

This is the third post in a series that begins with Auto-Complete. In the previous entry we did some basic preprocessing to transform the raw tweet data into a form closer to what we wanted. In this post we'll add some counts to the data so that we can use it to build our model.

Imports

# python
import os

# pypi
from dotenv import load_dotenv
from expects import (
    contain_exactly,
    contain_only,
    equal,
    expect,
    have_keys)

# this series
from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit

Set Up

The Environment

load_dotenv("posts/nlp/.env", override=True)

The Data

path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
    data = reader.read()
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
train_data, test_data = splitter.training, splitter.testing

Middle

Count Words

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: count_words ###
def count_words(tokenized_sentences: list) -> dict:
    """
    Count the number of word appearence in the tokenized sentences

    Args:
       tokenized_sentences: List of lists of strings

    Returns:
       dict that maps word (str) to the frequency (int)
    """

    word_counts = {}
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Loop through each sentence
    for sentence in tokenized_sentences: # complete this line

        # Go through each token in the sentence
        for token in sentence: # complete this line

            # If the token is not in the dictionary yet, set the count to 1
            if token not in word_counts: # complete this line
                word_counts[token] = 1

            # If the token is already in the dictionary, increment the count by 1
            else:
                word_counts[token] += 1

    ### END CODE HERE ###

    return word_counts

Test the Code

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
actual = count_words(tokenized_sentences)

expected =  {'sky': 1,
             'is': 1,
             'blue': 1,
             '.': 3,
             'leaves': 1,
             'are': 2,
             'green': 1,
             'roses': 1,
             'red': 1}

expect(actual).to(have_keys(**expected))

Out-Of-Vocabulary Words

If your model is performing autocomplete, but encounters a word that it never saw during training, it won't have an input word to help it determine the next word to suggest. The model will not be able to predict the next word because there are no counts for the current word.

  • This 'new' word is called an 'unknown word', or <b>out of vocabulary (OOV)</b> words.
  • The percentage of unknown words in the test set is called the <b> OOV </b> rate.

To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.

  • Modify the training data so that it has some 'unknown' words to train on.
  • Words to convert into "unknown" words are those that do not occur very frequently in the training set.
  • Create a list of the most frequent words in the training set, called the <b> closed vocabulary </b>.
  • Convert all the other words that are not part of the closed vocabulary to the token 'unk'.

Create a function that takes in a text document and a threshold `count_threshold`.

  • Any word whose count is greater than or equal to the threshold `count_threshold` is kept in the closed vocabulary.
  • Returns the word closed vocabulary list.
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_words_with_nplus_frequency ###
def get_words_with_nplus_frequency(tokenized_sentences: list, count_threshold: int) -> list:
    """
    Find the words that appear N times or more

    Args:
       tokenized_sentences: List of lists of sentences
       count_threshold: minimum number of occurrences for a word to be in the closed vocabulary.

    Returns:
       List of words that appear N times or more
    """
    # Initialize an empty list to contain the words that
    # appear at least 'minimum_freq' times.
    closed_vocab = []

    # Get the word couts of the tokenized sentences
    # Use the function that you defined earlier to count the words
    word_counts = count_words(tokenized_sentences)

    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # for each word and its count
    for word, cnt in word_counts.items(): # complete this line

        # check that the word's count
        # is at least as great as the minimum count
        if cnt >= count_threshold:

            # append the word to the list
            closed_vocab.append(word)
    ### END CODE HERE ###

    return closed_vocab

Test The Code

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
actual = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(actual)
expected = ['.', 'are']
expect(actual).to(contain_exactly(*expected))
Closed vocabulary:
['.', 'are']

Parts Unknown

The words that appear `count_threshold` times or more are in the closed vocabulary.

  • All other words are regarded as `unknown`.
  • Replace words not in the closed vocabulary with the token `<unk>`.
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: replace_oov_words_by_unk ###
def replace_oov_words_by_unk(tokenized_sentences: list,
                             vocabulary: list,
                             unknown_token: str="<unk>") -> list:
    """
    Replace words not in the given vocabulary with '<unk>' token.

    Args:
       tokenized_sentences: List of lists of strings
       vocabulary: List of strings that we will use
       unknown_token: A string representing unknown (out-of-vocabulary) words

    Returns:
       List of lists of strings, with words not in the vocabulary replaced
    """

    # Place vocabulary into a set for faster search
    vocabulary = set(vocabulary)

    # Initialize a list that will hold the sentences
    # after less frequent words are replaced by the unknown token
    replaced_tokenized_sentences = []

    # Go through each sentence
    for sentence in tokenized_sentences:

        # Initialize the list that will contain
        # a single sentence with "unknown_token" replacements
        replaced_sentence = []
        ### START CODE HERE (Replace instances of 'None' with your code) ###

        # for each token in the sentence
        for token in sentence: # complete this line

            # Check if the token is in the closed vocabulary
            if token in vocabulary: # complete this line
                # If so, append the word to the replaced_sentence
                replaced_sentence.append(token)
            else:
                # otherwise, append the unknown token instead
                replaced_sentence.append(unknown_token)
        ### END CODE HERE ###

        # Append the list of tokens to the list of lists
        replaced_tokenized_sentences.append(replaced_sentence)

    return replaced_tokenized_sentences

Test It

tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)

print(f"Original sentence:")
print(tokenized_sentences)
expecteds = [['dogs', 'run'], ['cats', 'sleep']]
for actual, expected in zip(tokenized_sentences, expecteds):
    expect(actual).to(contain_exactly(*expected))

print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)
expecteds = [['dogs', '<unk>'], ['<unk>', 'sleep']]
for actual,expected in zip(tmp_replaced_tokenized_sentences, expecteds):
    expect(actual).to(contain_exactly(*expected))
Original sentence:
[['dogs', 'run'], ['cats', 'sleep']]
tokenized_sentences with less frequent words converted to '<unk>':
[['dogs', '<unk>'], ['<unk>', 'sleep']]

Combine Them

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: preprocess_data ###
def preprocess_data(train_data: list, test_data: list, count_threshold: int) -> tuple:
    """
    Preprocess data, i.e.,
       - Find tokens that appear at least N times in the training data.
       - Replace tokens that appear less than N times by "<unk>" both for training and test data.        
    Args:
       train_data, test_data: List of lists of strings.
       count_threshold: Words whose count is less than this are 
                     treated as unknown.

    Returns:
       Tuple of
       - training data with low frequent words replaced by "<unk>"
       - test data with low frequent words replaced by "<unk>"
       - vocabulary of words that appear n times or more in the training data
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Get the closed vocabulary using the train data
    vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)

    # For the train data, replace less common words with "<unk>"
    train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)

    # For the test data, replace less common words with "<unk>"
    test_data_replaced =  replace_oov_words_by_unk(test_data, vocabulary)

    ### END CODE HERE ###
    return train_data_replaced, test_data_replaced, vocabulary
tmp_train = [['sky', 'is', 'blue', '.'],
     ['leaves', 'are', 'green']]
tmp_test = [['roses', 'are', 'red', '.']]

tmp_train_repl, tmp_test_repl, tmp_vocab = preprocess_data(tmp_train, 
                                                           tmp_test, 
                                                           count_threshold = 1)

print("tmp_train_repl")
print(tmp_train_repl)
expecteds = [['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]
for actual, expected in zip(tmp_train_repl, expecteds):
    expect(actual).to(contain_exactly(*expected))
print()
print("tmp_test_repl")
print(tmp_test_repl)

expecteds = [['<unk>', 'are', '<unk>', '.']]

for actual, expected in zip(tmp_test_repl, expecteds):
    expect(actual).to(contain_exactly(*expected))
print()
print("tmp_vocab")
print(tmp_vocab)
expected = ['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']
expect(tmp_vocab).to(contain_exactly(*expected))
tmp_train_repl
[['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]

tmp_test_repl
[['<unk>', 'are', '<unk>', '.']]

tmp_vocab
['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']

Preprocess the Real Data

minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, 
                                                                        test_data, 
                                                                        minimum_freq)
print("last preprocessed testing sample:")
actual = test_data_processed[-1]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()

print("preprocessed training sample:")
actual = train_data_processed[9592]
expected = ['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()

print("First 10 vocabulary:")
actual = vocabulary[0:10]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the']
print(actual)
#expect(actual).to(contain_exactly(*expected))
print()
actual = len(vocabulary)
print(f"Size of vocabulary: {actual:,}")
expected = 14821
#expect(actual).to(equal(expected))
last preprocessed testing sample:
['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']

preprocessed training sample:
['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']

First 10 vocabulary:
['d', '&', 's', 'is', 'covering', 'the', 'event', 'with', 'thomas', ',']

Size of vocabulary: 14,679

Note: My shuffling is different from theirs, even though I'm setting the seed, so it seems to come out differently.

Put It All Together

The Imports

# python
from collections import Counter
from itertools import chain

# from pypi
import attr

The Processor

@attr.s(auto_attribs=True)
class CountProcessor:
    """Processes the data to have unknowns

    Args:
     training: the tokenized training data (list of lists)
     testing: the tokenized testing data
     count_threshold: minimum number of times token needs to appear
     unknown_token: string to use for words below threshold
    """
    training: list
    testing: list
    count_threshold: int=2
    unknown_token: str="<unk>"
    _counts: dict=None
    _vocabulary: set=None
    _train_unknown: list=None
    _test_unknown: list=None
  • Counts
    @property
    def counts(self) -> Counter:
        """Count of each word in the training data"""
        if self._counts is None:
            self._counts = Counter(chain.from_iterable(self.training))
        return self._counts
    
  • The Vocabulary
    @property
    def vocabulary(self) -> set:
        """The tokens in training that appear at least ``count_threshold`` times"""
        if self._vocabulary is None:
            self._vocabulary = set((token for token, count in self.counts.items()
                                if count >= self.count_threshold))
        return self._vocabulary
    
  • Train Unknown
    @property
    def train_unknown(self) -> list:
        """Training data with words below threshold replaced"""
        if self._train_unknown is None:
            self._train_unknown = self.parts_unknown(self.training)
        return self._train_unknown
    
  • Test Unknown
    @property
    def test_unknown(self) -> list:
        """Testing data with words below threshold replaced"""
        if self._test_unknown is None:
            self._test_unknown = self.parts_unknown(self.testing)
        return self._test_unknown
    
  • Parts Unknown
    def parts_unknown(self, source: list) -> list:
        """Replaces tokens in source that aren't in vocabulary
    
        Args:
         source: nested list of lists with tokens to check
    
        Returns: source with unknown words replaced by unknown_token
        """
        return [
                [token if token in self.vocabulary else self.unknown_token
                 for token in tokens]
            for tokens in source
        ]    
    

Test It Out

from neurotic.nlp.autocomplete import CountProcessor

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]

testing = [[]]
processor = CountProcessor(tokenized_sentences, testing)
actual = processor.counts

expected =  {'sky': 1,
             'is': 1,
             'blue': 1,
             '.': 3,
             'leaves': 1,
             'are': 2,
             'green': 1,
             'roses': 1,
             'red': 1}

# note to future self: if you pass key=value to have_keys it checks both
expect(actual).to(have_keys(**expected))

actual = processor.vocabulary
expected = ['.', 'are']
expect(actual).to(contain_only(*expected))
tokenized_sentences = [["dogs", "run", "sleep"], ["cats", "sleep", "dogs"]]

testing = [["cows", "dogs"], ["pigs", "sleep"]]
processor = CountProcessor(training=tokenized_sentences, testing=testing)

actuals = processor.train_unknown

UNKNOWN = "<unk>"
expecteds = [["dogs", UNKNOWN, "sleep"], [UNKNOWN, "sleep", "dogs"]]
for actual,expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

actuals = processor.test_unknown
expecteds = [[UNKNOWN, "dogs"], [UNKNOWN, "sleep"]]
for actual,expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

End

Now that we have the data in the basic form we want we'll move on to building the N-Gram Language Model.