Auto-Complete: Pre-Process the Data II
Table of Contents
Beginning
This is the third post in a series that begins with Auto-Complete. In the previous entry we did some basic preprocessing to transform the raw tweet data into a form closer to what we wanted. In this post we'll add some counts to the data so that we can use it to build our model.
Imports
# python
import os
# pypi
from dotenv import load_dotenv
from expects import (
contain_exactly,
contain_only,
equal,
expect,
have_keys)
# this series
from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit
Set Up
The Environment
load_dotenv("posts/nlp/.env", override=True)
The Data
path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
data = reader.read()
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
train_data, test_data = splitter.training, splitter.testing
Middle
Count Words
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: count_words ###
def count_words(tokenized_sentences: list) -> dict:
"""
Count the number of word appearence in the tokenized sentences
Args:
tokenized_sentences: List of lists of strings
Returns:
dict that maps word (str) to the frequency (int)
"""
word_counts = {}
### START CODE HERE (Replace instances of 'None' with your code) ###
# Loop through each sentence
for sentence in tokenized_sentences: # complete this line
# Go through each token in the sentence
for token in sentence: # complete this line
# If the token is not in the dictionary yet, set the count to 1
if token not in word_counts: # complete this line
word_counts[token] = 1
# If the token is already in the dictionary, increment the count by 1
else:
word_counts[token] += 1
### END CODE HERE ###
return word_counts
Test the Code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green', '.'],
['roses', 'are', 'red', '.']]
actual = count_words(tokenized_sentences)
expected = {'sky': 1,
'is': 1,
'blue': 1,
'.': 3,
'leaves': 1,
'are': 2,
'green': 1,
'roses': 1,
'red': 1}
expect(actual).to(have_keys(**expected))
Out-Of-Vocabulary Words
If your model is performing autocomplete, but encounters a word that it never saw during training, it won't have an input word to help it determine the next word to suggest. The model will not be able to predict the next word because there are no counts for the current word.
- This 'new' word is called an 'unknown word', or <b>out of vocabulary (OOV)</b> words.
- The percentage of unknown words in the test set is called the <b> OOV </b> rate.
To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.
- Modify the training data so that it has some 'unknown' words to train on.
- Words to convert into "unknown" words are those that do not occur very frequently in the training set.
- Create a list of the most frequent words in the training set, called the <b> closed vocabulary </b>.
- Convert all the other words that are not part of the closed vocabulary to the token 'unk'.
Create a function that takes in a text document and a threshold `count_threshold`.
- Any word whose count is greater than or equal to the threshold `count_threshold` is kept in the closed vocabulary.
- Returns the word closed vocabulary list.
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_words_with_nplus_frequency ###
def get_words_with_nplus_frequency(tokenized_sentences: list, count_threshold: int) -> list:
"""
Find the words that appear N times or more
Args:
tokenized_sentences: List of lists of sentences
count_threshold: minimum number of occurrences for a word to be in the closed vocabulary.
Returns:
List of words that appear N times or more
"""
# Initialize an empty list to contain the words that
# appear at least 'minimum_freq' times.
closed_vocab = []
# Get the word couts of the tokenized sentences
# Use the function that you defined earlier to count the words
word_counts = count_words(tokenized_sentences)
### START CODE HERE (Replace instances of 'None' with your code) ###
# for each word and its count
for word, cnt in word_counts.items(): # complete this line
# check that the word's count
# is at least as great as the minimum count
if cnt >= count_threshold:
# append the word to the list
closed_vocab.append(word)
### END CODE HERE ###
return closed_vocab
Test The Code
tokenized_sentences = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green', '.'],
['roses', 'are', 'red', '.']]
actual = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(actual)
expected = ['.', 'are']
expect(actual).to(contain_exactly(*expected))
Closed vocabulary: ['.', 'are']
Parts Unknown
The words that appear `count_threshold` times or more are in the closed vocabulary.
- All other words are regarded as `unknown`.
- Replace words not in the closed vocabulary with the token `<unk>`.
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: replace_oov_words_by_unk ###
def replace_oov_words_by_unk(tokenized_sentences: list,
vocabulary: list,
unknown_token: str="<unk>") -> list:
"""
Replace words not in the given vocabulary with '<unk>' token.
Args:
tokenized_sentences: List of lists of strings
vocabulary: List of strings that we will use
unknown_token: A string representing unknown (out-of-vocabulary) words
Returns:
List of lists of strings, with words not in the vocabulary replaced
"""
# Place vocabulary into a set for faster search
vocabulary = set(vocabulary)
# Initialize a list that will hold the sentences
# after less frequent words are replaced by the unknown token
replaced_tokenized_sentences = []
# Go through each sentence
for sentence in tokenized_sentences:
# Initialize the list that will contain
# a single sentence with "unknown_token" replacements
replaced_sentence = []
### START CODE HERE (Replace instances of 'None' with your code) ###
# for each token in the sentence
for token in sentence: # complete this line
# Check if the token is in the closed vocabulary
if token in vocabulary: # complete this line
# If so, append the word to the replaced_sentence
replaced_sentence.append(token)
else:
# otherwise, append the unknown token instead
replaced_sentence.append(unknown_token)
### END CODE HERE ###
# Append the list of tokens to the list of lists
replaced_tokenized_sentences.append(replaced_sentence)
return replaced_tokenized_sentences
Test It
tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
vocabulary = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, vocabulary)
print(f"Original sentence:")
print(tokenized_sentences)
expecteds = [['dogs', 'run'], ['cats', 'sleep']]
for actual, expected in zip(tokenized_sentences, expecteds):
expect(actual).to(contain_exactly(*expected))
print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)
expecteds = [['dogs', '<unk>'], ['<unk>', 'sleep']]
for actual,expected in zip(tmp_replaced_tokenized_sentences, expecteds):
expect(actual).to(contain_exactly(*expected))
Original sentence: [['dogs', 'run'], ['cats', 'sleep']] tokenized_sentences with less frequent words converted to '<unk>': [['dogs', '<unk>'], ['<unk>', 'sleep']]
Combine Them
# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: preprocess_data ###
def preprocess_data(train_data: list, test_data: list, count_threshold: int) -> tuple:
"""
Preprocess data, i.e.,
- Find tokens that appear at least N times in the training data.
- Replace tokens that appear less than N times by "<unk>" both for training and test data.
Args:
train_data, test_data: List of lists of strings.
count_threshold: Words whose count is less than this are
treated as unknown.
Returns:
Tuple of
- training data with low frequent words replaced by "<unk>"
- test data with low frequent words replaced by "<unk>"
- vocabulary of words that appear n times or more in the training data
"""
### START CODE HERE (Replace instances of 'None' with your code) ###
# Get the closed vocabulary using the train data
vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)
# For the train data, replace less common words with "<unk>"
train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)
# For the test data, replace less common words with "<unk>"
test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary)
### END CODE HERE ###
return train_data_replaced, test_data_replaced, vocabulary
tmp_train = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green']]
tmp_test = [['roses', 'are', 'red', '.']]
tmp_train_repl, tmp_test_repl, tmp_vocab = preprocess_data(tmp_train,
tmp_test,
count_threshold = 1)
print("tmp_train_repl")
print(tmp_train_repl)
expecteds = [['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']]
for actual, expected in zip(tmp_train_repl, expecteds):
expect(actual).to(contain_exactly(*expected))
print()
print("tmp_test_repl")
print(tmp_test_repl)
expecteds = [['<unk>', 'are', '<unk>', '.']]
for actual, expected in zip(tmp_test_repl, expecteds):
expect(actual).to(contain_exactly(*expected))
print()
print("tmp_vocab")
print(tmp_vocab)
expected = ['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']
expect(tmp_vocab).to(contain_exactly(*expected))
tmp_train_repl [['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green']] tmp_test_repl [['<unk>', 'are', '<unk>', '.']] tmp_vocab ['sky', 'is', 'blue', '.', 'leaves', 'are', 'green']
Preprocess the Real Data
minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data,
test_data,
minimum_freq)
print("last preprocessed testing sample:")
actual = test_data_processed[-1]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()
print("preprocessed training sample:")
actual = train_data_processed[9592]
expected = ['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']
print(actual)
expect(actual).to(contain_exactly(*expected))
print()
print("First 10 vocabulary:")
actual = vocabulary[0:10]
expected = ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the']
print(actual)
#expect(actual).to(contain_exactly(*expected))
print()
actual = len(vocabulary)
print(f"Size of vocabulary: {actual:,}")
expected = 14821
#expect(actual).to(equal(expected))
last preprocessed testing sample: ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production'] preprocessed training sample: ['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>'] First 10 vocabulary: ['d', '&', 's', 'is', 'covering', 'the', 'event', 'with', 'thomas', ','] Size of vocabulary: 14,679
Note: My shuffling is different from theirs, even though I'm setting the seed, so it seems to come out differently.
Put It All Together
The Imports
# python
from collections import Counter
from itertools import chain
# from pypi
import attr
The Processor
@attr.s(auto_attribs=True)
class CountProcessor:
"""Processes the data to have unknowns
Args:
training: the tokenized training data (list of lists)
testing: the tokenized testing data
count_threshold: minimum number of times token needs to appear
unknown_token: string to use for words below threshold
"""
training: list
testing: list
count_threshold: int=2
unknown_token: str="<unk>"
_counts: dict=None
_vocabulary: set=None
_train_unknown: list=None
_test_unknown: list=None
- Counts
@property def counts(self) -> Counter: """Count of each word in the training data""" if self._counts is None: self._counts = Counter(chain.from_iterable(self.training)) return self._counts
- The Vocabulary
@property def vocabulary(self) -> set: """The tokens in training that appear at least ``count_threshold`` times""" if self._vocabulary is None: self._vocabulary = set((token for token, count in self.counts.items() if count >= self.count_threshold)) return self._vocabulary
- Train Unknown
@property def train_unknown(self) -> list: """Training data with words below threshold replaced""" if self._train_unknown is None: self._train_unknown = self.parts_unknown(self.training) return self._train_unknown
- Test Unknown
@property def test_unknown(self) -> list: """Testing data with words below threshold replaced""" if self._test_unknown is None: self._test_unknown = self.parts_unknown(self.testing) return self._test_unknown
- Parts Unknown
def parts_unknown(self, source: list) -> list: """Replaces tokens in source that aren't in vocabulary Args: source: nested list of lists with tokens to check Returns: source with unknown words replaced by unknown_token """ return [ [token if token in self.vocabulary else self.unknown_token for token in tokens] for tokens in source ]
Test It Out
from neurotic.nlp.autocomplete import CountProcessor
tokenized_sentences = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green', '.'],
['roses', 'are', 'red', '.']]
testing = [[]]
processor = CountProcessor(tokenized_sentences, testing)
actual = processor.counts
expected = {'sky': 1,
'is': 1,
'blue': 1,
'.': 3,
'leaves': 1,
'are': 2,
'green': 1,
'roses': 1,
'red': 1}
# note to future self: if you pass key=value to have_keys it checks both
expect(actual).to(have_keys(**expected))
actual = processor.vocabulary
expected = ['.', 'are']
expect(actual).to(contain_only(*expected))
tokenized_sentences = [["dogs", "run", "sleep"], ["cats", "sleep", "dogs"]]
testing = [["cows", "dogs"], ["pigs", "sleep"]]
processor = CountProcessor(training=tokenized_sentences, testing=testing)
actuals = processor.train_unknown
UNKNOWN = "<unk>"
expecteds = [["dogs", UNKNOWN, "sleep"], [UNKNOWN, "sleep", "dogs"]]
for actual,expected in zip(actuals, expecteds):
expect(actual).to(contain_exactly(*expected))
actuals = processor.test_unknown
expecteds = [[UNKNOWN, "dogs"], [UNKNOWN, "sleep"]]
for actual,expected in zip(actuals, expecteds):
expect(actual).to(contain_exactly(*expected))
End
Now that we have the data in the basic form we want we'll move on to building the N-Gram Language Model.