Auto-Complete: Pre-Process the Data I

Beginning

This is the second part of a series implementing an n-gram-based auto-complete system for tweets. The starting post is Auto-Complete.

Imports

# python
import os
import random

# pypi
from dotenv import load_dotenv

from expects import (
    contain_exactly,
    equal,
    expect
)

import nltk

Set Up

The Environment

load_dotenv("posts/nlp/.env", override=True)

Middle

Load the Data

We're going to use twitter data. There's no listing of the source, so I'm assuming it's the NLTK twitter data. But maybe not.

path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
    data = reader.read()
print("Data type:", type(data))
print(f"Number of letters: {len(data):,}")
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")

print("Last 300 letters of the data")
print("-------")
display(data[-300:])
print("-------")    
Data type: <class 'str'>
Number of letters: 3,335,477
First 300 letters of the data
-------
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A 
-------
Last 300 letters of the data
-------
ust had one a few weeks back....hopefully we will be back soon! wish you the best yo\nColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”\n#GutsiestMovesYouCanMake Giving a cat a bath.\nCoffee after 5 was a TERRIBLE idea.\n
-------

So the data looks like it's just tweets, without any metadata and is separated using newlines.

Split To Sentences

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: split_to_sentences ###
def split_to_sentences(data: str) -> list:
    """
    Split data by linebreak "\n"

    Args:
       data: str

    Returns:
       A list of sentences
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    sentences = data.split("\n")
    ### END CODE HERE ###

    # Additional clearning (This part is already implemented)
    # - Remove leading and trailing spaces from each sentence
    # - Drop sentences if they are empty strings.
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]

    return sentences

Test The Code

x = """
I have a pen.\nI have an apple. \nAh\nApple pen.\n
"""
print(x)

expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']
actual = split_to_sentences(x)
expect(actual).to(contain_exactly(*expected))

I have a pen.
I have an apple. 
Ah
Apple pen.

Tokenize Sentences

The next step is to tokenize sentences (split a sentence into a list of words).

  • Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
  • Append each tokenized list of words into a list of tokenized sentences.

Hints:

  • Use str.lower to convert strings to lowercase.
  • Please use nltk.word_tokenize to split sentences into tokens.
  • If you used str.split insteaad of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: tokenize_sentences ###
def tokenize_sentences(sentences: list) -> list:
    """
    Tokenize sentences into tokens (words)

    Args:
       sentences: List of strings

    Returns:
       List of lists of tokens
    """

    # Initialize the list of lists of tokenized sentences
    tokenized_sentences = []
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Go through each sentence
    for sentence in sentences:

        # Convert to lowercase letters
        sentence = sentence.lower()

        # Convert into a list of words
        tokenized = nltk.word_tokenize(sentence)

        # append the list of words to the list of lists
        tokenized_sentences.append(tokenized)

    ### END CODE HERE ###

    return tokenized_sentences

Test the Code

sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]

expecteds = [['sky', 'is', 'blue', '.'],
            ['leaves', 'are', 'green', '.'],
            ['roses', 'are', 'red', '.']]

actuals = tokenize_sentences(sentences)
for expected, actual in zip(expecteds, actuals):
    expect(actual).to(contain_exactly(*expected))

Combine Split Sentences and Tokenize

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_tokenized_data ###
def get_tokenized_data(data: str) -> list:
    """
    Make a list of tokenized sentences

    Args:
       data: String

    Returns:
       List of lists of tokens
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Get the sentences by splitting up the data
    sentences = split_to_sentences(data)

    # Get the list of lists of tokens by tokenizing the sentences
    tokenized_sentences = tokenize_sentences(sentences)

    ### END CODE HERE ###

    return tokenized_sentences

Test It

x = "Sky is blue.\nLeaves are green\nRoses are red."
actuals = get_tokenized_data(x)
expecteds =  [['sky', 'is', 'blue', '.'],
              ['leaves', 'are', 'green'],
              ['roses', 'are', 'red', '.']]
for actual, expected in zip(actuals, expecteds):
    expect(actual).to(contain_exactly(*expected))

Split Train and Test Sets

tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]
actual_data, expected_data = len(tokenized_data), 47961
actual_training, expected_training = len(train_data), 38368
actual_testing, expected_testing = len(test_data), 9593

print((f"{actual_data:,} are split into {actual_training:,} training entries"
       f" and {actual_testing:,} test set entries."))

for label, actual, expected in zip(
        "data training testing".split(),
        (actual_data, actual_training, actual_testing),
        (expected_data, expected_training, expected_testing)):
    expect(actual).to(equal(expected)), (label, actual, expected)
47,961 are split into 38,368 training entries and 9,593 test set entries.
print("First training sample:")
actual = train_data[0]
print(actual)
expected = ["i", "personally", "would", "like", "as", "our", "official", "glove",
            "of", "the", "team", "local", "company", "and", "quality",
            "production"]
expect(actual).to(contain_exactly(*expected))
First training sample:
['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print("First test sample")
actual = test_data[0]
print(actual)
expected = ["that", "picture", "i", "just", "seen", "whoa", "dere", "!", "!",
            ">", ">", ">", ">", ">", ">", ">"]
expect(actual).to(contain_exactly(*expected))
First test sample
['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']

Object-Oriented

<<imports>>


<<the-tokenizer>>

    <<sentences>>

    <<tokenized>>


<<train-test-split>>

    <<shuffled-data>>

    <<training-data>>

    <<testing-data>>

    <<split>>

Imports

# python
import random

# pypi
import attr
import nltk

The Tokenizer

@attr.s(auto_attribs=True)
class Tokenizer:
    """Tokenizes string sentences

    Args:
     source: string data to tokenize
     end_of_sentence: what to split sentences on

    """
    source: str
    end_of_sentence: str="\n"
    _sentences: list=None
    _tokenized: list=None
    _training_data: list=None
  • Sentences
    @property
    def sentences(self) -> list:
        """The data split into sentences"""
        if self._sentences is None:
            self._sentences = self.source.split(self.end_of_sentence)
            self._sentences = (sentence.strip() for sentence in self._sentences)
            self._sentences = [sentence for sentence in self._sentences if sentence]
        return self._sentences
    
  • Tokenized
    @property
    def tokenized(self) -> list:
        """List of tokenized sentence"""
        if self._tokenized is None:
            self._tokenized = [nltk.word_tokenize(sentence.lower())
                               for sentence in self.sentences]
        return self._tokenized
    

Train-Test-Split

@attr.s(auto_attribs=True)
class TrainTestSplit:
    """splits up the training and testing sets

    Args:
     data: list of data to split
     training_fraction: how much to put in the training set
     seed: something to seed the random call
    """
    data: list
    training_fraction: float=0.8
    seed: int=87
    _shuffled: list=None
    _training: list=None
    _testing: list=None
    _split: int=None
  • Shuffled Data
    @property
    def shuffled(self) -> list:
        """The data shuffled"""
        if self._shuffled is None:
            random.seed(self.seed)
            self._shuffled = random.sample(self.data, k=len(self.data))
        return self._shuffled
    
  • Split
    @property
    def split(self) -> int:
        """The slice value for training and testing"""
        if self._split is None:
            self._split = int(len(self.data) * self.training_fraction)
        return self._split
    
  • Training Data
    @property
    def training(self) -> list:
        """The Training Portion of the Set"""
        if self._training is None:
            self._training = self.shuffled[0:self.split]
        return self._training
    
  • Testing Data
    @property
    def testing(self) -> list:
        """The testing data"""
        if self._testing is None:
            self._testing = self.shuffled[self.split:]
        return self._testing
    

Test It Out

  • Sentences
    from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit
    
    x = """
    I have a pen.\nI have an apple. \nAh\nApple pen.\n
    """
    expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']
    tokenizer = Tokenizer(x)
    
    actual = tokenizer.sentences
    expect(actual).to(contain_exactly(*expected))
    
  • Tokens
    source = "\n".join(["Sky is blue.", "Leaves are green.", "Roses are red."])
    
    expecteds = [['sky', 'is', 'blue', '.'],
                ['leaves', 'are', 'green', '.'],
                ['roses', 'are', 'red', '.']]
    
    tokenizer = Tokenizer(source)
    actuals = tokenizer.tokenized
    for expected, actual in zip(expecteds, actuals):
        expect(actual).to(contain_exactly(*expected))
    

Training And Test Sets

random.seed(87)
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
actual_data, expected_data = len(tokenizer.tokenized), 47961
actual_training, expected_training = len(splitter.training), 38368
actual_testing, expected_testing = len(splitter.testing), 9593

print((f"{actual_data:,} are split into {actual_training:,} training entries"
       f" and {actual_testing:,} test set entries."))

for label, actual, expected in zip(
        "data training testing".split(),
        (actual_data, actual_training, actual_testing),
        (expected_data, expected_training, expected_testing)):
    expect(actual).to(equal(expected)), (label, actual, expected)

End

The next post in this series is Pre-Processing II in which we'll add counts to the tweets.