Auto-Complete: Pre-Process the Data I
Table of Contents
Beginning
This is the second part of a series implementing an n-gram-based auto-complete system for tweets. The starting post is Auto-Complete.
Imports
# python
import os
import random
# pypi
from dotenv import load_dotenv
from expects import (
contain_exactly,
equal,
expect
)
import nltk
Set Up
The Environment
load_dotenv("posts/nlp/.env", override=True)
Middle
Load the Data
We're going to use twitter data. There's no listing of the source, so I'm assuming it's the NLTK twitter data. But maybe not.
path = os.environ["TWITTER_AUTOCOMPLETE"]
with open(path) as reader:
data = reader.read()
print("Data type:", type(data))
print(f"Number of letters: {len(data):,}")
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")
print("Last 300 letters of the data")
print("-------")
display(data[-300:])
print("-------")
Data type: <class 'str'> Number of letters: 3,335,477 First 300 letters of the data ------- How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A ------- Last 300 letters of the data ------- ust had one a few weeks back....hopefully we will be back soon! wish you the best yo\nColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”\n#GutsiestMovesYouCanMake Giving a cat a bath.\nCoffee after 5 was a TERRIBLE idea.\n -------
So the data looks like it's just tweets, without any metadata and is separated using newlines.
Split To Sentences
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: split_to_sentences ###
def split_to_sentences(data: str) -> list:
"""
Split data by linebreak "\n"
Args:
data: str
Returns:
A list of sentences
"""
### START CODE HERE (Replace instances of 'None' with your code) ###
sentences = data.split("\n")
### END CODE HERE ###
# Additional clearning (This part is already implemented)
# - Remove leading and trailing spaces from each sentence
# - Drop sentences if they are empty strings.
sentences = [s.strip() for s in sentences]
sentences = [s for s in sentences if len(s) > 0]
return sentences
Test The Code
x = """
I have a pen.\nI have an apple. \nAh\nApple pen.\n
"""
print(x)
expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']
actual = split_to_sentences(x)
expect(actual).to(contain_exactly(*expected))
I have a pen. I have an apple. Ah Apple pen.
Tokenize Sentences
The next step is to tokenize sentences (split a sentence into a list of words).
- Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
- Append each tokenized list of words into a list of tokenized sentences.
Hints:
- Use str.lower to convert strings to lowercase.
- Please use nltk.word_tokenize to split sentences into tokens.
- If you used
str.split
insteaad ofnltk.word_tokenize
, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: tokenize_sentences ###
def tokenize_sentences(sentences: list) -> list:
"""
Tokenize sentences into tokens (words)
Args:
sentences: List of strings
Returns:
List of lists of tokens
"""
# Initialize the list of lists of tokenized sentences
tokenized_sentences = []
### START CODE HERE (Replace instances of 'None' with your code) ###
# Go through each sentence
for sentence in sentences:
# Convert to lowercase letters
sentence = sentence.lower()
# Convert into a list of words
tokenized = nltk.word_tokenize(sentence)
# append the list of words to the list of lists
tokenized_sentences.append(tokenized)
### END CODE HERE ###
return tokenized_sentences
Test the Code
sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]
expecteds = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green', '.'],
['roses', 'are', 'red', '.']]
actuals = tokenize_sentences(sentences)
for expected, actual in zip(expecteds, actuals):
expect(actual).to(contain_exactly(*expected))
Combine Split Sentences and Tokenize
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
### GRADED_FUNCTION: get_tokenized_data ###
def get_tokenized_data(data: str) -> list:
"""
Make a list of tokenized sentences
Args:
data: String
Returns:
List of lists of tokens
"""
### START CODE HERE (Replace instances of 'None' with your code) ###
# Get the sentences by splitting up the data
sentences = split_to_sentences(data)
# Get the list of lists of tokens by tokenizing the sentences
tokenized_sentences = tokenize_sentences(sentences)
### END CODE HERE ###
return tokenized_sentences
Test It
x = "Sky is blue.\nLeaves are green\nRoses are red."
actuals = get_tokenized_data(x)
expecteds = [['sky', 'is', 'blue', '.'],
['leaves', 'are', 'green'],
['roses', 'are', 'red', '.']]
for actual, expected in zip(actuals, expecteds):
expect(actual).to(contain_exactly(*expected))
Split Train and Test Sets
tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)
train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]
actual_data, expected_data = len(tokenized_data), 47961
actual_training, expected_training = len(train_data), 38368
actual_testing, expected_testing = len(test_data), 9593
print((f"{actual_data:,} are split into {actual_training:,} training entries"
f" and {actual_testing:,} test set entries."))
for label, actual, expected in zip(
"data training testing".split(),
(actual_data, actual_training, actual_testing),
(expected_data, expected_training, expected_testing)):
expect(actual).to(equal(expected)), (label, actual, expected)
47,961 are split into 38,368 training entries and 9,593 test set entries.
print("First training sample:")
actual = train_data[0]
print(actual)
expected = ["i", "personally", "would", "like", "as", "our", "official", "glove",
"of", "the", "team", "local", "company", "and", "quality",
"production"]
expect(actual).to(contain_exactly(*expected))
First training sample: ['i', 'personally', 'would', 'like', 'as', 'our', 'official', 'glove', 'of', 'the', 'team', 'local', 'company', 'and', 'quality', 'production']
print("First test sample")
actual = test_data[0]
print(actual)
expected = ["that", "picture", "i", "just", "seen", "whoa", "dere", "!", "!",
">", ">", ">", ">", ">", ">", ">"]
expect(actual).to(contain_exactly(*expected))
First test sample ['that', 'picture', 'i', 'just', 'seen', 'whoa', 'dere', '!', '!', '>', '>', '>', '>', '>', '>', '>']
Object-Oriented
<<imports>>
<<the-tokenizer>>
<<sentences>>
<<tokenized>>
<<train-test-split>>
<<shuffled-data>>
<<training-data>>
<<testing-data>>
<<split>>
Imports
# python
import random
# pypi
import attr
import nltk
The Tokenizer
@attr.s(auto_attribs=True)
class Tokenizer:
"""Tokenizes string sentences
Args:
source: string data to tokenize
end_of_sentence: what to split sentences on
"""
source: str
end_of_sentence: str="\n"
_sentences: list=None
_tokenized: list=None
_training_data: list=None
- Sentences
@property def sentences(self) -> list: """The data split into sentences""" if self._sentences is None: self._sentences = self.source.split(self.end_of_sentence) self._sentences = (sentence.strip() for sentence in self._sentences) self._sentences = [sentence for sentence in self._sentences if sentence] return self._sentences
- Tokenized
@property def tokenized(self) -> list: """List of tokenized sentence""" if self._tokenized is None: self._tokenized = [nltk.word_tokenize(sentence.lower()) for sentence in self.sentences] return self._tokenized
Train-Test-Split
@attr.s(auto_attribs=True)
class TrainTestSplit:
"""splits up the training and testing sets
Args:
data: list of data to split
training_fraction: how much to put in the training set
seed: something to seed the random call
"""
data: list
training_fraction: float=0.8
seed: int=87
_shuffled: list=None
_training: list=None
_testing: list=None
_split: int=None
- Shuffled Data
@property def shuffled(self) -> list: """The data shuffled""" if self._shuffled is None: random.seed(self.seed) self._shuffled = random.sample(self.data, k=len(self.data)) return self._shuffled
- Split
@property def split(self) -> int: """The slice value for training and testing""" if self._split is None: self._split = int(len(self.data) * self.training_fraction) return self._split
- Training Data
@property def training(self) -> list: """The Training Portion of the Set""" if self._training is None: self._training = self.shuffled[0:self.split] return self._training
- Testing Data
@property def testing(self) -> list: """The testing data""" if self._testing is None: self._testing = self.shuffled[self.split:] return self._testing
Test It Out
- Sentences
from neurotic.nlp.autocomplete import Tokenizer, TrainTestSplit x = """ I have a pen.\nI have an apple. \nAh\nApple pen.\n """ expected = ['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.'] tokenizer = Tokenizer(x) actual = tokenizer.sentences expect(actual).to(contain_exactly(*expected))
- Tokens
source = "\n".join(["Sky is blue.", "Leaves are green.", "Roses are red."]) expecteds = [['sky', 'is', 'blue', '.'], ['leaves', 'are', 'green', '.'], ['roses', 'are', 'red', '.']] tokenizer = Tokenizer(source) actuals = tokenizer.tokenized for expected, actual in zip(expecteds, actuals): expect(actual).to(contain_exactly(*expected))
Training And Test Sets
random.seed(87)
tokenizer = Tokenizer(data)
splitter = TrainTestSplit(tokenizer.tokenized)
actual_data, expected_data = len(tokenizer.tokenized), 47961
actual_training, expected_training = len(splitter.training), 38368
actual_testing, expected_testing = len(splitter.testing), 9593
print((f"{actual_data:,} are split into {actual_training:,} training entries"
f" and {actual_testing:,} test set entries."))
for label, actual, expected in zip(
"data training testing".split(),
(actual_data, actual_training, actual_testing),
(expected_data, expected_training, expected_testing)):
expect(actual).to(equal(expected)), (label, actual, expected)
End
The next post in this series is Pre-Processing II in which we'll add counts to the tweets.