Sentiment Analysis: Pre-processing the Data
Table of Contents
Beginning
This is the next in a series about building a Deep Learning model for sentiment analysis. The first post was this one.
Imports
# from python
from argparse import Namespace
import random
# from pypi
from expects import contain_exactly, equal, expect
from nltk.corpus import twitter_samples
import nltk
import numpy
# this project
from neurotic.nlp.twitter.processor import TwitterProcessor
Set Up
The NLTK data has to be downloaded at least once.
nltk.download("twitter_samples", download_dir="~/data/datasets/nltk_data/")
Middle
The NLTK Data
positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
print(f"Positive Tweets: {len(positive):,}")
print(f"Negative Tweets: {len(negative):,}")
Positive Tweets: 5,000 Negative Tweets: 5,000
Split It Up
Instead of randomly splitting the data we're going to do a straight slice.
SPLIT = 4000
Split positive set into validation and training
positive_validation = positive[SPLIT:]
positive_training = positive[:SPLIT]
Split negative set into validation and training
negative_validation = negative[SPLIT:]
negative_training = negative[:SPLIT]
Combine the Data Sets
The X data.
train_x = positive_training + negative_training
validation_x = positive_validation + negative_validation
The labels (1 for positive, 0 for negative).
train_y = numpy.append(numpy.ones(len(positive_training)),
numpy.zeros(len(negative_training)))
validation_y = numpy.append(numpy.ones(len(positive_validation)),
numpy.zeros(len(negative_validation)))
print(f"length of train_x {len(train_x):,}")
print(f"length of validation_x {len(validation_x):,}")
length of train_x 8,000 length of validation_x 2,000
Building the vocabulary
Now build the vocabulary.
- Map each word in each tweet to an integer (an "index").
- The following code does this for you, but please read it and understand what it's doing.
- Note that you will build the vocabulary based on the training data.
- To do so, you will assign an index to everyword by iterating over your training set.
The vocabulary will also include some special tokens
__PAD__
: padding</e>
: end of line__UNK__
: a token representing any word that is not in the vocabulary.
Tokens = Namespace(padding="__PAD__", ending="__</e>__", unknown="__UNK__")
process = TwitterProcessor()
vocabulary = {Tokens.padding: 0, Tokens.ending: 1, Tokens.unknown: 2}
for tweet in train_x:
for token in process(tweet):
if token not in vocabulary:
vocabulary[token] = len(vocabulary)
print(f"Words in the vocabulary: {len(vocabulary):,}")
count = 0
for token in vocabulary:
print(f"{count}: {token}: {vocabulary[token]}")
count += 1
if count == 5:
break
Words in the vocabulary: 9,164 0: __PAD__: 0 1: __</e>__: 1 2: __UNK__: 2 3: followfriday: 3 4: top: 4
Converting a tweet to a tensor
Now we'll write a function that will convert each tweet to a tensor (a list of unique integer IDs representing the processed tweet).
- Note, the returned data type will be a regular Python `list()`
- You won't use TensorFlow in this function
- You also won't use a numpy array
- You also won't use trax.fastmath.numpy array
-
For words in the tweet that are not in the vocabulary, set them to the unique ID for the token `__UNK__`.
For example, given this string:
'@happypuppy, is Maria happy?'
You first tokenize it.
['maria', 'happi']
Then convert each word into the index for it.
[2, 56]
Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the __UNK__
token, because it is considered "unknown."
def tweet_to_tensor(tweet: str, vocab_dict: dict,
unk_token: str='__UNK__', verbose: bool=False):
"""Convert a tweet to a list of indices
Args:
tweet - A string containing a tweet
vocab_dict - The words dictionary
unk_token - The special string for unknown tokens
verbose - Print info during runtime
Returns:
tensor_l - A python list with indices for the tweet tokens
"""
# Process the tweet into a list of words
# where only important words are kept (stop words removed)
word_l = processor(tweet)
if verbose:
print("List of words from the processed tweet:")
print(word_l)
# Initialize the list that will contain the unique integer IDs of each word
tensor_l = []
# Get the unique integer ID of the __UNK__ token
unk_ID = vocab_dict[unk_token]
if verbose:
print(f"The unique integer ID for the unk_token is {unk_ID}")
# for each word in the list:
for word in word_l:
# Get the unique integer ID.
# If the word doesn't exist in the vocab dictionary,
# use the unique ID for __UNK__ instead.
word_ID = vocab_dict.get(word, unk_ID)
# Append the unique integer ID to the tensor list.
tensor_l.append(word_ID)
return tensor_l
print("Actual tweet is\n", positive_validation[0])
print("\nTensor of tweet:\n", tweet_to_tensor(positive_validation[0], vocab_dict=vocabulary))
Actual tweet is Bro:U wan cut hair anot,ur hair long Liao bo Me:since ord liao,take it easy lor treat as save $ leave it longer :) Bro:LOL Sibei xialan Tensor of tweet: [1072, 96, 484, 2376, 750, 8220, 1132, 750, 53, 2, 2701, 796, 2, 2, 354, 606, 2, 3523, 1025, 602, 4599, 9, 1072, 158, 2, 2]
def test_tweet_to_tensor():
test_cases = [
{
"name":"simple_test_check",
"input": [positive_validation[1], vocabulary],
"expected":[444, 2, 304, 567, 56, 9],
"error":"The function gives bad output for val_pos[1]. Test failed"
},
{
"name":"datatype_check",
"input":[positive_validation[1], vocabulary],
"expected":type([]),
"error":"Datatype mismatch. Need only list not np.array"
},
{
"name":"without_unk_check",
"input":[positive_validation[1], vocabulary],
"expected":6,
"error":"Unk word check not done- Please check if you included mapping for unknown word"
}
]
count = 0
for test_case in test_cases:
try:
if test_case['name'] == "simple_test_check":
assert test_case["expected"] == tweet_to_tensor(*test_case['input'])
count += 1
if test_case['name'] == "datatype_check":
assert isinstance(tweet_to_tensor(*test_case['input']), test_case["expected"])
count += 1
if test_case['name'] == "without_unk_check":
assert None not in tweet_to_tensor(*test_case['input'])
count += 1
except:
print(test_case['error'])
if count == 3:
print("\033[92m All tests passed")
else:
print(count," Tests passed out of 3")
test_tweet_to_tensor()
The function gives bad output for val_pos[1]. Test failed 2 Tests passed out of 3
Their tweet processor wipes out everything after the start of a URL, even if it isn't part of the URL, so they have fewer tokens, so the indices won't match exactly.
Creating a batch generator
Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets.
- If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model.
- You will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to treat some examples as more important to get right than others, but commonly this will all be 1.0).
Once you create the generator, you could include it in a for loop:
for batch_inputs, batch_targets, batch_example_weights in data_generator:
You can also get a single batch like this:
batch_inputs, batch_targets, batch_example_weights = next(data_generator)
The generator returns the next batch each time it's called.
- This generator returns the data in a format (tensors) that you could directly use in your model.
- It returns a triple: the inputs, targets, and loss weights:
– Inputs is a tensor that contains the batch of tweets we put into the model. – Targets is the corresponding batch of labels that we train to generate. – Loss weights here are just 1s with same shape as targets. Next week, you will use it to mask input padding.
data_generator
A batch of spaghetti.
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED: Data generator
def data_generator(data_pos: list, data_neg: list, batch_size: int,
loop: bool, vocab_dict: dict, shuffle: bool=False):
"""Generates batches of data
Args:
data_pos - Set of positive examples
data_neg - Set of negative examples
batch_size - number of samples per batch. Must be even
loop - True or False
vocab_dict - The words dictionary
shuffle - Shuffle the data order
Yield:
inputs - Subset of positive and negative examples
targets - The corresponding labels for the subset
example_weights - An array specifying the importance of each example
"""
# make sure the batch size is an even number
# to allow an equal number of positive and negative samples
assert batch_size % 2 == 0
# Number of positive examples in each batch is half of the batch size
# same with number of negative examples in each batch
n_to_take = batch_size // 2
# Use pos_index to walk through the data_pos array
# same with neg_index and data_neg
pos_index = 0
neg_index = 0
len_data_pos = len(data_pos)
len_data_neg = len(data_neg)
# Get and array with the data indexes
pos_index_lines = list(range(len_data_pos))
neg_index_lines = list(range(len_data_neg))
# shuffle lines if shuffle is set to True
if shuffle:
rnd.shuffle(pos_index_lines)
rnd.shuffle(neg_index_lines)
stop = False
# Loop indefinitely
while not stop:
# create a batch with positive and negative examples
batch = []
# First part: Pack n_to_take positive examples
# Start from pos_index and increment i up to n_to_take
for i in range(n_to_take):
# If the positive index goes past the positive dataset length,
if pos_index >= len_data_pos:
# If loop is set to False, break once we reach the end of the dataset
if not loop:
stop = True;
break;
# If user wants to keep re-using the data, reset the index
pos_index = 0
if shuffle:
# Shuffle the index of the positive sample
rnd.shuffle(pos_index_lines)
# get the tweet as pos_index
tweet = data_pos[pos_index_lines[pos_index]]
# convert the tweet into tensors of integers representing the processed words
tensor = tweet_to_tensor(tweet, vocab_dict)
# append the tensor to the batch list
batch.append(tensor)
# Increment pos_index by one
pos_index = pos_index + 1
# Second part: Pack n_to_take negative examples
# Using the same batch list, start from neg_index and increment i up to n_to_take
for i in range(neg_index, n_to_take):
# If the negative index goes past the negative dataset length,
if neg_index > len_data_neg:
# If loop is set to False, break once we reach the end of the dataset
if not loop:
stop = True;
break;
# If user wants to keep re-using the data, reset the index
neg_index = 0
if shuffle:
# Shuffle the index of the negative sample
rnd.shuffle(neg_index_lines)
# get the tweet at neg_index
tweet = data_neg[neg_index_lines[neg_index]]
# convert the tweet into tensors of integers representing the processed words
tensor = tweet_to_tensor(tweet, vocab_dict)
# append the tensor to the batch list
batch.append(tensor)
# Increment neg_index by one
neg_index += 1
if stop:
break;
# Update the start index for positive data
# so that it's n_to_take positions after the current pos_index
pos_index += n_to_take
# Update the start index for negative data
# so that it's n_to_take positions after the current neg_index
neg_index += n_to_take
# Get the max tweet length (the length of the longest tweet)
# (you will pad all shorter tweets to have this length)
max_len = max([len(t) for t in batch])
# Initialize the input_l, which will
# store the padded versions of the tensors
tensor_pad_l = []
# Pad shorter tweets with zeros
for tensor in batch:
# Get the number of positions to pad for this tensor so that it will be max_len long
n_pad = max_len - len(tensor)
# Generate a list of zeros, with length n_pad
pad_l = [0] * n_pad
# concatenate the tensor and the list of padded zeros
tensor_pad = tensor + pad_l
# append the padded tensor to the list of padded tensors
tensor_pad_l.append(tensor_pad)
# convert the list of padded tensors to a numpy array
# and store this as the model inputs
inputs = numpy.array(tensor_pad_l)
# Generate the list of targets for the positive examples (a list of ones)
# The length is the number of positive examples in the batch
target_pos = [1] * len(batch[:n_to_take])
# Generate the list of targets for the negative examples (a list of zeros)
# The length is the number of negative examples in the batch
target_neg = [0] * len(batch[n_to_take:])
# Concatenate the positve and negative targets
target_l = target_pos + target_neg
# Convert the target list into a numpy array
targets = numpy.array(target_l)
# Example weights: Treat all examples equally importantly.It should return an np.array. Hint: Use np.ones_like()
example_weights = numpy.ones_like(targets)
yield inputs, targets, example_weights
Now you can use your data generator to create a data generator for the training data, and another data generator for the validation data.
We will create a third data generator that does not loop, for testing the final accuracy of the model.
# Set the random number generator for the shuffle procedure
rnd = random
rnd.seed(30)
# Create the training data generator
def train_generator(batch_size, shuffle = False):
return data_generator(positive_training, negative_training,
batch_size, True, vocabulary, shuffle)
# Create the validation data generator
def val_generator(batch_size, shuffle = False):
return data_generator(positive_validation, negative_validation,
batch_size, True, vocabulary, shuffle)
# Create the validation data generator
def test_generator(batch_size, shuffle = False):
return data_generator(positive_validation, negative_validation, batch_size,
False, vocabulary, shuffle)
# Get a batch from the train_generator and inspect.
inputs, targets, example_weights = next(train_generator(4, shuffle=True))
# this will print a list of 4 tensors padded with zeros
print(f'Inputs: {inputs}')
print(f'Targets: {targets}')
print(f'Example Weights: {example_weights}')
Inputs: [[2030 4492 3231 9 0 0 0 0 0 0 0] [5009 571 2025 1475 5233 3532 142 3532 132 464 9] [3798 111 96 587 2960 4007 0 0 0 0 0] [ 256 3798 0 0 0 0 0 0 0 0 0]] Targets: [1 1 0 0] Example Weights: [1 1 1 1]
Test the train_generator
Create a data generator for training data which produces batches of size 4 (for tensors and their respective targets).
tmp_data_gen = train_generator(batch_size = 4)
Call the data generator to get one batch and its targets.
tmp_inputs, tmp_targets, tmp_example_weights = next(tmp_data_gen)
print(f"The inputs shape is {tmp_inputs.shape}")
print(f"The targets shape is {tmp_targets.shape}")
print(f"The example weights shape is {tmp_example_weights.shape}")
for i,t in enumerate(tmp_inputs):
print(f"input tensor: {t}; target {tmp_targets[i]}; example weights {tmp_example_weights[i]}")
The inputs shape is (4, 14) The targets shape is (4,) The example weights shape is (4,) input tensor: [3 4 5 6 7 8 9 0 0 0 0 0 0 0]; target 1; example weights 1 input tensor: [10 11 12 13 14 15 16 17 18 19 20 9 21 22]; target 1; example weights 1 input tensor: [5807 2931 3798 0 0 0 0 0 0 0 0 0 0 0]; target 0; example weights 1 input tensor: [ 865 261 3689 5808 313 4499 571 1248 2795 333 1220 3798 0 0]; target 0; example weights 1
Bundle It Up
Imports
# python
from argparse import Namespace
from itertools import cycle
import random
# pypi
from nltk.corpus import twitter_samples
import attr
import numpy
# this project
from .processor import TwitterProcessor
Defaults
Defaults = Namespace(
split = 4000,
)
NLTK Settings
NLTK = Namespace(
corpus="twitter_samples",
negative = "negative_tweets.json",
positive="positive_tweets.json",
)
Special Tokens
SpecialTokens = Namespace(padding="__PAD__",
ending="__</e>__",
unknown="__UNK__")
SpecialIDs = Namespace(
padding=0,
ending=1,
unknown=2,
)
The Builder
@attr.s(auto_attribs=True)
class TensorBuilder:
"""converts tweets to tensors
Args:
- split: where to split the training and validation data
"""
split = Defaults.split
_positive: list=None
_negative: list=None
_positive_training: list=None
_negative_training: list=None
_positive_validation: list=None
_negative_validation: list=None
_process: TwitterProcessor=None
_vocabulary: dict=None
_x_train: list=None
- Positive Tweets
@property def positive(self) -> list: """The raw positive NLTK tweets""" if self._positive is None: self._positive = twitter_samples.strings(NLTK.positive) return self._positive
- Negative Tweets
@property def negative(self) -> list: """The raw negative NLTK tweets""" if self._negative is None: self._negative = twitter_samples.strings(NLTK.negative) return self._negative
- Positive Training
@property def positive_training(self) -> list: """The positive training data""" if self._positive_training is None: self._positive_training = self.positive[:self.split] return self._positive_training
- Negative Training
@property def negative_training(self) -> list: """The negative training data""" if self._negative_training is None: self._negative_training = self.negative[:self.split] return self._negative_training
- Positive Validation
@property def positive_validation(self) -> list: """The positive validation data""" if self._positive_validation is None: self._positive_validation = self.positive[self.split:] return self._positive_validation
- Negative Validation
@property def negative_validation(self) -> list: """The negative validation data""" if self._negative_validation is None: self._negative_validation = self.negative[self.split:] return self._negative_validation
- Twitter Processor
@property def process(self) -> TwitterProcessor: """processor for tweets""" if self._process is None: self._process = TwitterProcessor() return self._process
- X Train
@property def x_train(self) -> list: """The unprocessed training data""" if self._x_train is None: self._x_train = self.positive_training + self.negative_training return self._x_train
- The Vocabulary
@property def vocabulary(self) -> dict: """A map of token to numeric id""" if self._vocabulary is None: self._vocabulary = {SpecialTokens.padding: SpecialIDs.padding, SpecialTokens.ending: SpecialIDs.ending, SpecialTokens.unknown: SpecialIDs.unknown} for tweet in self.x_train: for token in self.process(tweet): if token not in self._vocabulary: self._vocabulary[token] = len(self._vocabulary) return self._vocabulary
- To Tensor
def to_tensor(self, tweet: str) -> list: """Converts tweet to list of numeric identifiers Args: tweet: the string to convert Returns: list of IDs for the tweet """ tensor = [self.vocabulary.get(token, SpecialIDs.unknown) for token in self.process(tweet)] return tensor
The Generator
@attr.s(auto_attribs=True)
class TensorGenerator:
"""Generates batches of vectorized-tweets
Args:
converter: TensorBuilder object
positive_data: list of positive data
negative_data: list of negative data
batch_size: the size for each generated batch
shuffle: whether to shuffle the generated data
infinite: whether to generate data forever
"""
converter: TensorBuilder
positive_data: list
negative_data: list
batch_size: int
shuffle: bool=True
infinite: bool = True
_positive_indices: list=None
_negative_indices: list=None
_positives: iter=None
_negatives: iter=None
- Positive Indices
@property def positive_indices(self) -> list: """The indices to use to grab the positive tweets""" if self._positive_indices is None: k = len(self.positive_data) if self.shuffle: self._positive_indices = random.sample(range(k), k=k) else: self._positive_indices = list(range(k)) return self._positive_indices
- Negative Indices
@property def negative_indices(self) -> list: """Indices for the negative tweets""" if self._negative_indices is None: k = len(self.negative_data) if self.shuffle: self._negative_indices = random.sample(range(k), k=k) else: self._negative_indices = list(range(k)) return self._negative_indices
- Positives
@property def positives(self): """The positive index generator""" if self._positives is None: self._positives = self.positive_generator() return self._positives
- Negatives
@property def negatives(self): """The negative index generator""" if self._negatives is None: self._negatives = self.negative_generator() return self._negatives
- Positive Generator
def positive_generator(self): """Generator of indices for positive tweets""" stop = len(self.positive_indices) index = 0 while True: yield self.positive_indices[index] index += 1 if index == stop: if not self.infinite: break if self.shuffle: self._positive_indices = None index = 0 return
- Negative Generator
def negative_generator(self): """generator of indices for negative tweets""" stop = len(self.negative_indices) index = 0 while True: yield self.negative_indices[index] index += 1 if index == stop: if not self.infinite: break if self.shuffle: self._negative_indices = None index = 0 return
- The Iterator
def __iter__(self): return self
- The Next Method
def __next__(self): assert self.batch_size % 2 == 0 half_batch = self.batch_size // 2 # get the indices positives = (next(self.positives) for index in range(half_batch)) negatives = (next(self.negatives) for index in range(half_batch)) # get the tweets positives = (self.positive_data[index] for index in positives) negatives = (self.negative_data[index] for index in negatives) # get the token ids try: positives = [self.converter.to_tensor(tweet) for tweet in positives] negatives = [self.converter.to_tensor(tweet) for tweet in negatives] except RuntimeError: # the next(self.positives) in the first generator will raise a # RuntimeError if # we're not running this infinitely raise StopIteration batch = positives + negatives longest = max((len(tweet) for tweet in batch)) paddings = (longest - len(tensor) for tensor in batch) paddings = ([0] * padding for padding in paddings) padded = [tensor + padding for tensor, padding in zip(batch, paddings)] inputs = numpy.array(padded) # the labels for the inputs targets = numpy.array([1] * half_batch + [0] * half_batch) assert len(targets) == len(batch) # default the weights to ones weights = numpy.ones_like(targets) return inputs, targets, weights
Test It Out
from neurotic.nlp.twitter.tensor_generator import TensorBuilder, TensorGenerator
converter = TensorBuilder()
expect(len(converter.vocabulary)).to(equal(len(vocabulary)))
tweet = positive_validation[0]
expected = [1072, 96, 484, 2376, 750, 8220, 1132, 750, 53, 2, 2701, 796, 2, 2,
354, 606, 2, 3523, 1025, 602, 4599, 9, 1072, 158, 2, 2]
actual = converter.to_tensor(tweet)
expect(actual).to(contain_exactly(*expected))
generator = TensorGenerator(converter, batch_size=4)
print(next(generator))
(array([[ 749, 1019, 313, 1020, 75], [1009, 9, 0, 0, 0], [3540, 6030, 6031, 3798, 0], [ 50, 96, 3798, 0, 0]]), array([1, 1, 0, 0]), array([1, 1, 1, 1]))
for count, batch in enumerate(generator):
print(batch[0])
print()
if count == 5:
break
print(next(generator))
[[ 22 1228 434 354 227 2371 9] [ 267 160 89 0 0 0 0] [ 315 1008 8480 3798 2108 371 3233] [8232 8233 791 3798 0 0 0]] [[1173 1061 586 9 896 729 1264 345 1062 1063] [3387 558 991 2166 3388 3231 558 238 120 0] [ 198 5997 3798 0 0 0 0 0 0 0] [ 223 310 3798 0 0 0 0 0 0 0]] [[4015 4015 4015 4016 231 2117 57 422 9 4017 4018 4019 86 86] [2554 57 102 358 75 0 0 0 0 0 0 0 0 0] [ 50 38 881 3798 0 0 0 0 0 0 0 0 0 0] [6729 6730 6731 382 3798 0 0 0 0 0 0 0 0 0]] [[3479 75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [4636 4637 233 4299 111 237 2626 9 0 0 0 0 0 0 0 0 0] [ 73 381 463 4321 142 96 7390 7391 92 85 1394 7392 5895 7393 45 3798 7394] [8863 2844 991 127 5818 0 0 0 0 0 0 0 0 0 0 0 0]] [[ 226 615 22 75 0 0] [2135 703 237 435 3124 9] [2379 6264 3798 0 0 0] [6504 1912 2380 3798 0 0]] [[5623 120 0 0 0 0 0 0 0 0] [ 133 54 102 63 1300 56 9 50 92 3181] [2094 383 73 464 3798 0 0 0 0 0] [ 223 101 8754 383 2085 5818 8755 0 0 0]] (array([[ 374, 44, 2981, 435, 132, 111, 1040, 1382, 9, 0, 0, 0], [ 369, 398, 283, 9, 2671, 1411, 136, 184, 769, 1262, 2061, 3460], [1094, 9024, 315, 381, 3798, 0, 0, 0, 0, 0, 0, 0], [9036, 3798, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), array([1, 1, 0, 0]), array([1, 1, 1, 1]))
Ladies and gentlemen, we have ourselves a generator.
End
Now that we have our data, the next step will be to define the model.