Implementing a Naive Bayes Twitter Sentiment Classifier
Table of Contents
Beginning
In the previous post I went through some of the background of how Naive Bayes works. In this post I'll implement a Naive Bayes Classifier to classify tweets by whether they are positive in sentiment or negative. The Naive Bayes model uses Bayes' rule to make its predictions and it's called "naive" because it makes the assumption that words in the document are independent (in the probability event sense) which allows us to use the multiplication rule to calculate our probabilities. It also uses the \(\textit{Bag of Words}\) assumption that word ordering isn't important.
Set Up
This first bit imports the needed dependencies followed by setting up the data and some helpers.
Imports
# python
from collections import Counter, defaultdict
from functools import partial
from pathlib import Path
import os
import pickle
# pypi
from dotenv import load_dotenv
from tabulate import tabulate
import numpy
import pandas
# my stuff
from neurotic.nlp.twitter.counter import WordCounter
Tabulate
This sets up tabulate to make it a little simpler to display pandas DataFrames in org.
TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)
The Dotenv
I put the path to the data files in a .env file so this loads it into the environment.
env_path = Path("posts/nlp/.env")
assert env_path.is_file()
load_dotenv(env_path)
Load the Twitter Data
I split the data previously for the Logistic Regression twitter sentiment classifier so I'll load it here and skip building the sets.
train_raw = pandas.read_feather(
Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser())
test_raw = pandas.read_feather(
Path(os.environ["TWITTER_TEST_RAW"]).expanduser()
)
print(f"Training: {len(train_raw):,}")
print(f"Testing: {len(test_raw):,}")
Training: 8,000 Testing: 2,000
I'll also re-use the WordCounter from the Logistic Regression. Despite the name it also does tokenizing and cleaning.
counter = WordCounter(train_raw.tweet, train_raw.label)
Constants
This was an object I created to store a few constant values.
with open(os.environ["TWITTER_SENTIMENT"], "rb") as reader:
Sentiment = pickle.load(reader)
print(Sentiment)
Namespace(decode={1: 'positive', 0: 'negative'}, encode={'positive': 1, 'negative': 0}, negative=0, positive=1)
Middle
Implementing the Model
In an earlier post I wrote up a little of the background behind what we're doing and now I'm going to translate the math in that post into code.
Implementing The Training Function
The first part of the problem - training the model by building up the probabilities.
def train_naive_bayes(counts: Counter,
train_x: pandas.Series,
train_y: pandas.Series) -> tuple:
"""
Args:
counts: Counter from (word, label) to how often the word appears
train_x: a list of tweets
train_y: a list of labels correponding to the tweets (0,1)
Returns:
logprior: the log odds ratio
loglikelihood: log likelihood dictionary for the Naive bayes equation
"""
loglikelihood = defaultdict(lambda: 0)
logprior = 0
vocabulary = set([pair[0] for pair in counts])
V = len(vocabulary)
# number of positive and negative words in the training set
N_pos = sum((counts[(token, sentiment)] for token, sentiment in counts
if sentiment == Sentiment.positive))
N_neg = sum((counts[(token, sentiment)] for token, sentiment in counts
if sentiment == Sentiment.negative))
D = len(train_x)
# D_pos is number of positive documents
D_pos = train_y.sum()
# D_neg is the number of negative documents
D_neg = D - D_pos
# the log odds ratio
logprior = numpy.log(D_pos) - numpy.log(D_neg)
for word in vocabulary:
freq_pos = counts[(word, Sentiment.positive)]
freq_neg = counts[(word, Sentiment.negative)]
# the probability that the word is positive, and negative
p_w_pos = (freq_pos + 1)/(N_pos + V)
p_w_neg = (freq_neg + 1)/(N_neg + V)
loglikelihood[word] = numpy.log(p_w_pos) - numpy.log(p_w_neg)
return logprior, loglikelihood
Now we can see what we get when we train our model.
logprior, loglikelihood = train_naive_bayes(counter.counts, train_raw.tweet, train_raw.label)
print(f"Log Prior: {logprior}")
print(f"Words in Log Likelihood: {len(loglikelihood):,}")
Log Prior: -0.006500022885560952 Words in Log Likelihood: 9,172
print(f"Positive Tweets: {len(train_raw[train_raw.label==Sentiment.positive]):,}")
print(f"Negative Tweets: {len(train_raw[train_raw.label==Sentiment.negative]):,}")
Positive Tweets: 3,987 Negative Tweets: 4,013
We get a negative value for the logprior
because we have more negative tweets than positive tweets in the training set and the negative count is the second term when we calculate the difference for the logprior
. If we evened it out it would drop to 0.
all_raw = pandas.concat([train_raw, test_raw])
check = pandas.concat([
all_raw[all_raw.label==1].iloc[:4000], all_raw[all_raw.label==0].iloc[:4000]])
logprior, loglikelihood = train_naive_bayes(counter.counts, check.tweet, check.label)
print(f"Log Prior: {logprior}")
print(f"Log Likelihood: {len(loglikelihood)}")
Log Prior: 0.0 Log Likelihood: 9172
Making Predictions
Now that we have the model we can use it to make some predictions.
\[ p = logprior + \sum_i^N (loglikelihood_i) \]
def naive_bayes_predict(tweet: str, logprior: float, loglikelihood: dict) -> float:
"""
Args:
tweet: a tweet to classify
logprior: the log odds ratio of prior probabilities
loglikelihood: a dictionary of words mapped to their log likelihood ratios
Returns:
p: sum of the log-odds ratio for the tweet
"""
# process the tweet to get a list of words
words = counter.process(tweet)
return logprior + sum(loglikelihood[word] for word in words)
Now test it with a tweet.
my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(f'The positive to negative ratio is {p:0.2f}.')
The positive to negative ratio is 1.44.
Since the ratio is greater than 0, we're predicting that the tweet has a positive sentiment.
Test The Model
Now we'll calculate the accuracy of the model against the test set.
def test_naive_bayes(test_x: pandas.Series, test_y: pandas.Series,
logprior: float, loglikelihood: dict) -> float:
"""
Args:
test_x: tweets to classify
test_y: labels for test_x
logprior: the logprior for the training set
loglikelihood: a dictionary with the loglikelihoods for each word
Returns:
accuracy: (# of tweets classified correctly)/(total # of tweets)
"""
accuracy = 0
y_hats = numpy.array([int(naive_bayes_predict(tweet, logprior, loglikelihood) > 0)
for tweet in test_x])
# error is the average of the absolute values of the differences between y_hats and test_y
# error = number wrong/number of tweets
error = numpy.abs(y_hats - test_y).mean()
# Accuracy is 1 minus the error
accuracy = 1 - error
return accuracy
print("Naive Bayes accuracy = %0.4f" %
(test_naive_bayes(test_raw.tweet, test_raw.label, logprior, loglikelihood)))
Naive Bayes accuracy = 0.9955
Which looks good, but it might actually be overfitting - it looks too good. Now here's some example tweets to check.
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.',
'great', 'great great', 'great great great', 'great great great great']:
p = naive_bayes_predict(tweet, logprior, loglikelihood)
print(f'{tweet} -> {p:.2f}')
I am happy -> 1.89 I am bad -> -1.63 this movie should have been great. -> 2.05 great -> 2.06 great great -> 4.13 great great great -> 6.19 great great great great -> 8.25
It looks like the word "great" throws off the third sentence which hints at being negative. What if we pass in a neutral (nonsensical) tweet?
my_tweet = "the answer is nicht in the umwelt"
print(naive_bayes_predict(my_tweet, logprior, loglikelihood))
-0.41441957689474407
I don't know which of those words triggered the negative value…
for word in "the answer is nicht in the umwelt".split():
print(f"{word}:\t{naive_bayes_predict(word, logprior, loglikelihood):0.2f}")
the: 0.00 answer: -0.41 is: 0.00 nicht: 0.00 in: 0.00 the: 0.00 umwelt: 0.00
It only got one word, answer
and that's negative for some reason. Go figure.
Filtering Words
This is sort of an aside, but one way to quickly filter tweets based on how positive or negative they are is to use the ratio of positive to negative counts and setting a threshold that has to be met to be included in the output.
\[ ratio = \frac{\text{pos_words} + 1}{\text{neg_words} + 1} \]
Words | Positive word count | Negative Word Count |
---|---|---|
glad | 41 | 2 |
arriv | 57 | 4 |
:( | 1 | 3663 |
:-( | 0 | 378 |
Get The Ratio
As an intermediate step we'll create a function named get_ratio
that looks up a word and calculates the positive to negative ratio.
def get_ratio(freqs: Counter, word: str) -> dict:
"""
Args:
freqs: Counter with (word, sentiment) : count
word: string to lookup
Returns:
dictionary with keys 'positive', 'negative', and 'ratio'.
Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
"""
pos_neg_ratio = dict(
positive = freqs[(word, Sentiment.positive)],
negative = freqs[(word, Sentiment.negative)],
)
# calculate the ratio of positive to negative counts for the word
pos_neg_ratio['ratio'] = (pos_neg_ratio["positive"] + 1)/(
pos_neg_ratio["negative"] + 1)
return pos_neg_ratio
print(get_ratio(counter.counts, 'happi'))
{'positive': 160, 'negative': 23, 'ratio': 6.708333333333333}
Get Words By Threshold
Now we'll create the filter function. To make it simpler we'll assume that if we're filtering on the positive label then the ratio for a word to be included has to be equal to or greater than the given threshold while if the label is negative then a word has to be less than or equal to the threshold. Doing this means we're filtering to get words that are further toward the extremes of positive or negative (further from 0).
An example key-value pair would have this structure:
{'happi':
{'positive': 10, 'negative': 20, 'ratio': 0.5}
}
def get_words_by_threshold(freqs: Counter, label: int, threshold: float) -> dict:
"""
Args:
freqs: Counter of (word, sentiment): word count
label: 1 for positive, 0 for negative
threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
Returns:
words: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
example of a key value pair:
{'happi':
{'positive': 10, 'negative': 20, 'ratio': 0.5}
}
"""
words = {}
for word, _ in freqs:
pos_neg_ratio = get_ratio(freqs, word)
if ((label == Sentiment.positive and pos_neg_ratio["ratio"] >= threshold) or
(label == Sentiment.negative and pos_neg_ratio["ratio"] <= threshold)):
words[word] = pos_neg_ratio
return words
Here's an example where we'll filter on negative sentiment so all the tweets should be negative and have a positive to negative ration less that the threshold.
passed = get_words_by_threshold(counter.counts, label=Sentiment.negative, threshold=0.05)
count = 1
for word, info in passed.items():
print(f"{count}\tword: {word}\t{info}")
count += 1
1 word: :( {'positive': 1, 'negative': 3705, 'ratio': 0.0005396654074473826} 2 word: :-( {'positive': 0, 'negative': 407, 'ratio': 0.0024509803921568627} 3 word: ♛ {'positive': 0, 'negative': 162, 'ratio': 0.006134969325153374} 4 word: 》 {'positive': 0, 'negative': 162, 'ratio': 0.006134969325153374} 5 word: beli̇ev {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571} 6 word: wi̇ll {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571} 7 word: justi̇n {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571} 8 word: see {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571} 9 word: me {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571} 10 word: sad {'positive': 3, 'negative': 100, 'ratio': 0.039603960396039604} 11 word: >:( {'positive': 0, 'negative': 36, 'ratio': 0.02702702702702703}
So our threshold gives us the eleven most negative words.
Now, what about filtering on the most positive words?
passed = get_words_by_threshold(counter.counts, label=Sentiment.positive, threshold=10)
count = 1
for word, info in passed.items():
print(f"{count}\tword: {word}\t{info}")
count += 1
1 word: :) {'positive': 2967, 'negative': 1, 'ratio': 1484.0} 2 word: :-) {'positive': 547, 'negative': 0, 'ratio': 548.0} 3 word: :D {'positive': 537, 'negative': 0, 'ratio': 538.0} 4 word: :p {'positive': 113, 'negative': 0, 'ratio': 114.0} 5 word: fback {'positive': 22, 'negative': 0, 'ratio': 23.0} 6 word: blog {'positive': 29, 'negative': 2, 'ratio': 10.0} 7 word: followfriday {'positive': 19, 'negative': 0, 'ratio': 20.0} 8 word: recent {'positive': 9, 'negative': 0, 'ratio': 10.0} 9 word: stat {'positive': 52, 'negative': 0, 'ratio': 53.0} 10 word: arriv {'positive': 57, 'negative': 4, 'ratio': 11.6} 11 word: thx {'positive': 11, 'negative': 0, 'ratio': 12.0} 12 word: here' {'positive': 19, 'negative': 0, 'ratio': 20.0} 13 word: influenc {'positive': 16, 'negative': 0, 'ratio': 17.0} 14 word: bam {'positive': 34, 'negative': 0, 'ratio': 35.0} 15 word: warsaw {'positive': 34, 'negative': 0, 'ratio': 35.0} 16 word: welcom {'positive': 58, 'negative': 4, 'ratio': 11.8} 17 word: vid {'positive': 9, 'negative': 0, 'ratio': 10.0} 18 word: ceo {'positive': 9, 'negative': 0, 'ratio': 10.0} 19 word: 1month {'positive': 9, 'negative': 0, 'ratio': 10.0} 20 word: flipkartfashionfriday {'positive': 14, 'negative': 0, 'ratio': 15.0} 21 word: inde {'positive': 10, 'negative': 0, 'ratio': 11.0} 22 word: glad {'positive': 35, 'negative': 2, 'ratio': 12.0} 23 word: braindot {'positive': 9, 'negative': 0, 'ratio': 10.0} 24 word: ;) {'positive': 21, 'negative': 0, 'ratio': 22.0} 25 word: goodnight {'positive': 19, 'negative': 1, 'ratio': 10.0} 26 word: youth {'positive': 10, 'negative': 0, 'ratio': 11.0} 27 word: shout {'positive': 9, 'negative': 0, 'ratio': 10.0} 28 word: fantast {'positive': 10, 'negative': 0, 'ratio': 11.0}
The first four make sense, but after that maybe not so much. "fback"?
Error Analysis
Now let's look at some tweets that we got wrong. We're going to use numpy.sign which reduces numbers to -1
, 0
, or 1
.
print('Truth Predicted Tweet')
for row in test_raw.itertuples():
y_hat = naive_bayes_predict(row.tweet, logprior, loglikelihood)
if row.label != (numpy.sign(y_hat) > 0):
print(
f"{row.label}\t{numpy.sign(y_hat) > 0:d}\t"
f"{' '.join(counter.process(row.tweet)).encode('ascii', 'ignore')}")
Truth Predicted Tweet 0 1 b'whatev stil l young >:-(' 1 0 b'look fun kik va 642 kik kikgirl french model orgasm hannib phonesex :)' 0 1 b'great news thank let us know :( hope good weekend' 0 1 b"amb pleas harry' jean :) ): ): ):" 0 1 b'srsli fuck u unfollow hope ur futur child unpar u >:-(' 1 0 b'ate last cooki shir 0 >:d' 1 0 b'snapchat jennyjean 22 snapchat kikmeboy model french kikchat sabadodeganarseguidor sexysasunday :)' 1 0 b'add kik ughtm 545 kik kikmeguy kissm nude likeforfollow musicbiz sexysasunday :)' 0 1 b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'
For some reason it misses the >:-(
emoji and the :)
- maybe they didn't occur in the training set. I think these woud be hard for a human to get too, unless you were well versed in tweets and emojis and maybe even then it would be hard…
Predict Your Own Tweet
Let's try a random tweet not in the given training or test sets.
my_tweet = 'my balls itch'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(f"{my_tweet} is a positive tweet: {numpy.sign(p) > 0}")
my balls itch is a positive tweet: True
Hmmm. Maybe…
End
I want to do more work with the Naive Bayes Classifier but this post is getting too long so I'm going to move on to other posts, the next being a class-based implementation of the model.