Implementing a Naive Bayes Twitter Sentiment Classifier

Beginning

In the previous post I went through some of the background of how Naive Bayes works. In this post I'll implement a Naive Bayes Classifier to classify tweets by whether they are positive in sentiment or negative. The Naive Bayes model uses Bayes' rule to make its predictions and it's called "naive" because it makes the assumption that words in the document are independent (in the probability event sense) which allows us to use the multiplication rule to calculate our probabilities. It also uses the \(\textit{Bag of Words}\) assumption that word ordering isn't important.

Set Up

This first bit imports the needed dependencies followed by setting up the data and some helpers.

Imports

# python
from collections import Counter, defaultdict
from functools import partial
from pathlib import Path

import os
import pickle

# pypi
from dotenv import load_dotenv
from tabulate import tabulate

import numpy
import pandas

# my stuff
from neurotic.nlp.twitter.counter import WordCounter

Tabulate

This sets up tabulate to make it a little simpler to display pandas DataFrames in org.

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

The Dotenv

I put the path to the data files in a .env file so this loads it into the environment.

env_path = Path("posts/nlp/.env")
assert env_path.is_file()
load_dotenv(env_path)

Load the Twitter Data

I split the data previously for the Logistic Regression twitter sentiment classifier so I'll load it here and skip building the sets.

train_raw = pandas.read_feather(
    Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser())

test_raw = pandas.read_feather(
    Path(os.environ["TWITTER_TEST_RAW"]).expanduser()
)

print(f"Training: {len(train_raw):,}")
print(f"Testing: {len(test_raw):,}")
Training: 8,000
Testing: 2,000

I'll also re-use the WordCounter from the Logistic Regression. Despite the name it also does tokenizing and cleaning.

counter = WordCounter(train_raw.tweet, train_raw.label)

Constants

This was an object I created to store a few constant values.

with open(os.environ["TWITTER_SENTIMENT"], "rb") as reader:
    Sentiment = pickle.load(reader)
print(Sentiment)
Namespace(decode={1: 'positive', 0: 'negative'}, encode={'positive': 1, 'negative': 0}, negative=0, positive=1)

Middle

Implementing the Model

In an earlier post I wrote up a little of the background behind what we're doing and now I'm going to translate the math in that post into code.

Implementing The Training Function

The first part of the problem - training the model by building up the probabilities.

def train_naive_bayes(counts: Counter,
                      train_x: pandas.Series,
                      train_y: pandas.Series) -> tuple:
    """
    Args:
       counts: Counter from (word, label) to how often the word appears
       train_x: a list of tweets
       train_y: a list of labels correponding to the tweets (0,1)

    Returns:
       logprior: the log odds ratio
       loglikelihood: log likelihood dictionary for the Naive bayes equation
    """
    loglikelihood = defaultdict(lambda: 0)
    logprior = 0

    vocabulary = set([pair[0] for pair in counts])
    V = len(vocabulary)

    # number of positive and negative words in the training set
    N_pos = sum((counts[(token, sentiment)] for token, sentiment in counts
                 if sentiment == Sentiment.positive))
    N_neg = sum((counts[(token, sentiment)] for token, sentiment in counts
                 if sentiment == Sentiment.negative))

    D = len(train_x)

    # D_pos is number of positive documents
    D_pos = train_y.sum()

    # D_neg is the number of negative documents
    D_neg = D - D_pos

    # the log odds ratio
    logprior = numpy.log(D_pos) - numpy.log(D_neg)

    for word in vocabulary:
        freq_pos = counts[(word, Sentiment.positive)]
        freq_neg = counts[(word, Sentiment.negative)]

        # the probability that the word is positive, and negative
        p_w_pos = (freq_pos + 1)/(N_pos + V)
        p_w_neg = (freq_neg + 1)/(N_neg + V)

        loglikelihood[word] = numpy.log(p_w_pos) - numpy.log(p_w_neg)
    return logprior, loglikelihood

Now we can see what we get when we train our model.

logprior, loglikelihood = train_naive_bayes(counter.counts, train_raw.tweet, train_raw.label)
print(f"Log Prior: {logprior}")
print(f"Words in Log Likelihood: {len(loglikelihood):,}")
Log Prior: -0.006500022885560952
Words in Log Likelihood: 9,172
print(f"Positive Tweets: {len(train_raw[train_raw.label==Sentiment.positive]):,}")
print(f"Negative Tweets: {len(train_raw[train_raw.label==Sentiment.negative]):,}")
Positive Tweets: 3,987
Negative Tweets: 4,013

We get a negative value for the logprior because we have more negative tweets than positive tweets in the training set and the negative count is the second term when we calculate the difference for the logprior. If we evened it out it would drop to 0.

all_raw = pandas.concat([train_raw, test_raw])
check = pandas.concat([
    all_raw[all_raw.label==1].iloc[:4000], all_raw[all_raw.label==0].iloc[:4000]])
logprior, loglikelihood = train_naive_bayes(counter.counts, check.tweet, check.label)
print(f"Log Prior: {logprior}")
print(f"Log Likelihood: {len(loglikelihood)}")
Log Prior: 0.0
Log Likelihood: 9172

Making Predictions

Now that we have the model we can use it to make some predictions.

\[ p = logprior + \sum_i^N (loglikelihood_i) \]

def naive_bayes_predict(tweet: str, logprior: float, loglikelihood: dict) -> float:
    """
    Args:
       tweet: a tweet to classify
       logprior: the log odds ratio of prior probabilities
       loglikelihood: a dictionary of words mapped to their log likelihood ratios

    Returns:
       p: sum of the log-odds ratio for the tweet
    """
    # process the tweet to get a list of words
    words = counter.process(tweet)
    return logprior + sum(loglikelihood[word] for word in words)

Now test it with a tweet.

my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(f'The positive to negative ratio is {p:0.2f}.')
The positive to negative ratio is 1.44.

Since the ratio is greater than 0, we're predicting that the tweet has a positive sentiment.

Test The Model

Now we'll calculate the accuracy of the model against the test set.

def test_naive_bayes(test_x: pandas.Series, test_y: pandas.Series,
                     logprior: float, loglikelihood: dict) -> float:
    """
    Args:
       test_x: tweets to classify
       test_y: labels for test_x
       logprior: the logprior for the training set
       loglikelihood: a dictionary with the loglikelihoods for each word

    Returns:
       accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0

    y_hats = numpy.array([int(naive_bayes_predict(tweet, logprior, loglikelihood) > 0)
              for tweet in test_x])

    # error is the average of the absolute values of the differences between y_hats and test_y
    # error = number wrong/number of tweets
    error = numpy.abs(y_hats - test_y).mean()

    # Accuracy is 1 minus the error
    accuracy = 1 - error
    return accuracy
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_raw.tweet, test_raw.label, logprior, loglikelihood)))
Naive Bayes accuracy = 0.9955

Which looks good, but it might actually be overfitting - it looks too good. Now here's some example tweets to check.

for tweet in ['I am happy', 'I am bad', 'this movie should have been great.',
              'great', 'great great', 'great great great', 'great great great great']:
    p = naive_bayes_predict(tweet, logprior, loglikelihood)
    print(f'{tweet} -> {p:.2f}')
I am happy -> 1.89
I am bad -> -1.63
this movie should have been great. -> 2.05
great -> 2.06
great great -> 4.13
great great great -> 6.19
great great great great -> 8.25

It looks like the word "great" throws off the third sentence which hints at being negative. What if we pass in a neutral (nonsensical) tweet?

my_tweet = "the answer is nicht in the umwelt"
print(naive_bayes_predict(my_tweet, logprior, loglikelihood))
-0.41441957689474407

I don't know which of those words triggered the negative value…

for word in "the answer is nicht in the umwelt".split():
    print(f"{word}:\t{naive_bayes_predict(word, logprior, loglikelihood):0.2f}")
the:    0.00
answer: -0.41
is:     0.00
nicht:  0.00
in:     0.00
the:    0.00
umwelt: 0.00

It only got one word, answer and that's negative for some reason. Go figure.

Filtering Words

This is sort of an aside, but one way to quickly filter tweets based on how positive or negative they are is to use the ratio of positive to negative counts and setting a threshold that has to be met to be included in the output.

\[ ratio = \frac{\text{pos_words} + 1}{\text{neg_words} + 1} \]

Words Positive word count Negative Word Count
glad 41 2
arriv 57 4
:( 1 3663
:-( 0 378

Get The Ratio

As an intermediate step we'll create a function named get_ratio that looks up a word and calculates the positive to negative ratio.

def get_ratio(freqs: Counter, word: str) -> dict:
    """
    Args:
       freqs: Counter with (word, sentiment) : count
       word: string to lookup

    Returns: 
     dictionary with keys 'positive', 'negative', and 'ratio'.
       Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    """
    pos_neg_ratio = dict(
        positive = freqs[(word, Sentiment.positive)],
        negative = freqs[(word, Sentiment.negative)],
    )

    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio["positive"] + 1)/(
        pos_neg_ratio["negative"] + 1)
    return pos_neg_ratio
print(get_ratio(counter.counts, 'happi'))
{'positive': 160, 'negative': 23, 'ratio': 6.708333333333333}

Get Words By Threshold

Now we'll create the filter function. To make it simpler we'll assume that if we're filtering on the positive label then the ratio for a word to be included has to be equal to or greater than the given threshold while if the label is negative then a word has to be less than or equal to the threshold. Doing this means we're filtering to get words that are further toward the extremes of positive or negative (further from 0).

An example key-value pair would have this structure:

{'happi':
     {'positive': 10, 'negative': 20, 'ratio': 0.5}
 }
def get_words_by_threshold(freqs: Counter, label: int, threshold: float) -> dict:
    """
    Args:
       freqs: Counter of (word, sentiment): word count
       label: 1 for positive, 0 for negative
       threshold: ratio that will be used as the cutoff for including a word in the returned dictionary

    Returns:
       words: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
       example of a key value pair:
       {'happi':
           {'positive': 10, 'negative': 20, 'ratio': 0.5}
       }
    """
    words = {}

    for word, _ in freqs:
        pos_neg_ratio = get_ratio(freqs, word)

        if ((label == Sentiment.positive and pos_neg_ratio["ratio"] >= threshold) or
            (label == Sentiment.negative and pos_neg_ratio["ratio"] <= threshold)):
            words[word] = pos_neg_ratio

    return words

Here's an example where we'll filter on negative sentiment so all the tweets should be negative and have a positive to negative ration less that the threshold.

passed = get_words_by_threshold(counter.counts, label=Sentiment.negative, threshold=0.05)
count = 1
for word, info in passed.items():
    print(f"{count}\tword: {word}\t{info}")
    count += 1
1       word: :(        {'positive': 1, 'negative': 3705, 'ratio': 0.0005396654074473826}
2       word: :-(       {'positive': 0, 'negative': 407, 'ratio': 0.0024509803921568627}
3       word: ♛ {'positive': 0, 'negative': 162, 'ratio': 0.006134969325153374}
4       word: 》 {'positive': 0, 'negative': 162, 'ratio': 0.006134969325153374}
5       word: beli̇ev   {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571}
6       word: wi̇ll     {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571}
7       word: justi̇n   {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571}
8       word: see       {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571}
9       word: me        {'positive': 0, 'negative': 27, 'ratio': 0.03571428571428571}
10      word: sad       {'positive': 3, 'negative': 100, 'ratio': 0.039603960396039604}
11      word: >:(    {'positive': 0, 'negative': 36, 'ratio': 0.02702702702702703}

So our threshold gives us the eleven most negative words.

Now, what about filtering on the most positive words?

passed = get_words_by_threshold(counter.counts, label=Sentiment.positive, threshold=10)
count = 1
for word, info in passed.items():
    print(f"{count}\tword: {word}\t{info}")
    count += 1
1       word: :)        {'positive': 2967, 'negative': 1, 'ratio': 1484.0}
2       word: :-)       {'positive': 547, 'negative': 0, 'ratio': 548.0}
3       word: :D        {'positive': 537, 'negative': 0, 'ratio': 538.0}
4       word: :p        {'positive': 113, 'negative': 0, 'ratio': 114.0}
5       word: fback     {'positive': 22, 'negative': 0, 'ratio': 23.0}
6       word: blog      {'positive': 29, 'negative': 2, 'ratio': 10.0}
7       word: followfriday      {'positive': 19, 'negative': 0, 'ratio': 20.0}
8       word: recent    {'positive': 9, 'negative': 0, 'ratio': 10.0}
9       word: stat      {'positive': 52, 'negative': 0, 'ratio': 53.0}
10      word: arriv     {'positive': 57, 'negative': 4, 'ratio': 11.6}
11      word: thx       {'positive': 11, 'negative': 0, 'ratio': 12.0}
12      word: here'     {'positive': 19, 'negative': 0, 'ratio': 20.0}
13      word: influenc  {'positive': 16, 'negative': 0, 'ratio': 17.0}
14      word: bam       {'positive': 34, 'negative': 0, 'ratio': 35.0}
15      word: warsaw    {'positive': 34, 'negative': 0, 'ratio': 35.0}
16      word: welcom    {'positive': 58, 'negative': 4, 'ratio': 11.8}
17      word: vid       {'positive': 9, 'negative': 0, 'ratio': 10.0}
18      word: ceo       {'positive': 9, 'negative': 0, 'ratio': 10.0}
19      word: 1month    {'positive': 9, 'negative': 0, 'ratio': 10.0}
20      word: flipkartfashionfriday     {'positive': 14, 'negative': 0, 'ratio': 15.0}
21      word: inde      {'positive': 10, 'negative': 0, 'ratio': 11.0}
22      word: glad      {'positive': 35, 'negative': 2, 'ratio': 12.0}
23      word: braindot  {'positive': 9, 'negative': 0, 'ratio': 10.0}
24      word: ;)        {'positive': 21, 'negative': 0, 'ratio': 22.0}
25      word: goodnight {'positive': 19, 'negative': 1, 'ratio': 10.0}
26      word: youth     {'positive': 10, 'negative': 0, 'ratio': 11.0}
27      word: shout     {'positive': 9, 'negative': 0, 'ratio': 10.0}
28      word: fantast   {'positive': 10, 'negative': 0, 'ratio': 11.0}

The first four make sense, but after that maybe not so much. "fback"?

Error Analysis

Now let's look at some tweets that we got wrong. We're going to use numpy.sign which reduces numbers to -1, 0, or 1.

print('Truth Predicted Tweet')
for row in test_raw.itertuples():
    y_hat = naive_bayes_predict(row.tweet, logprior, loglikelihood)
    if row.label != (numpy.sign(y_hat) > 0):
        print(
            f"{row.label}\t{numpy.sign(y_hat) > 0:d}\t"
            f"{' '.join(counter.process(row.tweet)).encode('ascii', 'ignore')}")
Truth Predicted Tweet
0       1       b'whatev stil l young >:-('
1       0       b'look fun kik va 642 kik kikgirl french model orgasm hannib phonesex :)'
0       1       b'great news thank let us know :( hope good weekend'
0       1       b"amb pleas harry' jean :) ): ): ):"
0       1       b'srsli fuck u unfollow hope ur futur child unpar u >:-('
1       0       b'ate last cooki shir 0 >:d'
1       0       b'snapchat jennyjean 22 snapchat kikmeboy model french kikchat sabadodeganarseguidor sexysasunday :)'
1       0       b'add kik ughtm 545 kik kikmeguy kissm nude likeforfollow musicbiz sexysasunday :)'
0       1       b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'

For some reason it misses the >:-( emoji and the :) - maybe they didn't occur in the training set. I think these woud be hard for a human to get too, unless you were well versed in tweets and emojis and maybe even then it would be hard…

Predict Your Own Tweet

Let's try a random tweet not in the given training or test sets.

my_tweet = 'my balls itch'

p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(f"{my_tweet} is a positive tweet: {numpy.sign(p) > 0}")
my balls itch is a positive tweet: True

Hmmm. Maybe…

End

I want to do more work with the Naive Bayes Classifier but this post is getting too long so I'm going to move on to other posts, the next being a class-based implementation of the model.