Using Naive Bayes to Classify Tweets by Sentiment

Table of Contents

Beginning

In a previous post I implemented a Logistic Regression model to classify twitter tweets as having a positive or negative sentiment. This time I'll be using the same data set (from NLTK) but implementing it with a Naive Bayes model. This post will look at some of the math behind it and the next one will translate the math into code.

Middle

Bayesian Inference

What we want is to take a document (D) - which is a tweet in this case - and guess its classification \(\hat{c}\). We do this by calculating the probability for both of our classifications (positive and negative) using Bayes' Rule and then choosing the classification with the higher probability.

\begin{align} \hat{c} &= \underset{c \in C}{\mathrm{argmax}} P(c|d)\\ &= \underset{c \in C}{\mathrm{argmax}} P(D|c)P(c)\\ \end{align}

So our guess as to what class the document belongs to is the classification with the highest probability given the document - and "the probability of the classification given the document", when translated using Bayes' Rule, becomes the probability of the document given the classification (the likelihood of the document) times the prior probability of any document belonging to the class. But then you might wonder - if there's only one of each document then won't the probability always be \(\frac{1}{c}\)? It would, so we use the words within the document to calculate the probability for the document. How? Well, I mentioned earlier that we make two assumptions - that the documents can be represented as a bag of words and that they are independent. The independent assumption allows us to figure out the total probability using the Multiplication Rule:

\[ P(A \cap B) = P(A)P(B) \]

The probability of A and B is the product of their probabilities. In this case we are calculating the probability of the document as the product of the conditional probabilities of the words given the class:

\[ P(D|c) = \prod_{i=1}^{n}P(w_i | c) \]

Where the n refers to the number of words in the document. Given this we could re-write the previous equation like this.

\begin{align} \hat{c} &= \underset{c \in C}{\textrm{argmax}} P(c) \prod_{i}^{n} P(w_i | c)\\ \end{align}

But it turns out this form isn't really ideal. Among other things you're multiplying values that range from 0 to 1, with most values being less than 1, so the more classes you have, the smaller this number will get and you could end up with really small numbers leading to underflow. So we're going to do a log transform of the equation which will also simplify the computation a little (although nowadays I don't know that that's so much of a consideration).

\[ \hat{c} = \underset{c \in C}{\textrm{argmax}} \log{P(c)} + \sum_{i=1}^n \log{P(w_i|c)} \]

This is what we'll use to classify tweets after training the model by building up the probabilities.

Ratios

While I wrote out the general case where you take the class with the highest probability, in this case we only have two classes, positive and negative so we can take advantage of this and make our classification using the ratio of the conditional probabilities for each class (the log odds ratio). We're going to use the ratio of positive to negative.

\[ \log{\frac{P(positive|D)}{P(negative | D)}} = \log{\frac{P(positive)}{P(negative)}} + \sum_{i=1}^n \log{\frac{P(w_i|positive)}{P(w_i|negative)}} \]

Since positive is the numerator, and the log of values less than one are negative, this ratio will be positive when the review is likely positive and negative otherwise, so we can use the sign of this ratio to classify tweets.

Priors and Log Priors

Now we can start picking apart our ratio. The prior probabilities are just the fraction of our training set that matches a variable. So the prior probabilities of the document classifications can be described like this:

\begin{align} P(D_{positive}) &= \frac{\textit{number of positive tweets}}{\textit{total number of tweets}}\\ &= \frac{D_{pos}}{D}\\ \end{align} \begin{align} P(D_{negative}) &= \frac{\textit{number of negative tweets}}{\textit{total number of tweets}}\\ &= \frac{D_{neg}}{D}\\ \end{align}

But as I noted above we are going to use the ratio of the prior probabilities \(\frac{P(D_{pos})}{P(D_{neg})}\) and if you look at them, they have the same denominator (D) so taking the ratio of the probabilities means the denominator cancels out and we end up with the ratio of the positive to negative documents.

\begin{align} \frac{P(D_{pos})}{P(D_{neg})} &= \frac{\frac{D_{pos}}{D}}{\frac{D_{neg}}{D}}\\ &= \frac{\left( \frac{D_{pos}}{\cancel{D}}\right) \left(\frac{\cancel{D}}{D_{neg}}\right) }{ \cancel{\left(\frac{D_{neg}}{D}\right)} \cancel{\left(\frac{D}{D_{neg}}\right)} }\\ &= \frac{D_{pos}}{D_{neg}}\\ \end{align}

And as I noted above, we'll be using a log transform so our ratio (which will be called logprior) needs to be transformed as well.

\begin{align} \text{logprior} &= log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) \\ &= log \left( \frac{D_{pos}}{D_{neg}} \right)\\ \end{align}

Note that \(log(\frac{A}{B})\) is the same as \(log(A) - log(B)\). So the logprior can also be calculated as the difference between two logs:

\begin{align} \text{logprior} &= \log (P(D_{pos})) - \log (P(D_{neg})) \\ &= \log (D_{pos}) - \log (D_{neg})\\ \end{align}

I don't know that this helps any with computation, but it makes it clearer (to me) that the ratio will be positive when the tweet's sentiment is positive and negative when the sentiment is negative.

Positive and Negative Word Probabilities

Now for the second part of our equation. To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

  • \(freq_{pos} =\) the number of times the word is counted in a document with a label of 1
  • \(freq_{neg} =\) the number of times the word is counted in a document with a label of 0
  • \(N_{pos} = \) the number of words in all the positive documents
  • \(N_{neg} = \) the number of words in all the negative documents
  • V is the number of unique words in the entire set of documents
  • W is a word in a document

So now we can re-write our numerator and denominator for the second term.

\begin{align} P(W|positive) &= P(W_{pos})\\ &= \frac{freq_{pos}}{N_{pos}}\\ \end{align} \begin{align} P(W | negative ) &= P(W_{neg})\\ &= \frac{freq_{neg}}{N_{neg}}\\ \end{align}

Meaning that the likelihood of the word given the class is the number of times the word shows up in documents of that class divided by a count of all the unique words in the corpus. One thing to notice, though, is that our numerators have the count for a word within documents labeled with the classification, but it's not guaranteed that all of the words will show up in both classes (the word "horrible" might only show up in the negative tweets, for instance) so if a word shows up in one class but not the other, we might end up with a zero in the numerator or denominator and not only is division by zero not defined, but neither is the logarithm of zero. The solution is to add 1 to the numerator and the size of the vocabulary to the denominator (adding 1 for each word). Besides fixing our arithmetic problem there's some other more mathy reasons for doing this that are explained in this wikipedia article.

With those changes we now have:

\begin{align} P(W_{pos}) &= \frac{freq_{pos} + 1}{N_{pos} + V}\\ \end{align} \begin{align} P(W_{neg}) &= \frac{freq_{neg} + 1}{N_{neg} + V}\\ \end{align}

And the log-likelihood term becomes:

\begin{align} \text{loglikelihood} &= \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\\ &= \log P(W_{pos}) - \log P(W_{neg})\\ &= \log \frac{freq_{pos} + 1}{N_{pos} + V} - \log \frac{freq_{neg} + 1}{N_{neg} + V} \end{align}

End

Now that we have the math I'm going to implement the model using python in this post.