Implementing Logistic Regression for Tweet Sentiment Analysis

Beginning

In the previous post in this series (The Tweet Vectorizer) I transformed some tweet data to vectors based on the sums of the positive and negative tokens in each tweet. This post will implement a Logistic Regression model to train on those vectors to classify tweets by sentiment.

Set Up

Imports

# from python
from argparse import Namespace
from functools import partial
from pathlib import Path
from typing import Union

import math
import os
import pickle

# from pypi
from bokeh.models.tools import HoverTool
from dotenv import load_dotenv
from expects import (
    be_true,
    expect,
    equal
)
from nltk.corpus import twitter_samples
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegressionCV

import holoviews
import hvplot.pandas
import nltk
import numpy
import pandas

# this package
from neurotic.nlp.twitter.counter import WordCounter
from neurotic.nlp.twitter.sentiment import TweetSentiment
from neurotic.nlp.twitter.vectorizer import TweetVectorizer

# for plotting
from graeae import EmbedHoloviews, Timer

The Timer

TIMER = Timer()

The Dotenv

This loads the locations of previous data and object saves I made.

load_dotenv("posts/nlp/.env")

The Data

I made vectors earlier but to process new tweets I need the Twitter Vectorizer anyway, so I'm going to reprocess everything here.

train_raw = pandas.read_feather(
    Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser())

test_raw = pandas.read_feather(
    Path(os.environ["TWITTER_TEST_RAW"]).expanduser()
)

print(f"Training: {len(train_raw):,}")
print(f"Testing: {len(test_raw):,}")
Training: 8,000
Testing: 2,000
columns = "bias positive negative".split()
counter = WordCounter(train_raw.tweet, train_raw.label)
train_vectorizer = TweetVectorizer(train_raw.tweet, counter.counts, processed=False)
test_vectorizer = TweetVectorizer(test_raw.tweet, counter.counts, processed=False)

But it's easier to work with the DataFrame when exploring and I've been going back and fiddling with different parts of the pipeline and not all the data-files are up to date so it's safer to start from the raw files again.

training = pandas.DataFrame(train_vectorizer.vectors, columns=columns)
testing = pandas.DataFrame(test_vectorizer.vectors, columns=columns)

training["sentiment"] = train_raw.label
testing["sentiment"] = test_raw.label

print(f"Training: {len(training):,}")
print(f"Testing: {len(testing):,}")
Training: 8,000
Testing: 2,000

For Plotting

SLUG = "implementing-twitter-logistic-regression"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/nlp/{SLUG}")

with Path(os.environ["TWITTER_PLOT"]).expanduser().open("rb") as reader:
    Plot = pickle.load(reader)

Types

Some stuff for type hinting.

Tweet = Union[numpy.ndarray, float]
PositiveProbability = Tweet

Middle

Logistic Regression

Now that we have the data it's time to implement the Logistic Regression model to classify tweets as positive or negative.

The Sigmoid Function

Logistic Regression uses a version of the Sigmoid Function called the Standard Logistic Function to measure whether an entry has passed the threshold for classification. This is the mathematical definition:

\[ \sigma(z) = \frac{1}{1 + e^{-x \cdot \theta}} \]

The numerator (1) determines the maximum value for the function, so in this case the range is from 0 to 1 and we can interpret \(\sigma(z)\) as the probability that a tweet (z) is positive (1). The interpretation of \(\sigma(z)\) is it's the probability that z (a vector representation of a tweet times the weights) is classified as 1 (having a positive sentiment). So we could re-write this as:

\[ P(Y=1 | z) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}} \]

Where \(x_1\) is the sum of the positive tweet counts for the tokens in \(x\) and \(x_2\) is the sum of the negative tweet counts for the tokens. \(\beta_0\) is our bias and \(\beta_1\) and \(\beta_2\) are the weights that we're going to find by training our model.

def sigmoid(z: Tweet) -> PositiveProbability:
    """Calculates the logistic function value

    Args:
     z: input to the logistic function (float or array)

    Returns:
     calculated sigmoid for z
    """
    return 1/(1 + numpy.exp(-z))
  • A Little Test

    We have a couple of given values to test that our sigmoid is correct.

    expect(sigmoid(0)).to(equal(0.5))
    
    expect(math.isclose(sigmoid(4.92), 0.9927537604041685)).to(be_true)
    
    expected = numpy.array([0.5, 0.9927537604041685])
    actual = sigmoid(numpy.array([0, 4.92]))
    
    expect(all(actual==expected)).to(be_true)
    
  • Plotting It

    Let's see what the output looks like.

    min_x = -6
    max_x = 6
    
    x = numpy.linspace(min_x, max_x)
    y = sigmoid(x)
    halfway = sigmoid(0)
    
    plot_data = pandas.DataFrame.from_dict(dict(x=x, y=y))
    curve = plot_data.hvplot(x="x", y="y", color=Plot.color_cycle)
    
    line = holoviews.Curve([(min_x, halfway), (max_x, halfway)], color=Plot.tan)
    
    plot = (curve * line).opts(
        width=Plot.width,
        height=Plot.height,
        fontscale=Plot.font_scale,
        title="Sigmoid",
        show_grid=True,
    )
    
    embedded = Embed(plot=plot, file_name="sigmoid_function")
    output = embedded()
    
    print(output)
    

    Figure Missing

    Looking at the plot you can see that the probability that a tweet is positive is 0.5 when the input is 0, becomes more likely the more positive the input is, and is less likely the more negative an input is. Next we'll need to look at how to train our model.

The Loss Function

To train our model we need a way to measure how well (or in this case poorly) it's doing. For this we'll use the Log Loss function which is the negative logarithm of our probability - so for each tweet, we'll calculate \(\sigma\) (which is the probability that it's positive) and take the negative logarithm of it to get the log-loss.

The formula for loss:

\[ Loss = - \left( y\log (p) + (1-y)\log (1-p) \right) \]

\(y\) is the classification of the tweet (1 or 0) so when the tweet is classified 1 (positive) the right term becomes 0 and when the tweet is classified 0 (negative) the left term becomes 0 so this is the equivalent of:

if y == 1:
    loss = -log(p)
else:
    loss = -log(1 - p)

Where \(p\) is the probability that the tweet is positive and \(1 - p\) is the probability that it isn't (so it's negative since that's the only alternative). We take the negative of the logarithm because \(log(p)\) is negative (all the values of \(p\) are between 0 and 1) so negating it makes the output positive.

We can fill it in to make it match what we're going to actually calculate - for the \(i^{th}\) item in our dataset \(p = \sigma(z^i \cdot \theta)\) and the equation becomes:

\[ Loss = - \left( y^{(i)}\log (\sigma(z^{(i)} \cdot \theta)) + (1-y^{(i)})\log (1-\sigma(z^{(i)} \cdot \theta)) \right) \]

epsilon = 1e-3
steps = 10**3
probabilities = numpy.linspace(epsilon, 1, num=steps)
losses = -1 * numpy.log(probabilities)
data = pandas.DataFrame.from_dict({
    "p": probabilities,
    "Log-Loss": losses 
})

plot = data.hvplot(x="p", y="Log-Loss", color=Plot.blue).opts(
    title="Log-Loss (Y=1)",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    ylim=(0, losses.max())
)

output = Embed(plot=plot, file_name="log_loss_example")()
print(output)

Figure Missing

So what is this telling us? This is for the case where a tweet is labeled positive and at the far left, near 0 (log(0) is undefined so you can use a really small probability but not 0) our model is saying that it probably isn't a positive tweet, so the log-loss is fairly high, then as we move along the x-axis our model is saying that it is more and more likely that the tweet is positive so our log-loss goes down, until we reach the point where our model says that it's 100% guaranteed to be a positive tweet, at which point our log-loss drops to zero. Fairly intuitive.

Let's look at the case where the tweet is actually negative (y=0). Since p is the probability that it's positive, when the label is 0 we need to take the log of 1-p to see what the model thinks the probability is that it's negative.

epsilon = 1e-3
steps = 10**3
probabilities = numpy.linspace(epsilon, 1-epsilon, num=steps)
losses = -1 * (numpy.log(1 - probabilities))
data = pandas.DataFrame.from_dict({
    "p": probabilities,
    "Log-Loss": losses 
})

plot = data.hvplot(x="p", y="Log-Loss", color=Plot.blue).opts(
    title="Log-Loss (Y=0)",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    ylim=(0, losses.max())
)

output = Embed(plot=plot, file_name="log_loss_y_0_example")()
print(output)

Figure Missing

So now we have basically the opposite loss. In this case the tweet is not positive so when the model puts a low likelihood that the tweet is positive the log-loss is small, but as you move along the x-axis the model is giving more probability to the notion that the tweet is positive so the log-loss gets larger.

Training the Model

To train the model we're going to use Gradient Descent. What this means is that we're going to use the gradient of our loss function to figure out how to update our weights. The gradient is just the slope of the loss-function (but generalized to multiple dimensions).

How do we do this? First we calculate our model's estimate of the input being positive, then we calculate the gradient of its loss. If you remember from calculus the slope of a line is the derivative of its function so instead of calculating the loss, we'll calculate the derivative of the loss-function which is given as:

\[ \nabla_{\theta}L_{\theta} = \left [ \sigma(x \cdot \theta) - y \right] x_j \]

The rightmost term \(x_j\) represents one term in the input vector, the one that matches the weight - this has to be repeated for each \(\beta\) in \(\theta\) so in our case it will be repeated three times, with \(x\) being 1 for the bias term.

It's called stochastic gradient descent because the inputs are chosen randomly from our training set. This turns out to not give you a smooth descent so we're going to do batch training which changes our gradient a little.

\[ \nabla_{\theta_j}L_{\theta} = \frac{1}{m} \sum_{i=1}^m(\sigma(x \cdot \theta)-y)x_j \]

Our gradient is now the average of the gradients for each of the inputs in our training set. We update the weights by subtracting a fraction of the difference between the current weights and the gradient. The fraction \(\eta\) is called the learning rate and it controls how much the weights change, representng how fast our model will learn. If it is too large we can miss the minimum and if it's too large it will take too long to train the model, so we need to choose the right value for it to reach the minima within a feasible time.

Here's the algorithm in the rough.

  • L: Loss Function
  • \(\sigma\): probability function parameterized by \(\theta\)
  • x: set of training inputs
  • y: set of training labels
\begin{algorithm}
\caption{Gradient Descent}
\begin{algorithmic}
\STATE $\theta \gets 0$
\WHILE{not done}

 \FOR{each $(x^{(i)},y^{(i)})$ in training data}
  \State $\hat{y} \gets \sigma(x^{(i)}; \theta)$
  \State $loss \gets L(\hat{y}^{(i)}, y^{(i)})$
  \State $g \gets \nabla_{\theta} L(\hat{y}^{(i)}, y^{(i)})$
  \State $\theta \gets \theta - \eta g$
 \ENDFOR

\ENDWHILE
\end{algorithmic}
\end{algorithm}

We can translate this a little more.

\begin{algorithm}
\caption{Gradient Descent}
\begin{algorithmic}
\STATE Initialize the weights
\WHILE{the loss is still too high}

 \FOR{each $(x^{(i)},y^{(i)})$ in training data}
  \State What is our probability that the input is positive?
  \State How far off are we?
  \State What direction would we need to head to maximize the error?
  \State Let's go in the opposite direction.
 \ENDFOR

\ENDWHILE
\end{algorithmic}
\end{algorithm}

Note that the losses aren't needed for the algorithm to train the model, just for assessing how well the model did.

Implement It

  • The Function
    def gradient_descent(x: numpy.ndarray, y: numpy.ndarray,
                         weights: numpy.ndarray, learning_rate: float,
                         iterations: int=1):
        """Finds the weights for the model
    
        Args:
         x: the tweet vectors
         y: the positive/negative labels
         weights: the regression weights
         learning_rate: (eta) how much to update the weights
         iterations: the number of times to repeat training
        """
        assert len(x) == len(y)
        rows = len(x)
        losses = []
        learning_rate /= rows
        for iteration in range(iterations):
            y_hat = sigmoid(x.dot(weights))
            # average loss
            loss = numpy.squeeze(-((y.T.dot(numpy.log(y_hat))) +
                                   (1 - y.T).dot(numpy.log(1 - y_hat))))/rows
            losses.append(loss)
            gradient = ((y_hat - y).T.dot(x)).sum(axis=0, keepdims=True)
            weights -= learning_rate * gradient.T
        return loss, weights, losses
    

    If you look at the implementation you can see that there are some changes made to it from what I wrote earlier. This is because the algorithm I wrote in pseudocode came from a book while the implementation that I made came from a Coursera assignment. The main differences being that we use a set number of iterations to train the model and the learning rate is divided by the number of training examples. Of course, you could just divide the learning rate before passing it in to the function so it doesn't really change it that much. I also had to take into account the fact that you can't just take a dot product of two matrices if their shapes aren't compatible - the rows of the left hand matrix has to match the columns of the right hand matrix) so there's some transposing of matrices being done. Our actual implementation might be more like this.

    \begin{algorithm}
    \caption{Gradient Descent Implemented}
    \begin{algorithmic}
    \STATE $\theta \gets 0$
    \STATE $m \gets rows(X)$
    \FOR{$iteration \in$ \{0 $\ldots iterations-1$ \}}
      \STATE $\hat{Y} \gets \sigma(X \cdot \theta)$
      \STATE $loss \gets -\frac{1}{m}(Y^T \cdot \ln \hat{Y}) + (1 - Y)^T \cdot (\ln 1 - \hat{Y})$
      \STATE $\nabla \gets \sum (\hat{Y} - Y)^T \cdot x$
      \STATE $\theta \gets \theta - \frac{\eta}{m} \nabla^T$
     \ENDFOR
    \end{algorithmic}
    \end{algorithm}
    
  • Test It

    First we'll make a fake (random) input set to make it easier to check the gradient descent.

    numpy.random.seed(1)
    bias = numpy.ones((10, 1))
    fake = numpy.random.rand(10, 2) * 2000
    fake_tweet_vectors = numpy.append(bias, fake, axis=1)
    

    Now, the fake labels - we'll make around 35% of them negative and the rest positive.

    fake_labels = (numpy.random.rand(10, 1) > 0.35).astype(float)
    
  • Do the Descent

    So now we can pass our test data into the gradient descent function and see what happens.

    fake_weights = numpy.zeros((3, 1))
    fake_loss, fake_weights, losses = gradient_descent(x=fake_tweet_vectors,
                                               y=fake_labels, 
                                               weights=fake_weights,
                                               learning_rate=1e-8,
                                               iterations=700)
    expect(math.isclose(fake_loss, 0.67094970, rel_tol=1e-8)).to(be_true)
    print(f"The log-loss after training is {fake_loss:.8f}.")
    print(f"The trained weights are {[round(t, 8) for t in numpy.squeeze(fake_weights)]}")
    
    The log-loss after training is 0.67094970.
    The trained weights are [4.1e-07, 0.00035658, 7.309e-05]
    

Train the Model

Now that we have our parts let's actually train the model using the real training data. I originally did this expecting numpy arrays (like in earlier steps I was expecting python lists instead of numpy arrays - stuff changes) so I'll be extracting the relevant columns from the pandas DataFrame and converting them back to arrays.

weights = numpy.zeros((3, 1))
eta = 1e-9
iterations = 1500
with TIMER:
    final_loss, weights, losses = gradient_descent(
        x=train_vectorizer.vectors,
        y=training.sentiment.values.reshape((-1, 1)), weights=weights,
        learning_rate=eta, iterations=iterations)

print(f"The log-loss after training is {final_loss:.8f}.")
print(f"The resulting vector of weights is "
      f"{[round(t, 8) for t in numpy.squeeze(weights)]}")

model = TweetSentiment(train_vectorizer, weights)
predictions = model()

correct = sum(predictions.T[0] == training.sentiment)
print(f"Training Accuracy: {correct/len(training)}")
2020-07-27 17:54:58,357 graeae.timers.timer start: Started: 2020-07-27 17:54:58.357765
2020-07-27 17:54:58,776 graeae.timers.timer end: Ended: 2020-07-27 17:54:58.776834
2020-07-27 17:54:58,777 graeae.timers.timer end: Elapsed: 0:00:00.419069
The log-loss after training is 0.22043072.
The resulting vector of weights is [6e-08, 0.00053899, -0.0005613]
Training Accuracy: 0.997625
plot_losses = pandas.DataFrame.from_dict({"Log-Loss": losses})
plot = plot_losses.hvplot().opts(title="Training Losses",
                            width=Plot.width,
                            height=Plot.height,
                            fontscale=Plot.font_scale,
                            color=Plot.blue
                            )

output = Embed(plot=plot, file_name="training_loss")()
print(output)

Figure Missing

As you can see, the losses are still on the decline, but we'll stop here to see how it's doing.

Test the Model

This will be a class to predict the sentiment of a tweet using our model.

# pypi
import attr
import numpy

# this project
from .vectorizer import TweetVectorizer


@attr.s(auto_attribs=True)
class TweetSentiment:
    """Predicts the sentiment of a tweet

    Args:
     vectorizer: something to vectorize tweets
     theta: vector of weights for the logistic regression model
    """
    vectorizer: TweetVectorizer
    theta: numpy.ndarray

    def sigmoid(self, vectors: numpy.ndarray) -> float:
        """the logistic function

       Args:
        vectors: a matrix of bias, positive, negative counts

       Returns:
        array of probabilities that the tweets are positive
       """
        return 1/(1 + numpy.exp(-vectors))

    def probability_positive(self, tweet: str) -> float:
        """Calculates the probability of the tweet being positive

       Args:
        tweet: a tweet to classify

       Returns:
        the probability that the tweet is a positive one
       """
        x = self.vectorizer.extract_features(tweet, as_array=True)
        return numpy.squeeze(self.sigmoid(x.dot(self.theta)))

    def classify(self, tweet: str) -> int:
        """Decides if the tweet was positive or not

       Args:
        tweet: the tweet message to classify.
       """
        return int(numpy.round(self.probability_positive(tweet)))

    def __call__(self) -> numpy.ndarray:
        """Get the sentiments of the vectorized tweets

       Note:
        this assumes that the vectorizer passed in has the tweets

       Returns:
        array of predicted sentiments (1 for positive 0 for negative)
       """
        return numpy.round(self.sigmoid(self.vectorizer.vectors.dot(self.theta)))
sentiment = TweetSentiment(test_vectorizer, weights)
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print(f'{tweet} -> {sentiment.probability_positive(tweet)}')
I am happy -> 0.5183237992258976
I am bad -> 0.4924963884222927
this movie should have been great. -> 0.5156997144475827
great -> 0.5158056039006712
great great -> 0.5315796358935646
great great great -> 0.5472908064541816
great great great great -> 0.5629083094155534

Strangely very near the center. Probably because the words weren't that commonly used in our training set.

totals = sum(counter.counts.values())
print(f"Great positive percentage: {100 * counter.counts[('great', 1)]/totals:.2f} %")
print(f"Great negative percentage: {100 * counter.counts[('great', 0)]/totals:.2f} % ")
Great positive percentage: 0.24 %
Great negative percentage: 0.03 % 

Now we can see how it did overall.

predictions = sentiment()
correct = sum(predictions.T[0] == testing.sentiment)
print(f"Accuracy: {correct/len(testing)}")
Accuracy: 0.996

Almost suspiciously good.

The Wrong Stuff

wrong_places = predictions.T[0] != testing.sentiment
wrong = testing[wrong_places]
print(len(wrong))
8
for row in wrong.itertuples():
    print("*" * 10)
    print(f"Tweet number {row.Index}")
    raw = test_raw.iloc[row.Index]
    print(f"Tweet: {raw.tweet}")
    tokens = train_vectorizer.process(raw.tweet)
    print(f"Tokens: {tokens}")
    print(f"Probability Positive: {sentiment.probability_positive(raw.tweet)}")
    print(f"Actual Classification: {row.sentiment}")
    print()
    for token in tokens:
        print(f"{token} \tPositive: {counter.counts[(token, 1)]} "
              f"Negative: {counter.counts[(token, 0)]}")
    print()
**********
Tweet number 64
Tweet: @_sarah_mae omg you can't just tell this and don't say more :p can't wait to know !!!! ❤️
Tokens: ['omg', "can't", 'tell', 'say', ':p', "can't", 'wait', 'know', '❤', '️']
Probability Positive: 0.48137283482824483
Actual Classification: 1

omg     Positive: 11 Negative: 51
can't   Positive: 36 Negative: 145
tell    Positive: 20 Negative: 19
say     Positive: 48 Negative: 52
:p      Positive: 113 Negative: 0
can't   Positive: 36 Negative: 145
wait    Positive: 59 Negative: 37
know    Positive: 123 Negative: 100
❤       Positive: 18 Negative: 20
️       Positive: 9 Negative: 18

**********
Tweet number 118
Tweet: @bae_ts WHATEVER STIL L YOUNG >:-(
Tokens: ['whatev', 'stil', 'l', 'young', '>:-(']
Probability Positive: 0.5006402767570053
Actual Classification: 0

whatev  Positive: 5 Negative: 0
stil    Positive: 0 Negative: 0
l       Positive: 4 Negative: 1
young   Positive: 2 Negative: 3
>:-(         Positive: 0 Negative: 2

**********
Tweet number 435
Tweet: @wtfxmbs AMBS please it's harry's jeans :)):):):(
Tokens: ['amb', 'pleas', "harry'", 'jean', ':)', '):', '):', '):']
Probability Positive: 0.821626817973081
Actual Classification: 0

amb     Positive: 0 Negative: 0
pleas   Positive: 76 Negative: 215
harry'  Positive: 0 Negative: 1
jean    Positive: 0 Negative: 1
:)      Positive: 2967 Negative: 1
):      Positive: 7 Negative: 1
):      Positive: 7 Negative: 1
):      Positive: 7 Negative: 1

**********
Tweet number 458
Tweet: @GODDAMMlT SRSLY FUCK U UNFOLLOWER HOPE UR FUTURE CHILD UNPARENTS U >:-(
Tokens: ['srsli', 'fuck', 'u', 'unfollow', 'hope', 'ur', 'futur', 'child', 'unpar', 'u', '>:-(']
Probability Positive: 0.5157383070453547
Actual Classification: 0

srsli   Positive: 1 Negative: 4
fuck    Positive: 19 Negative: 48
u       Positive: 193 Negative: 162
unfollow        Positive: 55 Negative: 8
hope    Positive: 119 Negative: 77
ur      Positive: 28 Negative: 20
futur   Positive: 13 Negative: 1
child   Positive: 3 Negative: 3
unpar   Positive: 0 Negative: 0
u       Positive: 193 Negative: 162
>:-(         Positive: 0 Negative: 2

**********
Tweet number 493
Tweet: 5h + kids makes all ://:(\\\
Tokens: ['5h', 'kid', 'make', ':/']
Probability Positive: 0.5003797971971914
Actual Classification: 0

5h      Positive: 0 Negative: 0
kid     Positive: 17 Negative: 16
make    Positive: 87 Negative: 77
:/      Positive: 4 Negative: 8

**********
Tweet number 788
Tweet: i love got7's outfit for just right >:( its so fun
Tokens: ['love', 'got', '7', 'outfit', 'right', '>:(', 'fun']
Probability Positive: 0.5197464496373044
Actual Classification: 0

love    Positive: 306 Negative: 114
got     Positive: 55 Negative: 70
7       Positive: 5 Negative: 11
outfit  Positive: 3 Negative: 3
right   Positive: 41 Negative: 39
>:(  Positive: 0 Negative: 36
fun     Positive: 48 Negative: 26

**********
Tweet number 995
Tweet: I ATE YOUR LAST COOKIE SHIR0 >:D
Tokens: ['ate', 'last', 'cooki', 'shir', '0', '>:d']
Probability Positive: 0.4961173289819544
Actual Classification: 1

ate     Positive: 3 Negative: 8
last    Positive: 35 Negative: 58
cooki   Positive: 0 Negative: 2
shir    Positive: 0 Negative: 0
0       Positive: 1 Negative: 0
>:d  Positive: 3 Negative: 0

**********
Tweet number 1662
Tweet: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
Tokens: ['sr', 'financi', 'analyst', 'expedia', 'inc', 'bellevu', 'wa', 'financ', 'expediajob', 'job', 'job', 'hire']
Probability Positive: 0.5038917149486426
Actual Classification: 0

sr      Positive: 0 Negative: 1
financi         Positive: 0 Negative: 0
analyst         Positive: 0 Negative: 0
expedia         Positive: 0 Negative: 0
inc     Positive: 1 Negative: 2
bellevu         Positive: 0 Negative: 0
wa      Positive: 0 Negative: 0
financ  Positive: 0 Negative: 0
expediajob      Positive: 0 Negative: 0
job     Positive: 28 Negative: 12
job     Positive: 28 Negative: 12
hire    Positive: 0 Negative: 0

It looks like these were tweets with uncommon tokens. Personally I'm not sure what to make of some of them myself. And I'm not sure about the classifications - why is a job posting considered a negative tweet?

Some Fresh Tweets

First someone reacting to a post about the Clown Motel in Tonopah, Nevada. The previous link was to Atlas Obscura, but the tweet came from thrillist.

tweet = "Nah dude. I drove by that at night and it was the creepiest thing ever. The whole town gave me bad vibes. I still shudder when I think about it."
print(f"Classified as {sentiments[sentiment.classify(tweet)]}")
Classified as negative

Seems reasonable.

tweet = "This is just dope. Quaint! I’d love to have an ironic drive-in wedding in Las Vegas and then stay in a clown motel as newly weds for one night. I bet they have Big Clown Suits for newly weds, haha."

print(f"Classified as {sentiments[sentiment.classify(tweet)]}")
Classified as positive

Compare to SKLearn

columns = "bias positive negative".split()
classifier = LogisticRegressionCV(
    random_state=2020,
    max_iter=1500,
    scoring="neg_log_loss").fit(training[columns], training.sentiment)

predictions = classifier.predict(testing[columns]).reshape((-1, 1))
correct = sum(predictions == testing.sentiment.values.reshape((-1, 1)))
print(f"Accuracy: {correct[0]/len(testing)}")
Accuracy: 0.995

So it did pretty much the same just using the default parameters. We could probably do a parameter search but that's okay for now.

Vizualizing the Model

Since we've been given the model's weights we can plot its output when fed the vectors to see how it separates the data. To get the equation for the separation line we need to solve for the positive or negative terms when the product of the weights and the vector is 0 (\(\theta \times x = 0\), where x is our vector \(\langle bias, positive, negative \rangle\)).

Get ready for some algebra.

\begin{align} \theta \times x &= 0\\ \theta \times \langle bias, positive, negative \rangle &= 0\\ \theta \times \langle 1, positive, negative \rangle &= 0\\ \theta_0 + \theta_1 \times positive + \theta_2 \times negative &= 0\\ \theta_2 \times negative &= -\theta_0 - \theta_1 \times positive\\ negative &= \frac{-\theta_0 - \theta_1 \times positive}{\theta_2}\\ \end{align}

This is the equation for our separation line (on our plot positive is the x-axis and negative is the y-axis, which we can translate to a function to apply to our data.

def negative(theta: list, positive: float) -> float:
    """Calculate the negative value

    This calculates the value for the separation line

    Args:
     theta: list of weights for the logistic regression
     positive: count of positive tweets matching tweet

    Returns:
     the calculated negative value for the separation line
    """
    return (-theta.bias
            - positive * theta.positive)/theta.negative

theta = pandas.DataFrame(weights.T, columns = columns)
negative_ = partial(negative, theta=theta)

We plotted the vectorized data before, now we can add our regression line.

hover = HoverTool(
    tooltips = [
        ("Positive", "@positive{0,0}"),
        ("Negative", "@negative{0,0}"),
        ("Sentiment", "@Sentiment"),
    ]
)


training["regression negative"] = training.positive.apply(
    lambda positive: negative_(positive=positive))

line = training.hvplot(x="positive", y="regression negative", color=Plot.tan)
scatter = training.hvplot.scatter(x="positive", y="negative", by="sentiment", fill_alpha=0,
color=Plot.color_cycle, tools=[hover]).opts(
                               height=Plot.height,
                               width=Plot.width,
                               fontscale=Plot.font_scale,
                               title="Positive vs Negative Tweet Sentiment",
                           )

plot = scatter * line
output = Embed(plot=plot, file_name="positive_negative_scatter_with_model")()
print(output)

Figure Missing

Let's see if a log-log scale helps.

line = training.hvplot(x="positive", y="regression negative", color=Plot.tan)
scatter = training.hvplot.scatter(x="positive", y="negative", by="sentiment",
                                  fill_alpha=0,
                                  color=Plot.color_cycle, tools=[hover])

plot = (scatter * line).opts(
    height=Plot.height,
    width=Plot.width,
    xrotation=45,
    fontscale=Plot.font_scale,
    title="Positive vs Negative Tweet Sentiment",
    logx=True,
    logy=True,
)
output = Embed(plot=plot, file_name="positive_negative_scatter_log")()
print(output)

Figure Missing

The log-scale seems to break the auto-scaling of the plot so you'll have to zoom out a little bit (with the Wheel Zoom tool on the toolbar) which will show you that the model did a pretty good job of separating the positive from the negative. You can see that some of the points aren't really linearly separable using our vectors so this is probably as good as it can get.

End

This concludes the series begun with the post on pre-processing tweets.

I should mention that I used Speech and Language Processing to understanding the math.