Implementing Logistic Regression for Tweet Sentiment Analysis
Table of Contents
Beginning
In the previous post in this series (The Tweet Vectorizer) I transformed some tweet data to vectors based on the sums of the positive and negative tokens in each tweet. This post will implement a Logistic Regression model to train on those vectors to classify tweets by sentiment.
Set Up
Imports
# from python
from argparse import Namespace
from functools import partial
from pathlib import Path
from typing import Union
import math
import os
import pickle
# from pypi
from bokeh.models.tools import HoverTool
from dotenv import load_dotenv
from expects import (
be_true,
expect,
equal
)
from nltk.corpus import twitter_samples
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
import holoviews
import hvplot.pandas
import nltk
import numpy
import pandas
# this package
from neurotic.nlp.twitter.counter import WordCounter
from neurotic.nlp.twitter.sentiment import TweetSentiment
from neurotic.nlp.twitter.vectorizer import TweetVectorizer
# for plotting
from graeae import EmbedHoloviews, Timer
The Timer
TIMER = Timer()
The Dotenv
This loads the locations of previous data and object saves I made.
load_dotenv("posts/nlp/.env")
The Data
I made vectors earlier but to process new tweets I need the Twitter Vectorizer anyway, so I'm going to reprocess everything here.
train_raw = pandas.read_feather(
Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser())
test_raw = pandas.read_feather(
Path(os.environ["TWITTER_TEST_RAW"]).expanduser()
)
print(f"Training: {len(train_raw):,}")
print(f"Testing: {len(test_raw):,}")
Training: 8,000 Testing: 2,000
columns = "bias positive negative".split()
counter = WordCounter(train_raw.tweet, train_raw.label)
train_vectorizer = TweetVectorizer(train_raw.tweet, counter.counts, processed=False)
test_vectorizer = TweetVectorizer(test_raw.tweet, counter.counts, processed=False)
But it's easier to work with the DataFrame when exploring and I've been going back and fiddling with different parts of the pipeline and not all the data-files are up to date so it's safer to start from the raw files again.
training = pandas.DataFrame(train_vectorizer.vectors, columns=columns)
testing = pandas.DataFrame(test_vectorizer.vectors, columns=columns)
training["sentiment"] = train_raw.label
testing["sentiment"] = test_raw.label
print(f"Training: {len(training):,}")
print(f"Testing: {len(testing):,}")
Training: 8,000 Testing: 2,000
For Plotting
SLUG = "implementing-twitter-logistic-regression"
Embed = partial(EmbedHoloviews,
folder_path=f"files/posts/nlp/{SLUG}")
with Path(os.environ["TWITTER_PLOT"]).expanduser().open("rb") as reader:
Plot = pickle.load(reader)
Types
Some stuff for type hinting.
Tweet = Union[numpy.ndarray, float]
PositiveProbability = Tweet
Middle
Logistic Regression
Now that we have the data it's time to implement the Logistic Regression model to classify tweets as positive or negative.
The Sigmoid Function
Logistic Regression uses a version of the Sigmoid Function called the Standard Logistic Function to measure whether an entry has passed the threshold for classification. This is the mathematical definition:
\[ \sigma(z) = \frac{1}{1 + e^{-x \cdot \theta}} \]
The numerator (1) determines the maximum value for the function, so in this case the range is from 0 to 1 and we can interpret \(\sigma(z)\) as the probability that a tweet (z) is positive (1). The interpretation of \(\sigma(z)\) is it's the probability that z (a vector representation of a tweet times the weights) is classified as 1 (having a positive sentiment). So we could re-write this as:
\[ P(Y=1 | z) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)}} \]
Where \(x_1\) is the sum of the positive tweet counts for the tokens in \(x\) and \(x_2\) is the sum of the negative tweet counts for the tokens. \(\beta_0\) is our bias and \(\beta_1\) and \(\beta_2\) are the weights that we're going to find by training our model.
def sigmoid(z: Tweet) -> PositiveProbability:
"""Calculates the logistic function value
Args:
z: input to the logistic function (float or array)
Returns:
calculated sigmoid for z
"""
return 1/(1 + numpy.exp(-z))
- A Little Test
We have a couple of given values to test that our sigmoid is correct.
expect(sigmoid(0)).to(equal(0.5)) expect(math.isclose(sigmoid(4.92), 0.9927537604041685)).to(be_true) expected = numpy.array([0.5, 0.9927537604041685]) actual = sigmoid(numpy.array([0, 4.92])) expect(all(actual==expected)).to(be_true)
- Plotting It
Let's see what the output looks like.
min_x = -6 max_x = 6 x = numpy.linspace(min_x, max_x) y = sigmoid(x) halfway = sigmoid(0) plot_data = pandas.DataFrame.from_dict(dict(x=x, y=y)) curve = plot_data.hvplot(x="x", y="y", color=Plot.color_cycle) line = holoviews.Curve([(min_x, halfway), (max_x, halfway)], color=Plot.tan) plot = (curve * line).opts( width=Plot.width, height=Plot.height, fontscale=Plot.font_scale, title="Sigmoid", show_grid=True, ) embedded = Embed(plot=plot, file_name="sigmoid_function") output = embedded()
print(output)
Looking at the plot you can see that the probability that a tweet is positive is 0.5 when the input is 0, becomes more likely the more positive the input is, and is less likely the more negative an input is. Next we'll need to look at how to train our model.
The Loss Function
To train our model we need a way to measure how well (or in this case poorly) it's doing. For this we'll use the Log Loss function which is the negative logarithm of our probability - so for each tweet, we'll calculate \(\sigma\) (which is the probability that it's positive) and take the negative logarithm of it to get the log-loss.
The formula for loss:
\[ Loss = - \left( y\log (p) + (1-y)\log (1-p) \right) \]
\(y\) is the classification of the tweet (1 or 0) so when the tweet is classified 1 (positive) the right term becomes 0 and when the tweet is classified 0 (negative) the left term becomes 0 so this is the equivalent of:
if y == 1:
loss = -log(p)
else:
loss = -log(1 - p)
Where \(p\) is the probability that the tweet is positive and \(1 - p\) is the probability that it isn't (so it's negative since that's the only alternative). We take the negative of the logarithm because \(log(p)\) is negative (all the values of \(p\) are between 0 and 1) so negating it makes the output positive.
We can fill it in to make it match what we're going to actually calculate - for the \(i^{th}\) item in our dataset \(p = \sigma(z^i \cdot \theta)\) and the equation becomes:
\[ Loss = - \left( y^{(i)}\log (\sigma(z^{(i)} \cdot \theta)) + (1-y^{(i)})\log (1-\sigma(z^{(i)} \cdot \theta)) \right) \]
epsilon = 1e-3
steps = 10**3
probabilities = numpy.linspace(epsilon, 1, num=steps)
losses = -1 * numpy.log(probabilities)
data = pandas.DataFrame.from_dict({
"p": probabilities,
"Log-Loss": losses
})
plot = data.hvplot(x="p", y="Log-Loss", color=Plot.blue).opts(
title="Log-Loss (Y=1)",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
ylim=(0, losses.max())
)
output = Embed(plot=plot, file_name="log_loss_example")()
print(output)
So what is this telling us? This is for the case where a tweet is labeled positive and at the far left, near 0 (log(0)
is undefined so you can use a really small probability but not 0) our model is saying that it probably isn't a positive tweet, so the log-loss is fairly high, then as we move along the x-axis our model is saying that it is more and more likely that the tweet is positive so our log-loss goes down, until we reach the point where our model says that it's 100% guaranteed to be a positive tweet, at which point our log-loss drops to zero. Fairly intuitive.
Let's look at the case where the tweet is actually negative (y=0). Since p is the probability that it's positive, when the label is 0 we need to take the log of 1-p to see what the model thinks the probability is that it's negative.
epsilon = 1e-3
steps = 10**3
probabilities = numpy.linspace(epsilon, 1-epsilon, num=steps)
losses = -1 * (numpy.log(1 - probabilities))
data = pandas.DataFrame.from_dict({
"p": probabilities,
"Log-Loss": losses
})
plot = data.hvplot(x="p", y="Log-Loss", color=Plot.blue).opts(
title="Log-Loss (Y=0)",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
ylim=(0, losses.max())
)
output = Embed(plot=plot, file_name="log_loss_y_0_example")()
print(output)
So now we have basically the opposite loss. In this case the tweet is not positive so when the model puts a low likelihood that the tweet is positive the log-loss is small, but as you move along the x-axis the model is giving more probability to the notion that the tweet is positive so the log-loss gets larger.
Training the Model
To train the model we're going to use Gradient Descent. What this means is that we're going to use the gradient of our loss function to figure out how to update our weights. The gradient is just the slope of the loss-function (but generalized to multiple dimensions).
How do we do this? First we calculate our model's estimate of the input being positive, then we calculate the gradient of its loss. If you remember from calculus the slope of a line is the derivative of its function so instead of calculating the loss, we'll calculate the derivative of the loss-function which is given as:
\[ \nabla_{\theta}L_{\theta} = \left [ \sigma(x \cdot \theta) - y \right] x_j \]
The rightmost term \(x_j\) represents one term in the input vector, the one that matches the weight - this has to be repeated for each \(\beta\) in \(\theta\) so in our case it will be repeated three times, with \(x\) being 1 for the bias term.
It's called stochastic gradient descent because the inputs are chosen randomly from our training set. This turns out to not give you a smooth descent so we're going to do batch training which changes our gradient a little.
\[ \nabla_{\theta_j}L_{\theta} = \frac{1}{m} \sum_{i=1}^m(\sigma(x \cdot \theta)-y)x_j \]
Our gradient is now the average of the gradients for each of the inputs in our training set. We update the weights by subtracting a fraction of the difference between the current weights and the gradient. The fraction \(\eta\) is called the learning rate and it controls how much the weights change, representng how fast our model will learn. If it is too large we can miss the minimum and if it's too large it will take too long to train the model, so we need to choose the right value for it to reach the minima within a feasible time.
Here's the algorithm in the rough.
- L: Loss Function
- \(\sigma\): probability function parameterized by \(\theta\)
- x: set of training inputs
- y: set of training labels
\begin{algorithm} \caption{Gradient Descent} \begin{algorithmic} \STATE $\theta \gets 0$ \WHILE{not done} \FOR{each $(x^{(i)},y^{(i)})$ in training data} \State $\hat{y} \gets \sigma(x^{(i)}; \theta)$ \State $loss \gets L(\hat{y}^{(i)}, y^{(i)})$ \State $g \gets \nabla_{\theta} L(\hat{y}^{(i)}, y^{(i)})$ \State $\theta \gets \theta - \eta g$ \ENDFOR \ENDWHILE \end{algorithmic} \end{algorithm}
We can translate this a little more.
\begin{algorithm} \caption{Gradient Descent} \begin{algorithmic} \STATE Initialize the weights \WHILE{the loss is still too high} \FOR{each $(x^{(i)},y^{(i)})$ in training data} \State What is our probability that the input is positive? \State How far off are we? \State What direction would we need to head to maximize the error? \State Let's go in the opposite direction. \ENDFOR \ENDWHILE \end{algorithmic} \end{algorithm}
Note that the losses aren't needed for the algorithm to train the model, just for assessing how well the model did.
Implement It
- The Function
def gradient_descent(x: numpy.ndarray, y: numpy.ndarray, weights: numpy.ndarray, learning_rate: float, iterations: int=1): """Finds the weights for the model Args: x: the tweet vectors y: the positive/negative labels weights: the regression weights learning_rate: (eta) how much to update the weights iterations: the number of times to repeat training """ assert len(x) == len(y) rows = len(x) losses = [] learning_rate /= rows for iteration in range(iterations): y_hat = sigmoid(x.dot(weights)) # average loss loss = numpy.squeeze(-((y.T.dot(numpy.log(y_hat))) + (1 - y.T).dot(numpy.log(1 - y_hat))))/rows losses.append(loss) gradient = ((y_hat - y).T.dot(x)).sum(axis=0, keepdims=True) weights -= learning_rate * gradient.T return loss, weights, losses
If you look at the implementation you can see that there are some changes made to it from what I wrote earlier. This is because the algorithm I wrote in pseudocode came from a book while the implementation that I made came from a Coursera assignment. The main differences being that we use a set number of iterations to train the model and the learning rate is divided by the number of training examples. Of course, you could just divide the learning rate before passing it in to the function so it doesn't really change it that much. I also had to take into account the fact that you can't just take a dot product of two matrices if their shapes aren't compatible - the rows of the left hand matrix has to match the columns of the right hand matrix) so there's some transposing of matrices being done. Our actual implementation might be more like this.
\begin{algorithm} \caption{Gradient Descent Implemented} \begin{algorithmic} \STATE $\theta \gets 0$ \STATE $m \gets rows(X)$ \FOR{$iteration \in$ \{0 $\ldots iterations-1$ \}} \STATE $\hat{Y} \gets \sigma(X \cdot \theta)$ \STATE $loss \gets -\frac{1}{m}(Y^T \cdot \ln \hat{Y}) + (1 - Y)^T \cdot (\ln 1 - \hat{Y})$ \STATE $\nabla \gets \sum (\hat{Y} - Y)^T \cdot x$ \STATE $\theta \gets \theta - \frac{\eta}{m} \nabla^T$ \ENDFOR \end{algorithmic} \end{algorithm}
- Test It
First we'll make a fake (random) input set to make it easier to check the gradient descent.
numpy.random.seed(1) bias = numpy.ones((10, 1)) fake = numpy.random.rand(10, 2) * 2000 fake_tweet_vectors = numpy.append(bias, fake, axis=1)
Now, the fake labels - we'll make around 35% of them negative and the rest positive.
fake_labels = (numpy.random.rand(10, 1) > 0.35).astype(float)
- Do the Descent
So now we can pass our test data into the gradient descent function and see what happens.
fake_weights = numpy.zeros((3, 1)) fake_loss, fake_weights, losses = gradient_descent(x=fake_tweet_vectors, y=fake_labels, weights=fake_weights, learning_rate=1e-8, iterations=700) expect(math.isclose(fake_loss, 0.67094970, rel_tol=1e-8)).to(be_true) print(f"The log-loss after training is {fake_loss:.8f}.") print(f"The trained weights are {[round(t, 8) for t in numpy.squeeze(fake_weights)]}")
The log-loss after training is 0.67094970. The trained weights are [4.1e-07, 0.00035658, 7.309e-05]
Train the Model
Now that we have our parts let's actually train the model using the real training data. I originally did this expecting numpy arrays (like in earlier steps I was expecting python lists instead of numpy arrays - stuff changes) so I'll be extracting the relevant columns from the pandas DataFrame and converting them back to arrays.
weights = numpy.zeros((3, 1))
eta = 1e-9
iterations = 1500
with TIMER:
final_loss, weights, losses = gradient_descent(
x=train_vectorizer.vectors,
y=training.sentiment.values.reshape((-1, 1)), weights=weights,
learning_rate=eta, iterations=iterations)
print(f"The log-loss after training is {final_loss:.8f}.")
print(f"The resulting vector of weights is "
f"{[round(t, 8) for t in numpy.squeeze(weights)]}")
model = TweetSentiment(train_vectorizer, weights)
predictions = model()
correct = sum(predictions.T[0] == training.sentiment)
print(f"Training Accuracy: {correct/len(training)}")
2020-07-27 17:54:58,357 graeae.timers.timer start: Started: 2020-07-27 17:54:58.357765 2020-07-27 17:54:58,776 graeae.timers.timer end: Ended: 2020-07-27 17:54:58.776834 2020-07-27 17:54:58,777 graeae.timers.timer end: Elapsed: 0:00:00.419069 The log-loss after training is 0.22043072. The resulting vector of weights is [6e-08, 0.00053899, -0.0005613] Training Accuracy: 0.997625
plot_losses = pandas.DataFrame.from_dict({"Log-Loss": losses})
plot = plot_losses.hvplot().opts(title="Training Losses",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
color=Plot.blue
)
output = Embed(plot=plot, file_name="training_loss")()
print(output)
As you can see, the losses are still on the decline, but we'll stop here to see how it's doing.
Test the Model
This will be a class to predict the sentiment of a tweet using our model.
# pypi
import attr
import numpy
# this project
from .vectorizer import TweetVectorizer
@attr.s(auto_attribs=True)
class TweetSentiment:
"""Predicts the sentiment of a tweet
Args:
vectorizer: something to vectorize tweets
theta: vector of weights for the logistic regression model
"""
vectorizer: TweetVectorizer
theta: numpy.ndarray
def sigmoid(self, vectors: numpy.ndarray) -> float:
"""the logistic function
Args:
vectors: a matrix of bias, positive, negative counts
Returns:
array of probabilities that the tweets are positive
"""
return 1/(1 + numpy.exp(-vectors))
def probability_positive(self, tweet: str) -> float:
"""Calculates the probability of the tweet being positive
Args:
tweet: a tweet to classify
Returns:
the probability that the tweet is a positive one
"""
x = self.vectorizer.extract_features(tweet, as_array=True)
return numpy.squeeze(self.sigmoid(x.dot(self.theta)))
def classify(self, tweet: str) -> int:
"""Decides if the tweet was positive or not
Args:
tweet: the tweet message to classify.
"""
return int(numpy.round(self.probability_positive(tweet)))
def __call__(self) -> numpy.ndarray:
"""Get the sentiments of the vectorized tweets
Note:
this assumes that the vectorizer passed in has the tweets
Returns:
array of predicted sentiments (1 for positive 0 for negative)
"""
return numpy.round(self.sigmoid(self.vectorizer.vectors.dot(self.theta)))
sentiment = TweetSentiment(test_vectorizer, weights)
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
print(f'{tweet} -> {sentiment.probability_positive(tweet)}')
I am happy -> 0.5183237992258976 I am bad -> 0.4924963884222927 this movie should have been great. -> 0.5156997144475827 great -> 0.5158056039006712 great great -> 0.5315796358935646 great great great -> 0.5472908064541816 great great great great -> 0.5629083094155534
Strangely very near the center. Probably because the words weren't that commonly used in our training set.
totals = sum(counter.counts.values())
print(f"Great positive percentage: {100 * counter.counts[('great', 1)]/totals:.2f} %")
print(f"Great negative percentage: {100 * counter.counts[('great', 0)]/totals:.2f} % ")
Great positive percentage: 0.24 % Great negative percentage: 0.03 %
Now we can see how it did overall.
predictions = sentiment()
correct = sum(predictions.T[0] == testing.sentiment)
print(f"Accuracy: {correct/len(testing)}")
Accuracy: 0.996
Almost suspiciously good.
The Wrong Stuff
wrong_places = predictions.T[0] != testing.sentiment
wrong = testing[wrong_places]
print(len(wrong))
8
for row in wrong.itertuples():
print("*" * 10)
print(f"Tweet number {row.Index}")
raw = test_raw.iloc[row.Index]
print(f"Tweet: {raw.tweet}")
tokens = train_vectorizer.process(raw.tweet)
print(f"Tokens: {tokens}")
print(f"Probability Positive: {sentiment.probability_positive(raw.tweet)}")
print(f"Actual Classification: {row.sentiment}")
print()
for token in tokens:
print(f"{token} \tPositive: {counter.counts[(token, 1)]} "
f"Negative: {counter.counts[(token, 0)]}")
print()
********** Tweet number 64 Tweet: @_sarah_mae omg you can't just tell this and don't say more :p can't wait to know !!!! ❤️ Tokens: ['omg', "can't", 'tell', 'say', ':p', "can't", 'wait', 'know', '❤', '️'] Probability Positive: 0.48137283482824483 Actual Classification: 1 omg Positive: 11 Negative: 51 can't Positive: 36 Negative: 145 tell Positive: 20 Negative: 19 say Positive: 48 Negative: 52 :p Positive: 113 Negative: 0 can't Positive: 36 Negative: 145 wait Positive: 59 Negative: 37 know Positive: 123 Negative: 100 ❤ Positive: 18 Negative: 20 ️ Positive: 9 Negative: 18 ********** Tweet number 118 Tweet: @bae_ts WHATEVER STIL L YOUNG >:-( Tokens: ['whatev', 'stil', 'l', 'young', '>:-('] Probability Positive: 0.5006402767570053 Actual Classification: 0 whatev Positive: 5 Negative: 0 stil Positive: 0 Negative: 0 l Positive: 4 Negative: 1 young Positive: 2 Negative: 3 >:-( Positive: 0 Negative: 2 ********** Tweet number 435 Tweet: @wtfxmbs AMBS please it's harry's jeans :)):):):( Tokens: ['amb', 'pleas', "harry'", 'jean', ':)', '):', '):', '):'] Probability Positive: 0.821626817973081 Actual Classification: 0 amb Positive: 0 Negative: 0 pleas Positive: 76 Negative: 215 harry' Positive: 0 Negative: 1 jean Positive: 0 Negative: 1 :) Positive: 2967 Negative: 1 ): Positive: 7 Negative: 1 ): Positive: 7 Negative: 1 ): Positive: 7 Negative: 1 ********** Tweet number 458 Tweet: @GODDAMMlT SRSLY FUCK U UNFOLLOWER HOPE UR FUTURE CHILD UNPARENTS U >:-( Tokens: ['srsli', 'fuck', 'u', 'unfollow', 'hope', 'ur', 'futur', 'child', 'unpar', 'u', '>:-('] Probability Positive: 0.5157383070453547 Actual Classification: 0 srsli Positive: 1 Negative: 4 fuck Positive: 19 Negative: 48 u Positive: 193 Negative: 162 unfollow Positive: 55 Negative: 8 hope Positive: 119 Negative: 77 ur Positive: 28 Negative: 20 futur Positive: 13 Negative: 1 child Positive: 3 Negative: 3 unpar Positive: 0 Negative: 0 u Positive: 193 Negative: 162 >:-( Positive: 0 Negative: 2 ********** Tweet number 493 Tweet: 5h + kids makes all ://:(\\\ Tokens: ['5h', 'kid', 'make', ':/'] Probability Positive: 0.5003797971971914 Actual Classification: 0 5h Positive: 0 Negative: 0 kid Positive: 17 Negative: 16 make Positive: 87 Negative: 77 :/ Positive: 4 Negative: 8 ********** Tweet number 788 Tweet: i love got7's outfit for just right >:( its so fun Tokens: ['love', 'got', '7', 'outfit', 'right', '>:(', 'fun'] Probability Positive: 0.5197464496373044 Actual Classification: 0 love Positive: 306 Negative: 114 got Positive: 55 Negative: 70 7 Positive: 5 Negative: 11 outfit Positive: 3 Negative: 3 right Positive: 41 Negative: 39 >:( Positive: 0 Negative: 36 fun Positive: 48 Negative: 26 ********** Tweet number 995 Tweet: I ATE YOUR LAST COOKIE SHIR0 >:D Tokens: ['ate', 'last', 'cooki', 'shir', '0', '>:d'] Probability Positive: 0.4961173289819544 Actual Classification: 1 ate Positive: 3 Negative: 8 last Positive: 35 Negative: 58 cooki Positive: 0 Negative: 2 shir Positive: 0 Negative: 0 0 Positive: 1 Negative: 0 >:d Positive: 3 Negative: 0 ********** Tweet number 1662 Tweet: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring Tokens: ['sr', 'financi', 'analyst', 'expedia', 'inc', 'bellevu', 'wa', 'financ', 'expediajob', 'job', 'job', 'hire'] Probability Positive: 0.5038917149486426 Actual Classification: 0 sr Positive: 0 Negative: 1 financi Positive: 0 Negative: 0 analyst Positive: 0 Negative: 0 expedia Positive: 0 Negative: 0 inc Positive: 1 Negative: 2 bellevu Positive: 0 Negative: 0 wa Positive: 0 Negative: 0 financ Positive: 0 Negative: 0 expediajob Positive: 0 Negative: 0 job Positive: 28 Negative: 12 job Positive: 28 Negative: 12 hire Positive: 0 Negative: 0
It looks like these were tweets with uncommon tokens. Personally I'm not sure what to make of some of them myself. And I'm not sure about the classifications - why is a job posting considered a negative tweet?
Some Fresh Tweets
First someone reacting to a post about the Clown Motel in Tonopah, Nevada. The previous link was to Atlas Obscura, but the tweet came from thrillist.
tweet = "Nah dude. I drove by that at night and it was the creepiest thing ever. The whole town gave me bad vibes. I still shudder when I think about it."
print(f"Classified as {sentiments[sentiment.classify(tweet)]}")
Classified as negative
Seems reasonable.
tweet = "This is just dope. Quaint! I’d love to have an ironic drive-in wedding in Las Vegas and then stay in a clown motel as newly weds for one night. I bet they have Big Clown Suits for newly weds, haha."
print(f"Classified as {sentiments[sentiment.classify(tweet)]}")
Classified as positive
Compare to SKLearn
columns = "bias positive negative".split()
classifier = LogisticRegressionCV(
random_state=2020,
max_iter=1500,
scoring="neg_log_loss").fit(training[columns], training.sentiment)
predictions = classifier.predict(testing[columns]).reshape((-1, 1))
correct = sum(predictions == testing.sentiment.values.reshape((-1, 1)))
print(f"Accuracy: {correct[0]/len(testing)}")
Accuracy: 0.995
So it did pretty much the same just using the default parameters. We could probably do a parameter search but that's okay for now.
Vizualizing the Model
Since we've been given the model's weights we can plot its output when fed the vectors to see how it separates the data. To get the equation for the separation line we need to solve for the positive or negative terms when the product of the weights and the vector is 0 (\(\theta \times x = 0\), where x is our vector \(\langle bias, positive, negative \rangle\)).
Get ready for some algebra.
\begin{align} \theta \times x &= 0\\ \theta \times \langle bias, positive, negative \rangle &= 0\\ \theta \times \langle 1, positive, negative \rangle &= 0\\ \theta_0 + \theta_1 \times positive + \theta_2 \times negative &= 0\\ \theta_2 \times negative &= -\theta_0 - \theta_1 \times positive\\ negative &= \frac{-\theta_0 - \theta_1 \times positive}{\theta_2}\\ \end{align}This is the equation for our separation line (on our plot positive
is the x-axis and negative
is the y-axis, which we can translate to a function to apply to our data.
def negative(theta: list, positive: float) -> float:
"""Calculate the negative value
This calculates the value for the separation line
Args:
theta: list of weights for the logistic regression
positive: count of positive tweets matching tweet
Returns:
the calculated negative value for the separation line
"""
return (-theta.bias
- positive * theta.positive)/theta.negative
theta = pandas.DataFrame(weights.T, columns = columns)
negative_ = partial(negative, theta=theta)
We plotted the vectorized data before, now we can add our regression line.
hover = HoverTool(
tooltips = [
("Positive", "@positive{0,0}"),
("Negative", "@negative{0,0}"),
("Sentiment", "@Sentiment"),
]
)
training["regression negative"] = training.positive.apply(
lambda positive: negative_(positive=positive))
line = training.hvplot(x="positive", y="regression negative", color=Plot.tan)
scatter = training.hvplot.scatter(x="positive", y="negative", by="sentiment", fill_alpha=0,
color=Plot.color_cycle, tools=[hover]).opts(
height=Plot.height,
width=Plot.width,
fontscale=Plot.font_scale,
title="Positive vs Negative Tweet Sentiment",
)
plot = scatter * line
output = Embed(plot=plot, file_name="positive_negative_scatter_with_model")()
print(output)
Let's see if a log-log scale helps.
line = training.hvplot(x="positive", y="regression negative", color=Plot.tan)
scatter = training.hvplot.scatter(x="positive", y="negative", by="sentiment",
fill_alpha=0,
color=Plot.color_cycle, tools=[hover])
plot = (scatter * line).opts(
height=Plot.height,
width=Plot.width,
xrotation=45,
fontscale=Plot.font_scale,
title="Positive vs Negative Tweet Sentiment",
logx=True,
logy=True,
)
output = Embed(plot=plot, file_name="positive_negative_scatter_log")()
print(output)
The log-scale seems to break the auto-scaling of the plot so you'll have to zoom out a little bit (with the Wheel Zoom tool on the toolbar) which will show you that the model did a pretty good job of separating the positive from the negative. You can see that some of the points aren't really linearly separable using our vectors so this is probably as good as it can get.
End
This concludes the series begun with the post on pre-processing tweets.
I should mention that I used Speech and Language Processing to understanding the math.