The Tweet Vectorizer

Beginning

In the previous post (Twitter Word Frequencies) I built up a word-counter now we're going to use it to create word-counters for our tweets.

We are going to be classifying tweets by positive or negative sentiment, but tweets are free-form text (and images, but we're ignoring them) and we want numbers in a table form so in order to be able to work with the tweets we'll have to convert them somehow. That's what we'll be doing here.

Set Up

This is some preliminary stuff so we have python ready to go.

Imports

# python
from argparse import Namespace
from functools import partial
from pathlib import Path

import os
import pickle

# pypi
from bokeh.models.tools import HoverTool
from dotenv import load_dotenv
from nltk.corpus import twitter_samples
import holoviews
import hvplot.pandas
import pandas

# the vectorizer
from neurotic.nlp.twitter.vectorizer import TweetVectorizer

# some helper stuff
from graeae import EmbedHoloviews

The Environment

I'm using environment variables (well, in this case a .env file) to keep track of where I save files so this loads the paths into the environment.

load_dotenv("posts/nlp/.env", override=True)

The Data

training = pandas.read_feather(
    Path(os.environ["TWITTER_TRAINING_PROCESSED"]).expanduser())

train_raw = pandas.read_feather(
    Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser())

with Path(os.environ["TWITTER_SENTIMENT"]).expanduser().open("rb") as reader:
    Sentiment = pickle.load(reader)

The training frame has the cleaned, stemmed, and tokenized version of the tweets.

print(training.iloc[0])
tweet    [park, get, sunlight, :)]
label                            1
Name: 0, dtype: object

This is what we need for when things are working. The train_raw frame has the tweets as they come from NLTK.

print(train_raw.iloc[0])
tweet    off to the park to get some sunlight : )
label                                           1
Name: 0, dtype: object

This is just for double-checking if things aren't working the way we expect.

For Plotting

These are some helpers for the plotting that I'll do later on.

SLUG = "the-tweet-vectorizer"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/nlp/{SLUG}")

with Path(os.environ["TWITTER_PLOT"]).expanduser().open("rb") as reader:
    Plot = pickle.load(reader)

The Token Counter

I made the counts in a previous post (Twitter Word Frequencies) so I'll just load it here.

with Path(os.environ["TWITTER_COUNTER"]).expanduser().open("rb") as reader:
    counter = pickle.load(reader)

Middle

The Tweet Vectors

In an earlier post we built a dictionary-like set to count the number of times each token was in a positive tweet and the number of times it was in a negative tweet. To represent a tweet as a vector we're going to sum the total counts for the tokens in the tweet when they are positive and when they are positive.

Come again?

Lets say you have a tweet "a b c" which tokenizes to a, b, c and you look up the positive and negative tweet counts for each token so you add them up, getting this:

Token Positive Negative
a 1 4
b 2 5
c 3 6
Total 6 15

The bottom row (total) has the values for our vector for any tweet containing the tokens a, b, and c. So to represent this tweet you would create a vector of the form:

\begin{align} \hat{v} &= \langle bias, positive, negative \rangle\\ &= \langle 1, 6, 15\rangle\\ \end{align}

Note: The bias is always one (it just is).

The Tweet Vectorizer

Here's where I'll create the class to create the vectors.

The Testing

We'll start with some vaguely BDD-ish testing. First the tangles.

Feature: A Tweet Count Vectorizer

<<extract-features-feature>>

<<get-vectors-feature>>

<<reset-vectors-feature>>

<<check-rep-vectorizer-tweets-feature>>

<<check-rep-vectorizer-counter-feature>>
# from python
from collections import Counter

import random

# from pypi
from expects import (
    be,
    be_true,
    contain_exactly,
    expect,
    raise_error,
)
from pytest_bdd import (
    given,
    scenarios,
    when,
    then
)

import numpy

# this testing
from fixtures import katamari

# software under test
from neurotic.nlp.twitter.vectorizer import Columns, TweetVectorizer
from neurotic.nlp.twitter.counter import WordCounter

and_also = then
scenarios("twitter/tweet_vectorizer.feature")

<<test-extract-features>>

<<test-vectors>>

<<test-reset-vectors>>

<<test-vectorizer-tweets-check-rep>>

<<test-vectorizer-counter-check-rep>>

And now we can move on to the tests.

  • Extract Features

    For training and testing I'm going to want to convert them in bulk, but first I'll create a method so that a single tweet can be vectorized.

    Scenario: A user converts a tweet to a feature-vector
    
    Given a Tweet Vectorizer
    When the user converts a tweet to a feature-vector
    Then it's the expected feature-vector
    
    # Scenario: A user converts a tweet to a feature-vector
    
    
    @given("a Tweet Vectorizer")
    def setup_tweet_vectorizer(katamari, mocker):
        katamari.bias = random.randrange(100) * random.random()
        TWEETS = 1
    
        TOKENS = "A B C".split()
        katamari.tweets = [TOKENS for tweet in range(TWEETS)]
        katamari.counts = Counter({('A', 0):1,
                                   ('B', 1):2,
                                   ('C', 0):3})
        katamari.counter = mocker.MagicMock(spec=WordCounter)
        katamari.counter.processed = katamari.tweets
        katamari.vectorizer = TweetVectorizer(tweets=katamari.tweets,
                                              counts=katamari.counts,
                                              bias=katamari.bias)
        katamari.vectorizer._process = mocker.MagicMock()
        katamari.vectorizer._process.return_value = "A B C".split()
        return
    
    
    @when("the user converts a tweet to a feature-vector")
    def extract_features(katamari):
        katamari.actual = katamari.vectorizer.extract_features("A B C")
        katamari.actual_array = katamari.vectorizer.extract_features("A B C", as_array=True)
        katamari.expected = [katamari.bias, 2, 4]
        katamari.expected_array = numpy.array(katamari.expected)
        return
    
    
    @then("it's the expected feature-vector")
    def check_feature_vectors(katamari):
        expect(numpy.allclose(katamari.actual_array, katamari.expected_array)).to(be_true)
        expect(katamari.actual).to(contain_exactly(*katamari.expected))
    
        expect(katamari.actual_array.shape).to(contain_exactly(1, 3))
        return
    
  • Get the Vectors
    Scenario: A user retrieves the count vectors
    Given a user sets up the Count Vectorizer with tweets
    When the user checks the count vectors
    Then the first column is the bias colum
    And the positive counts are correct
    And the negative counts are correct
    
    # Feature: A Tweet Count Vectorizer
    
    # Scenario: A user retrieves the count vectors
    
    @given("a user sets up the Count Vectorizer with tweets")
    def setup_vectorizer(katamari, faker, mocker):
        katamari.bias = random.randrange(100) * random.random()
        TWEETS = 3
    
        TOKENS = "A B C"
        katamari.tweets = [TOKENS for tweet in range(TWEETS)]
        katamari.counter = mocker.MagicMock(spec=WordCounter)
        katamari.counter.counts = Counter({('A', 0):1,
                                           ('B', 1):2,
                                           ('C', 0):3})
        katamari.vectorizer = TweetVectorizer(tweets=katamari.tweets,
                                              counts=katamari.counter.counts,
                                              bias=katamari.bias)
    
        katamari.vectorizer._process = mocker.MagicMock()
        katamari.vectorizer._process.return_value = TOKENS.split()
        katamari.negative = numpy.array([sum([katamari.counter.counts[(token, 0)]
                                          for token in TOKENS])
                                          for row in range(TWEETS)])
        katamari.positive = numpy.array([sum([katamari.counter.counts[(token, 1)]
                                          for token in TOKENS])
                                         for row in range(TWEETS)])
        return
    
    
    @when("the user checks the count vectors")
    def check_count_vectors(katamari):
        # kind of silly, but useful for troubleshooting
        katamari.actual_vectors = katamari.vectorizer.vectors
        return
    
    
    @then("the first column is the bias colum")
    def check_bias(katamari):
        expect(all(katamari.actual_vectors[:, Columns.bias]==katamari.bias)).to(be_true)
        return
    
    
    @and_also("the positive counts are correct")
    def check_positive_counts(katamari):
        positive = katamari.actual_vectors[:, Columns.positive]
        expect(numpy.allclose(positive, katamari.positive)).to(be_true)
        return
    
    
    @and_also("the negative counts are correct")
    def check_negative_counts(katamari):
        negative = katamari.actual_vectors[:, Columns.negative]
        expect(numpy.allclose(negative, katamari.negative)).to(be_true)
        return
    
  • Reset the Vectors
    Scenario: The vectors are reset
    Given a Tweet Vectorizer with the vectors set
    When the user calls the reset method
    Then the vectors are gone
    
    # Scenario: The vectors are reset
    
    
    @given("a Tweet Vectorizer with the vectors set")
    def setup_vectors(katamari, faker, mocker):
        katamari.vectors = mocker.MagicMock()
        katamari.vectorizer = TweetVectorizer(tweets = [faker.sentence()], counts=None)
        katamari.vectorizer._vectors = katamari.vectors
        return
    
    
    @when("the user calls the reset method")
    def call_reset(katamari):
        expect(katamari.vectorizer.vectors).to(be(katamari.vectors))
        katamari.vectorizer.reset()
        return
    
    
    @then("the vectors are gone")
    def check_vectors_gone(katamari):
        expect(katamari.vectorizer._vectors).to(be(None))
        return
    
  • Check Rep
    Scenario: the check-rep is called with bad tweets
    Given a Tweet Vectorizer with bad tweets
    When check-rep is called
    Then it raises an AssertionError
    
    # Scenario: the check-rep is called with bad tweets
    
    
    @given("a Tweet Vectorizer with bad tweets")
    def setup_bad_tweets(katamari):
        katamari.vectorizer = TweetVectorizer(tweets=[5],
                                              counts=Counter())
        return
    
    
    @when("check-rep is called")
    def call_check_rep(katamari):
        def bad_call():
            katamari.vectorizer.check_rep()
        katamari.bad_call = bad_call
        return
    
    
    @then("it raises an AssertionError")
    def check_assertion_error(katamari):
        expect(katamari.bad_call).to(raise_error(AssertionError))
        return
    
    Scenario: the check-rep is called with a bad word-counter
    Given a Tweet Vectorizer with the wrong counter object
    When check-rep is called
    Then it raises an AssertionError
    
    # Scenario: the check-rep is called with a bad word-counter
    
    
    @given("a Tweet Vectorizer with the wrong counter object")
    def setup_bad_counter(katamari, mocker):
        katamari.vectorizer = TweetVectorizer(tweets=["apple"], counts=mocker.MagicMock())
        return
    
    # When check-rep is called
    # Then it raises an AssertionError
    

The Implementation

Okay, so now for the actual class.

# python
from argparse import Namespace
from collections import Counter
from typing import List, Union

# pypi
import numpy
import attr


# this package
from neurotic.nlp.twitter.processor import TwitterProcessor
from neurotic.nlp.twitter.counter import WordCounter

Columns = Namespace(
    bias=0,
    positive=1,
    negative=2
)

TweetClass = Namespace(
    positive=1,
    negative=0
)

# some types
Tweets = List[List[str]]
Vector = Union[numpy.ndarray, list]


@attr.s(auto_attribs=True)
class TweetVectorizer:
    """A tweet vectorizer

    Args:
     tweets: the pre-processed/tokenized tweets to vectorize
     counts: the counter with the tweet token counts
     processed: to not process the bulk tweets
     bias: constant to use for the bias
    """
    tweets: Tweets
    counts: Counter
    processed: bool=True
    bias: float=1
    _process: TwitterProcessor=None
    _vectors: numpy.ndarray=None

    @property
    def process(self) -> TwitterProcessor:
        """Processes tweet strings to tokens"""
        if self._process is None:
            self._process = TwitterProcessor()
        return self._process

    @property
    def vectors(self) -> numpy.ndarray:
        """The vectorized tweet counts"""
        if self._vectors is None:
            rows = [self.extract_features(tweet) for tweet in self.tweets]
            self._vectors = numpy.array(rows)
        return self._vectors

    def extract_features(self, tweet: str, as_array: bool=False) -> Vector:
        """converts a single tweet to an array of counts

       Args:
        tweet: a string tweet to count up
        as_array: whether to return an array instead of a list

       Returns:
        either a list of floats or a 1 x 3 array
       """
        # this is a hack to make this work both in bulk and one tweet at a time
        tokens = tweet if self.processed else self.process(tweet)
        vector = [
            self.bias,
            sum((self.counts[(token, TweetClass.positive)]
                 for token in tokens)),
            sum((self.counts[(token, TweetClass.negative)]
                                for token in tokens))
        ]
        vector = numpy.array([vector]) if as_array else vector
        return vector

    def reset(self) -> None:
        """Removes the vectors"""
        self._vectors = None
        return

    def check_rep(self) -> None:
        """Checks that the tweets and word-counter are set

       Raises:
        AssertionError if one of them isn't right
       """
        for tweet in self.tweets:
            assert type(tweet) is str
        assert type(self.counts) is Counter
        return

Plotting The Vectors

Now that we have a vectorizer definition, let's see what it looks like when we plot the training set. First, we'll have to convert the training set tweets to the vectors.

vectorizer = TweetVectorizer(tweets=training.tweet.values, counts=counter)
data = pandas.DataFrame(vectorizer.vectors, columns=
                        "bias positive negative".split())

data["Sentiment"] = training.label.map(Sentiment.decode)
print(training.tweet.iloc[0])
print(data.iloc[0])
['park' 'get' 'sunlight' ':)']
bias                1
positive         3139
negative          208
Sentiment    positive
Name: 0, dtype: object
print(train_raw.iloc[0].tweet)
for token in training.iloc[0].tweet:
    print(f"{token}\t{counter.counts[(token, 1)]}")
    print(f"{token}\t{counter.counts[(token, 0)]}")
off to the park to get some sunlight : )
park    6
park    7
get     165
get     200
sunlight        1
sunlight        0
:)      2967
:)      1

So a smiley face seems to overwhelm other tokens.

print(data.Sentiment.value_counts())
negative    4013
positive    3987
Name: Sentiment, dtype: int64

If you followed the previous post you can probably figure out that this is the training set. Weird but I hadn't noticed that they aren't exactly balanced… Anyway, now the plot.

hover = HoverTool(
    tooltips = [
        ("Positive", "@positive{0,0}"),
        ("Negative", "@negative{0,0}"),
        ("Sentiment", "@Sentiment"),
    ]
)

plot = data.hvplot.scatter(x="positive", y="negative", by="Sentiment", fill_alpha=0,
                           color=Plot.color_cycle, tools=[hover]).opts(
                               height=Plot.height,
                               width=Plot.width,
                               fontscale=Plot.font_scale,
                               title="Positive vs Negative Tweet Sentiment",
                           )

output = Embed(plot=plot, file_name="positive_negative_scatter")()
print(output)

Figure Missing

So, each point is a tweet and the color is what the tweet was classified as. I don't know why they seem to group in bunches, but you can sort of see that by using the token counts we've made them separable. This becomes even more obvious if we change the scale to a logarithmic one.

plot = data.hvplot.scatter(x="positive", y="negative", by="Sentiment",
                           loglog=True,
                           fill_alpha=0,
                           color=Plot.color_cycle, tools=[hover]).opts(
                               height=Plot.height,
                               width=Plot.width,
                               fontscale=Plot.font_scale,
                               xlim=(0, None),
                               ylim=(0, None),
                               apply_ranges=True,
                               title="Positive vs Negative Tweet Sentiment (log-log)",
                           )

output = Embed(plot=plot, file_name="positive_negative_scatter_log")()
print(output)

Figure Missing

I don't know why but the xlim and ylim arguments don't seem to work when you use a logarithmic scale, but if you zoom out using the wheel zoom tool (third icon from the top of the toolbar on the right) you'll see that there's a pretty good separation between the sentiment classifications.

End

So, that's it for vectorizing tweets I'll save the values so I don't have to re-do them again when I actually fit the model. Since I changed some values to make it better for plotting I'll change them back first.

data = data.rename(columns={"Sentiment": "sentiment"})
data["sentiment"] = data.sentiment.map(Sentiment.encode)
data.to_feather(Path(os.environ["TWITTER_TRAIN_VECTORS"]).expanduser())

To make it consistent I'm going to convert the test set too.

test = pandas.read_feather(Path(os.environ["TWITTER_TEST_PROCESSED"]).expanduser())
test_vectorizer = TweetVectorizer(tweets=test.tweet, counter=counter)
test_data = pandas.DataFrame(test_vectorizer.vectors,
                             columns="bias positive negative".split())
test_data["sentiment"] = test.label

test_data.to_feather(Path(os.environ["TWITTER_TEST_VECTORS"]).expanduser())

We also need to use the vectorizers to vectorize future tweets so I'll pickle them too.

with Path(os.environ["TWITTER_VECTORIZER"]).expanduser().open("wb") as writer:
    pickle.dump(vectorizer, writer)

Next up in the series: Implementing Logistic Regression for Tweet Sentiment Analysis.