Twitter Preprocessing With NLTK

Beginning

This is the first in a series that will look at taking a group of tweets and building a Logistic Regression model to classify tweets as being either positive or negative in their sentiment. This post is followed by:

This first post is a look at taking a corpus of Twitter data which comes from the Natural Language Toolkit's (NLTK) collection of data and creating a preprocessor for a Sentiment Analysis pipeline. This dataset has entries whose sentiment was categorized by hand so it's a convenient source for training models.

The NLTK Corpus How To has a brief description of the Twitter dataset and they also have some documentation about how to gather new data using the Twitter API yourself.

Set Up

Imports

# from python
from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint

import os
import pickle
import random
import re
import string

# from pypi
from dotenv import load_dotenv
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split

import holoviews
import hvplot.pandas
import nltk
import pandas

# this is created further down in the post
from neurotic.nlp.twitter.processor import TwitterProcessor

# my stuff
from graeae import CountPercentage, EmbedHoloviews

The Environment

This is where I keep the paths to the files I save.

load_dotenv("posts/nlp/.env")

Data

The first thing to do is download the dataset using the download function. If you don't pass an argument to it a dialog will open and you can choose to download any or all of their datasets, but for this exercise we'll just download the Twitter samples. Note that if you run this function and the samples were already downloaded then it won't re-download them so it's safe to call it in any case.

nltk.download('twitter_samples')

The data is contained in three files. You can see the file names using the twitter_samples.fileids function.

for name in twitter_samples.fileids():
    print(f" - {name}")
- negative_tweets.json
- positive_tweets.json
- tweets.20150430-223406.json

As you can see (or maybe guess) two of the files contain tweets that have been categorized as negative or positive. The third file has another 20,000 tweets that aren't classified.

The dataset contains the JSON for each tweet, including some metadata, which you can access through the twitter_samples.docs function. Here's a sample.

pprint(twitter_samples.docs()[0])
{'contributors': None,
 'coordinates': None,
 'created_at': 'Fri Jul 24 10:42:49 +0000 2015',
 'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 624530164626534400,
 'id_str': '624530164626534400',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'place': None,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web '
           '(M2)</a>',
 'text': 'hopeless for tmr :(',
 'truncated': False,
 'user': {'contributors_enabled': False,
          'created_at': 'Sun Mar 08 05:43:40 +0000 2015',
          'default_profile': False,
          'default_profile_image': False,
          'description': '⇨ [V] TravelGency █ 2/4 Goddest from Girls Day █ 92L '
                         '█ sucrp',
          'entities': {'description': {'urls': []}},
          'favourites_count': 196,
          'follow_request_sent': False,
          'followers_count': 1281,
          'following': False,
          'friends_count': 1264,
          'geo_enabled': True,
          'has_extended_profile': False,
          'id': 3078803375,
          'id_str': '3078803375',
          'is_translation_enabled': False,
          'is_translator': False,
          'lang': 'id',
          'listed_count': 3,
          'location': 'wearegsd;favor;pucukfams;barbx',
          'name': 'yuwra ✈ ',
          'notifications': False,
          'profile_background_color': '000000',
          'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png',
          'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png',
          'profile_background_tile': True,
          'profile_banner_url': 'https://pbs.twimg.com/profile_banners/3078803375/1433287528',
          'profile_image_url': 'http://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg',
          'profile_image_url_https': 'https://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg',
          'profile_link_color': '000000',
          'profile_sidebar_border_color': '000000',
          'profile_sidebar_fill_color': '000000',
          'profile_text_color': '000000',
          'profile_use_background_image': True,
          'protected': False,
          'screen_name': 'yuwraxkim',
          'statuses_count': 19710,
          'time_zone': 'Jakarta',
          'url': None,
          'utc_offset': 25200,
          'verified': False}}

There's some potentially useful data here - like if the tweet was re-tweeted, but for what we're doing we'll just use the tweet itself.

To get just the text of the tweets you use the twitter_samples.strings function.

help(twitter_samples.strings)
Help on method strings in module nltk.corpus.reader.twitter:

strings(fileids=None) method of nltk.corpus.reader.twitter.TwitterCorpusReader instance
    Returns only the text content of Tweets in the file(s)
    
    :return: the given file(s) as a list of Tweets.
    :rtype: list(str)

Note that it says that it returns only the given file(s) as a list of tweets but it also makes the fileids argument optional. If you don't pass in any argument you end up with the tweets from all the files in the same list, which you probably don't want.

positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
all_tweets = twitter_samples.strings("tweets.20150430-223406.json")

Now I'll download the stopwords for our pre-processing and setup the english stopwords for use later.

nltk.download('stopwords')
english_stopwords = stopwords.words("english")

Rather than working with the whole data-set I'm going to split it up here so we'll only work with the training set. First thing is to create a set of labels for the positive and negative tweets.

Sentiment = Namespace(
    positive = 1,
    negative = 0,
    decode = {
        1: "positive",
        0: "negative"
    },
    encode = {
        "positive": 1,
        "negative": 0,
    }
)
positive_labels = [Sentiment.positive] * len(positive)
negative_labels = [Sentiment.negative] * len(negative)

Now I'll combine the positive and negative tweets.

labels = positive_labels + negative_labels
tweets = positive + negative

print(f"Labels: {len(labels):,}")
print(f"tweets: {len(tweets):,}")
Labels: 10,000
tweets: 10,000

Now we can do the train-test splitting. The train_test_split function shuffles and splits up the dataset, so combining the positive and negative sets first before the splitting seemed like a good idea.

TRAINING_SIZE = 0.8
SEED = 20200724
x_train, x_test, y_train, y_test = train_test_split(
    tweets, labels, train_size=TRAINING_SIZE, random_state=SEED)

print(f"Training: {len(x_train):,}\tTesting: {len(x_test):,}")
Training: 8,000 Testing: 2,000

The Random Seed

This just sets the random seed so that we get the same values if we re-run this later on (although this is a little tricky with the notebook, since you can call the same code multiple times).

random.seed(SEED)

Plotting

I won't be doing a lot of plotting here, but this is a setup for the little that I do.

SLUG = "01-twitter-preprocessing-with-nltk"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/nlp/{SLUG}",
                create_folder=False)
Plot = Namespace(
    width=990,
    height=780,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    font_scale=2,
    color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)

Middle

It can be more convenient to use a Pandas Series for some checks of the tweets so I'll convert the all-tweets list to one.

all_tweets = pandas.Series(all_tweets)

Explore the Data

Let's start by looking at the number of tweets we got and confirming that the strings function gave us back a list of strings like the docstring said it would.

print(f"Number of tweets: {len(all_tweets):,}")
print(f'Number of positive tweets: {len(positive):,}')
print(f'Number of negative tweets: {len(negative):,}')

for thing in (positive, negative):
    assert type(thing) is list
    assert type(random.choice(thing)) is str
Number of tweets: 20,000
Number of positive tweets: 5,000
Number of negative tweets: 5,000

We can see that the data for each file is made up of strings stored in a list and there were 20,000 tweets in total but only half as much were categorized.

Looking At Some Examples

First, since our data sets are shuffled, I'll convert them into a pandas DataFrame to make it a little easier to get positive vs negative tweets.

training = pandas.DataFrame.from_dict(dict(tweet=x_train, label=y_train))
print(f"Random Positive Tweet: {random.choice(positive)}")
print(f"\nRandom Negative Tweet: {random.choice(negative)}")
Random Positive Tweet: Hi.. Please say"happybirthday" to me :) thanksss :) —  http://t.co/HPXV43LK5L

Random Negative Tweet: I think I should stop getting so angry over stupid shit :(

The First Token

Later on we're going to remove the "RT" (re-tweet) token at the start of the strings. Let's look at how significant this is.

first_tokens = all_tweets.str.split(expand=True)[0]
top_ten = CountPercentage(first_tokens, stop=10, value_label="First Token")
top_ten()
First Token Count Percent (%)
RT 13287 92.92
I 160 1.12
Farage 141 0.99
The 134 0.94
VIDEO: 132 0.92
Nigel 117 0.82
Ed 116 0.81
Miliband 77 0.54
SNP 69 0.48
@UKIP 67 0.47

That gives you some sense of how much there is, but plotting it might make it a little clearer.

plot = top_ten.table.hvplot.bar(y="Percent (%)", x="First Token").opts(
    title="Top Ten Tweet First Tokens", 
    width=Plot.width,
    height=Plot.height)
output = Embed(plot=plot, file_name="top_ten", create_folder=False)
print(output())

Figure Missing

So, about 93 % of the unclassified tweets start with RT, making it perhaps not so informative a token. Or maybe it is… what does a re-tweet tell us? Let's look at if the re-tweeted show up as duplicates and if so, how many times they show up.

retweeted = all_tweets[all_tweets.str.startswith("RT")].value_counts().iloc[:10]
for item in retweeted.values:
    print(f" - {item}")
  • 491
  • 430
  • 131
  • 131
  • 117
  • 103
  • 82
  • 73
  • 69
  • 68

Some of the entries are the same tweet repeated hundreds of times. Does each one count as an additional entry? I don't show it here because the tweets are kind of long, but the top five are all about British politics, so there might have been some kind of bias in the way the tweets were gathered.

Processing the Data

There are four basic steps in our NLP pre-processing:

Let's start by pulling up a tweet that has most of the stuff we're cleaning up.

THE_CHOSEN = training[(training.tweet.str.contains("beautiful")) &
                      (training.tweet.str.contains("http")) &
                      (training.tweet.str.contains("#"))].iloc[0].tweet
print(THE_CHOSEN)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

Cleaning Up Twitter-Specific Markup

Although I listed four steps in the beginning, there's often another step where we remove things that are common or not useful but known in advance. In this case we want to remove re-tweet tags (RT), hyperlinks, and hashtags. We're going to do that with python's built in regular expression module. We're also going to do it one tweet at a time, although you could perhaps more efficiently do it in bulk using pandas.

START_OF_LINE = r"^"
OPTIONAL = "?"
ANYTHING = "."
ZERO_OR_MORE = "*"
ONE_OR_MORE = "+"

SPACE = "\s"
SPACES = SPACE + ONE_OR_MORE
NOT_SPACE = "[^\s]" + ONE_OR_MORE
EVERYTHING_OR_NOTHING = ANYTHING + ZERO_OR_MORE

ERASE = ""
FORWARD_SLASH = "\/"
NEWLINES = r"[\r\n]"
  • Re-Tweets

    None of the positive or negative samples have this tag so I'm going to pull an example from the complete set just to show it working.

    RE_TWEET = START_OF_LINE + "RT" + SPACES
    
    tweet = all_tweets[0]
    print(tweet)
    tweet = re.sub(RE_TWEET, ERASE, tweet)
    print(tweet)
    
    RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
    @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
    
  • Hyperlinks
    HYPERLINKS = ("http" + "s" + OPTIONAL + ":" + FORWARD_SLASH + FORWARD_SLASH
                  + NOT_SPACE + NEWLINES + ZERO_OR_MORE)
    
    print(THE_CHOSEN)
    re_chosen = re.sub(HYPERLINKS, ERASE, THE_CHOSEN)
    print(re_chosen)
    
    My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
    My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… 
    
  • HashTags

    We aren't removing the actual hash-tags, just the hash-marks (#).

    HASH = "#"
    re_chosen = re.sub(HASH, ERASE, re_chosen)
    print(re_chosen)
    
    My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
    

Tokenize

NLTK has a tokenizer specially built for tweets. The twitter_samples module actually has a tokenizer function that breaks the tweets up, but since we are using regular expressions to clean up the strings a little first, it makes more sense to tokenize the strings afterwards. Also note that one of the steps in the pipeline is to lower-case the letters, which the TweetTokenizer will do for us if we set the preserve_case argument to False.

print(help(TweetTokenizer))
Help on class TweetTokenizer in module nltk.tokenize.casual:

class TweetTokenizer(builtins.object)
 |  TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)
 |  
 |  Tokenizer for tweets.
 |  
 |      >>> from nltk.tokenize import TweetTokenizer
 |      >>> tknzr = TweetTokenizer()
 |      >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
 |      >>> tknzr.tokenize(s0)
 |      ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
 |  
 |  Examples using `strip_handles` and `reduce_len parameters`:
 |  
 |      >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
 |      >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
 |      >>> tknzr.tokenize(s1)
 |      [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
 |  
 |  Methods defined here:
 |  
 |  __init__(self, preserve_case=True, reduce_len=False, strip_handles=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  tokenize(self, text)
 |      :param text: str
 |      :rtype: list(str)
 |      :return: a tokenized list of strings; concatenating this list returns        the original string if `preserve_case=False`
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

None
tokenizer = TweetTokenizer(
    preserve_case=False,
    strip_handles=True,
    reduce_len=True)

As I mentioned, preserve_case lower-cases the letters. The other two arguments are strip_handles which removes the twitter-handles and reduce_len which limits the number of times a character can be repeated to three - so zzzzz will be changed to zzz. Now we can tokenize our partly cleaned token.

print(re_chosen)
tokens = tokenizer.tokenize(re_chosen)
print(tokens)
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

Remove Stop Words and Punctuation

print(english_stopwords)
print(string.punctuation)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Not as many stopwords as I would have thought.

cleaned = [word for word in tokens if (word not in english_stopwords and
                                       word not in string.punctuation)]
print(cleaned)
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

Stemming

We're going to use the Porter Stemmer from NLTK to stem the words (this is the official Porter Stemmer algorithm page).

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in cleaned]
print(stemmed)
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

End

So now we've seen the basic steps that we're going to need to preprocess our tweets for Sentiment Analysis.

Things to check out:

  • The book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze has some useful information about tokenizing, stop words, and stemming, among other things (and is available to read online).
  • preprocessor - (called tweet-preprocessor on pypi) has some of this baked in. The hashtag cleaning removes the word and the pound sign and it doesn't use the NLTK twitter tokenizer but looks like it might be useful (unfortunately not everything is documented so you have to look at the code to figure some things out).

Finally I'm going to re-write what we did as a class to re-use it later as well as save the testing and training data.

Tests

I'm going to use pytest-bdd to run the tests for the pre-processor but I'm also going to take advantage of org-babel and keep the scenario definitions and the test functions grouped by what they do, even though they will exist in two different files (tweet_preprocessing.feature and test_preprocessing.py) when tangled out of this file.

The Tangles

Feature: Tweet pre-processor

<<stock-processing>>

<<re-tweet-processing>>

<<hyperlink-processing>>

<<hash-processing>>

<<tokenization-preprocessing>>

<<stop-word-preprocessing>>

<<stem-preprocessing>>

<<whole-shebang-preprocessing>>
# from pypi
import pytest

# software under test
from neurotic.nlp.twitter.processor import TwitterProcessor

class Katamari:
    """Something to stick values into"""

@pytest.fixture
def katamari():
    return Katamari()


@pytest.fixture
def processor():
    return TwitterProcessor()
# from python
import random
import string

# from pypi
from expects import (
    contain_exactly,
    equal,
    expect
)
from pytest_bdd import (
    given,
    scenarios,
    then,
    when,
)

And = when


# fixtures
from fixtures import katamari, processor

scenarios("twitter/tweet_preprocessing.feature")


<<test-stock-symbol>>


<<test-re-tweet>>


<<test-hyperlinks>>


<<test-hashtags>>


<<test-tokenization>>


<<test-unstopping>>


<<test-stem>>


<<test-call>>

Now on to the sections that go into the tangles.

Stock Symbols

Twitter has a special symbol for stocks which is a dollar sign followed by the stock ticker name (e.g. $HOG for Harley Davidson) that I'll remove. This is going to assume anything with a dollar sign immediately followed by a letter, number, or underscore is a stock symbol.

Scenario: A tweet with a stock symbol is cleaned
  Given a tweet with a stock symbol in it
  When the tweet is cleaned
  Then it has the text removed
#Scenario: A tweet with a stock symbol is cleaned


@given("a tweet with a stock symbol in it")
def setup_stock_symbol(katamari, faker):
    symbol = "".join(random.choices(string.ascii_uppercase, k=4))
    head, tail = faker.sentence(), faker.sentence()
    katamari.to_clean = (f"{head} ${symbol} "
                         f"{tail}")

    # the cleaner ignores spaces so there's going to be two spaces between
    # the head and tail after the symbol is removed
    katamari.expected = f"{head}  {tail}"
    return

#   When the tweet is cleaned
#   Then it has the text removed

The Re-tweets

This tests that we can remove the RT tag.

Scenario: A re-tweet is cleaned.

  Given a tweet that has been re-tweeted
  When the tweet is cleaned
  Then it has the text removed
# Scenario: A re-tweet is cleaned.

@given("a tweet that has been re-tweeted")
def setup_re_tweet(katamari, faker):
    katamari.expected = faker.sentence()
    spaces = " " * random.randrange(1, 10)
    katamari.to_clean = f"RT{spaces}{katamari.expected}"
    return


@when("the tweet is cleaned")
def process_tweet(katamari, processor):
    katamari.actual = processor.clean(katamari.to_clean)
    return


@then("it has the text removed")
def check_cleaned_text(katamari):
    expect(katamari.expected).to(equal(katamari.actual))
    return

Hyperlinks

Now test that we can remove hyperlinks.

Scenario: The tweet has a hyperlink
  Given a tweet with a hyperlink
  When the tweet is cleaned
  Then it has the text removed
# Scenario: The tweet has a hyperlink

@given("a tweet with a hyperlink")
def setup_hyperlink(katamari, faker):
    base = faker.sentence()
    katamari.expected = base + " :)"
    katamari.to_clean = base + faker.uri() + " :)"
    return

Hash Symbols

Test that we can remove the pound symbol.

Scenario: A tweet has hash symbols in it.
  Given a tweet with hash symbols
  When the tweet is cleaned
  Then it has the text removed
@given("a tweet with hash symbols")
def setup_hash_symbols(katamari, faker):
    expected = faker.sentence()
    tokens = expected.split()
    expected_tokens = expected.split()

    for count in range(random.randrange(1, 10)):
        index = random.randrange(len(tokens))
        word = faker.word()
        tokens = tokens[:index] + [f"#{word}"] + tokens[index:]
        expected_tokens = expected_tokens[:index] + [word] + expected_tokens[index:]
    katamari.to_clean = " ".join(tokens)
    katamari.expected = " ".join(expected_tokens)
    return

Tokenization

This is being done by NLTK, so it might not really make sense to test it, but I figured adding a test would make it more likely that I'd slow down enough to understand what it's doing.

Scenario: The text is tokenized
  Given a string of text
  When the text is tokenized
  Then it is the expected list of strings
# Scenario: The text is tokenized


@given("a string of text")
def setup_text(katamari):
    katamari.text = "Time flies like an Arrow, fruit flies like a BANANAAAA!"
    katamari.expected = ("time flies like an arrow , "
                         "fruit flies like a bananaaa !").split()
    return


@when("the text is tokenized")
def tokenize(katamari, processor):
    katamari.actual = processor.tokenizer.tokenize(katamari.text)
    return


@then("it is the expected list of strings")
def check_tokens(katamari):
    expect(katamari.actual).to(contain_exactly(*katamari.expected))
    return

Stop Word Removal

Check that we're removing stop-words and punctuation.

Scenario: The user removes stop words and punctuation
  Given a tokenized string
  When the string is un-stopped
  Then it is the expected list of strings
#Scenario: The user removes stop words and punctuation


@given("a tokenized string")
def setup_tokenized_string(katamari):
    katamari.source = ("now is the winter of our discontent , "
                       "made glorious summer by this son of york ;").split()
    katamari.expected = ("winter discontent made glorious "
                         "summer son york".split())
    return


@when("the string is un-stopped")
def un_stop(katamari, processor):
    katamari.actual = processor.remove_useless_tokens(katamari.source)
    return
#  Then it is the expected list of strings

Stemming

This is kind of a fake test. I guessed incorrectly what the stemming would do the first time so I had to go back and match the test values to what it output. I don't think I'll take the time to learn how the stemming is working, though, so it'll have to do.

Scenario: The user stems the tokens
  Given a tokenized string
  When the string is un-stopped
  And tokens are stemmed
  Then it is the expected list of strings
# Scenario: The user stems the tokens
#  Given a tokenized string
#  When the string is un-stopped


@And("tokens are stemmed")
def stem_tokens(katamari, processor):
    katamari.actual = processor.stem(katamari.actual)
    katamari.expected = "winter discont made gloriou summer son york".split()
    return


#  Then it is the expected list of strings

The Whole Shebang

I made some of the steps separate just for illustration and testing, but I'll make the processor callable so they don't have to be done separately.

Scenario: The user calls the processor
  Given a tweet
  When the processor is called with the tweet
  Then it returns the cleaned, tokenized, and stemmed list
# Scenario: The user calls the processor


@given("a tweet")
def setup_tweet(katamari, faker):
    katamari.words = "#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)"
    katamari.tweet = f"RT {katamari.words} {faker.uri()}"
    katamari.expected =  ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
    return


@when("the processor is called with the tweet")
def process_tweet(katamari, processor):
    katamari.actual = processor(katamari.tweet)
    return


@then("it returns the cleaned, tokenized, and stemmed list")
def check_processed_tweet(katamari):
    expect(katamari.actual).to(contain_exactly(*katamari.expected))
    return

Implementation

A Regular Expression Helper

class WheatBran:
    """This is a holder for the regular expressions"""
    START_OF_LINE = r"^"
    END_OF_LINE = r"$"
    OPTIONAL = "{}?"
    ANYTHING = "."
    ZERO_OR_MORE = "{}*"
    ONE_OR_MORE = "{}+"
    ONE_OF_THESE = "[{}]"
    FOLLOWED_BY = r"(?={})"
    PRECEDED_BY = r"(?<={})"
    OR = "|"

    NOT = "^"
    SPACE = r"\s"
    SPACES = ONE_OR_MORE.format(SPACE)
    PART_OF_A_WORD = r"\w"
    EVERYTHING_OR_NOTHING = ZERO_OR_MORE.format(ANYTHING)
    EVERYTHING_BUT_SPACES = ZERO_OR_MORE.format(
        ONE_OF_THESE.format(NOT + SPACE))

    ERASE = ""
    FORWARD_SLASHES = r"\/\/"
    NEWLINES = ONE_OF_THESE.format(r"\r\n")
    # a dollar is a special regular expression character meaning end of line
    # so escape it
    DOLLAR_SIGN = r"\$"

    # to remove
    STOCK_SYMBOL = DOLLAR_SIGN + ZERO_OR_MORE.format(PART_OF_A_WORD)
    RE_TWEET = START_OF_LINE + "RT" + SPACES
    HYPERLINKS = ("http" + OPTIONAL.format("s") + ":" + FORWARD_SLASHES
                  + EVERYTHING_BUT_SPACES + ZERO_OR_MORE.format(NEWLINES))
    HASH = "#"

    EYES = ":"
    FROWN = "\("
    SMILE = "\)"

    SPACEY_EMOTICON = (FOLLOWED_BY.format(START_OF_LINE + OR + PRECEDED_BY.format(SPACE))
                       + EYES + SPACE + "{}" +
                       FOLLOWED_BY.format(SPACE + OR + END_OF_LINE))
    SPACEY_FROWN = SPACEY_EMOTICON.format(FROWN)
    SPACEY_SMILE = SPACEY_EMOTICON.format(SMILE)

    spacey_fixed_emoticons = [":(", ":)"]
    spacey_emoticons = [SPACEY_FROWN, SPACEY_SMILE]

    remove = [STOCK_SYMBOL, RE_TWEET, HYPERLINKS, HASH]

The Processor

Here's the class-based implementation to pre-process tweets.

# python
import re
import string

# pypi
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

import attr
import nltk

<<regular-expressions>>


@attr.s
class TwitterProcessor:
    """processor for tweets"""
    _tokenizer = attr.ib(default=None)
    _stopwords = attr.ib(default=None)
    _stemmer = attr.ib(default=None)

    <<processor-clean>>

    <<processor-tokenizer>>

    <<processor-un-stop>>

    <<processor-stopwords>>

    <<processor-stemmer>>

    <<processor-stem>>

    <<processor-call>>

    <<processor-emoticon>>

The Clean Method

def clean(self, tweet: str) -> str:
    """Removes sub-strings from the tweet

    Args:
     tweet: string tweet

    Returns:
     tweet with certain sub-strings removed
    """
    for expression in WheatBran.remove:
        tweet = re.sub(expression, WheatBran.ERASE, tweet)
    return tweet

Emoticon Fixer

This tries to handle emoticons with spaces in them.

def unspace_emoticons(self, tweet: str) ->  str:
    """Tries to  remove spaces from emoticons

    Args:
     tweet: message to check

    Returns:
     tweet with things that looks like emoticons with spaces un-spaced
    """
    for expression, fix in zip(
            WheatBran.spacey_emoticons, WheatBran.spacey_fixed_emoticons):
        tweet = re.sub(expression, fix, tweet)
    return tweet

The Tokenizer

@property
def tokenizer(self) -> TweetTokenizer:
    """The NLTK Tweet Tokenizer

    It will:
     - tokenize a string
     - remove twitter handles
     - remove repeated characters after the first three
    """
    if self._tokenizer is None:
        self._tokenizer = TweetTokenizer(preserve_case=False,
                                         strip_handles=True,
                                         reduce_len=True)
    return self._tokenizer

Stopwords

This might make more sense to be done at the module level, but I'll see how it goes.

@property
def stopwords(self) -> list:
    """NLTK English stopwords

    Warning:
     if the stopwords haven't been downloaded this also tries too download them
    """
    if self._stopwords is None:
        nltk.download('stopwords', quiet=True)
        self._stopwords =  stopwords.words("english")
    return self._stopwords

Un-Stop the Tokens

def remove_useless_tokens(self, tokens: list) -> list:
    """Remove stopwords and punctuation

    Args:
     tokens: list of strings

    Returns:
     tokens with unuseful tokens removed
    """    
    return [word for word in tokens if (word not in self.stopwords and
                                        word not in string.punctuation)]

Stem the Tokens

@property
def stemmer(self) -> PorterStemmer:
    """Porter Stemmer for the tokens"""
    if self._stemmer is None:
        self._stemmer = PorterStemmer()
    return self._stemmer
def stem(self, tokens: list) -> list:
    """stem the tokens"""
    return [self.stemmer.stem(word) for word in tokens]

Call Me

def __call__(self, tweet: str) -> list:
    """does all the processing in one step

    Args:
     tweet: string to process

    Returns:
     the tweet as a pre-processed list of strings
    """
    cleaned = self.unspace_emoticons(tweet)
    cleaned = self.clean(cleaned)
    cleaned = self.tokenizer.tokenize(cleaned.strip())
    # the stopwords are un-stemmed so this has to come before stemming
    cleaned = self.remove_useless_tokens(cleaned)
    cleaned = self.stem(cleaned)
    return cleaned

Save Things

Rather than create the training and test sets over and over I'll save them as feather files. I tried saving them as CSVs but I think since the tweets have commas in them it messes it up (or something does anyway). I don't use the un-processed tweets later, but maybe it's a good idea to keep things around.

It occurred to me that I could just pickle the data-frame, but I've never used feather before so it'll give me a chance to try it out. According to that post I linked to about feather it's meant to be fast rather than stable (the format might change) so this is both overkill and impractical, but, oh, well.

Data

First I'll process the tweets so I won't have to do this later.

process = TwitterProcessor()
training_processed = training.copy()
training_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_train]
print(training.head(1))
print(training_processed.head(1))
                                      tweet  label
0  off to the park to get some sunlight : )      1
                       tweet  label
0  [park, get, sunlight, :)]      1

Now to save it

processed_path = Path(os.environ["TWITTER_TRAINING_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser()

training_processed.to_feather(processed_path)
training.to_feather(raw_path)
processed_path = Path(os.environ["TWITTER_TEST_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TEST_RAW"]).expanduser()

testing = pandas.DataFrame.from_dict(dict(tweet=x_test, label=y_test))
testing_processed = testing.copy()
testing_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_test]

testing_processed.to_feather(processed_path)
testing.to_feather(raw_path)

Pickles

I'm spreading this over several posts so I'm going to save some python objects to hold constant values that they share.

path = Path(os.environ["TWITTER_SENTIMENT"]).expanduser()
with path.open("wb") as writer:
    pickle.dump(Sentiment, writer)
path = Path(os.environ["TWITTER_PLOT"]).expanduser()
with path.open("wb") as writer:
    pickle.dump(Plot, writer)

Next in the series: Twitter Word Frequencies

Note: This series is a re-write of an exercise taken from Coursera's Natural Language Processing specialization. I changed some of the way it works, though, so it won't match their solution 100% (but it's close).