Twitter Preprocessing With NLTK
Table of Contents
Beginning
This is the first in a series that will look at taking a group of tweets and building a Logistic Regression model to classify tweets as being either positive or negative in their sentiment. This post is followed by:
- Twitter Word Frequencies
- The Tweet Vectorizer
- Implementing Logistic Regression for Tweet Sentiment Analysis
This first post is a look at taking a corpus of Twitter data which comes from the Natural Language Toolkit's (NLTK) collection of data and creating a preprocessor for a Sentiment Analysis pipeline. This dataset has entries whose sentiment was categorized by hand so it's a convenient source for training models.
The NLTK Corpus How To has a brief description of the Twitter dataset and they also have some documentation about how to gather new data using the Twitter API yourself.
Set Up
Imports
# from python
from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint
import os
import pickle
import random
import re
import string
# from pypi
from dotenv import load_dotenv
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split
import holoviews
import hvplot.pandas
import nltk
import pandas
# this is created further down in the post
from neurotic.nlp.twitter.processor import TwitterProcessor
# my stuff
from graeae import CountPercentage, EmbedHoloviews
The Environment
This is where I keep the paths to the files I save.
load_dotenv("posts/nlp/.env")
Data
The first thing to do is download the dataset using the download function. If you don't pass an argument to it a dialog will open and you can choose to download any or all of their datasets, but for this exercise we'll just download the Twitter samples. Note that if you run this function and the samples were already downloaded then it won't re-download them so it's safe to call it in any case.
nltk.download('twitter_samples')
The data is contained in three files. You can see the file names using the twitter_samples.fileids
function.
for name in twitter_samples.fileids():
print(f" - {name}")
- negative_tweets.json - positive_tweets.json - tweets.20150430-223406.json
As you can see (or maybe guess) two of the files contain tweets that have been categorized as negative or positive. The third file has another 20,000 tweets that aren't classified.
The dataset contains the JSON for each tweet, including some metadata, which you can access through the twitter_samples.docs
function. Here's a sample.
pprint(twitter_samples.docs()[0])
{'contributors': None, 'coordinates': None, 'created_at': 'Fri Jul 24 10:42:49 +0000 2015', 'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []}, 'favorite_count': 0, 'favorited': False, 'geo': None, 'id': 624530164626534400, 'id_str': '624530164626534400', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'is_quote_status': False, 'lang': 'en', 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'place': None, 'retweet_count': 0, 'retweeted': False, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web ' '(M2)</a>', 'text': 'hopeless for tmr :(', 'truncated': False, 'user': {'contributors_enabled': False, 'created_at': 'Sun Mar 08 05:43:40 +0000 2015', 'default_profile': False, 'default_profile_image': False, 'description': '⇨ [V] TravelGency █ 2/4 Goddest from Girls Day █ 92L ' '█ sucrp', 'entities': {'description': {'urls': []}}, 'favourites_count': 196, 'follow_request_sent': False, 'followers_count': 1281, 'following': False, 'friends_count': 1264, 'geo_enabled': True, 'has_extended_profile': False, 'id': 3078803375, 'id_str': '3078803375', 'is_translation_enabled': False, 'is_translator': False, 'lang': 'id', 'listed_count': 3, 'location': 'wearegsd;favor;pucukfams;barbx', 'name': 'yuwra ✈ ', 'notifications': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png', 'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png', 'profile_background_tile': True, 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/3078803375/1433287528', 'profile_image_url': 'http://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg', 'profile_link_color': '000000', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': True, 'protected': False, 'screen_name': 'yuwraxkim', 'statuses_count': 19710, 'time_zone': 'Jakarta', 'url': None, 'utc_offset': 25200, 'verified': False}}
There's some potentially useful data here - like if the tweet was re-tweeted, but for what we're doing we'll just use the tweet itself.
To get just the text of the tweets you use the twitter_samples.strings
function.
help(twitter_samples.strings)
Help on method strings in module nltk.corpus.reader.twitter: strings(fileids=None) method of nltk.corpus.reader.twitter.TwitterCorpusReader instance Returns only the text content of Tweets in the file(s) :return: the given file(s) as a list of Tweets. :rtype: list(str)
Note that it says that it returns only the given file(s) as a list of tweets but it also makes the fileids
argument optional. If you don't pass in any argument you end up with the tweets from all the files in the same list, which you probably don't want.
positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
all_tweets = twitter_samples.strings("tweets.20150430-223406.json")
Now I'll download the stopwords for our pre-processing and setup the english stopwords for use later.
nltk.download('stopwords')
english_stopwords = stopwords.words("english")
Rather than working with the whole data-set I'm going to split it up here so we'll only work with the training set. First thing is to create a set of labels for the positive and negative tweets.
Sentiment = Namespace(
positive = 1,
negative = 0,
decode = {
1: "positive",
0: "negative"
},
encode = {
"positive": 1,
"negative": 0,
}
)
positive_labels = [Sentiment.positive] * len(positive)
negative_labels = [Sentiment.negative] * len(negative)
Now I'll combine the positive and negative tweets.
labels = positive_labels + negative_labels
tweets = positive + negative
print(f"Labels: {len(labels):,}")
print(f"tweets: {len(tweets):,}")
Labels: 10,000 tweets: 10,000
Now we can do the train-test splitting. The train_test_split function shuffles and splits up the dataset, so combining the positive and negative sets first before the splitting seemed like a good idea.
TRAINING_SIZE = 0.8
SEED = 20200724
x_train, x_test, y_train, y_test = train_test_split(
tweets, labels, train_size=TRAINING_SIZE, random_state=SEED)
print(f"Training: {len(x_train):,}\tTesting: {len(x_test):,}")
Training: 8,000 Testing: 2,000
The Random Seed
This just sets the random seed so that we get the same values if we re-run this later on (although this is a little tricky with the notebook, since you can call the same code multiple times).
random.seed(SEED)
Plotting
I won't be doing a lot of plotting here, but this is a setup for the little that I do.
SLUG = "01-twitter-preprocessing-with-nltk"
Embed = partial(EmbedHoloviews,
folder_path=f"files/posts/nlp/{SLUG}",
create_folder=False)
Plot = Namespace(
width=990,
height=780,
tan="#ddb377",
blue="#4687b7",
red="#ce7b6d",
font_scale=2,
color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)
Middle
It can be more convenient to use a Pandas Series for some checks of the tweets so I'll convert the all-tweets list to one.
all_tweets = pandas.Series(all_tweets)
Explore the Data
Let's start by looking at the number of tweets we got and confirming that the strings
function gave us back a list of strings like the docstring said it would.
print(f"Number of tweets: {len(all_tweets):,}")
print(f'Number of positive tweets: {len(positive):,}')
print(f'Number of negative tweets: {len(negative):,}')
for thing in (positive, negative):
assert type(thing) is list
assert type(random.choice(thing)) is str
Number of tweets: 20,000 Number of positive tweets: 5,000 Number of negative tweets: 5,000
We can see that the data for each file is made up of strings stored in a list and there were 20,000 tweets in total but only half as much were categorized.
Looking At Some Examples
First, since our data sets are shuffled, I'll convert them into a pandas DataFrame to make it a little easier to get positive vs negative tweets.
training = pandas.DataFrame.from_dict(dict(tweet=x_train, label=y_train))
print(f"Random Positive Tweet: {random.choice(positive)}")
print(f"\nRandom Negative Tweet: {random.choice(negative)}")
Random Positive Tweet: Hi.. Please say"happybirthday" to me :) thanksss :) — http://t.co/HPXV43LK5L Random Negative Tweet: I think I should stop getting so angry over stupid shit :(
The First Token
Later on we're going to remove the "RT" (re-tweet) token at the start of the strings. Let's look at how significant this is.
first_tokens = all_tweets.str.split(expand=True)[0]
top_ten = CountPercentage(first_tokens, stop=10, value_label="First Token")
top_ten()
First Token | Count | Percent (%) |
---|---|---|
RT | 13287 | 92.92 |
I | 160 | 1.12 |
Farage | 141 | 0.99 |
The | 134 | 0.94 |
VIDEO: | 132 | 0.92 |
Nigel | 117 | 0.82 |
Ed | 116 | 0.81 |
Miliband | 77 | 0.54 |
SNP | 69 | 0.48 |
@UKIP | 67 | 0.47 |
That gives you some sense of how much there is, but plotting it might make it a little clearer.
plot = top_ten.table.hvplot.bar(y="Percent (%)", x="First Token").opts(
title="Top Ten Tweet First Tokens",
width=Plot.width,
height=Plot.height)
output = Embed(plot=plot, file_name="top_ten", create_folder=False)
print(output())
So, about 93 % of the unclassified tweets start with RT
, making it perhaps not so informative a token. Or maybe it is… what does a re-tweet tell us? Let's look at if the re-tweeted show up as duplicates and if so, how many times they show up.
retweeted = all_tweets[all_tweets.str.startswith("RT")].value_counts().iloc[:10]
for item in retweeted.values:
print(f" - {item}")
- 491
- 430
- 131
- 131
- 117
- 103
- 82
- 73
- 69
- 68
Some of the entries are the same tweet repeated hundreds of times. Does each one count as an additional entry? I don't show it here because the tweets are kind of long, but the top five are all about British politics, so there might have been some kind of bias in the way the tweets were gathered.
Processing the Data
There are four basic steps in our NLP pre-processing:
- Tokenization
- Lower-casing
- Removing stop words and punctuation
- Stemming
Let's start by pulling up a tweet that has most of the stuff we're cleaning up.
THE_CHOSEN = training[(training.tweet.str.contains("beautiful")) &
(training.tweet.str.contains("http")) &
(training.tweet.str.contains("#"))].iloc[0].tweet
print(THE_CHOSEN)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
Cleaning Up Twitter-Specific Markup
Although I listed four steps in the beginning, there's often another step where we remove things that are common or not useful but known in advance. In this case we want to remove re-tweet tags (RT
), hyperlinks, and hashtags. We're going to do that with python's built in regular expression module. We're also going to do it one tweet at a time, although you could perhaps more efficiently do it in bulk using pandas.
START_OF_LINE = r"^"
OPTIONAL = "?"
ANYTHING = "."
ZERO_OR_MORE = "*"
ONE_OR_MORE = "+"
SPACE = "\s"
SPACES = SPACE + ONE_OR_MORE
NOT_SPACE = "[^\s]" + ONE_OR_MORE
EVERYTHING_OR_NOTHING = ANYTHING + ZERO_OR_MORE
ERASE = ""
FORWARD_SLASH = "\/"
NEWLINES = r"[\r\n]"
- Re-Tweets
None of the positive or negative samples have this tag so I'm going to pull an example from the complete set just to show it working.
RE_TWEET = START_OF_LINE + "RT" + SPACES tweet = all_tweets[0] print(tweet) tweet = re.sub(RE_TWEET, ERASE, tweet) print(tweet)
RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
- Hyperlinks
HYPERLINKS = ("http" + "s" + OPTIONAL + ":" + FORWARD_SLASH + FORWARD_SLASH + NOT_SPACE + NEWLINES + ZERO_OR_MORE) print(THE_CHOSEN) re_chosen = re.sub(HYPERLINKS, ERASE, THE_CHOSEN) print(re_chosen)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off…
- HashTags
We aren't removing the actual hash-tags, just the hash-marks (
#
).HASH = "#" re_chosen = re.sub(HASH, ERASE, re_chosen) print(re_chosen)
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…
Tokenize
NLTK has a tokenizer specially built for tweets. The twitter_samples
module actually has a tokenizer
function that breaks the tweets up, but since we are using regular expressions to clean up the strings a little first, it makes more sense to tokenize the strings afterwards. Also note that one of the steps in the pipeline is to lower-case the letters, which the TweetTokenizer
will do for us if we set the preserve_case
argument to False
.
print(help(TweetTokenizer))
Help on class TweetTokenizer in module nltk.tokenize.casual: class TweetTokenizer(builtins.object) | TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False) | | Tokenizer for tweets. | | >>> from nltk.tokenize import TweetTokenizer | >>> tknzr = TweetTokenizer() | >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--" | >>> tknzr.tokenize(s0) | ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--'] | | Examples using `strip_handles` and `reduce_len parameters`: | | >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) | >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!' | >>> tknzr.tokenize(s1) | [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!'] | | Methods defined here: | | __init__(self, preserve_case=True, reduce_len=False, strip_handles=False) | Initialize self. See help(type(self)) for accurate signature. | | tokenize(self, text) | :param text: str | :rtype: list(str) | :return: a tokenized list of strings; concatenating this list returns the original string if `preserve_case=False` | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) None
tokenizer = TweetTokenizer(
preserve_case=False,
strip_handles=True,
reduce_len=True)
As I mentioned, preserve_case
lower-cases the letters. The other two arguments are strip_handles
which removes the twitter-handles and reduce_len
which limits the number of times a character can be repeated to three - so zzzzz
will be changed to zzz
. Now we can tokenize our partly cleaned token.
print(re_chosen)
tokens = tokenizer.tokenize(re_chosen)
print(tokens)
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… ['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
Remove Stop Words and Punctuation
print(english_stopwords)
print(string.punctuation)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Not as many stopwords as I would have thought.
cleaned = [word for word in tokens if (word not in english_stopwords and
word not in string.punctuation)]
print(cleaned)
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
Stemming
We're going to use the Porter Stemmer from NLTK to stem the words (this is the official Porter Stemmer algorithm page).
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in cleaned]
print(stemmed)
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
End
So now we've seen the basic steps that we're going to need to preprocess our tweets for Sentiment Analysis.
Things to check out:
- The book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze has some useful information about tokenizing, stop words, and stemming, among other things (and is available to read online).
- preprocessor - (called tweet-preprocessor on pypi) has some of this baked in. The hashtag cleaning removes the word and the pound sign and it doesn't use the NLTK twitter tokenizer but looks like it might be useful (unfortunately not everything is documented so you have to look at the code to figure some things out).
Finally I'm going to re-write what we did as a class to re-use it later as well as save the testing and training data.
Tests
I'm going to use pytest-bdd to run the tests for the pre-processor but I'm also going to take advantage of org-babel and keep the scenario definitions and the test functions grouped by what they do, even though they will exist in two different files (tweet_preprocessing.feature
and test_preprocessing.py
) when tangled out of this file.
The Tangles
Feature: Tweet pre-processor
<<stock-processing>>
<<re-tweet-processing>>
<<hyperlink-processing>>
<<hash-processing>>
<<tokenization-preprocessing>>
<<stop-word-preprocessing>>
<<stem-preprocessing>>
<<whole-shebang-preprocessing>>
# from pypi
import pytest
# software under test
from neurotic.nlp.twitter.processor import TwitterProcessor
class Katamari:
"""Something to stick values into"""
@pytest.fixture
def katamari():
return Katamari()
@pytest.fixture
def processor():
return TwitterProcessor()
# from python
import random
import string
# from pypi
from expects import (
contain_exactly,
equal,
expect
)
from pytest_bdd import (
given,
scenarios,
then,
when,
)
And = when
# fixtures
from fixtures import katamari, processor
scenarios("twitter/tweet_preprocessing.feature")
<<test-stock-symbol>>
<<test-re-tweet>>
<<test-hyperlinks>>
<<test-hashtags>>
<<test-tokenization>>
<<test-unstopping>>
<<test-stem>>
<<test-call>>
Now on to the sections that go into the tangles.
Stock Symbols
Twitter has a special symbol for stocks which is a dollar sign followed by the stock ticker name (e.g. $HOG
for Harley Davidson) that I'll remove. This is going to assume anything with a dollar sign immediately followed by a letter, number, or underscore is a stock symbol.
Scenario: A tweet with a stock symbol is cleaned
Given a tweet with a stock symbol in it
When the tweet is cleaned
Then it has the text removed
#Scenario: A tweet with a stock symbol is cleaned
@given("a tweet with a stock symbol in it")
def setup_stock_symbol(katamari, faker):
symbol = "".join(random.choices(string.ascii_uppercase, k=4))
head, tail = faker.sentence(), faker.sentence()
katamari.to_clean = (f"{head} ${symbol} "
f"{tail}")
# the cleaner ignores spaces so there's going to be two spaces between
# the head and tail after the symbol is removed
katamari.expected = f"{head} {tail}"
return
# When the tweet is cleaned
# Then it has the text removed
The Re-tweets
This tests that we can remove the RT tag.
Scenario: A re-tweet is cleaned.
Given a tweet that has been re-tweeted
When the tweet is cleaned
Then it has the text removed
# Scenario: A re-tweet is cleaned.
@given("a tweet that has been re-tweeted")
def setup_re_tweet(katamari, faker):
katamari.expected = faker.sentence()
spaces = " " * random.randrange(1, 10)
katamari.to_clean = f"RT{spaces}{katamari.expected}"
return
@when("the tweet is cleaned")
def process_tweet(katamari, processor):
katamari.actual = processor.clean(katamari.to_clean)
return
@then("it has the text removed")
def check_cleaned_text(katamari):
expect(katamari.expected).to(equal(katamari.actual))
return
Hyperlinks
Now test that we can remove hyperlinks.
Scenario: The tweet has a hyperlink
Given a tweet with a hyperlink
When the tweet is cleaned
Then it has the text removed
# Scenario: The tweet has a hyperlink
@given("a tweet with a hyperlink")
def setup_hyperlink(katamari, faker):
base = faker.sentence()
katamari.expected = base + " :)"
katamari.to_clean = base + faker.uri() + " :)"
return
Hash Symbols
Test that we can remove the pound symbol.
Scenario: A tweet has hash symbols in it.
Given a tweet with hash symbols
When the tweet is cleaned
Then it has the text removed
@given("a tweet with hash symbols")
def setup_hash_symbols(katamari, faker):
expected = faker.sentence()
tokens = expected.split()
expected_tokens = expected.split()
for count in range(random.randrange(1, 10)):
index = random.randrange(len(tokens))
word = faker.word()
tokens = tokens[:index] + [f"#{word}"] + tokens[index:]
expected_tokens = expected_tokens[:index] + [word] + expected_tokens[index:]
katamari.to_clean = " ".join(tokens)
katamari.expected = " ".join(expected_tokens)
return
Tokenization
This is being done by NLTK, so it might not really make sense to test it, but I figured adding a test would make it more likely that I'd slow down enough to understand what it's doing.
Scenario: The text is tokenized
Given a string of text
When the text is tokenized
Then it is the expected list of strings
# Scenario: The text is tokenized
@given("a string of text")
def setup_text(katamari):
katamari.text = "Time flies like an Arrow, fruit flies like a BANANAAAA!"
katamari.expected = ("time flies like an arrow , "
"fruit flies like a bananaaa !").split()
return
@when("the text is tokenized")
def tokenize(katamari, processor):
katamari.actual = processor.tokenizer.tokenize(katamari.text)
return
@then("it is the expected list of strings")
def check_tokens(katamari):
expect(katamari.actual).to(contain_exactly(*katamari.expected))
return
Stop Word Removal
Check that we're removing stop-words and punctuation.
Scenario: The user removes stop words and punctuation
Given a tokenized string
When the string is un-stopped
Then it is the expected list of strings
#Scenario: The user removes stop words and punctuation
@given("a tokenized string")
def setup_tokenized_string(katamari):
katamari.source = ("now is the winter of our discontent , "
"made glorious summer by this son of york ;").split()
katamari.expected = ("winter discontent made glorious "
"summer son york".split())
return
@when("the string is un-stopped")
def un_stop(katamari, processor):
katamari.actual = processor.remove_useless_tokens(katamari.source)
return
# Then it is the expected list of strings
Stemming
This is kind of a fake test. I guessed incorrectly what the stemming would do the first time so I had to go back and match the test values to what it output. I don't think I'll take the time to learn how the stemming is working, though, so it'll have to do.
Scenario: The user stems the tokens
Given a tokenized string
When the string is un-stopped
And tokens are stemmed
Then it is the expected list of strings
# Scenario: The user stems the tokens
# Given a tokenized string
# When the string is un-stopped
@And("tokens are stemmed")
def stem_tokens(katamari, processor):
katamari.actual = processor.stem(katamari.actual)
katamari.expected = "winter discont made gloriou summer son york".split()
return
# Then it is the expected list of strings
The Whole Shebang
I made some of the steps separate just for illustration and testing, but I'll make the processor callable so they don't have to be done separately.
Scenario: The user calls the processor
Given a tweet
When the processor is called with the tweet
Then it returns the cleaned, tokenized, and stemmed list
# Scenario: The user calls the processor
@given("a tweet")
def setup_tweet(katamari, faker):
katamari.words = "#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)"
katamari.tweet = f"RT {katamari.words} {faker.uri()}"
katamari.expected = ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
return
@when("the processor is called with the tweet")
def process_tweet(katamari, processor):
katamari.actual = processor(katamari.tweet)
return
@then("it returns the cleaned, tokenized, and stemmed list")
def check_processed_tweet(katamari):
expect(katamari.actual).to(contain_exactly(*katamari.expected))
return
Implementation
A Regular Expression Helper
class WheatBran:
"""This is a holder for the regular expressions"""
START_OF_LINE = r"^"
END_OF_LINE = r"$"
OPTIONAL = "{}?"
ANYTHING = "."
ZERO_OR_MORE = "{}*"
ONE_OR_MORE = "{}+"
ONE_OF_THESE = "[{}]"
FOLLOWED_BY = r"(?={})"
PRECEDED_BY = r"(?<={})"
OR = "|"
NOT = "^"
SPACE = r"\s"
SPACES = ONE_OR_MORE.format(SPACE)
PART_OF_A_WORD = r"\w"
EVERYTHING_OR_NOTHING = ZERO_OR_MORE.format(ANYTHING)
EVERYTHING_BUT_SPACES = ZERO_OR_MORE.format(
ONE_OF_THESE.format(NOT + SPACE))
ERASE = ""
FORWARD_SLASHES = r"\/\/"
NEWLINES = ONE_OF_THESE.format(r"\r\n")
# a dollar is a special regular expression character meaning end of line
# so escape it
DOLLAR_SIGN = r"\$"
# to remove
STOCK_SYMBOL = DOLLAR_SIGN + ZERO_OR_MORE.format(PART_OF_A_WORD)
RE_TWEET = START_OF_LINE + "RT" + SPACES
HYPERLINKS = ("http" + OPTIONAL.format("s") + ":" + FORWARD_SLASHES
+ EVERYTHING_BUT_SPACES + ZERO_OR_MORE.format(NEWLINES))
HASH = "#"
EYES = ":"
FROWN = "\("
SMILE = "\)"
SPACEY_EMOTICON = (FOLLOWED_BY.format(START_OF_LINE + OR + PRECEDED_BY.format(SPACE))
+ EYES + SPACE + "{}" +
FOLLOWED_BY.format(SPACE + OR + END_OF_LINE))
SPACEY_FROWN = SPACEY_EMOTICON.format(FROWN)
SPACEY_SMILE = SPACEY_EMOTICON.format(SMILE)
spacey_fixed_emoticons = [":(", ":)"]
spacey_emoticons = [SPACEY_FROWN, SPACEY_SMILE]
remove = [STOCK_SYMBOL, RE_TWEET, HYPERLINKS, HASH]
The Processor
Here's the class-based implementation to pre-process tweets.
# python
import re
import string
# pypi
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import attr
import nltk
<<regular-expressions>>
@attr.s
class TwitterProcessor:
"""processor for tweets"""
_tokenizer = attr.ib(default=None)
_stopwords = attr.ib(default=None)
_stemmer = attr.ib(default=None)
<<processor-clean>>
<<processor-tokenizer>>
<<processor-un-stop>>
<<processor-stopwords>>
<<processor-stemmer>>
<<processor-stem>>
<<processor-call>>
<<processor-emoticon>>
The Clean Method
def clean(self, tweet: str) -> str:
"""Removes sub-strings from the tweet
Args:
tweet: string tweet
Returns:
tweet with certain sub-strings removed
"""
for expression in WheatBran.remove:
tweet = re.sub(expression, WheatBran.ERASE, tweet)
return tweet
Emoticon Fixer
This tries to handle emoticons with spaces in them.
def unspace_emoticons(self, tweet: str) -> str:
"""Tries to remove spaces from emoticons
Args:
tweet: message to check
Returns:
tweet with things that looks like emoticons with spaces un-spaced
"""
for expression, fix in zip(
WheatBran.spacey_emoticons, WheatBran.spacey_fixed_emoticons):
tweet = re.sub(expression, fix, tweet)
return tweet
The Tokenizer
@property
def tokenizer(self) -> TweetTokenizer:
"""The NLTK Tweet Tokenizer
It will:
- tokenize a string
- remove twitter handles
- remove repeated characters after the first three
"""
if self._tokenizer is None:
self._tokenizer = TweetTokenizer(preserve_case=False,
strip_handles=True,
reduce_len=True)
return self._tokenizer
Stopwords
This might make more sense to be done at the module level, but I'll see how it goes.
@property
def stopwords(self) -> list:
"""NLTK English stopwords
Warning:
if the stopwords haven't been downloaded this also tries too download them
"""
if self._stopwords is None:
nltk.download('stopwords', quiet=True)
self._stopwords = stopwords.words("english")
return self._stopwords
Un-Stop the Tokens
def remove_useless_tokens(self, tokens: list) -> list:
"""Remove stopwords and punctuation
Args:
tokens: list of strings
Returns:
tokens with unuseful tokens removed
"""
return [word for word in tokens if (word not in self.stopwords and
word not in string.punctuation)]
Stem the Tokens
@property
def stemmer(self) -> PorterStemmer:
"""Porter Stemmer for the tokens"""
if self._stemmer is None:
self._stemmer = PorterStemmer()
return self._stemmer
def stem(self, tokens: list) -> list:
"""stem the tokens"""
return [self.stemmer.stem(word) for word in tokens]
Call Me
def __call__(self, tweet: str) -> list:
"""does all the processing in one step
Args:
tweet: string to process
Returns:
the tweet as a pre-processed list of strings
"""
cleaned = self.unspace_emoticons(tweet)
cleaned = self.clean(cleaned)
cleaned = self.tokenizer.tokenize(cleaned.strip())
# the stopwords are un-stemmed so this has to come before stemming
cleaned = self.remove_useless_tokens(cleaned)
cleaned = self.stem(cleaned)
return cleaned
Save Things
Rather than create the training and test sets over and over I'll save them as feather files. I tried saving them as CSVs but I think since the tweets have commas in them it messes it up (or something does anyway). I don't use the un-processed tweets later, but maybe it's a good idea to keep things around.
It occurred to me that I could just pickle the data-frame, but I've never used feather before so it'll give me a chance to try it out. According to that post I linked to about feather it's meant to be fast rather than stable (the format might change) so this is both overkill and impractical, but, oh, well.
Data
First I'll process the tweets so I won't have to do this later.
process = TwitterProcessor()
training_processed = training.copy()
training_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_train]
print(training.head(1))
print(training_processed.head(1))
tweet label 0 off to the park to get some sunlight : ) 1 tweet label 0 [park, get, sunlight, :)] 1
Now to save it
processed_path = Path(os.environ["TWITTER_TRAINING_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser()
training_processed.to_feather(processed_path)
training.to_feather(raw_path)
processed_path = Path(os.environ["TWITTER_TEST_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TEST_RAW"]).expanduser()
testing = pandas.DataFrame.from_dict(dict(tweet=x_test, label=y_test))
testing_processed = testing.copy()
testing_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_test]
testing_processed.to_feather(processed_path)
testing.to_feather(raw_path)
Pickles
I'm spreading this over several posts so I'm going to save some python objects to hold constant values that they share.
path = Path(os.environ["TWITTER_SENTIMENT"]).expanduser()
with path.open("wb") as writer:
pickle.dump(Sentiment, writer)
path = Path(os.environ["TWITTER_PLOT"]).expanduser()
with path.open("wb") as writer:
pickle.dump(Plot, writer)
Next in the series: Twitter Word Frequencies
Note: This series is a re-write of an exercise taken from Coursera's Natural Language Processing specialization. I changed some of the way it works, though, so it won't match their solution 100% (but it's close).