Twitter Preprocessing With NLTK

Beginning

This is the first in a series that will look at taking a group of tweets and building a Logistic Regression model to classify tweets as being either positive or negative in their sentiment. This post is followed by:

This first post is a look at taking a corpus of Twitter data which comes from the Natural Language Toolkit's (NLTK) collection of data and creating a preprocessor for a Sentiment Analysis pipeline. This dataset has entries whose sentiment was categorized by hand so it's a convenient source for training models.

The NLTK Corpus How To has a brief description of the Twitter dataset and they also have some documentation about how to gather new data using the Twitter API yourself.

Set Up

Imports

# from python
from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint

import os
import pickle
import random
import re
import string

# from pypi
from dotenv import load_dotenv
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split

import holoviews
import hvplot.pandas
import nltk
import pandas

# this is created further down in the post
from neurotic.nlp.twitter.processor import TwitterProcessor

# my stuff
from graeae import CountPercentage, EmbedHoloviews

The Environment

This is where I keep the paths to the files I save.

load_dotenv("posts/nlp/.env")

Data

The first thing to do is download the dataset using the download function. If you don't pass an argument to it a dialog will open and you can choose to download any or all of their datasets, but for this exercise we'll just download the Twitter samples. Note that if you run this function and the samples were already downloaded then it won't re-download them so it's safe to call it in any case.

nltk.download('twitter_samples')

The data is contained in three files. You can see the file names using the twitter_samples.fileids function.

for name in twitter_samples.fileids():
    print(f" - {name}")
- negative_tweets.json
- positive_tweets.json
- tweets.20150430-223406.json

As you can see (or maybe guess) two of the files contain tweets that have been categorized as negative or positive. The third file has another 20,000 tweets that aren't classified.

The dataset contains the JSON for each tweet, including some metadata, which you can access through the twitter_samples.docs function. Here's a sample.

pprint(twitter_samples.docs()[0])
{'contributors': None,
 'coordinates': None,
 'created_at': 'Fri Jul 24 10:42:49 +0000 2015',
 'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 624530164626534400,
 'id_str': '624530164626534400',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'place': None,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web '
           '(M2)</a>',
 'text': 'hopeless for tmr :(',
 'truncated': False,
 'user': {'contributors_enabled': False,
          'created_at': 'Sun Mar 08 05:43:40 +0000 2015',
          'default_profile': False,
          'default_profile_image': False,
          'description': '⇨ [V] TravelGency █ 2/4 Goddest from Girls Day █ 92L '
                         '█ sucrp',
          'entities': {'description': {'urls': []}},
          'favourites_count': 196,
          'follow_request_sent': False,
          'followers_count': 1281,
          'following': False,
          'friends_count': 1264,
          'geo_enabled': True,
          'has_extended_profile': False,
          'id': 3078803375,
          'id_str': '3078803375',
          'is_translation_enabled': False,
          'is_translator': False,
          'lang': 'id',
          'listed_count': 3,
          'location': 'wearegsd;favor;pucukfams;barbx',
          'name': 'yuwra ✈ ',
          'notifications': False,
          'profile_background_color': '000000',
          'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png',
          'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/585476378365014016/j1mvQu3c.png',
          'profile_background_tile': True,
          'profile_banner_url': 'https://pbs.twimg.com/profile_banners/3078803375/1433287528',
          'profile_image_url': 'http://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg',
          'profile_image_url_https': 'https://pbs.twimg.com/profile_images/622631732399898624/kmYsX_k1_normal.jpg',
          'profile_link_color': '000000',
          'profile_sidebar_border_color': '000000',
          'profile_sidebar_fill_color': '000000',
          'profile_text_color': '000000',
          'profile_use_background_image': True,
          'protected': False,
          'screen_name': 'yuwraxkim',
          'statuses_count': 19710,
          'time_zone': 'Jakarta',
          'url': None,
          'utc_offset': 25200,
          'verified': False}}

There's some potentially useful data here - like if the tweet was re-tweeted, but for what we're doing we'll just use the tweet itself.

To get just the text of the tweets you use the twitter_samples.strings function.

help(twitter_samples.strings)
Help on method strings in module nltk.corpus.reader.twitter:

strings(fileids=None) method of nltk.corpus.reader.twitter.TwitterCorpusReader instance
    Returns only the text content of Tweets in the file(s)
    
    :return: the given file(s) as a list of Tweets.
    :rtype: list(str)

Note that it says that it returns only the given file(s) as a list of tweets but it also makes the fileids argument optional. If you don't pass in any argument you end up with the tweets from all the files in the same list, which you probably don't want.

positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
all_tweets = twitter_samples.strings("tweets.20150430-223406.json")

Now I'll download the stopwords for our pre-processing and setup the english stopwords for use later.

nltk.download('stopwords')
english_stopwords = stopwords.words("english")

Rather than working with the whole data-set I'm going to split it up here so we'll only work with the training set. First thing is to create a set of labels for the positive and negative tweets.

Sentiment = Namespace(
    positive = 1,
    negative = 0,
    decode = {
        1: "positive",
        0: "negative"
    },
    encode = {
        "positive": 1,
        "negative": 0,
    }
)
positive_labels = [Sentiment.positive] * len(positive)
negative_labels = [Sentiment.negative] * len(negative)

Now I'll combine the positive and negative tweets.

labels = positive_labels + negative_labels
tweets = positive + negative

print(f"Labels: {len(labels):,}")
print(f"tweets: {len(tweets):,}")
Labels: 10,000
tweets: 10,000

Now we can do the train-test splitting. The train_test_split function shuffles and splits up the dataset, so combining the positive and negative sets first before the splitting seemed like a good idea.

TRAINING_SIZE = 0.8
SEED = 20200724
x_train, x_test, y_train, y_test = train_test_split(
    tweets, labels, train_size=TRAINING_SIZE, random_state=SEED)

print(f"Training: {len(x_train):,}\tTesting: {len(x_test):,}")
Training: 8,000 Testing: 2,000

The Random Seed

This just sets the random seed so that we get the same values if we re-run this later on (although this is a little tricky with the notebook, since you can call the same code multiple times).

random.seed(SEED)

Plotting

I won't be doing a lot of plotting here, but this is a setup for the little that I do.

SLUG = "01-twitter-preprocessing-with-nltk"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/nlp/{SLUG}",
                create_folder=False)
Plot = Namespace(
    width=990,
    height=780,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    font_scale=2,
    color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)

Middle

It can be more convenient to use a Pandas Series for some checks of the tweets so I'll convert the all-tweets list to one.

all_tweets = pandas.Series(all_tweets)

Explore the Data

Let's start by looking at the number of tweets we got and confirming that the strings function gave us back a list of strings like the docstring said it would.

print(f"Number of tweets: {len(all_tweets):,}")
print(f'Number of positive tweets: {len(positive):,}')
print(f'Number of negative tweets: {len(negative):,}')

for thing in (positive, negative):
    assert type(thing) is list
    assert type(random.choice(thing)) is str
Number of tweets: 20,000
Number of positive tweets: 5,000
Number of negative tweets: 5,000

We can see that the data for each file is made up of strings stored in a list and there were 20,000 tweets in total but only half as much were categorized.

Looking At Some Examples

First, since our data sets are shuffled, I'll convert them into a pandas DataFrame to make it a little easier to get positive vs negative tweets.

training = pandas.DataFrame.from_dict(dict(tweet=x_train, label=y_train))
print(f"Random Positive Tweet: {random.choice(positive)}")
print(f"\nRandom Negative Tweet: {random.choice(negative)}")
Random Positive Tweet: Hi.. Please say"happybirthday" to me :) thanksss :) —  http://t.co/HPXV43LK5L

Random Negative Tweet: I think I should stop getting so angry over stupid shit :(

The First Token

Later on we're going to remove the "RT" (re-tweet) token at the start of the strings. Let's look at how significant this is.

first_tokens = all_tweets.str.split(expand=True)[0]
top_ten = CountPercentage(first_tokens, stop=10, value_label="First Token")
top_ten()
First Token Count Percent (%)
RT 13287 92.92
I 160 1.12
Farage 141 0.99
The 134 0.94
VIDEO: 132 0.92
Nigel 117 0.82
Ed 116 0.81
Miliband 77 0.54
SNP 69 0.48
@UKIP 67 0.47

That gives you some sense of how much there is, but plotting it might make it a little clearer.

plot = top_ten.table.hvplot.bar(y="Percent (%)", x="First Token").opts(
    title="Top Ten Tweet First Tokens", 
    width=Plot.width,
    height=Plot.height)
output = Embed(plot=plot, file_name="top_ten", create_folder=False)
print(output())

Figure Missing

So, about 93 % of the unclassified tweets start with RT, making it perhaps not so informative a token. Or maybe it is… what does a re-tweet tell us? Let's look at if the re-tweeted show up as duplicates and if so, how many times they show up.

retweeted = all_tweets[all_tweets.str.startswith("RT")].value_counts().iloc[:10]
for item in retweeted.values:
    print(f" - {item}")
  • 491
  • 430
  • 131
  • 131
  • 117
  • 103
  • 82
  • 73
  • 69
  • 68

Some of the entries are the same tweet repeated hundreds of times. Does each one count as an additional entry? I don't show it here because the tweets are kind of long, but the top five are all about British politics, so there might have been some kind of bias in the way the tweets were gathered.

Processing the Data

There are four basic steps in our NLP pre-processing:

Let's start by pulling up a tweet that has most of the stuff we're cleaning up.

THE_CHOSEN = training[(training.tweet.str.contains("beautiful")) &
                      (training.tweet.str.contains("http")) &
                      (training.tweet.str.contains("#"))].iloc[0].tweet
print(THE_CHOSEN)
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

Cleaning Up Twitter-Specific Markup

Although I listed four steps in the beginning, there's often another step where we remove things that are common or not useful but known in advance. In this case we want to remove re-tweet tags (RT), hyperlinks, and hashtags. We're going to do that with python's built in regular expression module. We're also going to do it one tweet at a time, although you could perhaps more efficiently do it in bulk using pandas.

START_OF_LINE = r"^"
OPTIONAL = "?"
ANYTHING = "."
ZERO_OR_MORE = "*"
ONE_OR_MORE = "+"

SPACE = "\s"
SPACES = SPACE + ONE_OR_MORE
NOT_SPACE = "[^\s]" + ONE_OR_MORE
EVERYTHING_OR_NOTHING = ANYTHING + ZERO_OR_MORE

ERASE = ""
FORWARD_SLASH = "\/"
NEWLINES = r"[\r\n]"
  • Re-Tweets

    None of the positive or negative samples have this tag so I'm going to pull an example from the complete set just to show it working.

    RE_TWEET = START_OF_LINE + "RT" + SPACES
    
    tweet = all_tweets[0]
    print(tweet)
    tweet = re.sub(RE_TWEET, ERASE, tweet)
    print(tweet)
    
    RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
    @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP
    
  • Hyperlinks
    HYPERLINKS = ("http" + "s" + OPTIONAL + ":" + FORWARD_SLASH + FORWARD_SLASH
                  + NOT_SPACE + NEWLINES + ZERO_OR_MORE)
    
    print(THE_CHOSEN)
    re_chosen = re.sub(HYPERLINKS, ERASE, THE_CHOSEN)
    print(re_chosen)
    
    My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
    My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… 
    
  • HashTags

    We aren't removing the actual hash-tags, just the hash-marks (#).

    HASH = "#"
    re_chosen = re.sub(HASH, ERASE, re_chosen)
    print(re_chosen)
    
    My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
    

Tokenize

NLTK has a tokenizer specially built for tweets. The twitter_samples module actually has a tokenizer function that breaks the tweets up, but since we are using regular expressions to clean up the strings a little first, it makes more sense to tokenize the strings afterwards. Also note that one of the steps in the pipeline is to lower-case the letters, which the TweetTokenizer will do for us if we set the preserve_case argument to False.

print(help(TweetTokenizer))
Help on class TweetTokenizer in module nltk.tokenize.casual:

class TweetTokenizer(builtins.object)
 |  TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)
 |  
 |  Tokenizer for tweets.
 |  
 |      >>> from nltk.tokenize import TweetTokenizer
 |      >>> tknzr = TweetTokenizer()
 |      >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
 |      >>> tknzr.tokenize(s0)
 |      ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
 |  
 |  Examples using `strip_handles` and `reduce_len parameters`:
 |  
 |      >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
 |      >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
 |      >>> tknzr.tokenize(s1)
 |      [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
 |  
 |  Methods defined here:
 |  
 |  __init__(self, preserve_case=True, reduce_len=False, strip_handles=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  tokenize(self, text)
 |      :param text: str
 |      :rtype: list(str)
 |      :return: a tokenized list of strings; concatenating this list returns        the original string if `preserve_case=False`
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

None
tokenizer = TweetTokenizer(
    preserve_case=False,
    strip_handles=True,
    reduce_len=True)

As I mentioned, preserve_case lower-cases the letters. The other two arguments are strip_handles which removes the twitter-handles and reduce_len which limits the number of times a character can be repeated to three - so zzzzz will be changed to zzz. Now we can tokenize our partly cleaned token.

print(re_chosen)
tokens = tokenizer.tokenize(re_chosen)
print(tokens)
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

Remove Stop Words and Punctuation

print(english_stopwords)
print(string.punctuation)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Not as many stopwords as I would have thought.

cleaned = [word for word in tokens if (word not in english_stopwords and
                                       word not in string.punctuation)]
print(cleaned)
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

Stemming

We're going to use the Porter Stemmer from NLTK to stem the words (this is the official Porter Stemmer algorithm page).

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in cleaned]
print(stemmed)
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

End

So now we've seen the basic steps that we're going to need to preprocess our tweets for Sentiment Analysis.

Things to check out:

  • The book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze has some useful information about tokenizing, stop words, and stemming, among other things (and is available to read online).
  • preprocessor - (called tweet-preprocessor on pypi) has some of this baked in. The hashtag cleaning removes the word and the pound sign and it doesn't use the NLTK twitter tokenizer but looks like it might be useful (unfortunately not everything is documented so you have to look at the code to figure some things out).

Finally I'm going to re-write what we did as a class to re-use it later as well as save the testing and training data.

Tests

I'm going to use pytest-bdd to run the tests for the pre-processor but I'm also going to take advantage of org-babel and keep the scenario definitions and the test functions grouped by what they do, even though they will exist in two different files (tweet_preprocessing.feature and test_preprocessing.py) when tangled out of this file.

The Tangles

Feature: Tweet pre-processor

<<stock-processing>>

<<re-tweet-processing>>

<<hyperlink-processing>>

<<hash-processing>>

<<tokenization-preprocessing>>

<<stop-word-preprocessing>>

<<stem-preprocessing>>

<<whole-shebang-preprocessing>>
# from pypi
import pytest

# software under test
from neurotic.nlp.twitter.processor import TwitterProcessor

class Katamari:
    """Something to stick values into"""

@pytest.fixture
def katamari():
    return Katamari()


@pytest.fixture
def processor():
    return TwitterProcessor()
# from python
import random
import string

# from pypi
from expects import (
    contain_exactly,
    equal,
    expect
)
from pytest_bdd import (
    given,
    scenarios,
    then,
    when,
)

And = when


# fixtures
from fixtures import katamari, processor

scenarios("twitter/tweet_preprocessing.feature")


<<test-stock-symbol>>


<<test-re-tweet>>


<<test-hyperlinks>>


<<test-hashtags>>


<<test-tokenization>>


<<test-unstopping>>


<<test-stem>>


<<test-call>>

Now on to the sections that go into the tangles.

Stock Symbols

Twitter has a special symbol for stocks which is a dollar sign followed by the stock ticker name (e.g. $HOG for Harley Davidson) that I'll remove. This is going to assume anything with a dollar sign immediately followed by a letter, number, or underscore is a stock symbol.

Scenario: A tweet with a stock symbol is cleaned
  Given a tweet with a stock symbol in it
  When the tweet is cleaned
  Then it has the text removed
#Scenario: A tweet with a stock symbol is cleaned


@given("a tweet with a stock symbol in it")
def setup_stock_symbol(katamari, faker):
    symbol = "".join(random.choices(string.ascii_uppercase, k=4))
    head, tail = faker.sentence(), faker.sentence()
    katamari.to_clean = (f"{head} ${symbol} "
                         f"{tail}")

    # the cleaner ignores spaces so there's going to be two spaces between
    # the head and tail after the symbol is removed
    katamari.expected = f"{head}  {tail}"
    return

#   When the tweet is cleaned
#   Then it has the text removed

The Re-tweets

This tests that we can remove the RT tag.

Scenario: A re-tweet is cleaned.

  Given a tweet that has been re-tweeted
  When the tweet is cleaned
  Then it has the text removed
# Scenario: A re-tweet is cleaned.

@given("a tweet that has been re-tweeted")
def setup_re_tweet(katamari, faker):
    katamari.expected = faker.sentence()
    spaces = " " * random.randrange(1, 10)
    katamari.to_clean = f"RT{spaces}{katamari.expected}"
    return


@when("the tweet is cleaned")
def process_tweet(katamari, processor):
    katamari.actual = processor.clean(katamari.to_clean)
    return


@then("it has the text removed")
def check_cleaned_text(katamari):
    expect(katamari.expected).to(equal(katamari.actual))
    return

Hyperlinks

Now test that we can remove hyperlinks.

Scenario: The tweet has a hyperlink
  Given a tweet with a hyperlink
  When the tweet is cleaned
  Then it has the text removed
# Scenario: The tweet has a hyperlink

@given("a tweet with a hyperlink")
def setup_hyperlink(katamari, faker):
    base = faker.sentence()
    katamari.expected = base + " :)"
    katamari.to_clean = base + faker.uri() + " :)"
    return

Hash Symbols

Test that we can remove the pound symbol.

Scenario: A tweet has hash symbols in it.
  Given a tweet with hash symbols
  When the tweet is cleaned
  Then it has the text removed
@given("a tweet with hash symbols")
def setup_hash_symbols(katamari, faker):
    expected = faker.sentence()
    tokens = expected.split()
    expected_tokens = expected.split()

    for count in range(random.randrange(1, 10)):
        index = random.randrange(len(tokens))
        word = faker.word()
        tokens = tokens[:index] + [f"#{word}"] + tokens[index:]
        expected_tokens = expected_tokens[:index] + [word] + expected_tokens[index:]
    katamari.to_clean = " ".join(tokens)
    katamari.expected = " ".join(expected_tokens)
    return

Tokenization

This is being done by NLTK, so it might not really make sense to test it, but I figured adding a test would make it more likely that I'd slow down enough to understand what it's doing.

Scenario: The text is tokenized
  Given a string of text
  When the text is tokenized
  Then it is the expected list of strings
# Scenario: The text is tokenized


@given("a string of text")
def setup_text(katamari):
    katamari.text = "Time flies like an Arrow, fruit flies like a BANANAAAA!"
    katamari.expected = ("time flies like an arrow , "
                         "fruit flies like a bananaaa !").split()
    return


@when("the text is tokenized")
def tokenize(katamari, processor):
    katamari.actual = processor.tokenizer.tokenize(katamari.text)
    return


@then("it is the expected list of strings")
def check_tokens(katamari):
    expect(katamari.actual).to(contain_exactly(*katamari.expected))
    return

Stop Word Removal

Check that we're removing stop-words and punctuation.

Scenario: The user removes stop words and punctuation
  Given a tokenized string
  When the string is un-stopped
  Then it is the expected list of strings
#Scenario: The user removes stop words and punctuation


@given("a tokenized string")
def setup_tokenized_string(katamari):
    katamari.source = ("now is the winter of our discontent , "
                       "made glorious summer by this son of york ;").split()
    katamari.expected = ("winter discontent made glorious "
                         "summer son york".split())
    return


@when("the string is un-stopped")
def un_stop(katamari, processor):
    katamari.actual = processor.remove_useless_tokens(katamari.source)
    return
#  Then it is the expected list of strings

Stemming

This is kind of a fake test. I guessed incorrectly what the stemming would do the first time so I had to go back and match the test values to what it output. I don't think I'll take the time to learn how the stemming is working, though, so it'll have to do.

Scenario: The user stems the tokens
  Given a tokenized string
  When the string is un-stopped
  And tokens are stemmed
  Then it is the expected list of strings
# Scenario: The user stems the tokens
#  Given a tokenized string
#  When the string is un-stopped


@And("tokens are stemmed")
def stem_tokens(katamari, processor):
    katamari.actual = processor.stem(katamari.actual)
    katamari.expected = "winter discont made gloriou summer son york".split()
    return


#  Then it is the expected list of strings

The Whole Shebang

I made some of the steps separate just for illustration and testing, but I'll make the processor callable so they don't have to be done separately.

Scenario: The user calls the processor
  Given a tweet
  When the processor is called with the tweet
  Then it returns the cleaned, tokenized, and stemmed list
# Scenario: The user calls the processor


@given("a tweet")
def setup_tweet(katamari, faker):
    katamari.words = "#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)"
    katamari.tweet = f"RT {katamari.words} {faker.uri()}"
    katamari.expected =  ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
    return


@when("the processor is called with the tweet")
def process_tweet(katamari, processor):
    katamari.actual = processor(katamari.tweet)
    return


@then("it returns the cleaned, tokenized, and stemmed list")
def check_processed_tweet(katamari):
    expect(katamari.actual).to(contain_exactly(*katamari.expected))
    return

Implementation

A Regular Expression Helper

class WheatBran:
    """This is a holder for the regular expressions"""
    START_OF_LINE = r"^"
    END_OF_LINE = r"$"
    OPTIONAL = "{}?"
    ANYTHING = "."
    ZERO_OR_MORE = "{}*"
    ONE_OR_MORE = "{}+"
    ONE_OF_THESE = "[{}]"
    FOLLOWED_BY = r"(?={})"
    PRECEDED_BY = r"(?<={})"
    OR = "|"

    NOT = "^"
    SPACE = r"\s"
    SPACES = ONE_OR_MORE.format(SPACE)
    PART_OF_A_WORD = r"\w"
    EVERYTHING_OR_NOTHING = ZERO_OR_MORE.format(ANYTHING)
    EVERYTHING_BUT_SPACES = ZERO_OR_MORE.format(
        ONE_OF_THESE.format(NOT + SPACE))

    ERASE = ""
    FORWARD_SLASHES = r"\/\/"
    NEWLINES = ONE_OF_THESE.format(r"\r\n")
    # a dollar is a special regular expression character meaning end of line
    # so escape it
    DOLLAR_SIGN = r"\$"

    # to remove
    STOCK_SYMBOL = DOLLAR_SIGN + ZERO_OR_MORE.format(PART_OF_A_WORD)
    RE_TWEET = START_OF_LINE + "RT" + SPACES
    HYPERLINKS = ("http" + OPTIONAL.format("s") + ":" + FORWARD_SLASHES
                  + EVERYTHING_BUT_SPACES + ZERO_OR_MORE.format(NEWLINES))
    HASH = "#"

    EYES = ":"
    FROWN = "\("
    SMILE = "\)"

    SPACEY_EMOTICON = (FOLLOWED_BY.format(START_OF_LINE + OR + PRECEDED_BY.format(SPACE))
                       + EYES + SPACE + "{}" +
                       FOLLOWED_BY.format(SPACE + OR + END_OF_LINE))
    SPACEY_FROWN = SPACEY_EMOTICON.format(FROWN)
    SPACEY_SMILE = SPACEY_EMOTICON.format(SMILE)

    spacey_fixed_emoticons = [":(", ":)"]
    spacey_emoticons = [SPACEY_FROWN, SPACEY_SMILE]

    remove = [STOCK_SYMBOL, RE_TWEET, HYPERLINKS, HASH]

The Processor

Here's the class-based implementation to pre-process tweets.

# python
import re
import string

# pypi
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

import attr
import nltk

<<regular-expressions>>


@attr.s
class TwitterProcessor:
    """processor for tweets"""
    _tokenizer = attr.ib(default=None)
    _stopwords = attr.ib(default=None)
    _stemmer = attr.ib(default=None)

    <<processor-clean>>

    <<processor-tokenizer>>

    <<processor-un-stop>>

    <<processor-stopwords>>

    <<processor-stemmer>>

    <<processor-stem>>

    <<processor-call>>

    <<processor-emoticon>>

The Clean Method

def clean(self, tweet: str) -> str:
    """Removes sub-strings from the tweet

    Args:
     tweet: string tweet

    Returns:
     tweet with certain sub-strings removed
    """
    for expression in WheatBran.remove:
        tweet = re.sub(expression, WheatBran.ERASE, tweet)
    return tweet

Emoticon Fixer

This tries to handle emoticons with spaces in them.

def unspace_emoticons(self, tweet: str) ->  str:
    """Tries to  remove spaces from emoticons

    Args:
     tweet: message to check

    Returns:
     tweet with things that looks like emoticons with spaces un-spaced
    """
    for expression, fix in zip(
            WheatBran.spacey_emoticons, WheatBran.spacey_fixed_emoticons):
        tweet = re.sub(expression, fix, tweet)
    return tweet

The Tokenizer

@property
def tokenizer(self) -> TweetTokenizer:
    """The NLTK Tweet Tokenizer

    It will:
     - tokenize a string
     - remove twitter handles
     - remove repeated characters after the first three
    """
    if self._tokenizer is None:
        self._tokenizer = TweetTokenizer(preserve_case=False,
                                         strip_handles=True,
                                         reduce_len=True)
    return self._tokenizer

Stopwords

This might make more sense to be done at the module level, but I'll see how it goes.

@property
def stopwords(self) -> list:
    """NLTK English stopwords

    Warning:
     if the stopwords haven't been downloaded this also tries too download them
    """
    if self._stopwords is None:
        nltk.download('stopwords', quiet=True)
        self._stopwords =  stopwords.words("english")
    return self._stopwords

Un-Stop the Tokens

def remove_useless_tokens(self, tokens: list) -> list:
    """Remove stopwords and punctuation

    Args:
     tokens: list of strings

    Returns:
     tokens with unuseful tokens removed
    """    
    return [word for word in tokens if (word not in self.stopwords and
                                        word not in string.punctuation)]

Stem the Tokens

@property
def stemmer(self) -> PorterStemmer:
    """Porter Stemmer for the tokens"""
    if self._stemmer is None:
        self._stemmer = PorterStemmer()
    return self._stemmer
def stem(self, tokens: list) -> list:
    """stem the tokens"""
    return [self.stemmer.stem(word) for word in tokens]

Call Me

def __call__(self, tweet: str) -> list:
    """does all the processing in one step

    Args:
     tweet: string to process

    Returns:
     the tweet as a pre-processed list of strings
    """
    cleaned = self.unspace_emoticons(tweet)
    cleaned = self.clean(cleaned)
    cleaned = self.tokenizer.tokenize(cleaned.strip())
    # the stopwords are un-stemmed so this has to come before stemming
    cleaned = self.remove_useless_tokens(cleaned)
    cleaned = self.stem(cleaned)
    return cleaned

Save Things

Rather than create the training and test sets over and over I'll save them as feather files. I tried saving them as CSVs but I think since the tweets have commas in them it messes it up (or something does anyway). I don't use the un-processed tweets later, but maybe it's a good idea to keep things around.

It occurred to me that I could just pickle the data-frame, but I've never used feather before so it'll give me a chance to try it out. According to that post I linked to about feather it's meant to be fast rather than stable (the format might change) so this is both overkill and impractical, but, oh, well.

Data

First I'll process the tweets so I won't have to do this later.

process = TwitterProcessor()
training_processed = training.copy()
training_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_train]
print(training.head(1))
print(training_processed.head(1))
                                      tweet  label
0  off to the park to get some sunlight : )      1
                       tweet  label
0  [park, get, sunlight, :)]      1

Now to save it

processed_path = Path(os.environ["TWITTER_TRAINING_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TRAINING_RAW"]).expanduser()

training_processed.to_feather(processed_path)
training.to_feather(raw_path)
processed_path = Path(os.environ["TWITTER_TEST_PROCESSED"]).expanduser()
raw_path = Path(os.environ["TWITTER_TEST_RAW"]).expanduser()

testing = pandas.DataFrame.from_dict(dict(tweet=x_test, label=y_test))
testing_processed = testing.copy()
testing_processed.loc[:, "tweet"] = [process(tweet) for tweet in x_test]

testing_processed.to_feather(processed_path)
testing.to_feather(raw_path)

Pickles

I'm spreading this over several posts so I'm going to save some python objects to hold constant values that they share.

path = Path(os.environ["TWITTER_SENTIMENT"]).expanduser()
with path.open("wb") as writer:
    pickle.dump(Sentiment, writer)
path = Path(os.environ["TWITTER_PLOT"]).expanduser()
with path.open("wb") as writer:
    pickle.dump(Plot, writer)

Next in the series: Twitter Word Frequencies

Note: This series is a re-write of an exercise taken from Coursera's Natural Language Processing specialization. I changed some of the way it works, though, so it won't match their solution 100% (but it's close).

Flask, TensorFlow, Streamlit and the MNIST Dataset

Beginning

This is a re-working of Coursera's Neural Network Vizualizer Web App With Python course. What we'll do is use tensorflow to build a model to classify images of handwritten digits from the MNIST Database of Handwritten Digits which tensoflow provides as one of their pre-built datasets. MNIST (according to wikipedia) stands for Modified National Institute of Standards and Technology (so we're using the Modified NIST Database).

Once we have the model we'll use Flask to serve up the model and Streamlit to build a web page to view the results.

Set Up

Parts

These are the libraries that we will use.

  • Python
    from functools import partial
    from pathlib import Path
    
    import os
    
  • PyPi
    from bokeh.models import HoverTool
    from dotenv import load_dotenv
    
    import matplotlib.pyplot as pyplot
    import numpy
    import pandas
    import hvplot.pandas
    import seaborn
    import tensorflow
    
  • My Stuff
    from graeae import EmbedHoloviews
    

The Environment

load_dotenv(".env", override=True)

Plotting

There won't be a lot of plotting, but we'll use matplotlib with seaborn to look at some images to see what they look like and HVplot to do other visualizations.

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "figure.figsize": (8, 6)},
            font_scale=1)

This is for the nikola posts. If you run the jupyter kernel on a remote machine there's going to be two behaviors for the plot-files. If you create the file in the code block (like I do for the HVPlot plots) then the file will show up on the remote machine. If you use the :file argument in the org-mode header (like I do for matplotlib) it will create the file on the machine where you're running emacs. Given this behavior it might make more sense to edit the emacs file on the remote machine so all the files are created there… next time.

SLUG = "flask-tensorflow-and-mnist"
OUTPUT = Path("../../files/posts/keras/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT)

The Random Seed

Since I'm commenting on the outcomes I'll set the random seed to try and make things more consistent.

tensorflow.random.set_seed(2020)

The Data

Like I mentioned, tensorflow includes the MNIST data set that we can grab with the load_data function. It returns two tuples of numpy arrays.

(x_train, y_train), (x_test, y_test) = tensorflow.keras.datasets.mnist.load_data()

Let's see how much data we have.

rows, width, height = x_train.shape
print(f"Training:\t{rows:,} images\timage = {width} x {height}")
rows, width, height = x_test.shape
print(f"Testing:\t{rows:,} images\timage = {width} x {height}")
Training:       60,000 images   image = 28 x 28
Testing:        10,000 images   image = 28 x 28

A Note On the Tangling

I'm going to do this as a literate programming document with the tangle going into a temporary folder. I was creating the temporary folder using python but I'm running the code on a different machine from where I'm editing this document so running python executes on the remote machine but tangling out the files happens on my local machine. Maybe next time it will make more sense to edit the document on the remote machine (note to future self). Although that also introduces problems because then I'd have to run the tests headless… Every solution has a problem.

Middle

The Data

The Distribution

First, we can look at the distribution of the digits to see if they are equally represented.

labels = (pandas.Series(y_train).value_counts(sort=False)
          .reset_index()
          .rename(columns={"index": "Digit",
                           0: "Count"}))
hover = HoverTool(
    tooltips=[
        ("Digit", "@Digit"),
        ("Count", "@Count{0,0}"),
    ]
)
plot = labels.hvplot.bar(x="Digit", y="Count").opts(
    height=800,
    width=1000,
    title="Digit Counts",
    tools=[hover],
)

output = Embed(plot=plot, file_name="digit_distribution")
output()

Figure Missing

If you look at the values for the counts you can see that there is a pretty significant difference between 1 and 5.

print(f"{int(labels.iloc[1].Count - labels.iloc[5].Count):,}")
1,321

But we're doing this as an exercise to get a web-page up more so than build a real model so let's not worry about that for now.

Some Example Digits

We'll make a 4 x 4 grid of the first 16 images to see what they look like. Note that our array uses 0-based indexing but matplotlib uses 1-based indexing so we have to make sure that the reference to the cell in the subplot is one ahead of the index for the array.

IMAGES = 16
ROWS = COLUMNS = 4

for index in range(IMAGES):
    pyplot.subplot(ROWS, COLUMNS, index + 1)
    pyplot.imshow(x_train[index], cmap='binary')
    pyplot.xlabel(str(y_train[index]))
    pyplot.xticks([])
    pyplot.yticks([])
pyplot.show()

sample_digits.png

So the digits (at least the first 16) seem to be pretty clear.

Normalizing the Data

One problem we have, though, is that images use values from 0 to 255 to indicate the brightness of a pixel, but neural networks tend to work better with values from 0 to 1, so we'l have to scale the data back. The images are also 28 x 28 squares, but we need to transform them to flat vectors. We can change the shape of the input data using the numpy.reshape function, which takes the original data and the shape you want to change it to. In our case we want the same number of rows that there were originally and we want to reduce the images from 2-dimensional images to 1-dimensional images which we can do by passing in the number of total number of pixels in each image as a single number instead of width and height.

Since we have to do this for both the training and testing data I'll make a helper function.

def normalize(data: numpy.array) -> numpy.array:
    """reshapes the data and scales the values"""
    rows, width, height = data.shape
    pixels = width * height
    data = numpy.reshape(data, (rows, pixels))

    assert data.shape == (rows, pixels)

    MAX_BRIGHTNESS = 255
    data = data / MAX_BRIGHTNESS

    assert data.max() == 1
    assert data.min() == 0
    return data
x_train = normalize(x_train)
x_test = normalize(x_test)

The Neural Network Model

Build and Train It

Now we'll build the model. It's going to be a simple fully-connected network with three layers (input, hidden, output). To make the visualization simpler we'll use the sigmoid activation function.

Besides the shallowness of the model it's also going to be relatively simple, with only 32 nodes in the hidden layer.

First we'll build it as a Sequential (linear stack) model.

rows, pixels = x_train.shape
HIDDEN_NODES = 32
CATEGORIES = len(labels)
ACTIVATION = "sigmoid"
OUTPUT_ACTIVATION = "softmax"

model = tensorflow.keras.models.Sequential([
    tensorflow.keras.layers.Dense(HIDDEN_NODES,
                                  activation=ACTIVATION,
                                  input_shape=(pixels,)),
    tensorflow.keras.layers.Dense(HIDDEN_NODES,
                                  activation=ACTIVATION),
    tensorflow.keras.layers.Dense(CATEGORIES,
                                  activation=OUTPUT_ACTIVATION)
])

Now we can compile the model using a sparse categorical cross-entropy loss function, which is for the case where you have more than one category (non-binary) and the Adam optimizer.

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

And next we'll train the model by calling its fit method.

NO_OUTPUT = 0
EPOCHS = 40
BATCH_SIZE = 2048

history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=EPOCHS, batch_size=BATCH_SIZE,
    verbose=NO_OUTPUT
)

Plot the Training History

history = pandas.DataFrame.from_dict(history.history)
history = history.rename(
    columns={
        "loss": "Training Loss",
        "accuracy": "Training Accuracy",
        "val_loss": "Validation Loss",
        "val_accuracy": "Validation Accuracy",
    })
hover = HoverTool(
    tooltips=[
        ("Metric", "$name"),
        ("Epoch", "$x"),
        ("Value", "$y")
    ]
)

plot = history.hvplot().opts(
    height=800,
    width=1000,
    title="Training History",
    tools=[hover]
)
output = Embed(plot=plot, file_name="training_history")
output()

Figure Missing

for column in history.columns:
    lowest = history[column].min()
    highest = history[column].max()
    print(f"({column}) Min={lowest:0.2f} Max={highest: 0.2f}")
(Training Loss) Min=0.20 Max= 2.26
(Training Accuracy) Min=0.22 Max= 0.95
(Validation Loss) Min=0.21 Max= 2.14
(Validation Accuracy) Min=0.38 Max= 0.94

So our validation accuracy goes from 38 % to 94%, which isn't bad, especially when you consider what a simple model we have.

Save It

Now we can save the model to use in our flask application.

Note To Self: Since this is being run on a remote machine, both the .env file and the directory to save the models refers to the remote machine, not the local machine where this file is being edited so you have to copy it to the local machine later on to use it with flask.

Also note that the you can't see the name since I put it in a .env file but it has .h5 as the extension. According to the TensorFlow page on saving and loading a model, H5 is the older format, they've switched to the SavedModel format, you lose some information that would help you resume training, but we're not going to do that anyway, and the H5 format should be a little smaller.

Most of the next blob is to make sure the folder for the model exists. I put it in the environment variable mostly because I keep changing my mind as to where to put it and what to call it.

base = "flask_tensorflow_mnist"
MODELS = Path(os.environ[base]).expanduser()
MODEL_NAME = os.environ[f"{base}_model"]
if not MODELS.is_dir():
    MODELS.mkdir(parents=True)
assert MODELS.is_dir()
MODEL_PATH = MODELS/MODEL_NAME
model.save(MODEL_PATH)
assert MODEL_PATH.is_file()

The Web Page

  • Back-End (The Model Server)
    • Tests
      • Fixtures

        These are the pytest fixtures to make it easier to create objects.

         # python
         from argparse import Namespace
        
         # from pypi
        
         import pytest
         import tensorflow
        
         # software under test
         from ml_server import app
        
        
         class Katamari:
             """Something to stick things into"""
        
        
         @pytest.fixture
         def katamari() -> Katamari:
             return Katamari()
        
        
         @pytest.fixture
         def client():
             """generates the flask client for testing"""
             app.config["TESTING"] = True
             with app.test_client() as client:
                 yield client
             return
        
        
         @pytest.fixture
         def mnist():
             """Gets the test labels"""
             MAX_BRIGHTNESS = 255
             _, (x_test, y_test) = tensorflow.keras.datasets.mnist.load_data()
             return Namespace(
                 x_test=x_test/MAX_BRIGHTNESS,
                 y_test=y_test,
             )
        
      • Features

        These are the feature files.

         Feature: A Prediction Getter
        
         Scenario: The root page is retrieved
           Given a connection to the flask client
           When the root page is retrieved
           Then it has the expected text
        
         Scenario: A prediction is retrieved
           Given the get_prediction function
           When a prediction is retrieved
           Then it has the correct tuple
        
         Scenario: The API end-point is retrieved
           Given a connection to the flask client
           When the API end-point is retrieved
           Then the response has the expected JSON
        
      • The Tests

        These are the actual test functions.

         # python
         from http import HTTPStatus
        
         import random
        
         # pypi
         from expects import (
             be,
             be_true,
             contain,
             equal,
             expect,
         )
        
         from pytest_bdd import (
             given,
             when,
             then,
             scenario,
             scenarios,
         )
        
         import numpy
        
         # for testing
         from fixtures import client, katamari, mnist
        
         # software under test
         from ml_server import get_prediction, PATHS
        
         scenarios("get_predictions.feature")
        
         # ***** Get Root Page ***** #
         # Scenario: The root page is retrieved
        
        
         @given("a connection to the flask client")
         def setup_client(katamari, client):
             # this is a no-op since I made a fixture to build the client instead
             return
        
        
         @when("the root page is retrieved")
         def get_root_page(katamari, client):
             katamari.response = client.get(PATHS.root)
             expect(katamari.response.status_code).to(equal(HTTPStatus.OK))
             return
        
         @then("it has the expected text")
         def check_root_text(katamari):
             expect(katamari.response.data).to(
                 contain(b"This is the Neural Network Visualizer"))
             return
        
         # ***** get predictions ***** #
         # *** Call the function *** #
         # Scenario: A prediction is retrieved
        
         @given("the get_prediction function")
         def check_get_prediction():
             """Another no-op"""
             return
        
         @when("a prediction is retrieved")
         def call_get_prediction(katamari, mocker):
             choice_mock = mocker.MagicMock()
             katamari.index = 6
             choice_mock.return_value = katamari.index
             mocker.patch("ml_server.numpy.random.choice", choice_mock)
             katamari.output = get_prediction()
             return
        
        
         @then("it has the correct tuple")
         def check_predictions(katamari, mnist):
             # Our model emits a list with one array for each layer of the model
             expect(type(katamari.output[0])).to(be(list))
             expect(len(katamari.output[0])).to(equal(3))
        
             # the last layer is the prediction layer
             predictions = katamari.output[0][-1]
        
             predicted = predictions.argmax()
             expected = mnist.y_test[katamari.index]
             expect(predicted).to(equal(expected))
        
             # now check the image
             expected = mnist.x_test[katamari.index]
             # expect(katamari.output[1].shape).to(equal((28, 28)))
             expect(numpy.array_equal(katamari.output[1], expected)).to(be_true)
             return
        
         # *** API Call *** #
         #Scenario: the API end-point is retrieved
         #  Given a connection to the flask client
        
        
         @when("the API end-point is retrieved")
         def get_predictions(katamari, client, mocker):
             # set up the mock so we can control which of the images it tries to predict
             choice_mock = mocker.MagicMock()
        
             mocker.patch("ml_server.numpy.random.choice", choice_mock)
        
             katamari.index = random.randrange(100)
             choice_mock.return_value = katamari.index
        
             katamari.response = client.get(PATHS.api)
             expect(katamari.response.status_code).to(equal(HTTPStatus.OK))
             return
        
        
         @then("the response has the expected JSON")
         def check_response(katamari, mnist):
             expect(katamari.response.is_json).to(be_true)
             data = katamari.response.json
             layers = data["prediction"]
        
             # the prediction should be the three outputs of our model
             # except with lists instead of numpy arrays
             expect(type(layers)).to(be(list))
             expect(len(layers)).to(equal(3))
             prediction = numpy.array(layers[-1])
        
             # now check that it made the expected prediction
             predicted = prediction.argmax()
             expected = mnist.y_test[katamari.index]
             expect(predicted).to(equal(expected))
        
             # and that it gave us the right image
             expected = mnist.x_test[katamari.index]
             expect(numpy.array_equal(numpy.array(data["image"]), expected)).to(be_true)
             return
        
    • The Implementation

      This is where we tangle out a file to run a flask server that will serve up our model's predictions.

       <<ml-server-imports>>
      
       <<ml-server-flask-app>>
      
      
       <<ml-server-load-model>>
      
       <<ml-server-feature-model>>
      
       <<ml-server-load-data>>
      
      
       <<ml-server-get-prediction>>
      
      
       <<ml-server-index>>
      
       <<ml-server-api>>
      
       <<ml-server-main>>
      

      First up is our imports. Other than Flask there really isn't anything new here.

       # python
       from argparse import Namespace
       import json
       import os
       import random
       import string
      
       from pathlib import Path
      
       # pypi
       import numpy
       import tensorflow
      
       from dotenv import load_dotenv
       from flask import Flask, request
      

      Now we create the flask app and something to hold the paths.

       app = Flask(__name__)
      
       PATHS = Namespace(
           root = "/",
           api = "/api",
       )
      

      Next we'll load the saved model. I'm going to break this up a little bit just because I wasn't clear about what was going on originally.

       load_dotenv(override=True)
      
       base = "flask_tensorflow_mnist"
       MODELS = Path(os.environ[base]).expanduser()
       MODEL_NAME = os.environ[f"{base}_model"]
       assert MODELS.is_dir()
       MODEL_PATH = MODELS/MODEL_NAME
       assert MODEL_PATH.is_file()
      
       model = tensorflow.keras.models.load_model(MODEL_PATH)
      

      At this point we should have a re-loaded version of our trained model (minus some information as noted above because it was saved using the H5 format). Our model has one output layer - the softmax prediction layer - which gives the probabilities that an input image is one of the ten digits, but since we want to see what each layer is doing, we'll create a new model with the output from each layer added to the outputs - so since we have three layers in the model we'll now have three outputs.

       feature_model = tensorflow.keras.models.Model(
           inputs=model.inputs,
           outputs=[layer.output for layer in model.layers])
      

      Next let's load and normalize the data. We don't use the training data or the labels here.

       MAX_BRIGHTNESS = 255
      
       _, (x_test, _) = tensorflow.keras.datasets.mnist.load_data()
       x_test = x_test/MAX_BRIGHTNESS
      

      Now we create the function to get the prediction for an image. It also returns the image so that we can see what it was.

       ROWS, HEIGHT, WIDTH = x_test.shape
       PIXELS = HEIGHT * WIDTH
      
       def get_prediction() -> (list, numpy.array):
           """Gets a random image and prediction
      
           The 'prediction' isn't so much the value (e.g. it's a 5) but rather the
           outputs of each layer so that they can be visualised. So the first value
           of the tuple will be a list of arrays whose length will be the number of 
           layers in the model. Each array will be the outputs for that layer.
      
           This always pulls the image from =x_test=.
      
           Returns:
            What our model predicts for a random image and the image
           """
           index = numpy.random.choice(ROWS)
           image = x_test[index,:,:]
           image_array = numpy.reshape(image, (1, PIXELS))
           return feature_model.predict(image_array), image
      

      Next we create the handler for the REST calls. If you make a GET request from the root you'll get an HTML page back.

       @app.route(PATHS.root, methods=['GET'])
       def index():
           """The home page view"""
           return "This is the Neural Network Visualizer (use /api for the API)"
      

      If you return a dict flask will automatically identify it as JSON.

       @app.route(PATHS.api, methods=["GET"])
       def api():
           """the JSON view
      
           Returns:
             JSON with prediction layers and image
           """
           predictions, image = get_prediction()
      
           # JSON needs lists, not numpy arrays
           final_predictions = [prediction.tolist() for prediction in predictions]
           return {"prediction": final_predictions,
                   'image': image.tolist()}
      

      And now we make the "main" entry point.

       if __name__ == "__main__":
           app.run()
      

      To run this you would enter the same directory as the ml_server.py file and execute:

       python ml_server.py
      

      Or better, use the development server.

      set -X FLASK_APP ml_server
      set -X FLASK_ENV development
      
      flask run
      

      This will automatically re-load if you make changes to the code. The first two lines in the code block above tell flask which one of the modules has the flask-app and also that it should run in development mode. I'm using the Fish Shell, so if you are using bash or a similar shell instead the lines would be this instead.

      export FLASK_APP=ml_server
      export FLASK_ENV=development
      
      flask run
      
  • Front-End
    • Tests
      <<front-end-feature-title>>
      
      <<front-end-click>>
      
      # python
      from argparse import Namespace
      
      # pypi
      from selenium import webdriver
      
      import pytest
      
      
      @pytest.fixture
      def browser():
          """Creates the selenium webdriver session"""
          browser = webdriver.Firefox()
          yield browser
          browser.close()
          return
      
      
      CSSSelectors = Namespace(
          main_title = ".main h1",
          main_button = ".main button",
          sidebar_title = ".sidebar h1",
          sidebar_image = ".sidebar-content img",
          )
      
      class HomePage:
          """A page-class for testing
      
          Args:
           address: the address of the streamlit server
           wait: seconds to implicitly wait for page-objects
          """
          def __init__(self, address: str="http://localhost:8501",
                       wait: int=1) -> None:
              self.address = address
              self.wait = wait
              self._browser = None
              return
      
          @property
          def browser(self) -> webdriver.Firefox:
              """The browser opened to the home page"""
              if self._browser is None:
                  self._browser = webdriver.Firefox()
                  self._browser.implicitly_wait(self.wait)
                  self._browser.get(self.address)
              return self._browser
      
          @property
          def main_title(self) -> webdriver.firefox.webelement.FirefoxWebElement:
              """The object with the main title"""
              return self.browser.find_element_by_css_selector(
                      CSSSelectors.main_title
                  )
      
          @property
          def main_button(self) -> webdriver.firefox.webelement.FirefoxWebElement:
              """The man button"""
              return self.browser.find_element_by_css_selector(
                      CSSSelectors.main_button
                  )
      
      
          @property
          def sidebar_title(self) -> webdriver.firefox.webelement.FirefoxWebElement:
              """The sidebar title element"""
              return self.browser.find_element_by_css_selector(
                      CSSSelectors.sidebar_title
                  )
      
          @property
          def sidebar_image(self) -> webdriver.firefox.webelement.FirefoxWebElement:
              """This tries to get the sidebar image element
             """
              return self.browser.find_element_by_css_selector(
                  CSSSelectors.sidebar_image)
      
          def __del__(self):
              """Finalizer that closes the browser"""
              if self._browser is not None:
                  self.browser.close()
              return
      
      
      @pytest.fixture
      def home_page():
          return HomePage()
      
       <<test-front-imports>>
      
       <<test-front-text>>
      
       <<test-front-click>>
      
    • The Features

      We can start with the imports and basic set up.

       # pypi
       from expects import (
           be_true,
           equal,
           expect
       )
      
       from pytest_bdd import (
           given,
           scenarios,
           then,
           when,
       )
      
       # fixtures
       from fixtures import katamari
      
       from front_end_fixtures import home_page
      
       and_also = then
       scenarios("front_end.feature")
      
      • The Initial Text
         Feature: The GUI web page to view the model
        
         Scenario: The user goes to the home page and checks it out
           Given a browser on the home page
           When the user checks out the titles and button
           Then they have the expected text
        
        # ***** The Text ***** #
        # Scenario: The user goes to the home page and checks it out
        
        
        @given("a browser on the home page")
        def setup_browser(katamari, home_page):
            # katamari.home_page = home_page
            return
        
        
        @when("the user checks out the titles and button")
        def get_text(katamari, home_page):
            katamari.main_title = home_page.main_title.text
            katamari.button_text = home_page.main_button.text
            katamari.sidebar_title = home_page.sidebar_title.text
            return
        
        
        @then("they have the expected text")
        def check_text(katamari):
            expect(katamari.main_title).to(equal("Neural Network Visualizer"))
            expect(katamari.button_text).to(equal("Get Random Prediction"))
            expect(katamari.sidebar_title).to(equal("Input Image"))
            return
        
      • Click the Button
        Scenario: The user gets a random prediction
          Given a browser on the home page
          When the user clicks on the button
          Then the sidebar displays the input image
        
        # ***** The button click ****** #
        # Scenario: The user gets a random prediction
        #  Given a browser on the home page
        
        
        @when("the user clicks on the button")
        def click_get_image_button(home_page):
            home_page.main_button.click()
            return
        
        
        @then("the sidebar displays the input image")
        def check_sidebar_sections(home_page):
            expect(home_page.sidebar_image.is_displayed()).to(be_true)
            return
        
    • Streamlit

      For the front-end we'll use Streamlit, a python library to make creating web-pages for certain types of applications more easily (I think, I'll need to check it out more later).

       <<streamlit-imports>>
      
       <<streamlit-url>>
      
       <<streamlit-title>>
      
       <<streamlit-sidebar>>
      
       <<streamlit-control>>
      

      First the imports.

       # python
       import json
       import os
       from urllib.parse import urljoin
      
       # pypi
       import requests
       import numpy
       import streamlit
       import matplotlib.pyplot as pyplot
      
       # this code
       from ml_server import PATHS
      

      Now we'll setup the URL for our flask backend - as you can see we're expecting to run this on the localhost address, you'd have to change this for make it available outside the host PC.

       URI = urljoin("http://127.0.0.1:5000/", PATHS.api)
      

      Next we'll set the title for the page - this can be a little confusing, although it's called the title, it isn't the HTML title but rather the main heading for the page.

       streamlit.title('Neural Network Visualizer')
      

      Now we'll add a collapsible sidebar where we'll eventually put our image output and add a headline for it (Input Image).

       streamlit.sidebar.markdown('# Input Image')
      

      Now we'll add some logic. I think this would be the control portion of a more traditional web-server. It's basically where we react to a button press by getting a random image and visualizing how it makes a prediction.

       # create a button and wait for someone to press it
       if streamlit.button("Get Random Prediction"):
           # Someone pressed the button, make an API call to our flask server
           response = requests.get(URI)
      
           # convert the response to a dict
           response = response.json()
      
           # get the prediction array
           predictions = response.get('prediction')
      
           # get the image we were making the prediction for
           image = response.get('image')
      
           # the image 
           # streamlit expects a numpy array or string-like object, not lists
           image = numpy.array(image)
      
           # show the image in the sidebar
           streamlit.sidebar.image(image, width=150)
      
           # iterate over the prediction for each layer in the model
           for layer, prediction in enumerate(predictions):
               # convert the prediction list to an array
               # and flatten it to a vector
               numbers = numpy.squeeze(numpy.array(prediction))
               pyplot.figure(figsize=(32, 4))
               rows = 1
               if layer == 2:
                   # this is the output layer so we only want one row
                   # and we want 10 columns (one for each digit)
                   columns = 10
               else:
                   # this is the input or hidden layer
                   # since our model had 32 hidden nodes it has 32 columns
                   # the original version had 2 rows and 16 columns, but
                   # while that looked nicer, I think it makes more sense for 
                   # there to be one layer
                   columns = 32
               for index, number in enumerate(numbers):
                   # add a subplot to the figure
                   pyplot.subplot(rows, columns, index + 1)
                   pyplot.imshow((number * numpy.ones((8, 8, 3)))
                                 .astype('float32'), cmap='binary')
                   pyplot.xticks([])
                   pyplot.yticks([])
                   if layer == 2:
                       pyplot.xlabel(str(index), fontsize=40)
                   pyplot.subplots_adjust(wspace=0.05, hspace=0.05)
                   pyplot.tight_layout()
               streamlit.text('Layer {}'.format(layer + 1), )
               streamlit.pyplot()
      

End

Hand-rolling a CountVectorizer

Beginning

This is part of lesson 3 from the fastai NLP course.

Imports

Python

from collections import Counter
from functools import partial

PyPi

from fastai.text import (
    URLs,
    untar_data,
    TextList,
    )
import hvplot.pandas
import pandas

Others

from graeae import CountPercentage, EmbedHoloviews

Setup

Plotting

Embed = partial(
    EmbedHoloviews,
    folder_path="../../files/posts/fastai/hand-rolling-a-countvectorizer/")

The Data Set

The data-set is a collection of 50,000 IMDB reviews hosted on AWS Open Datasets as part of the fastai datasets collection. We're going to try and create a classifier that can predict the "sentiment" of reviews. The original dataset comes from Stanford University.

To make it easier to experiment, we'll initially load a sub-set of the dataset that fastai prepared. The URLs class contains the URLs for the datasets that fastai has uploaded and the untar_data function downloads data from the URL given to a given (or in this case default) location.

path = untar_data(URLs.IMDB_SAMPLE)
print(path)
/home/athena/.fastai/data/imdb_sample

The untar_data function doesn't actually load the data for us, so we'll use pandas to do that.

sample_frame = pandas.read_csv(path/"texts.csv")
print(sample_frame.head())
      label                                               text  is_valid
0  negative  Un-bleeping-believable! Meg Ryan doesn't even ...     False
1  positive  This is a extremely well-made film. The acting...     False
2  negative  Every once in a long while a movie will come a...     False
3  positive  Name just says it all. I watched this movie wi...     False
4  negative  This movie succeeds at being one of the most u...     False

The is_valid column is kind of interesting here especially since the first examples are all false… but I couldn't find an explanation for it on the data-download page.

CountPercentage(sample.label)()
Value Count Percent (%)
negative 524 52.40
positive 476 47.60

So it is nearly balanced but with a slight bias toward negative comments.

CountPercentage(sample.is_valid)()
Value Count Percent (%)
False 800 80.00
True 200 20.00

Well, so exactly 20% are invalid? Curious.

The Text List

To actually work with the dataset we'll use fastai's TextList instead of pandas' dataframe.

sample_list = TextList.from_csv(path, "texts.csv", cols="text")
sample_split = sample_list.split_from_df(col=2)
sample = (sample_split
          .label_from_df(cols=0))

The original notebook builds the TextList in a single train-wreck, but if you try and find out what those methods do from the fastai documentation… well, it's easier (although still obscure) to inspect the intermediate objects to try and muddle through what's going on. The ultimate outcome seems to be that sample is an object with the somewhat pre-processed text. It looks like the text is lower-cased and somewhat tokenized. There's also a lot of strange tokens inserted (xxmaj, xxunk) which, according to the tokenization documentation indicate special tokens - although there's more unknown tokens than I would have expected.

print(sample.train.x[0])
xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !
print(sample_frame.text.iloc[0])
Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!

Here's the category for that review.

print(sample.train.y[0])
negative

Note that the output looks like a string, but it's actually a fastai "type".

print(type(sample.train.y[0]))
<class 'fastai.core.Category'>

Creating a Term-Document Matrix

Here we'll create a matrix that counts the number of times each token appears in each document.

End

Reference

The Dataset

  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT ’11). Association for Computational Linguistics, USA, 142–150

Topic Modeling With Matrix Decomposition

Beginning

This is part of a walk-through of the fastai Code-First Introduction to NLP. In this post I'll be using Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NMF) to group newsgroup posts. Both of these methods are statistical approaches that use the word-counts within documents to decide how similar they are (while ignoring things like word order).

Imports

Python

from functools import partial
import random

PyPi

from scipy import linalg
from sklearn import decomposition
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import hvplot.pandas
import matplotlib.pyplot as pyplot
import numpy
import pandas

Others

from graeae import EmbedHoloviews, Timer

Set Up

The Timer

TIMER = Timer()

Plotting

Embed = partial(
    EmbedHoloviews,
    folder_path="../../files/posts/fastai/topic-modeling-with-matrix-decomposition")

Middle

The Dataset

The dataset consists of ~18,000 newsgroup posts with 20 topics. To keep the computation down I'll only use a subset of the categories. I'm also going to only use the body of the posts.

keep = ("alt.atheism", "comp.graphics", "misc.forsale", "sci.crypt", "talk.politics.guns")
remove = ("headers", "footers", "quotes")
training = fetch_20newsgroups(subset="train", categories=keep, remove=remove)
testing = fetch_20newsgroups(subset="test", categories=keep, remove=remove)

I've run this more than once so there's no output, but the first time you run the fetch_20newsgroups function it downloads the dataset and you'll see some output mentioning this fact.

print(f"{training.filenames.shape[0]:,}")
print(f"{training.target.shape[0]:,}")
2,790
2,790

So, although the entire dataset has over 18,000 entries, our sub-set has fewer than 3,000.

print(numpy.unique(training.target))
[0 1 2 3 4]

So the categories don't seem to be preserved (I'm assuming that the five I kept weren't the first five in the original set) so you have to check anytime you pull a subset out of the data.

Lets see what one of the posts looks like.

print(random.choice(training.data))
      Just a question. 
      As a provider of a public BBS service - aren't you bound by law to gurantee
      intelligble access to the data of the users on the BBS, if police comes
      with sufficent authorisation ? I guessed this would be  a basic condition
      for such systems. (I did run a bbs some time ago, but that was in Switzerland)

The US doesn't yet have many laws covering BBSs - they're not common carriers,
they're not phone companies, they're just private machines or services
operated by businesses.  There's no obligation to keep records.
As Perry Metzger points out, if the police come with a search warrant,
you have to let them see what the warrant demands, if it exists,
and they generally can confiscate the equipment as "evidence"
(which is not Constitutionally valid, but we're only beginning to
develop court cases supporting us).  A court MAY be able to compel
you to tell them information you know, such as the encryption password
for the disk - there aren't any definitive cases yet, since it's a new
situation, and there probably aren't laws specifically covering it.
But the court can't force you to *know* the keys, and there are no
laws preventing you from allowing your users to have their own keys
for their own files without giving them to you.

Even in areas that do have established law, there is uncertainty.
There was a guy in Idaho a few years ago who had his business records
subpoenaed as evidence for taxes or some other business-restriction law,
so he gave the court the records.  Which were in Hebrew.
The US doesn't have laws forcing you to keep your records in English,
and these were the originals of the records.  HE didn't speak Hebrew,
and neither did anybody in the court organization.  Don't think they
were able to do much about it.

It might be illegal for your BBS to deny access to potential customers
based on race, religion, national origin, gender, or sexual preference;
it probably hasn't been tested in court, but it seems like a plausible
extension of anti-discrimination laws affecting other businesses.

Vectorizing

Here we'll convert the text to a matrix using sklearn's CountVectorizer. Interestingly, the Introduction to Information Retrieval book says the the trend has been towards not removing the most common words (stop word) but we'll be dropping them. There's a paper called Stop Word Lists in Free Open-source Software Packages which points out some problems with stop-word lists in general, but sklearn's list in particular. I don't know if sklearn has done anything to address their concerns since the paper came out, but the sklearn documentation includes a link to the paper so I would assume the problems are still there. Nonetheless, the fastai examples uses them so I will too.

vectorizer = CountVectorizer(stop_words="english")

The function we'le going to use doesn't accept the sparse matrices that are output by default so we'll make it a dense matrix after it's fit.

with TIMER:
    vectors = vectorizer.fit_transform(training.data).todense()
2020-01-01 16:26:48,048 graeae.timers.timer start: Started: 2020-01-01 16:26:48.047927
2020-01-01 16:26:48,466 graeae.timers.timer end: Ended: 2020-01-01 16:26:48.466285
2020-01-01 16:26:48,466 graeae.timers.timer end: Elapsed: 0:00:00.418358

That was much quicker than I thought it would be, probably because our dataset is so small.

vocabulary = vectorizer.get_feature_names()
print(f"{len(vocabulary):,}")
34,632

So our "vocabulary" is around 35,000 tokens.

Singular Value Decomposition (SVD)

Singular Value Decomposition is a linear algebra method to factor a matrix. The math is beyond me at this point, so I'll just try using it as a black box.

with TIMER:
    U, s, V = linalg.svd(vectors, full_matrices=False)
2020-01-01 16:26:50,508 graeae.timers.timer start: Started: 2020-01-01 16:26:50.508003
2020-01-01 16:27:23,979 graeae.timers.timer end: Ended: 2020-01-01 16:27:23.978988
2020-01-01 16:27:23,980 graeae.timers.timer end: Elapsed: 0:00:33.470985
s_frame = pandas.Series(s)
plot = s_frame.hvplot().opts(title="Diagonal Matrix S", width=1000, height=800)
Embed(plot=plot, file_name="s_values")()

Figure Missing

Looking At Some Topics

top_words_count = 8

def top_words(token):
    return [vocabulary[index] for index in numpy.argsort(token)[: -top_words_count - 1: -1]]

def show_topics(array):
    topic_words = ([top_words(topic) for topic in array])
    return [' '.join(topic) for topic in topic_words]
topics = show_topics(V[:10])
for index, topic in enumerate(topics):
    print(f"{index}: {topic}")
0: propagandist heliocentric galacticentric surname sandvik 400included wovy imaginative
1: file jpeg image edu pub ftp use graphics
2: file gun congress firearms control mr states rkba
3: privacy internet anonymous pub email information eff mail
4: graphics edu 128 3d ray pub data ftp
5: 00 50 40 appears dos 10 art 25
6: privacy internet 00 jpeg eff pub email electronic
7: key data image encryption des chip available law
8: pub key jesus jpeg eff graphics encryption ripem
9: key encryption edu des anonymous posting chip graphics

So what we're showing is the most significant words for the top-ten most strongly grouped "topics". It takes a little bit of interpretation to figure out how to map them to the newsgroups we used, and there probably could have been some clean-up of the texts (entry 5 looks suspect) but it's interesting that this linear algebra decomposition method could find these similar groups without any kind of prompting as to what groups might even exist in the first place (this is an unsupervised method, not a supervised method).

Non-negative Matrix Factorization (NMF)

number_of_topics = 5
classifier = decomposition.NMF(n_components=number_of_topics, random_state=1)
weights = classifier.fit_transform(vectors)
classified = classifier.components_
for index, topic in enumerate(show_topics(classified)):
    print(f"{index}: {topic}")
0: db mov bh si cs byte al bl
1: privacy internet anonymous information email eff use pub
2: file gun congress control firearms states mr united
3: jpeg image gif file color images format quality
4: edu graphics pub image data ftp mail available

Term-Frequency/Inverse Document Frequency

tfidf_vectorizer = TfidfVectorizer(stop_words="english")
tfidf_vectors = tfidf_vectorizer.fit_transform(training.data)
weights = classifier.fit_transform(tfidf_vectors)
classified = classifier.components_

for index, topic in enumerate(show_topics(classified)):
    print(f"{index}: {topic}")
0: people gun don think just guns right government
1: 00 sale offer shipping new drive price condition
2: key chip encryption clipper keys escrow government algorithm
3: graphics thanks file files image program know windows
4: god atheism believe does atheists belief said exist

NLP Classification Exercise

Beginning

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import hvplot.pandas
import numpy
import pandas
import tensorflow

Others

from graeae import (CountPercentage,
                    EmbedHoloviews,
                    SubPathLoader,
                    Timer,
                    ZipDownloader)

Set Up

The Timer

TIMER = Timer()

The Plotting

slug = "nlp-classification-exercise"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{slug}")

The Dataset

It isn't mentioned in the notebook where the data originally came from, but it looks like it's the Sentiment140 dataset, which consists of tweets whose sentiment was inferred by emoticons in each tweet.

url = "http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip"
path = Path("~/data/datasets/texts/sentiment140/").expanduser()
download = ZipDownloader(url, path)
download()
Files exist, not downloading
columns = ["polarity", "tweet_id", "datetime", "query", "user", "text"]
training = pandas.read_csv(path/"training.1600000.processed.noemoticon.csv", 
                           encoding="latin-1", names=columns, header=None)
testing = pandas.read_csv(path/"testdata.manual.2009.06.14.csv", 
                           encoding="latin-1", names=columns, header=None)

Some Constants

Text = Namespace(
    embedding_dim = 100,
    max_length = 16,
    trunc_type='post',
    padding_type='post',
    oov_tok = "<OOV>",
    training_size=16000,
)
Data = Namespace(
    batch_size = 64,
    shuffle_buffer_size=100,
)

Middle

The Data

print(training.sample().iloc[0])
polarity                                                    4
tweet_id                                           1468852290
datetime                         Tue Apr 07 04:04:10 PDT 2009
query                                                NO_QUERY
user                                              leawoodward
text        Def off now...unexpected day out tomorrow so s...
Name: 806643, dtype: object
CountPercentage(training.polarity)()
Value Count Percent (%)
4 800,000 50.00
0 800,000 50.00

The polarity is what might also be called the "sentiment" of the tweet - 0 means a negative tweet and 4 means a positive tweet.

But, for our purposes, we would be better off if the positive polarity was 1, not 4, so let's convert it.

training.loc[training.polarity==4, "polarity"] = 1
counts = CountPercentage(training.polarity)()
Value Count Percent (%)
1 800,000 50.00
0 800,000 50.00

The Tokenizer

As you can see from the sample, the data is still in text form so we need to convert it to a numeric form with a Tokenizer.

First I'll Lower-case it.

training.loc[:, "text"] = training.text.str.lower()

Next we'll fit it to our text.

tokenizer = Tokenizer()
with TIMER:
    tokenizer.fit_on_texts(training.text.values)
2019-10-10 07:25:09,065 graeae.timers.timer start: Started: 2019-10-10 07:25:09.065039
WARNING: Logging before flag parsing goes to stderr.
I1010 07:25:09.065394 140436771002176 timer.py:70] Started: 2019-10-10 07:25:09.065039
2019-10-10 07:25:45,389 graeae.timers.timer end: Ended: 2019-10-10 07:25:45.389540
I1010 07:25:45.389598 140436771002176 timer.py:77] Ended: 2019-10-10 07:25:45.389540
2019-10-10 07:25:45,391 graeae.timers.timer end: Elapsed: 0:00:36.324501
I1010 07:25:45.391984 140436771002176 timer.py:78] Elapsed: 0:00:36.324501

Now, we can store some of it's values in variables for convenience.

word_index = tokenizer.word_index
vocabulary_size = len(tokenizer.word_index)

Now, we'll convert the texts to sequences and pad them so they are all the same length.

with TIMER:
    sequences = tokenizer.texts_to_sequences(training.text.values)
    padded = pad_sequences(sequences, maxlen=Text.max_length,
                           truncating=Text.trunc_type)

    splits = train_test_split(
        padded, training.polarity, test_size=.2)

    training_sequences, test_sequences, training_labels, test_labels = splits
2019-10-10 07:25:51,057 graeae.timers.timer start: Started: 2019-10-10 07:25:51.057684
I1010 07:25:51.057712 140436771002176 timer.py:70] Started: 2019-10-10 07:25:51.057684
2019-10-10 07:26:33,530 graeae.timers.timer end: Ended: 2019-10-10 07:26:33.530338
I1010 07:26:33.530381 140436771002176 timer.py:77] Ended: 2019-10-10 07:26:33.530338
2019-10-10 07:26:33,531 graeae.timers.timer end: Elapsed: 0:00:42.472654
I1010 07:26:33.531477 140436771002176 timer.py:78] Elapsed: 0:00:42.472654

Now convert them to datasets.

training_dataset = tensorflow.data.Dataset.from_tensor_slices(
    (training_sequences, training_labels)
)

testing_dataset = tensorflow.data.Dataset.from_tensor_slices(
    (test_sequences, test_labels)
)

training_dataset = training_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)
testing_dataset = testing_dataset.shuffle(Data.shuffle_buffer_size).batch(Data.batch_size)

GloVe

GloVe is short for Global Vectors for Word Representation. It is an unsupervised algorithm that creates vector representations for words. They have a site where you can download pre-trained models or get the code and train one yourself. We're going to use one of their pre-trained models.

path = Path("~/models/glove/").expanduser()
url = "http://nlp.stanford.edu/data/glove.6B.zip"
ZipDownloader(url, path)()
Files exist, not downloading

The GloVe data is stored as a series of space separated lines with the first column being the word that's encoded and the rest of the columns being the values for the vector. To make this work we're going to split the word off from the vector and put each into a dictionary.

embeddings = {}
with TIMER:
    with open(path/"glove.6B.100d.txt") as lines:
        for line in lines:
            tokens = line.split()
            embeddings[tokens[0]] = numpy.array(tokens[1:])
2019-10-06 18:55:11,592 graeae.timers.timer start: Started: 2019-10-06 18:55:11.592880
I1006 18:55:11.592908 140055379531584 timer.py:70] Started: 2019-10-06 18:55:11.592880
2019-10-06 18:55:21,542 graeae.timers.timer end: Ended: 2019-10-06 18:55:21.542689
I1006 18:55:21.542738 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:21.542689
2019-10-06 18:55:21,544 graeae.timers.timer end: Elapsed: 0:00:09.949809
I1006 18:55:21.544939 140055379531584 timer.py:78] Elapsed: 0:00:09.949809
print(f"{len(embeddings):,}")
400,000

So, our vocabulary consists of 400,000 "words" (tokens is more accurate, since they also include punctuation). The problem we have to deal with next is that our data set wasn't part of the dataset used to train the embeddings, so there will probably be some tokens in our data set that aren't in the embeddings. To handle this we need to add zeroed embeddings for the extra tokens.

Rather than adding to the dict, we'll create a matrix of zeros with rows for each word in our datasets vocabulary, then we'll iterate over the words in our dataset and if there's a match in the GloVE embeddings we'll insert it into the matrix.

with TIMER:
    embeddings_matrix = numpy.zeros((vocabulary_size+1, Text.embedding_dim));
    for word, index in word_index.items():
        embedding_vector = embeddings.get(word);
        if embedding_vector is not None:
            embeddings_matrix[index] = embedding_vector;
2019-10-06 18:55:46,577 graeae.timers.timer start: Started: 2019-10-06 18:55:46.577855
I1006 18:55:46.577886 140055379531584 timer.py:70] Started: 2019-10-06 18:55:46.577855
2019-10-06 18:55:51,374 graeae.timers.timer end: Ended: 2019-10-06 18:55:51.374706
I1006 18:55:51.374763 140055379531584 timer.py:77] Ended: 2019-10-06 18:55:51.374706
2019-10-06 18:55:51,377 graeae.timers.timer end: Elapsed: 0:00:04.796851
I1006 18:55:51.377207 140055379531584 timer.py:78] Elapsed: 0:00:04.796851
print(f"{len(embeddings_matrix):,}")
690,961

The Models

A CNN

  • Build
    convoluted_model = tensorflow.keras.Sequential([
        tensorflow.keras.layers.Embedding(
            vocabulary_size + 1,
            Text.embedding_dim,
            input_length=Text.max_length,
            weights=[embeddings_matrix],
            trainable=False),
        tensorflow.keras.layers.Conv1D(filters=128,
                                       kernel_size=5,
        activation='relu'),
        tensorflow.keras.layers.GlobalMaxPooling1D(),
        tensorflow.keras.layers.Dense(24, activation='relu'),
        tensorflow.keras.layers.Dense(1, activation='sigmoid')
    ])
    convoluted_model.compile(loss="binary_crossentropy", optimizer="rmsprop",
                             metrics=["accuracy"])
    
    print(convoluted_model.summary())
    
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding (Embedding)        (None, 16, 100)           69096100  
    _________________________________________________________________
    conv1d (Conv1D)              (None, 12, 128)           64128     
    _________________________________________________________________
    global_max_pooling1d (Global (None, 128)               0         
    _________________________________________________________________
    dense (Dense)                (None, 24)                3096      
    _________________________________________________________________
    dense_1 (Dense)              (None, 1)                 25        
    =================================================================
    Total params: 69,163,349
    Trainable params: 67,249
    Non-trainable params: 69,096,100
    _________________________________________________________________
    None
    
  • Train
    Training = Namespace(
        size = 0.75,
        epochs = 2,
        verbosity = 2,
        batch_size=128,
        )
    
    with TIMER:
        cnn_history = convoluted_model.fit(training_dataset,
                                           epochs=Training.epochs,
                                           validation_data=testing_dataset,
                                           verbose=Training.verbosity)
    
    2019-10-10 07:27:04,921 graeae.timers.timer start: Started: 2019-10-10 07:27:04.921617
    I1010 07:27:04.921657 140436771002176 timer.py:70] Started: 2019-10-10 07:27:04.921617
    Epoch 1/2
    W1010 07:27:05.154920 140436771002176 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.where in 2.0, which has the same broadcast rule as np.where
    20000/20000 - 4964s - loss: 0.5091 - accuracy: 0.7454 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
    Epoch 2/2
    20000/20000 - 4935s - loss: 0.4790 - accuracy: 0.7671 - val_loss: 0.4782 - val_accuracy: 0.7677
    2019-10-10 10:12:04,382 graeae.timers.timer end: Ended: 2019-10-10 10:12:04.382359
    I1010 10:12:04.382491 140436771002176 timer.py:77] Ended: 2019-10-10 10:12:04.382359
    2019-10-10 10:12:04,384 graeae.timers.timer end: Elapsed: 2:44:59.460742
    I1010 10:12:04.384716 140436771002176 timer.py:78] Elapsed: 2:44:59.460742
    
  • Some Plotting
    performance = pandas.DataFrame(cnn_history.history)
    plot = performance.hvplot().opts(title="CNN Twitter Sentiment Training Performance",
                                     width=1000,
                                     height=800)
    Embed(plot=plot, file_name="cnn_training")()
    

End

Citations

  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Raw

Embeddings from Scratch

Beginning

This is a walk-through of the tensorflow Word Embeddings tutorial, just to make sure I can do it.

Imports

Python

from argparse import Namespace
from functools import partial

PyPi

from tensorflow import keras
from tensorflow.keras import layers
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import EmbedHoloviews, Timer

Set Up

Plotting

prefix = "../../files/posts/keras/"
slug = "embeddings-from-scratch"

Embed = partial(EmbedHoloviews, folder_path=f"{prefix}{slug}")

The Timer

TIMER = Timer()

Middle

Some Constants

Text = Namespace(
    vocabulary_size=1000,
    embeddings_size=16,
    max_length=500,
    padding="post",
)

Tokens = Namespace(
    padding = "<PAD>",
    start = "<START>",
    unknown = "<UNKNOWN>",
    unused = "<UNUSED>",
)

The Embeddings Layer

print(layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.
  
embedding_layer = layers.Embedding(Text.vocabulary_size, Text.embeddings_size)

The first argument is the number of possible words in the vocabulary and the second is the number of dimensions. The Emebdding is a sort of lookup table that maps an integer that represents a word to a vector. In this case we're going to build a vocabulary of 1,000 words represented by vectors with a length of 32. The weights in the vectors are learned when we train the model and will encode the distance between words.

The input to the embeddings layer is a 2D tensor of integers with the shape (number of samples, sequence_length). The sequences are integer-encoded sentences of the same length - so you have to pad the shorter sentences to match the longest one (the sequence_length).

The ouput of the embeddings layer is a 3D tensor with the shape (number of samples, sequence_length, embedding_dimensionality).

The Dataset

(train_data, test_data), info = tensorflow_datasets.load(
    "imdb_reviews/subwords8k",
    split=(tensorflow_datasets.Split.TRAIN,
           tensorflow_datasets.Split.TEST),
    with_info=True, as_supervised=True)
encoder = info.features["text"].encoder
print(encoder.subwords[:10])
['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br']

Add Padding

padded_shapes = ([None], ())
train_batches = train_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes)
test_batches = test_data.shuffle(Text.vocabulary_size).padded_batch(
    10, padded_shapes=padded_shapes
)

Checkout a Sample

batch, labels = next(iter(train_batches))
print(batch.numpy())
[[  62    9    4 ...    0    0    0]
 [  19 2428    6 ...    0    0    0]
 [ 691    2  594 ... 7961 1457 7975]
 ...
 [6072 5644 8043 ...    0    0    0]
 [ 977   15   57 ...    0    0    0]
 [5646    2    1 ...    0    0    0]]

Build a Model

model = keras.Sequential([
    layers.Embedding(encoder.vocab_size, Text.embeddings_size),
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation="sigmoid")
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0
_________________________________________________________________
None

Compile and Train

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
ONCE_PER_EPOCH = 2
with TIMER:
    history = model.fit(train_batches, epochs=10,
                        validation_data=test_batches,
                        verbose=ONCE_PER_EPOCH,
                        validation_steps=20)
2019-09-28 17:14:52,764 graeae.timers.timer start: Started: 2019-09-28 17:14:52.764725
I0928 17:14:52.764965 140515023214400 timer.py:70] Started: 2019-09-28 17:14:52.764725
W0928 17:14:52.806057 140515023214400 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/10
 val_loss: 0.3015 - val_accuracy: 0.8900
2019-09-28 17:17:36,036 graeae.timers.timer end: Ended: 2019-09-28 17:17:36.036090
I0928 17:17:36.036139 140515023214400 timer.py:77] Ended: 2019-09-28 17:17:36.036090
2019-09-28 17:17:36,037 graeae.timers.timer end: Elapsed: 0:02:43.271365
I0928 17:17:36.037808 140515023214400 timer.py:78] Elapsed: 0:02:43.271365

End

data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="Training/Validation Performance",
                          width=1000,
                          height=800)
Embed(plot=plot, file_name="training")()

Figure Missing

Amazingly, even with such a simple model, it managed a 92 % validation accuracy.

IMDB GRU With Tokenization

Beginning

This is another version of the RNN model to classify the IMDB reviews, but this time we're going to tokenize it ourselves and use a GRU, instead of using the tensorflow-datasets version.

Imports

Python

from argparse import Namespace

PyPi

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import hvplot.pandas
import numpy
import pandas
import tensorflow
import tensorflow_datasets

Other

from graeae import Timer, EmbedHoloviews

Set Up

The Timer

TIMER = Timer()

Plotting

Middle

Set Up the Data

imdb, info = tensorflow_datasets.load("imdb_reviews",
                                      with_info=True,
                                      as_supervised=True)
WARNING: Logging before flag parsing goes to stderr.
W0924 21:52:10.158111 139862640383808 dataset_builder.py:439] Warning: Setting shuffle_files=True because split=TRAIN and shuffle_files=None. This behavior will be deprecated on 2019-08-06, at which point shuffle_files=False will be the default for all splits.
training, testing = imdb["train"], imdb["test"]

Building Up the Tokenizer

Since we didn't pass in a specifier for the configuration we wanted (e.g. imdb/subwords8k) it defaulted to giving us the plain text reviews (and their labels) so we have to build the tokenizer ourselves.

Split Up the Sentences and Their Labels

As you might recall, the data set consists of 50,000 IMDB movie reviews categorized as positive or negative. To build the tokenize we first have to split the sentences from their labels

training_sentences = []
training_labels = []
testing_sentences = []
testing_labels = []
with TIMER:
    for sentence, label in training:
        training_sentences.append(str(sentence.numpy()))
        training_labels.append(str(label.numpy()))


    for sentence, label in testing:
        testing_sentences.append(str(sentence.numpy))
        testing_labels.append(str(label.numpy()))
2019-09-24 21:52:11,396 graeae.timers.timer start: Started: 2019-09-24 21:52:11.395126
I0924 21:52:11.396310 139862640383808 timer.py:70] Started: 2019-09-24 21:52:11.395126
2019-09-24 21:52:18,667 graeae.timers.timer end: Ended: 2019-09-24 21:52:18.667789
I0924 21:52:18.667830 139862640383808 timer.py:77] Ended: 2019-09-24 21:52:18.667789
2019-09-24 21:52:18,670 graeae.timers.timer end: Elapsed: 0:00:07.272663
I0924 21:52:18.670069 139862640383808 timer.py:78] Elapsed: 0:00:07.272663
training_labels_final = numpy.array(training_labels)
testing_labels_final = numpy.array(testing_labels)

Some Constants

Text = Namespace(
    vocab_size = 10000,
    embedding_dim = 16,
    max_length = 120,
    trunc_type='post',
    oov_token = "<OOV>",
)

Build the Tokenizer

tokenizer = Tokenizer(num_words=Text.vocab_size, oov_token=Text.oov_token)
with TIMER:
    tokenizer.fit_on_texts(training_sentences)

    word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(training_sentences)
    padded = pad_sequences(sequences, maxlen=Text.max_length, truncating=Text.trunc_type)

    testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
    testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length)
2019-09-24 21:52:21,705 graeae.timers.timer start: Started: 2019-09-24 21:52:21.705287
I0924 21:52:21.705317 139862640383808 timer.py:70] Started: 2019-09-24 21:52:21.705287
2019-09-24 21:52:32,152 graeae.timers.timer end: Ended: 2019-09-24 21:52:32.152267
I0924 21:52:32.152314 139862640383808 timer.py:77] Ended: 2019-09-24 21:52:32.152267
2019-09-24 21:52:32,154 graeae.timers.timer end: Elapsed: 0:00:10.446980
I0924 21:52:32.154620 139862640383808 timer.py:78] Elapsed: 0:00:10.446980

Decoder Ring

index_to_word = {value: key for key, value in word_index.items()}

def decode_review(text: numpy.array) -> str:
    return " ".join([index_to_word.get(item, "<?>") for item in text])

Build the Model

This time we're going to build a four-layer model with one Bidirectional layer that uses a GRU (Gated Recurrent Unit) instead of a LSTM.

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(Text.vocab_size, Text.embedding_dim, input_length=Text.max_length),
    tensorflow.keras.layers.Bidirectional(tensorflow.compat.v2.keras.layers.GRU(32)),
    tensorflow.keras.layers.Dense(6, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                9600      
_________________________________________________________________
dense (Dense)                (None, 6)                 390       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 169,997
Trainable params: 169,997
Non-trainable params: 0
_________________________________________________________________
None

Train it

EPOCHS = 50
ONCE_PER_EPOCH = 2
batch_size = 8
history = model.fit(padded, training_labels_final,
                    epochs=EPOCHS,
                    batch_size=batch_size,
                    validation_data=(testing_padded, testing_labels_final),
                    verbose=ONCE_PER_EPOCH)

Plot It

data = pandas.DataFrame(history.history)
plot = data.hvplot().opts(title="GRU Training Performance", width=1000, height=800)
Embed(plot=plot, file_name="gru_training")()

Raw

He Used Sarcasm

Beginning

This is a look at fitting a model to detect sarcasm using a json blob from Laurence Moroney (https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json).

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path
from pprint import pprint
from urllib.parse import urlparse
import json

PyPi

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import hvplot.pandas
import pandas
import tensorflow

Other

from graeae import (
    CountPercentage,
    EmbedHoloviews,
    TextDownloader,
    Timer
)

Set Up

The Timer

TIMER = Timer()

The Plotting

SLUG = "he-used-sarcasm"
Embed = partial(EmbedHoloviews, folder_path=f"../../files/posts/keras/{SLUG}")

The Data

URL = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json"
path = Path("~/data/datasets/text/sarcasm/sarcasm.json").expanduser()
downloader = TextDownloader(URL, path)
with TIMER:
    data = json.loads(downloader.download)
2019-09-22 15:05:27,302 graeae.timers.timer start: Started: 2019-09-22 15:05:27.301225
WARNING: Logging before flag parsing goes to stderr.
I0922 15:05:27.302001 139873020925760 timer.py:70] Started: 2019-09-22 15:05:27.301225
2019-09-22 15:05:27,306 [1mTextDownloader[0m download: /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
I0922 15:05:27.306186 139873020925760 downloader.py:51] /home/hades/data/datasets/text/sarcasm/sarcasm.json exists, opening it
2019-09-22 15:05:27,367 graeae.timers.timer end: Ended: 2019-09-22 15:05:27.367036
I0922 15:05:27.367099 139873020925760 timer.py:77] Ended: 2019-09-22 15:05:27.367036
2019-09-22 15:05:27,369 graeae.timers.timer end: Elapsed: 0:00:00.065811
I0922 15:05:27.369417 139873020925760 timer.py:78] Elapsed: 0:00:00.065811

Middle

Looking At the Data

pprint(data[0])
{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
 'headline': "former versace store clerk sues over secret 'black code' for "
             'minority shoppers',
 'is_sarcastic': 0}

So our data is a dictionary with three keys - the source of the article, the headline of the article, and whether it's a sarcastic headline or not. There's no citation in the original notebook, but it looks like it might be this one on GitHub.

data = pandas.DataFrame(data)
data.loc[:, "site"] = data.article_link.apply(lambda link: urlparse(link).netloc)
CountPercentage(data.site, value_label="Site")()
Site Count Percent (%)
www.huffingtonpost.com 14403 53.93
www.theonion.com 5811 21.76
local.theonion.com 2852 10.68
politics.theonion.com 1767 6.62
entertainment.theonion.com 1194 4.47
www.huffingtonpost.comhttp: 503 1.88
sports.theonion.com 100 0.37
www.huffingtonpost.comhttps: 79 0.30

So, it looks like there's some problems with the URLs. I don't think that's important, but maybe I can clean up a little anyway.

print(
    data[
        data.site.str.contains("www.huffingtonpost.comhttp"
                               )].article_link.value_counts())
https://www.huffingtonpost.comhttp://nymag.com/daily/intelligencer/2016/05/hillary-clinton-candidacy.html                                                                                              2
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostQueerVoices/videos/1153919084666530/                                                                                                    1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/chris-meloni-star-underground/56d8584f99ec6dca3d00000a                                                                          1
https://www.huffingtonpost.comhttp://www.thestreet.com/story/13223501/1/post-retirement-work-may-not-save-your-golden-years.html                                                                       1
https://www.huffingtonpost.comhttps://www.facebook.com/HuffPostEntertainment/                                                                                                                          1
                                                                                                                                                                                                      ..
https://www.huffingtonpost.comhttp://nymag.com/thecut/2015/10/first-legal-abortionists-tell-their-stories.html?mid=twitter_nymag                                                                       1
https://www.huffingtonpost.comhttp://www.tampabay.com/blogs/the-buzz-florida-politics/marco-rubio-warming-up-to-donald-trump/2275308                                                                   1
https://www.huffingtonpost.comhttp://live.huffingtonpost.com/r/segment/porn-to-pay-for-college/55aeadf62b8c2a2f6f000193                                                                                1
https://www.huffingtonpost.comhttps://www.thedodo.com/dog-mouth-taped-shut-facebook-1481874724.html                                                                                                    1
https://www.huffingtonpost.comhttps://www.washingtonpost.com/politics/ben-carson-to-tell-supporters-he-sees-no-path-forward-for-campaign/2016/03/02/d6bef352-d9b3-11e5-891a-4ed04f4213e8_story.html    1
Name: article_link, Length: 581, dtype: int64

That's kind of odd, I don't know what that means, maybe the Huffington Post was citing other sites? I went to go check the GitHub dataset I mentioned but it's actually much larger than this one so I don't know if it's really the source or not.

prefixes = ("www.huffingtonpost.comhttp:", "www.huffingtonpost.comhttps:")
for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(
        prefix,
        "www.huffingtonpost.com")

prefixes = ("local.theonion.com",
            "politics.theonion.com",
            "entertainment.theonion.com",
            "sports.theonion.com")

for prefix in prefixes:
    data.loc[:, "site"] = data.site.str.replace(prefix,
                                                "www.theonion.com")
counter = CountPercentage(data.site, value_label="Site")
counter()
Site Count Percent (%)
www.huffingtonpost.com 14985 56.10
www.theonion.com 11724 43.90
plot = counter.table.hvplot.bar(x="Site", y="Count").opts(
    title="Distribution by Site",
    width=1000,
    height=800)
Embed(plot=plot, file_name="site_distribution")()

Figure Missing

counter = CountPercentage(data.is_sarcastic, value_label="Is Sarcastic")
counter()
Is Sarcastic Count Percent (%)
0 14,985 56.10
1 11,724 43.90

Given that the counts match I'm assuming anything from the Huffington Post is labeled as not sarcastic and anything from the onion is sarcastic.

assert all(data[data.site=="www.onion.com"].is_sarcastic)
assert not any(data[data.site=="www.huffingtonpost.com"].is_sarcastic)

Set Up the Tokenizing and Training Data

print(f"{len(data):,}")
26,709
Text = Namespace(
    vocabulary_size = 1000,
    embedding_dim = 16,
    max_length = 120,
    truncating_type='post',
    padding_type='post',
    out_of_vocabulary_tok = "<OOV>",
)

# this is actually the default for train_test_split
Training = Namespace(
    size = 0.75,
    epochs = 50,
    verbosity = 2,
    )

The Training and Testing Data

x_train, x_test, y_train, y_test = train_test_split(
    data.headline, data.is_sarcastic, train_size=Training.size,
)

The Tokenizer

tokenizer = Tokenizer(num_words=Text.vocabulary_size,
                      oov_token=Text.out_of_vocabulary_tok)
print(tokenizer.__doc__)
Text tokenization utility class.

    This class allows to vectorize a text corpus, by turning each
    text into either a sequence of integers (each integer being the index
    of a token in a dictionary) or into a vector where the coefficient
    for each token could be binary, based on word count, based on tf-idf...

    # Arguments
        num_words: the maximum number of words to keep, based
            on word frequency. Only the most common `num_words-1` words will
            be kept.
        filters: a string where each element is a character that will be
            filtered from the texts. The default is all punctuation, plus
            tabs and line breaks, minus the `'` character.
        lower: boolean. Whether to convert the texts to lowercase.
        split: str. Separator for word splitting.
        char_level: if True, every character will be treated as a token.
        oov_token: if given, it will be added to word_index and used to
            replace out-of-vocabulary words during text_to_sequence calls

    By default, all punctuation is removed, turning the texts into
    space-separated sequences of words
    (words maybe include the `'` character). These sequences are then
    split into lists of tokens. They will then be indexed or vectorized.

    `0` is a reserved index that won't be assigned to any word.

Now that we have a tokenizer we can tokenize our training headlines.

help(tokenizer.fit_on_texts)
Help on method fit_on_texts in module keras_preprocessing.text:

fit_on_texts(texts) method of keras_preprocessing.text.Tokenizer instance
    Updates internal vocabulary based on a list of texts.
    
    In the case where texts contains lists,
    we assume each entry of the lists to be a token.
    
    Required before using `texts_to_sequences` or `texts_to_matrix`.
    
    # Arguments
        texts: can be a list of strings,
            a generator of strings (for memory-efficiency),
            or a list of list of strings.

tokenizer.fit_on_texts(x_train)

Now that we've fit the headlines we can get the word index, a dict mapping words to their index.

word_index = tokenizer.word_index

Note that the tokenizer doesn't remove stop-words.

print("the" in word_index)
True

Now we'll convert the training headlines to sequences of numbers.

help(tokenizer.texts_to_sequences)
Help on method texts_to_sequences in module keras_preprocessing.text:

texts_to_sequences(texts) method of keras_preprocessing.text.Tokenizer instance
    Transforms each text in texts to a sequence of integers.
    
    Only top `num_words-1` most frequent words will be taken into account.
    Only words known by the tokenizer will be taken into account.
    
    # Arguments
        texts: A list of texts (strings).
    
    # Returns
        A list of sequences.

We're also going to have to pad them to make them the same length.

help(pad_sequences)
Help on function pad_sequences in module keras_preprocessing.sequence:

pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)
    Pads sequences to the same length.
    
    This function transforms a list of
    `num_samples` sequences (lists of integers)
    into a 2D Numpy array of shape `(num_samples, num_timesteps)`.
    `num_timesteps` is either the `maxlen` argument if provided,
    or the length of the longest sequence otherwise.
    
    Sequences that are shorter than `num_timesteps`
    are padded with `value` at the end.
    
    Sequences longer than `num_timesteps` are truncated
    so that they fit the desired length.
    The position where padding or truncation happens is determined by
    the arguments `padding` and `truncating`, respectively.
    
    Pre-padding is the default.
    
    # Arguments
        sequences: List of lists, where each element is a sequence.
        maxlen: Int, maximum length of all sequences.
        dtype: Type of the output sequences.
            To pad sequences with variable length strings, you can use `object`.
        padding: String, 'pre' or 'post':
            pad either before or after each sequence.
        truncating: String, 'pre' or 'post':
            remove values from sequences larger than
            `maxlen`, either at the beginning or at the end of the sequences.
        value: Float or String, padding value.
    
    # Returns
        x: Numpy array with shape `(len(sequences), maxlen)`
    
    # Raises
        ValueError: In case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.

training_sequences = tokenizer.texts_to_sequences(x_train)
training_padded = pad_sequences(training_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

testing_sequences = tokenizer.texts_to_sequences(x_test)
testing_padded = pad_sequences(testing_sequences, maxlen=Text.max_length, padding=Text.padding_type, truncating=Text.truncating_type)

Build The Model

We're going to use a convolutional neural network to try and classify our headlines as sarcastic or not-sarcastic.

It's a Sequence of Layers

print(tensorflow.keras.Sequential.__doc__)
Linear stack of layers.

  Arguments:
      layers: list of layers to add to the model.

  Example:

  ```python
  # Optionally, the first layer can receive an `input_shape` argument:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  # Afterwards, we do automatic shape inference:
  model.add(Dense(32))

  # This is identical to the following:
  model = Sequential()
  model.add(Dense(32, input_dim=500))

  # And to the following:
  model = Sequential()
  model.add(Dense(32, batch_input_shape=(None, 500)))

  # Note that you can also omit the `input_shape` argument:
  # In that case the model gets built the first time you call `fit` (or other
  # training and evaluation methods).
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.compile(optimizer=optimizer, loss=loss)
  # This builds the model for the first time:
  model.fit(x, y, batch_size=32, epochs=10)

  # Note that when using this delayed-build pattern (no input shape specified),
  # the model doesn't have any weights until the first call
  # to a training/evaluation method (since it isn't yet built):
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.weights  # returns []

  # Whereas if you specify the input shape, the model gets built continuously
  # as you are adding layers:
  model = Sequential()
  model.add(Dense(32, input_shape=(500,)))
  model.add(Dense(32))
  model.weights  # returns list of length 4

  # When using the delayed-build pattern (no input shape specified), you can
  # choose to manually build your model by calling `build(batch_input_shape)`:
  model = Sequential()
  model.add(Dense(32))
  model.add(Dense(32))
  model.build((None, 500))
  model.weights  # returns list of length 4
  ```
  

Start With An Embedding Layer

print(tensorflow.keras.layers.Embedding.__doc__)
Turns positive integers (indexes) into dense vectors of fixed size.

  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`

  This layer can only be used as the first layer in a model.

  Example:

  ```python
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch,
  # input_length).
  # the largest integer (i.e. word index) in the input should be no larger
  # than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch
  # dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
  ```

  Arguments:
    input_dim: int > 0. Size of the vocabulary,
      i.e. maximum integer index + 1.
    output_dim: int >= 0. Dimension of the dense embedding.
    embeddings_initializer: Initializer for the `embeddings` matrix.
    embeddings_regularizer: Regularizer function applied to
      the `embeddings` matrix.
    embeddings_constraint: Constraint function applied to
      the `embeddings` matrix.
    mask_zero: Whether or not the input value 0 is a special "padding"
      value that should be masked out.
      This is useful when using recurrent layers
      which may take variable length input.
      If this is `True` then all subsequent layers
      in the model need to support masking or an exception will be raised.
      If mask_zero is set to True, as a consequence, index 0 cannot be
      used in the vocabulary (input_dim should equal size of
      vocabulary + 1).
    input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

  Input shape:
    2D tensor with shape: `(batch_size, input_length)`.

  Output shape:
    3D tensor with shape: `(batch_size, input_length, output_dim)`.
  

The Convolutional Layer

print(tensorflow.keras.layers.Conv1D.__doc__)
1D convolution layer (e.g. temporal convolution).

  This layer creates a convolution kernel that is convolved
  with the layer input over a single spatial (or temporal) dimension
  to produce a tensor of outputs.
  If `use_bias` is True, a bias vector is created and added to the outputs.
  Finally, if `activation` is not `None`,
  it is applied to the outputs as well.

  When using this layer as the first layer in a model,
  provide an `input_shape` argument
  (tuple of integers or `None`, e.g.
  `(10, 128)` for sequences of 10 vectors of 128-dimensional vectors,
  or `(None, 128)` for variable-length sequences of 128-dimensional vectors.

  Arguments:
    filters: Integer, the dimensionality of the output space
      (i.e. the number of output filters in the convolution).
    kernel_size: An integer or tuple/list of a single integer,
      specifying the length of the 1D convolution window.
    strides: An integer or tuple/list of a single integer,
      specifying the stride length of the convolution.
      Specifying any stride value != 1 is incompatible with specifying
      any `dilation_rate` value != 1.
    padding: One of `"valid"`, `"causal"` or `"same"` (case-insensitive).
      `"causal"` results in causal (dilated) convolutions, e.g. output[t]
      does not depend on input[t+1:]. Useful when modeling temporal data
      where the model should not violate the temporal order.
      See [WaveNet: A Generative Model for Raw Audio, section
        2.1](https://arxiv.org/abs/1609.03499).
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
    dilation_rate: an integer or tuple/list of a single integer, specifying
      the dilation rate to use for dilated convolution.
      Currently, specifying any `dilation_rate` value != 1 is
      incompatible with specifying any `strides` value != 1.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to the kernel matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Examples:
    ```python
    # Small convolutional model for 128-length vectors with 6 timesteps
    # model.input_shape == (None, 6, 128)
    
    model = Sequential()
    model.add(Conv1D(32, 3, 
              activation='relu', 
              input_shape=(6, 128)))
    
    # now: model.output_shape == (None, 4, 32)
    ```

  Input shape:
    3D tensor with shape: `(batch_size, steps, input_dim)`

  Output shape:
    3D tensor with shape: `(batch_size, new_steps, filters)`
      `steps` value might have changed due to padding or strides.
  

A Pooling Layer

print(tensorflow.keras.layers.GlobalMaxPooling1D.__doc__)
Global max pooling operation for temporal data.

  Arguments:
    data_format: A string,
      one of `channels_last` (default) or `channels_first`.
      The ordering of the dimensions in the inputs.
      `channels_last` corresponds to inputs with shape
      `(batch, steps, features)` while `channels_first`
      corresponds to inputs with shape
      `(batch, features, steps)`.

  Input shape:
    - If `data_format='channels_last'`:
      3D tensor with shape:
      `(batch_size, steps, features)`
    - If `data_format='channels_first'`:
      3D tensor with shape:
      `(batch_size, features, steps)`

  Output shape:
    2D tensor with shape `(batch_size, features)`.
  

The Fully-Connected Layers

Finally our output layers.

print(tensorflow.keras.layers.Dense.__doc__)
Just your regular densely-connected NN layer.

  `Dense` implements the operation:
  `output = activation(dot(input, kernel) + bias)`
  where `activation` is the element-wise activation function
  passed as the `activation` argument, `kernel` is a weights matrix
  created by the layer, and `bias` is a bias vector created by the layer
  (only applicable if `use_bias` is `True`).

  Note: If the input to the layer has a rank greater than 2, then
  it is flattened prior to the initial dot product with `kernel`.

  Example:

  ```python
  # as first layer in a sequential model:
  model = Sequential()
  model.add(Dense(32, input_shape=(16,)))
  # now the model will take as input arrays of shape (*, 16)
  # and output arrays of shape (*, 32)

  # after the first layer, you don't need to specify
  # the size of the input anymore:
  model.add(Dense(32))
  ```

  Arguments:
    units: Positive integer, dimensionality of the output space.
    activation: Activation function to use.
      If you don't specify anything, no activation is applied
      (ie. "linear" activation: `a(x) = x`).
    use_bias: Boolean, whether the layer uses a bias vector.
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to
      the `kernel` weights matrix.
    bias_regularizer: Regularizer function applied to the bias vector.
    activity_regularizer: Regularizer function applied to
      the output of the layer (its "activation")..
    kernel_constraint: Constraint function applied to
      the `kernel` weights matrix.
    bias_constraint: Constraint function applied to the bias vector.

  Input shape:
    N-D tensor with shape: `(batch_size, ..., input_dim)`.
    The most common situation would be
    a 2D input with shape `(batch_size, input_dim)`.

  Output shape:
    N-D tensor with shape: `(batch_size, ..., units)`.
    For instance, for a 2D input with shape `(batch_size, input_dim)`,
    the output would have shape `(batch_size, units)`.
  

Build It

I originally added the layers using the model.add method, but then when I tried to train it the output said the layers didn't have gradients and it never improved… I'll have to look into that, but in the meantime, passing them all in seems to work.

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(
        input_dim=Text.vocabulary_size,
        output_dim=Text.embedding_dim,
        input_length=Text.max_length),
    tensorflow.keras.layers.Conv1D(filters=128,
                                   kernel_size=5,
                                   activation='relu'),
    tensorflow.keras.layers.GlobalMaxPooling1D(),
    tensorflow.keras.layers.Dense(24, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])

Compile It

print(model.compile.__doc__)
Configures the model for training.

    Arguments:
        optimizer: String (name of optimizer) or optimizer instance.
            See `tf.keras.optimizers`.
        loss: String (name of objective function), objective function or
            `tf.losses.Loss` instance. See `tf.losses`. If the model has
            multiple outputs, you can use a different loss on each output by
            passing a dictionary or a list of losses. The loss value that will
            be minimized by the model will then be the sum of all individual
            losses.
        metrics: List of metrics to be evaluated by the model during training
            and testing. Typically you will use `metrics=['accuracy']`.
            To specify different metrics for different outputs of a
            multi-output model, you could also pass a dictionary, such as
            `metrics={'output_a': 'accuracy', 'output_b': ['accuracy', 'mse']}`.
            You can also pass a list (len = len(outputs)) of lists of metrics
            such as `metrics=[['accuracy'], ['accuracy', 'mse']]` or
            `metrics=['accuracy', ['accuracy', 'mse']]`.
        loss_weights: Optional list or dictionary specifying scalar
            coefficients (Python floats) to weight the loss contributions
            of different model outputs.
            The loss value that will be minimized by the model
            will then be the *weighted sum* of all individual losses,
            weighted by the `loss_weights` coefficients.
            If a list, it is expected to have a 1:1 mapping
            to the model's outputs. If a tensor, it is expected to map
            output names (strings) to scalar coefficients.
        sample_weight_mode: If you need to do timestep-wise
            sample weighting (2D weights), set this to `"temporal"`.
            `None` defaults to sample-wise weights (1D).
            If the model has multiple outputs, you can use a different
            `sample_weight_mode` on each output by passing a
            dictionary or a list of modes.
        weighted_metrics: List of metrics to be evaluated and weighted
            by sample_weight or class_weight during training and testing.
        target_tensors: By default, Keras will create placeholders for the
            model's target, which will be fed with the target data during
            training. If instead you would like to use your own
            target tensors (in turn, Keras will not expect external
            Numpy data for these targets at training time), you
            can specify them via the `target_tensors` argument. It can be
            a single tensor (for a single-output model), a list of tensors,
            or a dict mapping output names to target tensors.
        distribute: NOT SUPPORTED IN TF 2.0, please create and compile the
            model under distribution strategy scope instead of passing it to
            compile.
        **kwargs: Any additional arguments.

    Raises:
        ValueError: In case of invalid arguments for
            `optimizer`, `loss`, `metrics` or `sample_weight_mode`.
    
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 120, 16)           16000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 116, 128)          10368     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 24)                3096      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 25        
=================================================================
Total params: 29,489
Trainable params: 29,489
Non-trainable params: 0
_________________________________________________________________
None

Train It

print(model.fit.__doc__)
Trains the model for a fixed number of epochs (iterations on a dataset).

    Arguments:
        x: Input data. It could be:
          - A Numpy array (or array-like), or a list of arrays
            (in case the model has multiple inputs).
          - A TensorFlow tensor, or a list of tensors
            (in case the model has multiple inputs).
          - A dict mapping input names to the corresponding array/tensors,
            if the model has named inputs.
          - A `tf.data` dataset. Should return a tuple
            of either `(inputs, targets)` or
            `(inputs, targets, sample_weights)`.
          - A generator or `keras.utils.Sequence` returning `(inputs, targets)`
            or `(inputs, targets, sample weights)`.
        y: Target data. Like the input data `x`,
          it could be either Numpy array(s) or TensorFlow tensor(s).
          It should be consistent with `x` (you cannot have Numpy inputs and
          tensor targets, or inversely). If `x` is a dataset, generator,
          or `keras.utils.Sequence` instance, `y` should
          not be specified (since targets will be obtained from `x`).
        batch_size: Integer or `None`.
            Number of samples per gradient update.
            If unspecified, `batch_size` will default to 32.
            Do not specify the `batch_size` if your data is in the
            form of symbolic tensors, datasets,
            generators, or `keras.utils.Sequence` instances (since they generate
            batches).
        epochs: Integer. Number of epochs to train the model.
            An epoch is an iteration over the entire `x` and `y`
            data provided.
            Note that in conjunction with `initial_epoch`,
            `epochs` is to be understood as "final epoch".
            The model is not trained for a number of iterations
            given by `epochs`, but merely until the epoch
            of index `epochs` is reached.
        verbose: 0, 1, or 2. Verbosity mode.
            0 = silent, 1 = progress bar, 2 = one line per epoch.
            Note that the progress bar is not particularly useful when
            logged to a file, so verbose=2 is recommended when not running
            interactively (eg, in a production environment).
        callbacks: List of `keras.callbacks.Callback` instances.
            List of callbacks to apply during training.
            See `tf.keras.callbacks`.
        validation_split: Float between 0 and 1.
            Fraction of the training data to be used as validation data.
            The model will set apart this fraction of the training data,
            will not train on it, and will evaluate
            the loss and any model metrics
            on this data at the end of each epoch.
            The validation data is selected from the last samples
            in the `x` and `y` data provided, before shuffling. This argument is
            not supported when `x` is a dataset, generator or
           `keras.utils.Sequence` instance.
        validation_data: Data on which to evaluate
            the loss and any model metrics at the end of each epoch.
            The model will not be trained on this data.
            `validation_data` will override `validation_split`.
            `validation_data` could be:
              - tuple `(x_val, y_val)` of Numpy arrays or tensors
              - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays
              - dataset
            For the first two cases, `batch_size` must be provided.
            For the last case, `validation_steps` must be provided.
        shuffle: Boolean (whether to shuffle the training data
            before each epoch) or str (for 'batch').
            'batch' is a special option for dealing with the
            limitations of HDF5 data; it shuffles in batch-sized chunks.
            Has no effect when `steps_per_epoch` is not `None`.
        class_weight: Optional dictionary mapping class indices (integers)
            to a weight (float) value, used for weighting the loss function
            (during training only).
            This can be useful to tell the model to
            "pay more attention" to samples from
            an under-represented class.
        sample_weight: Optional Numpy array of weights for
            the training samples, used for weighting the loss function
            (during training only). You can either pass a flat (1D)
            Numpy array with the same length as the input samples
            (1:1 mapping between weights and samples),
            or in the case of temporal data,
            you can pass a 2D array with shape
            `(samples, sequence_length)`,
            to apply a different weight to every timestep of every sample.
            In this case you should make sure to specify
            `sample_weight_mode="temporal"` in `compile()`. This argument is not
            supported when `x` is a dataset, generator, or
           `keras.utils.Sequence` instance, instead provide the sample_weights
            as the third element of `x`.
        initial_epoch: Integer.
            Epoch at which to start training
            (useful for resuming a previous training run).
        steps_per_epoch: Integer or `None`.
            Total number of steps (batches of samples)
            before declaring one epoch finished and starting the
            next epoch. When training with input tensors such as
            TensorFlow data tensors, the default `None` is equal to
            the number of samples in your dataset divided by
            the batch size, or 1 if that cannot be determined. If x is a
            `tf.data` dataset, and 'steps_per_epoch'
            is None, the epoch will run until the input dataset is exhausted.
            This argument is not supported with array inputs.
        validation_steps: Only relevant if `validation_data` is provided and
            is a `tf.data` dataset. Total number of steps (batches of
            samples) to draw before stopping when performing validation
            at the end of every epoch. If validation_data is a `tf.data` dataset
            and 'validation_steps' is None, validation
            will run until the `validation_data` dataset is exhausted.
        validation_freq: Only relevant if validation data is provided. Integer
            or `collections_abc.Container` instance (e.g. list, tuple, etc.).
            If an integer, specifies how many training epochs to run before a
            new validation run is performed, e.g. `validation_freq=2` runs
            validation every 2 epochs. If a Container, specifies the epochs on
            which to run validation, e.g. `validation_freq=[1, 2, 10]` runs
            validation at the end of the 1st, 2nd, and 10th epochs.
        max_queue_size: Integer. Used for generator or `keras.utils.Sequence`
            input only. Maximum size for the generator queue.
            If unspecified, `max_queue_size` will default to 10.
        workers: Integer. Used for generator or `keras.utils.Sequence` input
            only. Maximum number of processes to spin up
            when using process-based threading. If unspecified, `workers`
            will default to 1. If 0, will execute the generator on the main
            thread.
        use_multiprocessing: Boolean. Used for generator or
            `keras.utils.Sequence` input only. If `True`, use process-based
            threading. If unspecified, `use_multiprocessing` will default to
            `False`. Note that because this implementation relies on
            multiprocessing, you should not pass non-picklable arguments to
            the generator as they can't be passed easily to children processes.
        **kwargs: Used for backwards compatibility.

    Returns:
        A `History` object. Its `History.history` attribute is
        a record of training loss values and metrics values
        at successive epochs, as well as validation loss values
        and validation metrics values (if applicable).

    Raises:
        RuntimeError: If the model was never compiled.
        ValueError: In case of mismatch between the provided input data
            and what the model expects.
    
with TIMER:
    history = model.fit(training_padded,
                        y_train.values,
                        epochs=Training.epochs,
                        validation_data=(testing_padded, y_test.values),
                        verbose=Training.verbosity)
2019-09-22 16:30:20,369 graeae.timers.timer start: Started: 2019-09-22 16:30:20.369886
I0922 16:30:20.369913 139873020925760 timer.py:70] Started: 2019-09-22 16:30:20.369886
Train on 20031 samples, validate on 6678 samples
Epoch 1/50
20031/20031 - 5s - loss: 0.4741 - accuracy: 0.7623 - val_loss: 0.4033 - val_accuracy: 0.8146
Epoch 2/50
20031/20031 - 4s - loss: 0.3663 - accuracy: 0.8366 - val_loss: 0.3980 - val_accuracy: 0.8196
Epoch 3/50
20031/20031 - 4s - loss: 0.3306 - accuracy: 0.8554 - val_loss: 0.3909 - val_accuracy: 0.8240
Epoch 4/50
20031/20031 - 4s - loss: 0.2990 - accuracy: 0.8721 - val_loss: 0.4148 - val_accuracy: 0.8179
Epoch 5/50
20031/20031 - 4s - loss: 0.2697 - accuracy: 0.8867 - val_loss: 0.4050 - val_accuracy: 0.8282
Epoch 6/50
20031/20031 - 4s - loss: 0.2406 - accuracy: 0.9003 - val_loss: 0.4291 - val_accuracy: 0.8212
Epoch 7/50
20031/20031 - 4s - loss: 0.2080 - accuracy: 0.9165 - val_loss: 0.4650 - val_accuracy: 0.8181
Epoch 8/50
20031/20031 - 4s - loss: 0.1824 - accuracy: 0.9272 - val_loss: 0.5053 - val_accuracy: 0.8130
Epoch 9/50
20031/20031 - 4s - loss: 0.1559 - accuracy: 0.9393 - val_loss: 0.5389 - val_accuracy: 0.8065
Epoch 10/50
20031/20031 - 4s - loss: 0.1325 - accuracy: 0.9498 - val_loss: 0.6213 - val_accuracy: 0.8044
Epoch 11/50
20031/20031 - 4s - loss: 0.1104 - accuracy: 0.9599 - val_loss: 0.6902 - val_accuracy: 0.8034
Epoch 12/50
20031/20031 - 4s - loss: 0.0966 - accuracy: 0.9646 - val_loss: 0.7437 - val_accuracy: 0.8035
Epoch 13/50
20031/20031 - 4s - loss: 0.0848 - accuracy: 0.9689 - val_loss: 0.8285 - val_accuracy: 0.7954
Epoch 14/50
20031/20031 - 4s - loss: 0.0693 - accuracy: 0.9753 - val_loss: 0.9121 - val_accuracy: 0.7934
Epoch 15/50
20031/20031 - 4s - loss: 0.0608 - accuracy: 0.9777 - val_loss: 1.0783 - val_accuracy: 0.7931
Epoch 16/50
20031/20031 - 4s - loss: 0.0529 - accuracy: 0.9810 - val_loss: 1.0620 - val_accuracy: 0.7889
Epoch 17/50
20031/20031 - 4s - loss: 0.0506 - accuracy: 0.9819 - val_loss: 1.2497 - val_accuracy: 0.7889
Epoch 18/50
20031/20031 - 4s - loss: 0.0471 - accuracy: 0.9821 - val_loss: 1.2518 - val_accuracy: 0.7963
Epoch 19/50
20031/20031 - 4s - loss: 0.0457 - accuracy: 0.9819 - val_loss: 1.3492 - val_accuracy: 0.7917
Epoch 20/50
20031/20031 - 4s - loss: 0.0392 - accuracy: 0.9851 - val_loss: 1.3702 - val_accuracy: 0.7948
Epoch 21/50
20031/20031 - 4s - loss: 0.0357 - accuracy: 0.9860 - val_loss: 1.4300 - val_accuracy: 0.7948
Epoch 22/50
20031/20031 - 4s - loss: 0.0341 - accuracy: 0.9864 - val_loss: 1.5654 - val_accuracy: 0.7889
Epoch 23/50
20031/20031 - 4s - loss: 0.0360 - accuracy: 0.9860 - val_loss: 1.5615 - val_accuracy: 0.7951
Epoch 24/50
20031/20031 - 4s - loss: 0.0307 - accuracy: 0.9872 - val_loss: 1.6964 - val_accuracy: 0.7953
Epoch 25/50
20031/20031 - 4s - loss: 0.0283 - accuracy: 0.9893 - val_loss: 1.6917 - val_accuracy: 0.7920
Epoch 26/50
20031/20031 - 4s - loss: 0.0365 - accuracy: 0.9850 - val_loss: 1.6935 - val_accuracy: 0.7944
Epoch 27/50
20031/20031 - 4s - loss: 0.0342 - accuracy: 0.9851 - val_loss: 1.7912 - val_accuracy: 0.7853
Epoch 28/50
20031/20031 - 4s - loss: 0.0301 - accuracy: 0.9879 - val_loss: 1.8194 - val_accuracy: 0.7887
Epoch 29/50
20031/20031 - 4s - loss: 0.0254 - accuracy: 0.9887 - val_loss: 1.9231 - val_accuracy: 0.7922
Epoch 30/50
20031/20031 - 4s - loss: 0.0216 - accuracy: 0.9910 - val_loss: 1.9480 - val_accuracy: 0.7914
Epoch 31/50
20031/20031 - 4s - loss: 0.0243 - accuracy: 0.9895 - val_loss: 1.9487 - val_accuracy: 0.7847
Epoch 32/50
20031/20031 - 4s - loss: 0.0241 - accuracy: 0.9891 - val_loss: 2.0333 - val_accuracy: 0.7893
Epoch 33/50
20031/20031 - 4s - loss: 0.0334 - accuracy: 0.9863 - val_loss: 1.9498 - val_accuracy: 0.7937
Epoch 34/50
20031/20031 - 4s - loss: 0.0318 - accuracy: 0.9873 - val_loss: 2.0181 - val_accuracy: 0.7942
Epoch 35/50
20031/20031 - 4s - loss: 0.0273 - accuracy: 0.9882 - val_loss: 2.0254 - val_accuracy: 0.7913
Epoch 36/50
20031/20031 - 4s - loss: 0.0236 - accuracy: 0.9897 - val_loss: 2.1159 - val_accuracy: 0.7937
Epoch 37/50
20031/20031 - 4s - loss: 0.0204 - accuracy: 0.9905 - val_loss: 2.1018 - val_accuracy: 0.7950
Epoch 38/50
20031/20031 - 4s - loss: 0.0187 - accuracy: 0.9916 - val_loss: 2.1939 - val_accuracy: 0.7947
Epoch 39/50
20031/20031 - 4s - loss: 0.0253 - accuracy: 0.9888 - val_loss: 2.2090 - val_accuracy: 0.7920
Epoch 40/50
20031/20031 - 4s - loss: 0.0270 - accuracy: 0.9889 - val_loss: 2.2737 - val_accuracy: 0.7862
Epoch 41/50
20031/20031 - 4s - loss: 0.0234 - accuracy: 0.9893 - val_loss: 2.2559 - val_accuracy: 0.7926
Epoch 42/50
20031/20031 - 4s - loss: 0.0223 - accuracy: 0.9902 - val_loss: 2.3223 - val_accuracy: 0.7884
Epoch 43/50
20031/20031 - 4s - loss: 0.0251 - accuracy: 0.9897 - val_loss: 2.2547 - val_accuracy: 0.7863
Epoch 44/50
20031/20031 - 4s - loss: 0.0209 - accuracy: 0.9900 - val_loss: 2.3917 - val_accuracy: 0.7823
Epoch 45/50
20031/20031 - 4s - loss: 0.0245 - accuracy: 0.9889 - val_loss: 2.4222 - val_accuracy: 0.7881
Epoch 46/50
20031/20031 - 4s - loss: 0.0215 - accuracy: 0.9901 - val_loss: 2.4135 - val_accuracy: 0.7869
Epoch 47/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9896 - val_loss: 2.3287 - val_accuracy: 0.7823
Epoch 48/50
20031/20031 - 4s - loss: 0.0191 - accuracy: 0.9918 - val_loss: 2.4639 - val_accuracy: 0.7845
Epoch 49/50
20031/20031 - 4s - loss: 0.0183 - accuracy: 0.9911 - val_loss: 2.6068 - val_accuracy: 0.7811
Epoch 50/50
20031/20031 - 4s - loss: 0.0229 - accuracy: 0.9897 - val_loss: 2.5152 - val_accuracy: 0.7928
2019-09-22 16:33:41,089 graeae.timers.timer end: Ended: 2019-09-22 16:33:41.089405
I0922 16:33:41.089459 139873020925760 timer.py:77] Ended: 2019-09-22 16:33:41.089405
2019-09-22 16:33:41,091 graeae.timers.timer end: Elapsed: 0:03:20.719519
I0922 16:33:41.091247 139873020925760 timer.py:78] Elapsed: 0:03:20.719519

Once again it looks like the model is overfitting, I should add a checkpoint or something.

Plot the Performance

performance = pandas.DataFrame(history.history)
plot = performance.hvplot().opts(title="CNN Sarcasm Training Performance",
                                 width=1000,
                                 height=800)
Embed(plot=plot, file_name="cnn_training")()

Figure Missing

There's something very wrong with the validation. I'll have to look into that.


End

Multi-Layer LSTM

Beginning

Imports

Python

from functools import partial
from pathlib import Path
import pickle

PyPi

import holoviews
import hvplot.pandas
import pandas
import tensorflow
import tensorflow_datasets

Others

from graeae import Timer, EmbedHoloviews

Set Up

The Timer

TIMER = Timer()

Plotting

Embed = partial(EmbedHoloviews,
                folder_path="../../files/posts/keras/multi-layer-lstm/")

The Dataset

This once again uses the IMDB dataset with 50,000 reviews. It has already been converted from strings to integers - each word is encoded as its own integer. Adding with_info=True returns an object that contains the dictionary with the word to integer mapping. Passing in imdb_reviews/subwords8k limits the vocabulary to 8,000 words.

Note: The first time you run this it will download a fairly large dataset so it might appear to hang, but after the first time it is fairly quick.

dataset, info = tensorflow_datasets.load("imdb_reviews/subwords8k",
                                         with_info=True,
                                         as_supervised=True)

Middle

Set Up the Datasets

train_dataset, test_dataset = dataset["train"], dataset["test"]
tokenizer = info.features['text'].encoder

Now we're going to shuffle and padd the data. The BUFFER_SIZE argument sets the size of the data to sample from. In this case 10,000 entries in the training set will be selected to be put in the buffer and then the "shuffle" is created by randomly selecting items from the buffer, replacing each item as it's selected until all the data has been through the buffer. The padded_batch method creates batches of consecutive data and pads them so that they are all the same shape.

The BATCH_SIZE needs to be tuned a little. If it's too big the amount of memory needed might keep the GPU from being able to use it (and it might not generalize), and if it's too small, you will take a long time to train, so you have to do a little tuning. If you train it and the GPU process percentage stays at 0, try reducing the Batch Size.

Also note that if you change the batch-size you have to go back to the previous step and re-define train_dataset and test_dataset because we alter them in the next step and re-altering them makes the shape wrong somehow.

BUFFER_SIZE = 10000
# if the batch size is too big it will run out of memory on the GPU 
# so you might have to experiment with this
BATCH_SIZE = 32

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

The Model

The previous model had one Bidirectional layer, this will add a second one.

Embedding

The Embedding layer converts our inputs of integers and converts them to vectors of real-numbers, which is a better input for a neural network.

Bidirectional

The Bidirectional layer is a wrapper for Recurrent Neural Networks.

LSTM

The LSTM layer implements Long-Short-Term Memory. The first argument is the size of the outputs. This is similar to the model that we ran previously on the same data, but it has an extra layer (so it uses more memory).

model = tensorflow.keras.Sequential([
    tensorflow.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(64, return_sequences=True)),
    tensorflow.keras.layers.Bidirectional(
        tensorflow.keras.layers.LSTM(32)),
    tensorflow.keras.layers.Dense(64, activation='relu'),
    tensorflow.keras.layers.Dense(1, activation='sigmoid')
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 635,329
Trainable params: 635,329
Non-trainable params: 0
_________________________________________________________________
None

Compile It

model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])

Train the Model

ONCE_PER_EPOCH = 2
NUM_EPOCHS = 10
with TIMER:
    history = model.fit(train_dataset,
                        epochs=NUM_EPOCHS,
                        validation_data=test_dataset,
                        verbose=ONCE_PER_EPOCH)
2019-09-21 17:26:50,395 graeae.timers.timer start: Started: 2019-09-21 17:26:50.394797
I0921 17:26:50.395130 140275698915136 timer.py:70] Started: 2019-09-21 17:26:50.394797
Epoch 1/10
W0921 17:26:51.400280 140275698915136 deprecation.py:323] From /home/hades/.virtualenvs/In-Too-Deep/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
782/782 - 224s - loss: 0.6486 - accuracy: 0.6039 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
782/782 - 214s - loss: 0.4941 - accuracy: 0.7661 - val_loss: 0.6706 - val_accuracy: 0.6744
Epoch 3/10
782/782 - 216s - loss: 0.4087 - accuracy: 0.8266 - val_loss: 0.4024 - val_accuracy: 0.8222
Epoch 4/10
782/782 - 217s - loss: 0.2855 - accuracy: 0.8865 - val_loss: 0.3343 - val_accuracy: 0.8645
Epoch 5/10
782/782 - 216s - loss: 0.2097 - accuracy: 0.9217 - val_loss: 0.2936 - val_accuracy: 0.8837
Epoch 6/10
782/782 - 217s - loss: 0.1526 - accuracy: 0.9467 - val_loss: 0.3188 - val_accuracy: 0.8771
Epoch 7/10
782/782 - 215s - loss: 0.1048 - accuracy: 0.9657 - val_loss: 0.3750 - val_accuracy: 0.8710
Epoch 8/10
782/782 - 216s - loss: 0.0764 - accuracy: 0.9757 - val_loss: 0.3821 - val_accuracy: 0.8762
Epoch 9/10
782/782 - 216s - loss: 0.0585 - accuracy: 0.9832 - val_loss: 0.4747 - val_accuracy: 0.8683
Epoch 10/10
782/782 - 216s - loss: 0.0438 - accuracy: 0.9883 - val_loss: 0.4441 - val_accuracy: 0.8704
2019-09-21 18:02:56,353 graeae.timers.timer end: Ended: 2019-09-21 18:02:56.353722
I0921 18:02:56.353781 140275698915136 timer.py:77] Ended: 2019-09-21 18:02:56.353722
2019-09-21 18:02:56,356 graeae.timers.timer end: Elapsed: 0:36:05.958925
I0921 18:02:56.356238 140275698915136 timer.py:78] Elapsed: 0:36:05.958925

Looking at the Performance

To get the history I had to pickle it and then copy it over to the machine with this org-notebook, so you can't just run this notebook and make it work unless everything is run on the same machine (which it wasn't).

path = Path("~/history.pkl").expanduser()
with path.open("wb") as writer:
    pickle.dump(history.history, writer)
path = Path("~/history.pkl").expanduser()
with path.open("rb") as reader:
    history = pickle.load(reader)
data = pandas.DataFrame(history)
best = data.val_loss.idxmin()
best_line = holoviews.VLine(best)
plot = (data.hvplot() * best_line).opts(
    title="Two-Layer LSTM Model",
    width=1000,
    height=800)
Embed(plot=plot, file_name="lstm_training")()

Figure Missing

It looks like the best epoch was the fifth one, with a validation loss of 0.29 and a validation accuracy of 0.88, after that it looks like it overfits. It seems that text might be a harder problem than images.