Hand-rolling a CountVectorizer

Beginning

This is part of lesson 3 from the fastai NLP course.

Imports

Python

from collections import Counter
from functools import partial

PyPi

from fastai.text import (
    URLs,
    untar_data,
    TextList,
    )
import hvplot.pandas
import pandas

Others

from graeae import CountPercentage, EmbedHoloviews

Setup

Plotting

Embed = partial(
    EmbedHoloviews,
    folder_path="../../files/posts/fastai/hand-rolling-a-countvectorizer/")

The Data Set

The data-set is a collection of 50,000 IMDB reviews hosted on AWS Open Datasets as part of the fastai datasets collection. We're going to try and create a classifier that can predict the "sentiment" of reviews. The original dataset comes from Stanford University.

To make it easier to experiment, we'll initially load a sub-set of the dataset that fastai prepared. The URLs class contains the URLs for the datasets that fastai has uploaded and the untar_data function downloads data from the URL given to a given (or in this case default) location.

path = untar_data(URLs.IMDB_SAMPLE)
print(path)
/home/athena/.fastai/data/imdb_sample

The untar_data function doesn't actually load the data for us, so we'll use pandas to do that.

sample_frame = pandas.read_csv(path/"texts.csv")
print(sample_frame.head())
      label                                               text  is_valid
0  negative  Un-bleeping-believable! Meg Ryan doesn't even ...     False
1  positive  This is a extremely well-made film. The acting...     False
2  negative  Every once in a long while a movie will come a...     False
3  positive  Name just says it all. I watched this movie wi...     False
4  negative  This movie succeeds at being one of the most u...     False

The is_valid column is kind of interesting here especially since the first examples are all false… but I couldn't find an explanation for it on the data-download page.

CountPercentage(sample.label)()
Value Count Percent (%)
negative 524 52.40
positive 476 47.60

So it is nearly balanced but with a slight bias toward negative comments.

CountPercentage(sample.is_valid)()
Value Count Percent (%)
False 800 80.00
True 200 20.00

Well, so exactly 20% are invalid? Curious.

The Text List

To actually work with the dataset we'll use fastai's TextList instead of pandas' dataframe.

sample_list = TextList.from_csv(path, "texts.csv", cols="text")
sample_split = sample_list.split_from_df(col=2)
sample = (sample_split
          .label_from_df(cols=0))

The original notebook builds the TextList in a single train-wreck, but if you try and find out what those methods do from the fastai documentation… well, it's easier (although still obscure) to inspect the intermediate objects to try and muddle through what's going on. The ultimate outcome seems to be that sample is an object with the somewhat pre-processed text. It looks like the text is lower-cased and somewhat tokenized. There's also a lot of strange tokens inserted (xxmaj, xxunk) which, according to the tokenization documentation indicate special tokens - although there's more unknown tokens than I would have expected.

print(sample.train.x[0])
xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !
print(sample_frame.text.iloc[0])
Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!

Here's the category for that review.

print(sample.train.y[0])
negative

Note that the output looks like a string, but it's actually a fastai "type".

print(type(sample.train.y[0]))
<class 'fastai.core.Category'>

Creating a Term-Document Matrix

Here we'll create a matrix that counts the number of times each token appears in each document.

End

Reference

The Dataset

  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT ’11). Association for Computational Linguistics, USA, 142–150