Exploring the Reviews Dataset

Set Up

Imports

Python

from collections import Counter
import pickle
import textwrap

PyPi

import numpy

This Project

from neurotic.tangles.data_paths import DataPath

Lesson 1: Curate a Dataset

The goal of this section is to become familiar with the data and perform any preprocessing that might be needed.

A Helper To Print

def pretty_print_review_and_label(index: int, up_to: int=80) -> None:
    """Prints the label and review

    Args:
     index: the index of the review in the data set
     up_to: number of characters in the review to show
    """
    print("|{}|{}|".format(labels[index], reviews[index][:up_to] + "..."))
    return

The Reviews

It's not really clear what he's doing here. I think he's stripping the newlines off of the reviews, so each review must be one line.

path = DataPath("reviews.txt")
with open(path.from_folder,'r') as reader:
    reviews = [line.rstrip() for line in reader]

The Labels

A similar deal except casting the labels to upper case.

path = DataPath("labels.txt")
with open(path.from_folder,'r') as reader:
    labels = (line.rstrip() for line in reader)
    labels = [line.upper() for line in labels]

Note: The data in reviews.txt we're using has already been preprocessed a bit and contains only lower case characters. If we were working from raw data, where we didn't know it was all lower case, we would want to add a step here to convert it. That's so we treat different variations of the same word, like `The`, `the`, and `THE`, all the same way.

How many reviews do we have?

print("{:,}".format(len(reviews)))
25,000

What does a review look like?

print("\n".join(textwrap.wrap(reviews[0], width=80)))
bromwell high is a cartoon comedy . it ran at the same time as some other
programs about school life  such as  teachers  . my   years in the teaching
profession lead me to believe that bromwell high  s satire is much closer to
reality than is  teachers  . the scramble to survive financially  the insightful
students who can see right through their pathetic teachers  pomp  the pettiness
of the whole situation  all remind me of the schools i knew and their students .
when i saw the episode in which a student repeatedly tried to burn down the
school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a
classic line inspector i  m here to sack one of your teachers . student welcome
to bromwell high . i expect that many adults of my age think that bromwell high
is far fetched . what a pity that it isn  t

Kind of odd looking. It looks like the pre-processor did some bad things.

What does the label for that review look like?

print(labels[0])
POSITIVE

What are the labels available?

At this point we don't have pandas loaded, so I'll just use a set to look at the labels.

print(set(labels))
{'NEGATIVE', 'POSITIVE'}

So there are two labels - "NEGATIVE" and "POSITIVE".

Develop a Predictive Theory

The previous section gave us a rough idea of what's in the data set. Now we want to make a guess as to what the labels mean - why is a review labled POSITIVE or NEGATIVE?

print("|labels.txt| reviews.txt|")
print("|-+-|")
indices = (2137, 12816, 6267, 21934, 5297, 4998)
for index in indices:
    pretty_print_review_and_label(index)
labels.txt reviews.txt
NEGATIVE this movie is terrible but it has some good effects ….
POSITIVE adrian pasdar is excellent is this film . he makes a fascinating woman ….
NEGATIVE comment this movie is impossible . is terrible very improbable bad interpretat…
POSITIVE excellent episode movie ala pulp fiction . days suicides . it doesnt get more…
NEGATIVE if you haven t seen this it s terrible . it is pure trash . i saw this about …
POSITIVE this schiffer guy is a real genius the movie is of excellent quality and both e…

If you look at the negative reviews, they all have the work 'terrible' in them, and the positives all have the workd 'excellent' in them. The theory then, is that the labels are based on whether a review has a key-word in it that makes it either positive or negative.

Quick Theory Validation

In this section we're going to test our theory that key-words identify whether a review is positive or negative using the Counter class and the numpy library.

Word Counter

We'll create three Counter objects, one for words from postive reviews, one for words from negative reviews, and one for all the words.

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

Examine all the reviews. For each word in a positive review, increase the count for that word in both your positive counter and the total words counter; likewise, for each word in a negative review, increase the count for that word in both your negative counter and the total words counter.

Note: Throughout these projects, you should use `split(' ')` to divide a piece of text (such as a review) into individual words. If you use `split()` instead, you'll get slightly different results than what the videos and solutions show.

The classifications in the labels list.

class Classification:
    positive = "POSITIVE"
    negative = "NEGATIVE"

What we are splitting on.

class Tokens:
    splitter = " "
with DataPath("labels.pkl").from_folder.open("rb") as reader:
    labels = pickle.load(reader)
for label, review in zip(labels, reviews):
    tokens = review.split(Tokens.splitter)
    total_counts.update(tokens)

    if label == Classification.positive:
        positive_counts.update(tokens)        
    else:
        negative_counts.update(tokens)

Most Common Words

Run the following two cells to list the words used in positive reviews and negative reviews, respectively, ordered from most to least commonly used.

Examine the counts of the most common words in positive reviews

print("|Token| Count|")
print("|-+-|")
for token, count in positive_counts.most_common(10):
    print("|{}|{:,}|".format(token, count))
Token Count
  518,327
the 173,324
. 159,654
and 89,722
a 83,688
of 76,855
to 66,746
is 57,245
in 50,215
br 49,235

So, we probably don't want most of the most common tokens.

Examine the counts of the most common words in negative reviews

print("|Token| Count|")
print("|-+-|")
for token, count in negative_counts.most_common(10):
    print("|{}|{:,}|".format(token, count))
Token Count
  531,016
. 167,538
the 163,389
a 79,321
and 74,385
of 69,009
to 68,974
br 52,637
is 50,083
it 48,327

As you can see, common words like "the" appear very often in both positive and negative reviews. Instead of finding the most common words in positive or negative reviews, what you really want are the words found in positive reviews more often than in negative reviews, and vice versa. To accomplish this, you'll need to calculate the ratios of word usage between positive and negative reviews.

Check all the words you've seen and calculate the ratio of postive to negative uses and store that ratio in pos_neg_ratios.

Hint: the positive-to-negative ratio for a given word can be calculated with `positive_counts[word] / float(negative_counts[word]+1)`. Notice the `+1` in the denominator – that ensures we don't divide by zero for words that are only seen in positive reviews.

Create a Counter object to store positive/negative ratios

pos_neg_ratios = Counter()

Positive to negative ratios

Calculate the ratios of positive and negative uses of the most common words

ratios = {element: positive_counts[element]/(negative_counts[element] + 1)
          for element in total_counts}
pos_neg_ratios.update(ratios)
with DataPath("pos_neg_ratios.pkl",
              check_exists=False).from_folder.open("wb") as writer:
    pickle.dump(pos_neg_ratios, writer)

Examine the ratios you've calculated for a few words:

print("Pos-to-neg ratio for 'the' = {:.2f}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {:.2f}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {:.2f}".format(pos_neg_ratios["terrible"]))
Pos-to-neg ratio for 'the' = 1.06
Pos-to-neg ratio for 'amazing' = 4.02
Pos-to-neg ratio for 'terrible' = 0.18

Looking closely at the values you just calculated, we see the following:

  • Words that you would expect to see more often in positive reviews - like "amazing" - have a ratio greater than 1. The more skewed a word is toward postive, the farther from 1 its positive-to-negative ratio will be.
  • Words that you would expect to see more often in negative reviews - like "terrible" - have positive values that are less than 1. The more skewed a word is toward negative, the closer to zero its positive-to-negative ratio will be.
  • Neutral words, which don't really convey any sentiment because you would expect to see them in all sorts of reviews – like "the" – have values very close to 1. A perfectly neutral word – one that was used in exactly the same number of positive reviews as negative reviews – would be almost exactly 1. The `+1` we suggested you add to the denominator slightly biases words toward negative, but it won't matter because it will be a tiny bias and later we'll be ignoring words that are too close to neutral anyway.

Ok, the ratios tell us which words are used more often in postive or negative reviews, but the specific values we've calculated are a bit difficult to work with. A very positive word like "amazing" has a value above 4, whereas a very negative word like "terrible" has a value around 0.18. Those values aren't easy to compare for a couple of reasons:

  • Right now, 1 is considered neutral, but the absolute value of the postive-to-negative ratios of very postive words is larger than the absolute value of the ratios for the very negative words. So there is no way to directly compare two numbers and see if one word conveys the same magnitude of positive sentiment as another word conveys negative sentiment. So we should center all the values around netural so the absolute value from neutral of the postive-to-negative ratio for a word would indicate how much sentiment (positive or negative) that word conveys.

When comparing absolute values it's easier to do that around zero than one.

To fix these issues, we'll convert all of our ratios to new values using logarithms.

Go through all the ratios you calculated and convert them to logarithms. (i.e. use `np.log(ratio)`)

In the end, extremely positive and extremely negative words will have positive-to-negative ratios with similar magnitudes but opposite signs. Note that you have to create a new counter - the update method adds the new value to the previous values.

log_ratios = {}
for token, ratio in pos_neg_ratios.items():
    if ratio > 1:
        log_ratios[token] = numpy.log(ratio)
    else:
        log_ratios[token] = -numpy.log(1/(ratio + 0.01))
positive_negative_log_ratios = Counter()
positive_negative_log_ratios.update(log_ratios)

Examine the new ratios you've calculated for the same words from before:

print("Pos-to-neg ratio for 'the' = {:.2f}".format(positive_negative_log_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {:.2f}".format(positive_negative_log_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {:.2f}".format(positive_negative_log_ratios["terrible"]))
Pos-to-neg ratio for 'the' = 0.06
Pos-to-neg ratio for 'amazing' = 1.39
Pos-to-neg ratio for 'terrible' = -1.67
with DataPath("pos_neg_log_ratios.pkl",
              check_exists=False).from_folder.open("wb") as writer:
    pickle.dump(positive_negative_log_ratios, writer)

If everything worked, now you should see neutral words with values close to zero. In this case, "the" is near zero but slightly positive, so it was probably used in more positive reviews than negative reviews. But look at "amazing"'s ratio - it's above 1, showing it is clearly a word with positive sentiment. And "terrible" has a similar score, but in the opposite direction, so it's below -1. It's now clear that both of these words are associated with specific, opposing sentiments.

Now run the following cells to see more ratios.

The first cell displays all the words, ordered by how associated they are with postive reviews. (Your notebook will most likely truncate the output so you won't actually see all the words in the list.)

The second cell displays the 30 words most associated with negative reviews by reversing the order of the first list and then looking at the first 30 words. (If you want the second cell to display all the words, ordered by how associated they are with negative reviews, you could just write `reversed(pos_neg_ratios.most_common())`.)

You should continue to see values similar to the earlier ones we checked – neutral words will be close to `0`, words will get more positive as their ratios approach and go above `1`, and words will get more negative as their ratios approach and go below `-1`. That's why we decided to use the logs instead of the raw ratios.

Here are the words most frequently seen in a review with a "POSITIVE" label.

print("|Token|Log Ratio|")
print("|-+-|")
for token, ratio in positive_negative_log_ratios.most_common(10):
    print("|{}|{:.2f}|".format(token, ratio))
Token Log Ratio
edie 4.69
antwone 4.48
din 4.41
gunga 4.19
goldsworthy 4.17
gypo 4.09
yokai 4.09
paulie 4.08
visconti 3.93
flavia 3.93

Ummm… okay.

print(positive_counts["edie"])
print(negative_counts["edie"])
109
0

So the ones that are most positive appeared in the positive but not in the negative.

Here are the words most frequently seen in a review with a "NEGATIVE" label. The python slice notation is list-name[first to include: first to exclude: step ].

print("|Token|Log Ratio|")
print("|-+-|")
for token, ratio in positive_negative_log_ratios.most_common()[:-11:-1]:
    print("|{}|{:.2f}|".format(token, ratio))
Token Log Ratio
whelk -4.61
pressurized -4.61
bellwood -4.61
mwuhahahaa -4.61
insulation -4.61
hoodies -4.61
yaks -4.61
raksha -4.61
deamon -4.61
ziller -4.61
print(positive_counts["whelk"])
print(negative_counts["whelk"])
0
1

And the most negative counts just didn't appear in the positive counts, even if they only appeared once in the negative counts.

As with the positive reviews, it's actually hard to figure out exactly what the most common tokens for negative reviews are.

Did our theory work?

Our theory was that key-words identify whether a review is positive or negative. There is some evidence for this, but really, it's not obvious that this is the case in general.

Pickling

Since the other posts in this section re-use some of this stuff it might make sense to pickle them.

with DataPath("total_counts.pkl", check_exists=False).from_folder.open("wb") as writer:
    pickle.dump(total_counts, writer)