The Network Parts
Table of Contents
This is an initial exploration of some of the parts that are going to make up the Neural Network as well as a little inspection of the data and how we're going to use it.
Set Up
Imports
The Tangle
from collections import Counter
import numpy
from neurotic.tangles.data_paths import DataPath
Python
import pickle
PyPi
from graphviz import Graph
import numpy
This Project
from neurotic.tangles.data_paths import DataPath
Loading The Pickles
path = DataPath("total_count.pkl")
with path.from_folder.open("rb") as reader:
total_counts = pickle.load(reader)
Some Constants
SPLIT_ON_THIS = " "
The Data
The Reviews.
path = DataPath("reviews.txt")
output_path = DataPath("reviews.pkl", check_exists=False)
if not output_path.from_folder.is_file():
with open(path.from_folder,'r') as reader:
reviews = [line.rstrip() for line in reader]
with output_path.from_folder.open('wb') as writer:
pickle.dump(reviews, writer)
The labels.
path = DataPath("labels.txt")
output_path = DataPath("labels.pkl", check_exists=False)
if not output_path.from_folder.is_file():
with path.from_folder.open() as reader:
labels = (line.rstrip() for line in reader)
labels = [line.upper() for line in labels]
with output_path.from_folder.open("wb") as writer:
pickle.dump(labels, writer)
Transforming Text into Numbers
def plot_network():
"""
Creates a simplified plot of our network (simple_network.dot.png)
"""
graph = Graph(format="png")
graph.attr(rankdir="LR")
graph.node("a", "horrible")
graph.node("b", "excellent")
graph.node("c", "terrible")
graph.node("d", "")
graph.node("e", "")
graph.node("f", "")
graph.node("g", "")
graph.node("h", "positive")
graph.edges(["ad", "ae", "af", "ag",
"bd", "be", "bf", "bg",
"cd", "ce" , "cf", "cg"])
graph.edges(["dh", 'eh', 'fh', 'gh'])
graph.render("graphs/simple_network.dot")
graph
return
This is one potential way to classify the sentiment of a review using a neural network. In this case if any of the terms (horrible, excellent, or terrible) exists the input is a one for that term and the output is the sum of the multiplication of the weights times the inputs.
Creating the Input/Output Data
The Vocabulary
We're going to create a "vocabulary" which is just a list of all the words in our reviews.
vocab = total_counts.keys()
Here's our vocabulary size.
vocab_size = len(vocab)
print("{:,}".format(vocab_size))
assert vocab_size==74074
74,074
Layer 0
Now we're going to create a numpy array called layer_0 and initialize it to all zeros. This will represent our input layer, so it will be a 2-dimensional matrix with 1 row and vocab_size columns.
layer_0 = numpy.zeros((1, vocab_size))
Now we can double-check the shape to make sure it matches what we're expecting.
shape = layer_0.shape
print("{}, {:,}".format(*shape))
assert shape == (1,74074)
1, 74,074
Word 2 Index
layer_0
contains one entry for every word in the vocabulary. We need to make sure we know the index of each word, so we'rec going to create a lookup table that stores the index of every word.
word2index = {word: index for index, word in enumerate(vocab)}
Here's the first ten entries in the lookup table.
print("|Term| Index|")
print("|-+-|")
keys = list(word2index.keys())[:10]
for key in keys:
print("|{}|{}|".format(key, word2index[key]))
Term | Index |
---|---|
bromwell | 0 |
high | 1 |
is | 2 |
a | 3 |
cartoon | 4 |
comedy | 5 |
. | 6 |
it | 7 |
ran | 8 |
at | 9 |
Update Input Layer
The update_input_layer
will count how many times each word is used in the review and then store those counts at the appropriate indices inside layer_0
. To make this useable in other posts you have to pass in the word2index
table, but in the actual Neural Network we're going to use a class so it will look a little different.
def update_input_layer(review:str, layer_0: numpy.ndarray, word2index: dict) -> Counter:
""" Modify the global layer_0 to represent the vector form of review.
The element at a given index of layer_0 should represent
how many times the given word occurs in the review.
Args:
review: the string of the review
layer_0: array representing layer 0
word2index: dict mapping word to index in layer_0
Returns:
counter for the tokens (used for troubleshooting)
"""
# clear out previous state by resetting the layer to be all 0s
layer_0 *= 0
tokens = review.split(SPLIT_ON_THIS)
counter = Counter()
counter.update(tokens)
for key, value in counter.items():
layer_0[:, word2index[key]] = value
return counter
Here's what happens when you update layer_0
with the first review.
update_input_layer(reviews[0])
print(layer_0)
[[4. 5. 4. ... 0. 0. 0.]]
It doesn't look exciting, but if we remember that we initialized the values as all zeros, then we can see that something is changing.
Get Target For Labels
get_target_for_labels
returns 0
or 1
, depending on whether the given label is NEGATIVE
or POSITIVE
, respectively. This will allow us to use the labels as we were given them and map them to numbers inside the neural net. An alternative might be to pre-process the labels or make this a dictionary.
def get_target_for_label(label: str) -> int:
"""Convert a label to `0` or `1`.
Args:
label(string) - Either "POSITIVE" or "NEGATIVE".
Returns:
`0` or `1`.
"""
return 1 if label=="POSITIVE" else 0
So, here's the first label.
print(labels[0])
POSITIVE
And here's what we mapped it to.
output = get_target_for_label(labels[0])
assert output == 1
print(output)
1
And here we go with the second label.
print(labels[1])
NEGATIVE
output = get_target_for_label(labels[1])
assert output == 0
print(output)
0