The Network Parts
Table of Contents
This is an initial exploration of some of the parts that are going to make up the Neural Network as well as a little inspection of the data and how we're going to use it.
Set Up
Imports
The Tangle
from collections import Counter
import numpy
from neurotic.tangles.data_paths import DataPath
Python
import pickle
PyPi
from graphviz import Graph
import numpy
This Project
from neurotic.tangles.data_paths import DataPath
Loading The Pickles
path = DataPath("total_count.pkl")
with path.from_folder.open("rb") as reader:
    total_counts = pickle.load(reader)
Some Constants
SPLIT_ON_THIS = " "
The Data
The Reviews.
path = DataPath("reviews.txt")
output_path = DataPath("reviews.pkl", check_exists=False)
if not output_path.from_folder.is_file():
    with open(path.from_folder,'r') as reader:
        reviews = [line.rstrip() for line in reader]
    with output_path.from_folder.open('wb') as writer:
        pickle.dump(reviews, writer)
The labels.
path = DataPath("labels.txt")
output_path = DataPath("labels.pkl", check_exists=False)
if not output_path.from_folder.is_file():
    with path.from_folder.open() as reader:
        labels = (line.rstrip() for line in reader)
        labels = [line.upper() for line in labels]
    with output_path.from_folder.open("wb") as writer:
        pickle.dump(labels, writer)
Transforming Text into Numbers
def plot_network():
    """
    Creates a simplified plot of our network (simple_network.dot.png)
    """
    graph = Graph(format="png")
    graph.attr(rankdir="LR")
    graph.node("a", "horrible")
    graph.node("b", "excellent")
    graph.node("c", "terrible")
    graph.node("d", "")
    graph.node("e", "")
    graph.node("f", "")
    graph.node("g", "")
    graph.node("h", "positive")
    graph.edges(["ad", "ae", "af", "ag",
                 "bd", "be", "bf", "bg",
                 "cd", "ce" , "cf", "cg"])
    graph.edges(["dh", 'eh', 'fh', 'gh'])
    graph.render("graphs/simple_network.dot")
    graph
    return
This is one potential way to classify the sentiment of a review using a neural network. In this case if any of the terms (horrible, excellent, or terrible) exists the input is a one for that term and the output is the sum of the multiplication of the weights times the inputs.
Creating the Input/Output Data
The Vocabulary
We're going to create a "vocabulary" which is just a list of all the words in our reviews.
vocab = total_counts.keys()
Here's our vocabulary size.
vocab_size = len(vocab)
print("{:,}".format(vocab_size))
assert vocab_size==74074
74,074
Layer 0
Now we're going to create a numpy array called layer_0 and initialize it to all zeros. This will represent our input layer, so it will be a 2-dimensional matrix with 1 row and vocab_size columns.
layer_0 = numpy.zeros((1, vocab_size))
Now we can double-check the shape to make sure it matches what we're expecting.
shape = layer_0.shape
print("{}, {:,}".format(*shape))
assert shape == (1,74074)
1, 74,074
Word 2 Index
layer_0 contains one entry for every word in the vocabulary. We need to make sure we know the index of each word, so we'rec going to create a lookup table that stores the index of every word.
word2index = {word: index for index, word in enumerate(vocab)}
Here's the first ten entries in the lookup table.
print("|Term| Index|")
print("|-+-|")
keys = list(word2index.keys())[:10]
for key in keys:
    print("|{}|{}|".format(key, word2index[key]))
| Term | Index | 
|---|---|
| bromwell | 0 | 
| high | 1 | 
| is | 2 | 
| a | 3 | 
| cartoon | 4 | 
| comedy | 5 | 
| . | 6 | 
| it | 7 | 
| ran | 8 | 
| at | 9 | 
Update Input Layer
The update_input_layer will count how many times each word is used in the review and then store those counts at the appropriate indices inside layer_0. To make this useable in other posts you have to pass in the word2index table, but in the actual Neural Network we're going to use a class so it will look a little different.
def update_input_layer(review:str, layer_0: numpy.ndarray, word2index: dict) -> Counter:
    """ Modify the global layer_0 to represent the vector form of review.
    The element at a given index of layer_0 should represent
    how many times the given word occurs in the review.
    Args:
       review: the string of the review
       layer_0: array representing layer 0
       word2index: dict mapping word to index in layer_0
    Returns:
        counter for the tokens (used for troubleshooting)
    """
    # clear out previous state by resetting the layer to be all 0s
    layer_0 *= 0
    tokens = review.split(SPLIT_ON_THIS)
    counter = Counter()
    counter.update(tokens)
    for key, value in counter.items():
        layer_0[:, word2index[key]] = value
    return counter
Here's what happens when you update layer_0 with the first review.
update_input_layer(reviews[0])
print(layer_0)
[[4. 5. 4. ... 0. 0. 0.]]
It doesn't look exciting, but if we remember that we initialized the values as all zeros, then we can see that something is changing.
Get Target For Labels
get_target_for_labels returns 0 or 1, depending on whether the given label is NEGATIVE or POSITIVE, respectively. This will allow us to use the labels as we were given them and map them to numbers inside the neural net. An alternative might be to pre-process the labels or make this a dictionary.
def get_target_for_label(label: str) -> int:
    """Convert a label to `0` or `1`.
    Args:
       label(string) - Either "POSITIVE" or "NEGATIVE".
    Returns:
       `0` or `1`.
    """
    return 1 if label=="POSITIVE" else 0
So, here's the first label.
print(labels[0])
POSITIVE
And here's what we mapped it to.
output = get_target_for_label(labels[0])
assert output == 1
print(output)
1
And here we go with the second label.
print(labels[1])
NEGATIVE
output = get_target_for_label(labels[1])
assert output == 0
print(output)
0