Locality-Sensitive Hashing (LSH) for Machine Translation
Table of Contents
Beginning
This is a continuation of the post in which we implemented k-Nearest Neighbors. It's part of a series of posts on building an English to French translator whose links are gathered in the this post.
Imports
# python
import math
# pypi
from dotenv import load_dotenv
from nltk.corpus import twitter_samples
import numpy
# my code
from graeae import Timer
from neurotic.nlp.word_embeddings.embeddings import EmbeddingsLoader
from neurotic.nlp.twitter.processor import TwitterProcessor
Set Up
The Timer
TIMER = Timer()
The Environment
load_dotenv("posts/nlp/.env")
The Tweets
positive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")
tweets = positive_tweets + negative_tweets
The Twitter Processor
process_tweet = TwitterProcessor()
The Embeddings Loader
embeddings = EmbeddingsLoader()
Middle
Locality-Sensitive Hashing (LSH)
In this part of the assignment, you will implement a more efficient version of k-nearest neighbors using locality sensitive hashing. You will then apply this to document search.
- Process the tweets and represent each tweet as a vector (represent a document with a vector embedding).
- Use locality sensitive hashing and k nearest neighbors to find tweets that are similar to a given tweet.
3.1 Getting the document embeddings
- Bag-of-words (BOW) document models
Text documents are sequences of words.
- The ordering of words makes a difference. For example, sentences "Apple pie is better than pepperoni pizza." and "Pepperoni pizza is better than apple pie" have opposite meanings due to the word ordering.
- However, for some applications, ignoring the order of words can allow us to train an efficient and still effective model.
- This approach is called Bag-of-words document model.
Document embeddings
- Document embedding is created by summing up the embeddings of all words in the document.
- If we don't know the embedding of some word, we can ignore that word.
Exercise 07: Complete the get_document_embedding()
function.
- The function
get_document_embedding()
encodes entire document as a "document" embedding. - It takes in a document (as a string) and a dictionary,
en_embeddings
- It processes the document, and looks up the corresponding embedding of each word.
- It then sums them up and returns the sum of all word vectors of that processed tweet.
- Hints
- You can handle missing words easier by using the
get()
method of the python dictionary instead of the bracket notation (i.e. "[ ]"). See more about it here - The default value for missing word should be the zero vector. Numpy will broadcast simple 0 scalar into a vector of zeros during the summation.
- Alternatively, skip the addition if a word is not in the dictonary.
- You can use your
process_tweet()
function which allows you to process the tweet. The function just takes in a tweet and returns a list of words.
# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def get_document_embedding(tweet, en_embeddings): ''' Input: - tweet: a string - en_embeddings: a dictionary of word embeddings Output: - doc_embedding: sum of all word embeddings in the tweet ''' doc_embedding = numpy.zeros(300) ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # process the document into a list of words (process the tweet) processed_doc = process_tweet(tweet) for word in processed_doc: # add the word embedding to the running total for the document embedding doc_embedding = doc_embedding + en_embeddings.get(word, 0) ### END CODE HERE ### return doc_embedding
You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
# testing your function custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np" #tweet_embedding = get_document_embedding(custom_tweet, en_embeddings_subset) tweet_embedding = get_document_embedding(custom_tweet, embeddings.english_subset) actual = tweet_embedding[-5:] expected = [-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184] assert numpy.allclose(actual, expected) print(actual)
[-0.00268555 -0.15378189 -0.55761719 -0.07216644 -0.32263184]
- You can handle missing words easier by using the
Exercise 08
- Store all document vectors into a dictionary
Now, let's store all the tweet embeddings into a dictionary. Implement
get_document_vecs()
.def get_document_vecs(all_docs, en_embeddings): ''' Input: - all_docs: list of strings - all tweets in our dataset. - en_embeddings: dictionary with words as the keys and their embeddings as the values. Output: - document_vec_matrix: matrix of tweet embeddings. - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values. ''' # the dictionary's key is an index (integer) that identifies a specific tweet # the value is the document embedding for that document ind2Doc_dict = {} # this is list that will store the document vectors document_vec_l = [] for i, doc in enumerate(all_docs): ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # get the document embedding of the tweet doc_embedding = get_document_embedding(doc, en_embeddings) # save the document embedding into the ind2Tweet dictionary at index i ind2Doc_dict[i] = doc_embedding # append the document embedding to the list of document vectors document_vec_l.append(doc_embedding) ### END CODE HERE ### # convert the list of document vectors into a 2D array (each row is a document vector) document_vec_matrix = numpy.vstack(document_vec_l) return document_vec_matrix, ind2Doc_dict
document_vecs, ind2Tweet = get_document_vecs(tweets, embeddings.english_subset)
dict_length = len(ind2Tweet) expected = len(tweets) assert dict_length == expected print(f"length of dictionary {dict_length:,}") rows, columns = document_vecs.shape print(f"shape of document_vecs ({rows:,}, {columns})") assert rows == expected assert columns == 300
3.2 Looking up the tweets
Now you have a vector of dimension (m,d) where m
is the number of tweets (10,000) and d
is the dimension of the embeddings (300). Now you will input a tweet, and use cosine similarity to see which tweet in our corpus is similar to your tweet.
my_tweet = 'i am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, embeddings.english_subset)
This gives you a tweet similar to your input.
def cosine_similarity(vector_1: numpy.ndarray, vector_2: numpy.ndarray) -> float:
"""Calculates the similarity between two vectors
Args:
vector_1: array to compare
vector_2: array to compare to vector_1
Returns:
cosine similarity between the two vectors
"""
return numpy.dot(vector_1, vector_2)/(numpy.linalg.norm(vector_1) *
numpy.linalg.norm(vector_2))
idx = numpy.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(tweets[idx])
3.3 Finding the most similar tweets with LSH
You will now implement locality sensitive hashing (LSH) to identify the most similar tweet. Instead of looking at all 10,000 vectors, you can just search a subset to find its nearest neighbors.
Let's say you have a set of data points, You can divide the vector space into regions and search within one region for nearest neighbors of a given vector.
N_VECS = len(tweets) # This many vectors.
N_DIMS = document_vecs.shape[1] # Vector dimensionality.
print(f"There are {N_VECS:,} vectors and each has {N_DIMS} dimensions.")
Choosing the number of planes
- Each plane divides the space to 2 parts.
- So n planes divide the space into \(2^{n}\) hash buckets.
- We want to organize 10,000 document vectors into buckets so that every bucket has about ~16 vectors.
- For that we need \(\frac{10000}{16}=625\) buckets.
- We're interested in n, number of planes, so that \(2^{n}= 625\). Now, we can calculate \(n=\log_{2}625 = 9.29 \approx 10\).
We use \(\log_2(625)\) as the number of planes to have ~16 vectors/bucket.
buckets = 10000/16
print(buckets)
planes = math.ceil(numpy.log2(buckets))
print(planes)
625.0 10
N_PLANES = planes
Number of times to repeat the hashing to improve the search.
N_UNIVERSES = 25
3.4 Getting the hash number for a vector
For each vector, we need to get a unique number associated to that vector in order to assign it to a "hash bucket".
Hyperlanes in vector spaces
- In 3-dimensional vector space, the hyperplane is a regular plane. In 2 dimensional vector space, the hyperplane is a line.
- Generally, the hyperplane is a subspace which has dimension 1 lower than the original vector space has.
- A hyperplane is uniquely defined by its normal vector.
- Normal vector n of the plane \(\pi\) is the vector to which all vectors in the plane \(\pi\) are orthogonal (perpendicular in 3 dimensional case).
Using Hyperplanes to split the vector space
We can use a hyperplane to split the vector space into 2 parts.
- All vectors whose dot product with a plane's normal vector is positive are on one side of the plane.
- All vectors whose dot product with the plane's normal vector is negative are on the other side of the plane.
Encoding hash buckets
- For a vector, we can take its dot product with all the planes, then encode this information to assign the vector to a single hash bucket.
- When the vector is pointing to the opposite side of the hyperplane than normal, encode it by 0.
- Otherwise, if the vector is on the same side as the normal vector, encode it by 1.
- If you calculate the dot product with each plane in the same order for every vector, you've encoded each vector's unique hash ID as a binary number, like [0, 1, 1, … 0].
Exercise 09: Implementing hash buckets
We've initialized hash table hashes
for you. It is list of N_UNIVERSES
matrices, each describes its own hash table. Each matrix has N_DIMS
rows and N_PLANES
columns. Every column of that matrix is a N_DIMS
-dimensional normal vector for each of N_PLANES
hyperplanes which are used for creating buckets of the particular hash table.
Exercise: Your task is to complete the function hash_value_of_vector
which places vector v
in the correct hash bucket.
- First multiply your vector v, with a corresponding plane. This will give you a vector of dimension \((1,\text{N_planes})\).
- You will then convert every element in that vector to 0 or 1.
- You create a hash vector by doing the following: if the element is negative, it becomes a 0, otherwise you change it to a 1.
- You then compute the unique number for the vector by iterating over
N_PLANES
- Then you multiply \(2^i\) times the corresponding bit (0 or 1).
- You will then store that sum in the variable
hash_value
.
Intructions: Create a hash for the vector in the function below. Use this formula:
\[ hash = \sum_{i=0}^{N-1} \left( 2^{i} \times h_{i} \right) \]
- Create the sets of planes
- Create multiple (25) sets of planes (the planes that divide up the region).
- You can think of these as 25 separate ways of dividing up the vector space with a different set of planes.
- Each element of this list contains a matrix with 300 rows (the word vectors have 300 dimensions), and 10 columns (there are 10 planes in each "universe").
numpy.random.seed(0) planes_l = [numpy.random.normal(size=(N_DIMS, N_PLANES)) for _ in range(N_UNIVERSES)]
- Hints
numpy.squeeze()
removes unused dimensions from an array; for instance, it converts a (10,1) 2D array into a (10,) 1D array
def hash_value_of_vector(v, planes): """Create a hash for a vector; hash_id says which random hash to use. Input: - v: vector of tweet. It's dimension is (1, N_DIMS) - planes: matrix of dimension (N_DIMS, N_PLANES) - the set of planes that divide up the region Output: - res: a number which is used as a hash for your vector """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # for the set of planes, # calculate the dot product between the vector and the matrix containing the planes # remember that planes has shape (300, 10) # The dot product will have the shape (1,10) assert planes.shape == (300, 10) assert v.shape == (1, 300) dot_product = numpy.dot(v, planes) assert dot_product.shape == (1, 10), dot_product.shape # get the sign of the dot product (1,10) shaped vector sign_of_dot_product = numpy.sign(dot_product) # set h to be false (equivalent to 0 when used in operations) if the sign is negative, # and true (equivalent to 1) if the sign is positive (1,10) shaped vector h = sign_of_dot_product >= 0 assert h.shape == (1, 10) # remove extra un-used dimensions (convert this from a 2D to a 1D array) h = numpy.squeeze(h) # initialize the hash value to 0 hash_value = 0 n_planes = planes.shape[1] for i in range(n_planes): # increment the hash value by 2^i * h_i hash_value += 2**i * h[i] ### END CODE HERE ### # cast hash_value as an integer hash_value = int(hash_value) return hash_value
numpy.random.seed(0) idx = 0 planes = planes_l[idx] # get one 'universe' of planes to test the function vec = numpy.random.rand(1, 300) expected = 768 actual = hash_value_of_vector(vec, planes) assert expected == actual print(f" The hash value for this vector,", f"and the set of planes at index {idx},", f"is {actual}")
3.5 Creating a hash table
Exercise 10
Given that you have a unique number for each vector (or tweet), You now want to create a hash table. You need a hash table, so that given a hash_id, you can quickly look up the corresponding vectors. This allows you to reduce your search by a significant amount of time.
We have given you the make_hash_table
function, which maps the tweet vectors to a bucket and stores the vector there. It returns the hash_table
and the id_table
. The id_table
tells you which vector in a certain bucket corresponds to what tweet.
- Hints
- a dictionary comprehension, similar to a list comprehension, looks like this: `{i:0 for i in range(10)}`, where the key is 'i' and the value is zero for all key-value pairs.
def make_hash_table(vecs, planes): """ Input: - vecs: list of vectors to be hashed. - planes: the matrix of planes in a single "universe", with shape (embedding dimensions, number of planes). Output: - hash_table: dictionary - keys are hashes, values are lists of vectors (hash buckets) - id_table: dictionary - keys are hashes, values are list of vectors id's (it's used to know which tweet corresponds to the hashed vector) """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # number of planes is the number of columns in the planes matrix num_of_planes = planes.shape[1] # number of buckets is 2^(number of planes) num_buckets = 2**num_of_planes # create the hash table as a dictionary. # Keys are integers (0,1,2.. number of buckets) # Values are empty lists hash_table = {index: [] for index in range(num_buckets)} # create the id table as a dictionary. # Keys are integers (0,1,2... number of buckets) # Values are empty lists id_table = {index: [] for index in range(num_buckets)} # for each vector in 'vecs' for i, v in enumerate(vecs): # calculate the hash value for the vector h = hash_value_of_vector(v.reshape(1, 300), planes) # store the vector into hash_table at key h, # by appending the vector v to the list at key h hash_table[h].append(v) # store the vector's index 'i' (each document is given a unique integer 0,1,2...) # the key is the h, and the 'i' is appended to the list at key h id_table[h].append(i) ### END CODE HERE ### return hash_table, id_table
numpy.random.seed(0) planes = planes_l[0] # get one 'universe' of planes to test the function vec = numpy.random.rand(1, 300) tmp_hash_table, tmp_id_table = make_hash_table(document_vecs, planes) # tmp_hash_table, tmp_id_table = make_hash_table(vec.reshape(1, 300), planes) index = 2 print(f"The hash table at key {index} has {len(tmp_hash_table[index])} document vectors") print(f"The id table at key {index} has {len(tmp_id_table[index])}") print(f"The first 5 document indices stored at key {index} of are {tmp_id_table[index][0:5]}")
Expected output
#+RESULTS The hash table at key 0 has 3 document vectors The id table at key 0 has 3 The first 5 document indices stored at key 0 of are [3276, 3281, 3282]
I get a hash of 2 for document 3276, not 1…
3.6 Creating all hash tables
You can now hash your vectors and store them in a hash table that would allow you to quickly look up and search for similar vectors. Run the cell below to create the hashes. By doing so, you end up having several tables which have all the vectors. Given a vector, you then identify the buckets in all the tables. You can then iterate over the buckets and consider much fewer vectors. The more buckets you use, the more accurate your lookup will be, but also the longer it will take.
- Creating the hashtables
hash_tables = [] id_tables = [] with TIMER: for universe_id in range(N_UNIVERSES): # there are 25 hashes print('working on hash universe #:', universe_id) planes = planes_l[universe_id] hash_table, id_table = make_hash_table(document_vecs, planes) hash_tables.append(hash_table) id_tables.append(id_table)
Bundling It Up
<<imports>>
<<planes-universe>>
<<planes-plane-count>>
<<planes-planes>>
<<documents-embeddings-builder>>
<<document-embedding>>
<<documents-embeddings>>
<<document-index-to-embedding>>
<<hash-table>>
<<hash-value-of-vector>>
<<hash-hashes>>
<<hash-table-table>>
<<hash-index-table>>
<<hash-tables>>
<<hash-tables-hash>>
<<hash-tables-index>>
Imports
# python
import math
# pypi
import attr
import numpy
Planes Universe
@attr.s(auto_attribs=True)
class PlanesUniverse:
"""Creates set of planes with a random mormal distribution of points
Args:
vector_count: number of vectors that will be hashed
dimensions: number of columns per vector
universes: number of universes to create
vectors_per_bucket: how many vectors we want in each
random_seed: value to seed the random number generator
"""
vector_count: int
dimensions: int
universes: int
vectors_per_bucket: int
random_seed: int=0
_plane_count: int=None
_planes: list=None
- Plane Count
@property def plane_count(self) -> int: """The number of planes to create Uses the number of vectors and desired vectors per bucket """ if self._plane_count is None: buckets = self.vector_count/self.vectors_per_bucket self._plane_count = math.ceil(numpy.log2(buckets)) return self._plane_count
- Planes
The list of planes.
@property def planes(self) -> list: """The list of planes""" if self._planes is None: numpy.random.seed(self.random_seed) self._planes = [numpy.random.normal(size=(self.dimensions, self.plane_count)) for _ in range(self.universes)] return self._planes
Documents Embeddings Builder
@attr.s(auto_attribs=True)
class DocumentsEmbeddings:
"""Builds embeddings for documents from their words
Args:
embeddings: word-embeddings for the documents
process: callable to pre-process documents
documents: documents (strings) to hash
"""
embeddings: dict
process: object
documents: list
_documents_embeddings: numpy.ndarray=None
_document_index_to_embedding: dict=None
- Getting the Document Embeddings
def document_embedding(self, document: str) -> numpy.ndarray: """sums the embeddings for words in the document Args: - document: string to tokenize and build embedding for Returns: - embedding: sum of all word embeddings in the document """ rows = len(next(iter(self.embeddings.values()))) embedding = numpy.zeros(rows) words = self.process(document) # adding the zeros means you always return an array, not just the number 0 # if none of the words in the document are in the embeddings return embedding + sum((self.embeddings.get(word, 0) for word in words))
- Documents Embeddings
@property def documents_embeddings(self) -> numpy.ndarray: """array of embeddings for each document in documents""" if self._documents_embeddings is None: self._documents_embeddings = numpy.vstack( [self.document_embedding(document) for document in self.documents]) return self._documents_embeddings
- Document Index to Embeddings
@property def document_index_to_embedding(self) -> dict: """maps document index (from self.documents) to embedding""" if self._document_index_to_embedding is None: self._document_index_to_embedding = { index: embedding for index, embedding in enumerate( self.documents_embeddings)} return self._document_index_to_embedding
Hash Table Builder
@attr.s(auto_attribs=True)
class HashTable:
"""Builds the hash-table for embeddings
Args:
planes: matrix of planes to divide into hash table
vectors: vectors to be hashed
"""
planes: numpy.ndarray
vectors: numpy.ndarray
_hashes: list=None
_hash_table: dict=None
_index_table: dict=None
- Vector Hash Value
def hash_value(self, vector: numpy.ndarray) -> int: """ Create a hash for a vector Args: - vector: vector of tweet. It's dimension is (1, N_DIMS) Returns: - res: a number which is used as a hash for your vector """ rows, columns = self.planes.shape # assert vector.shape == (1, rows), vector.shape dot_product = numpy.dot(vector, self.planes) #assert dot_product.shape == (1, columns), dot_product.shape sign_of_dot_product = numpy.sign(dot_product) hashes = sign_of_dot_product >= 0 assert hashes.shape == dot_product.shape # remove extra un-used dimensions (convert this from a 2D to a 1D array) hashes = numpy.squeeze(hashes) hash_value = 0 for column in range(columns): hash_value += 2**column * hashes[column] return int(hash_value)
- Hashes
@property def hashes(self) -> list: """Vector hashes""" if self._hashes is None: self._hashes = [self.hash_value(vector) for vector in self.vectors] return self._hashes
- Hash Table Build
@property def hash_table(self) -> dict: """Hash table of vectors Returns: hash_table: dictionary - keys are hashes, values are lists of vectors (hash buckets) """ if self._hash_table is None: number_of_planes = self.planes.shape[1] number_of_buckets = 2**number_of_planes self._hash_table = {index: [] for index in range(number_of_buckets)} for index, hash_ in enumerate(self.hashes): self._hash_table[hash_].append(self.vectors[index]) return self._hash_table
- Index Table
@property def index_table(self) -> dict: """Tabel of document hash to index""" if self._index_table is None: number_of_planes = self.planes.shape[1] number_of_buckets = 2**number_of_planes self._index_table = {index: [] for index in range(number_of_buckets)} for index, hash_ in enumerate(self.hashes): self._index_table[hash_].append(index) return self._index_table
- Build The Tables
The code that uses the tables doesn't actually pull them at the same time, so I'm going to keep them separate.
def build_tables(self) -> None: """Builds the hash and index table properties""" number_of_planes = self.planes.shape[1] number_of_buckets = 2**number_of_planes self._hash_table = {index: [] for index in range(number_of_buckets)} self._index_table = {index: [] for index in range(number_of_buckets)} for index, hash_ in enumerate(self.hashes): self._hash_table[hash_].append(self.vectors[index]) self._index_table[hash_].append(index) return
Hash Tables
@attr.s(auto_attribs=True)
class HashTables:
"""Builds the universes of hash tables
Args:
universes: how many universes
planes: planes to hash vectors into
vectors: vectors to hash
"""
universes: int
planes: list
vectors: numpy.ndarray
_hash_tables: list=None
_id_tables: list=None
- Hash Tables
@property def hash_tables(self) -> list: """Builds the list of hash tables""" if self._hash_tables is None: self._hash_tables = [ HashTable(vectors=self.vectors, planes=self.planes[universe]).hash_table for universe in range(self.universes) ] return self._hash_tables
- ID Tables
@property def id_tables(self) -> list: """Builds the list of id tables""" if self._id_tables is None: self._id_tables = [ HashTable(vectors=self.vectors, planes=self.planes[universe]).index_table for universe in range(self.universes) ] return self._id_tables
Testing the Classes
PlanesUniverse
from neurotic.nlp.word_embeddings.hashing import PlanesUniverse
universes = PlanesUniverse(vector_count=len(tweets),
dimensions=N_DIMS,
universes=N_UNIVERSES,
vectors_per_bucket=16)
assert universes.plane_count==10
Documents Embeddings Builder
from neurotic.nlp.word_embeddings.hashing import DocumentsEmbeddings
table = DocumentsEmbeddings(embeddings=embeddings.english_subset,
process=process_tweet, documents=tweets)
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
tweet_embedding = table.document_embedding(custom_tweet)
actual = tweet_embedding[-5:]
expected = [-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184]
assert numpy.allclose(actual, expected)
print(actual)
dict_length = len(table.document_index_to_embedding)
expected = len(tweets)
assert dict_length == expected
print(f"length of dictionary {dict_length:,}")
rows, columns = table.documents_embeddings.shape
print(f"shape of document_vecs ({rows:,}, {columns})")
assert rows == expected
assert columns == 300
my_tweet = 'i am sad'
tweet_embedding = table.document_embedding(my_tweet)
idx = numpy.argmax(cosine_similarity(table.documents_embeddings, tweet_embedding))
print(tweets[idx])
[-0.00268555 -0.15378189 -0.55761719 -0.07216644 -0.32263184] length of dictionary 10,000 shape of document_vecs (10,000, 300) @zoeeylim sad sad sad kid :( it's ok I help you watch the match HAHAHAHAHA
Hash Table Builder
from neurotic.nlp.word_embeddings.hashing import HashTable
numpy.random.seed(0)
idx = 0
planes = universes.planes[idx] # get one 'universe' of planes to test the function
vec = numpy.random.rand(1, 300)
expected = 768
hasher = HashTable(planes=planes, vectors=None)
actual = hasher.hash_value(vec)
assert expected == actual, f"expected: {expected}, Actual: {actual}"
print(f" The hash value for this vector,",
f"and the set of planes at index {idx},",
f"is {actual}")
The hash value for this vector, and the set of planes at index 0, is 768
numpy.random.seed(0)
planes = universes.planes[0] # get one 'universe' of planes to test the function
vec = numpy.random.rand(1, 300)
hasher = HashTable(planes=planes, vectors=document_vecs)
tmp_hash_table = hasher.hash_table
tmp_id_table = hasher.index_table
index = 2
print(f"The hash table at key {index} has {len(tmp_hash_table[index])} document vectors")
print(f"The id table at key {index} has {len(tmp_id_table[index])}")
print(f"The first 5 document indices stored at key {index} of are {tmp_id_table[index][0:5]}")
The hash table at key 2 has 21 document vectors The id table at key 2 has 21 The first 5 document indices stored at key 2 of are [356, 529, 976, 1754, 1779]
Hash Tables
from neurotic.nlp.word_embeddings.hashing import HashTables
tables = HashTables(universes=25, planes=universes.planes, vectors=table.documents_embeddings)
with TIMER:
hash_tables_2 = tables.hash_tables
id_tables_2 = tables.id_tables
2020-10-29 19:06:32,191 graeae.timers.timer start: Started: 2020-10-29 19:06:32.191271 2020-10-29 19:06:56,635 graeae.timers.timer end: Ended: 2020-10-29 19:06:56.635738 2020-10-29 19:06:56,637 graeae.timers.timer end: Elapsed: 0:00:24.444467
assert len(hash_tables_2) == universes.universes
assert len(id_tables_2) == universes.universes
id_tables_ = zip(id_tables, id_tables_2)
for table, table_2 in id_tables_:
assert len(table_2) == 2**universes.plane_count
for bucket, ids in table.items():
assert ids == table_2[bucket], "[{bucket}]: {ids}, {table_2[bucket]}"
Testing Objects
Variable | Class |
---|---|
embeddings | EmbeddingsLoader |
process_tweet | TwitterProcessor |
table | DocumentsEmbeddings |
hasher | HashTable |
tables | HashTables |
The Data
This is a summary of the data that was loaded since this was such a long post and I can't remember what's what without looking around.
Variable | Type | Description |
---|---|---|
tweets | list of strings | All the tweets (10,000) |
document_vecs | numpy.ndarray | Document embeddings for the tweets (10,000, 300) |
ind2Tweet | dict | Map index of tweet (in tweets or document_vecs) to document embedding |
planes_l | List of arrays | List of random planes for hashing (each is 300 x 10, 25 total) |
hash_tables | List | List of bucket-index to document embeddings maps (One for each plane, each with 1,024 buckets (2^number of planes)) |
id_tables | List | List of bucket index: document index maps (one for each plane, each with 1,024) |
End
The next step is to use this to implement approximate k-Nearest Neighbors to compelete our application.