This continues from a prior post where we built the EmbeddingsLoader to gather the word-embeddings that match our French and English dictionaries.
Imports
# pypifromdotenvimportload_dotenvimportattrimportnumpy# my stufffromneurotic.nlp.word_embeddings.embeddingsimportEmbeddingsLoader
Set Up
The Dotenv
This loads the paths to the files.
load_dotenv("posts/nlp/.env",override=True)
The Embeddings
Instead of using the subset word-embeddings that the course created I'm going to try and load the whole word embeddings from scratch. I defined the EmbeddingsLoader in {% lancelot title="this post" %}english-to-french-data{% /lancelot %} so I'll just load it here.
loader=EmbeddingsLoader()
Middle
Generate Embedding and Transform Matrices
Our English and French Embeddings are stored as word:vector dictionaries. To work with the embeddings we're going to need to convert them to matrices. At the same time we need to filter out words that are in one set but not the other so we're going to brute force it a little.
defget_matrices(en_fr:dict,french_vecs:dict,english_vecs:dict):""" Args: en_fr: English to French dictionary french_vecs: French words to their corresponding word embeddings. english_vecs: English words to their corresponding word embeddings. Return: X: a matrix where the columns are the English embeddings. Y: a matrix where the columns correspond to the French embeddings. """### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) #### X_l and Y_l are lists of the english and french word embeddings# X_l = list()# Y_l = list()# get the english words (the keys in the dictionary) and store in a set()english_set=set(english_vecs)# get the french words (keys in the dictionary) and store in a set()french_set=set(french_vecs)# store the french words that are part of the english-french dictionary (these are the values of the dictionary)# french_words = set(en_fr.values())filtered={english_word:french_wordforenglish_word,french_wordinen_fr.items()ifenglish_wordinenglish_setandfrench_wordinfrench_set}X=[english_vecs[english_word]forenglish_wordinfiltered]Y=[french_vecs[french_word]forfrench_wordinfiltered.values()]# loop through all english, french word pairs in the english french dictionary# for en_word, fr_word in en_fr.items():# # # check that the french word has an embedding and that the english word has an embedding# if fr_word in french_set and en_word in english_set:# # # get the english embedding# en_vec = english_vecs[en_word]# # # get the french embedding# fr_vec = french_vecs[fr_word]# # # add the english embedding to the list# X_l.append(en_vec)# # # add the french embedding to the list# Y_l.append(fr_vec)# # # stack the vectors of X_l into a matrix X# X = numpy.vstack(X_l)# # # stack the vectors of Y_l into a matrix Y# Y = numpy.vstack(Y_l)### END CODE HERE #### return X, Yreturnnumpy.vstack(X),numpy.vstack(Y)
Getting the Training Sets
@attr.s(auto_attribs=True)classTrainingData:"""Converts the embeddings into a test set Args: loader: EmbeddingsLoader instance """_loader:EmbeddingsLoader=None_english_vocabulary:set=None_french_vocabulary:set=None_filtered:dict=None_x_train:numpy.ndarray=None_y_train:numpy.ndarray=None@propertydefloader(self)->EmbeddingsLoader:"""A loader for the embeddings subsets"""ifself._loaderisNone:self._loader=EmbeddingsLoader()returnself._loader@loader.setterdefloader(self,new_loader:EmbeddingsLoader)->None:"""Sets the embeddings loader"""self._loader=new_loaderreturn@propertydefenglish_vocabulary(self)->set:"""The english embeddings subset words"""ifself._english_vocabularyisNone:self._english_vocabulary=set(self.loader.english_subset)returnself._english_vocabulary@propertydeffrench_vocabulary(self)->set:"""The french embeddings subset words"""ifself._french_vocabularyisNone:self._french_vocabulary=set(self.loader.french_subset)returnself._french_vocabulary@propertydeffiltered(self)->dict:"""A {enlish:french} dict filtered down This is a dict made of the original english-french dictionary created by the embeddings loader but filtered down so that the key is in the ``english_vocabulary`` and the value is in the ``french_vocabulary`` This is used to ensure that the training set is created it will only contain terms that have entries in both embeddings subsets """ifself._filteredisNone:self._filtered={english_word:french_wordforenglish_word,french_wordinself.loader.training.items()if(english_wordinself.english_vocabularyandfrench_wordinself.french_vocabulary)}returnself._filtered@propertydefx_train(self)->numpy.ndarray:"""The english-language embeddings as row-vectors"""ifself._x_trainisNone:self._x_train=numpy.vstack([self.loader.english_subset[english_word]forenglish_wordinself.filtered])returnself._x_train@propertydefy_train(self)->numpy.ndarray:"""The french-language embeddings as row-vectors"""ifself._y_trainisNone:self._y_train=numpy.vstack([self.loader.french_subset[french_word]forfrench_wordinself.filtered.values()])returnself._y_traindefcheck_rep(self)->None:"""Checks the shape of the training data Note: since this checks those attributes they will be built if they don't already exist Raises: AttributeError - there'se something unexpected about the shape of the data """rows,columns=self.x_train.shapeassertrows==len(self.filtered)assertcolumns==len(next(iter(self.loader.english_subset.values())))rows,columns=self.y_train.shapeassertrows==len(self.filtered)assertcolumns==len(next(iter(self.loader.french_subset.values())))return
End
The post that collects all the posts for this project is Machine Translation.
This is the first post in a series - the document with links to all the posts in the series is this post.
The Machine Translation exercise uses word embeddings that are subsets of prebuilt Word2Vec (English) embeddings (GoogleNews-vectors-negative300.bin.gz) and prebuilt French Embeddings (wiki.multi.fr.vec). Coursera provides them but I thought it would be a good exercise to look at how they're built.
To make loading files more or less portable I'm using a .env file with the paths to the data sets. This loads it into the environment so the values are accessible using os.environ.
load_dotenv("posts/nlp/.env",override=True)
Middle
The Embeddings
As I noted the English and French embeddings are available from the web. I was thinking of making a download if the files don't exist but the Google News embeddings file is pretty big so the download takes a while on my internet connection so I thought it'd be better to download it from a browser anyway. I'm going to assume the files are downloaded and the Google News embeddings are un-zipped (probably using gunzip or pigz, both of which are installed by default on Ubuntu 20.04).
Notes
"""This is a module for word embeddings loaders."""
Imports
# pythonfromargparseimportNamespacefrompathlibimportPathimportosimportpickle# from pypifromgensim.models.keyedvectorsimportBaseKeyedVectors,KeyedVectorsimportattrimportpandas
@attr.s(auto_attribs=True)classSubsetBuilder:"""Create subset of embeddings that matches sets Args: embeddings_1: word embeddings embeddings_2: word embeddings subset_dict: dict whose keys and values to pull out of the embeddings output_1: path to save the first subset to output_2: path to save the second subset to """embeddings_1:KeyedVectorsembeddings_2:KeyedVectorssubset_dict:dictoutput_1:Pathoutput_2:Path_vocabulary_1:set=None_vocabulary_2:set=None_subset_1:dict=None_subset_2:dict=None
Subset 1
@propertydefsubset_1(self)->dict:"""Subset of embeddings 1"""ifself._subset_1isNoneandself.output_1.is_file():withself.output_1.open("rb")asreader:self._subset_1=pickle.load(reader)returnself._subset_1
Subset 2
@propertydefsubset_2(self)->dict:"""subset of embeddings 2"""ifself._subset_2isNoneandself.output_2.is_file():withself.output_2.open("rb")asreader:self._subset_2=pickle.load(reader)returnself._subset_2
Save
defpickle_it(self):"""Save the subsets"""ifself.subset_1isnotNone:withself.output_1.open("wb")aswriter:pickle.dump(self.subset_1,writer)ifself.subset_2isnotNone:withself.output_2.open("wb")aswriter:pickle.dump(self.subset_2,writer)return
Clean it
defclean(self)->None:"""Remove any pickled subsets Also removes any subset dictionaries """forpathin(self.output_1,self.output_2):ifpath.is_file():path.unlink()self._subset_1=self._subset_2=Nonereturn
Call the Subset Builder
def__call__(self,pickle_it:bool=True)->None:"""Builds or loads the subsets and saves them as pickles Args: pickle_it: whether to save the subsets """ifself.subset_1isNoneorself.subset_2isNone:self.clean()self._subset_1,self._subset_2={},{}forkey,valueinself.subset_dict.items():ifkeyinself.embeddings_1andvalueinself.embeddings_2:self._subset_1[key]=self.embeddings_1[key]self._subset_2[value]=self.embeddings_2[value]ifpickle_it:self.pickle_it()return
Dict Loader
@attr.s(auto_attribs=True)classDictLoader:"""Loader for the english and french dictionaries This is specifically for the training and testing files - CSV-ish (separated by spaces instead of commas) - No header: column 1 = English, column 2 = English Args: path: path to the file columns: list of strings delimiter: separator for the columns in the source file """path:strcolumns:list=["English","French"]delimiter:str=" "_dataframe:pandas.DataFrame=None_dictionary:dict=None
Data Frame
@propertydefdataframe(self)->pandas.DataFrame:"""Loads the space-separated file as a dataframe"""ifself._dataframeisNone:self._dataframe=pandas.read_csv(self.path,names=self.columns,delimiter=self.delimiter)returnself._dataframe
Dictionary
@propertydefdictionary(self)->dict:"""english to french dictionary"""ifself._dictionaryisNone:self._dictionary=dict(zip(self.dataframe[self.columns[0]],self.dataframe[self.columns[1]]))returnself._dictionary
After I made the subset builder it occured to me that if there was overlap between the testing and training sets but they mapped to different definitions then the way I was going to build them would require two separated dictionaries, but as you can see, the training and testing sets don't have English terms in common.
After I tried using the EmbeddingsLoader on a different computer I realized that I didn't really simplify the creation of the embeddings all that much so I'm going to make an overall builder that maybe hides it from the end-user (although not entirely since I use environment variables that have to be set).
@attr.s(auto_attribs=True)classSourcePaths:"""Paths to the source files These are files provided from other sources """keys:Namespace=Keys_english:Path=None_french:Path=None_training:Path=None_testing:Path=None@propertydefenglish(self)->Path:"""Path to the english word-embeddings"""ifself._englishisNone:self._english=Path(os.environ[self.keys.source.english])returnself._english@propertydeffrench(self)->Path:"""Path to the french word-embeddings"""ifself._frenchisNone:self._french=Path(os.environ[self.keys.source.french])returnself._french@propertydeftraining(self)->Path:"""Path to the training dictionary"""ifself._trainingisNone:self._training=Path(os.environ[self.keys.source.training])returnself._training@propertydeftesting(self)->Path:"""Path to the testing dictionary"""ifself._testingisNone:self._testing=Path(os.environ[self.keys.source.testing])returnself._testing
Target Paths
@attr.s(auto_attribs=True)classTargetPaths:"""Paths to save derived files"""keys:Namespace=Keys_english:Path=None_french:Path=None@propertydefenglish(self)->Path:"""Path to derived subset of english embeddings"""ifself._englishisNone:self._english=Path(os.environ[self.keys.target.english])returnself._english@propertydeffrench(self)->Path:"""Path to derived subset of french embeddings"""ifself._frenchisNone:self._french=Path(os.environ[self.keys.target.french])returnself._french
Paths
@attr.s(auto_attribs=True)classPaths:"""Class to build and hold the source and target file paths"""_target:Path=None_source:Path=None@propertydeftarget(self)->TargetPaths:"""Holds object with paths to created embeddings subsets"""ifself._targetisNone:self._target=TargetPaths()returnself._target@propertydefsource(self)->SourcePaths:"""Holds objetw with paths to original source files"""ifself._sourceisNone:self._source=SourcePaths()returnself._source
Load And Build
@attr.s(auto_attribs=True)classLoadAndBuild:"""Loads embeddings and dictionaries and builds subsets"""_paths:Paths=None_english_embeddings:BaseKeyedVectors=None_french_embeddings:BaseKeyedVectors=None_training:dict=None_testing:dict=None_merged_dicts:dict=None_subset_builder:SubsetBuilder=None@propertydefpaths(self)->Paths:"""Object with paths to files"""ifself._pathsisNone:self._paths=Paths()returnself._paths@propertydefenglish_embeddings(self)->BaseKeyedVectors:"""Word embeddings for English"""ifself._english_embeddingsisNone:self._english_embeddings=Embeddings(self.paths.source.english,binary=True).embeddingsreturnself._english_embeddings@propertydeffrench_embeddings(self)->BaseKeyedVectors:"""Word embeddings for French"""ifself._french_embeddingsisNone:self._french_embeddings=Embeddings(self.paths.source.french,binary=False).embeddingsreturnself._french_embeddings@propertydeftraining(self)->dict:"""training dictionary"""ifself._trainingisNone:self._training=DictLoader(self.paths.source.training).dictionaryreturnself._training@propertydeftesting(self)->dict:"""Testing dictionary"""ifself._testingisNone:self._testing=DictLoader(self.paths.source.testing).dictionaryreturnself._testing@propertydefmerged_dicts(self)->dict:"""Testing and training merged"""ifself._merged_dictsisNone:self._merged_dicts=self.training.copy()self._merged_dicts.update(self.testing)assertlen(self._merged_dicts)==(len(self.training)+len(self.testing))returnself._merged_dicts@propertydefsubset_builder(self)->SubsetBuilder:"""Builder of the subset dictionaries"""ifself._subset_builderisNone:self._subset_builder=SubsetBuilder(self.english_embeddings,self.french_embeddings,self.merged_dicts,self.paths.target.english,self.paths.target.french)returnself._subset_builderdef__call__(self)->None:"""Calls the subset builder"""self.subset_builder()return
A Loader
As a convenience I'm going to make a loader for all the parts.
@attr.s(auto_attribs=True)classEmbeddingsLoader:"""Loads the embeddings and dictionaries Warning: this assumes that you've loaded the proper environment variables to find the files - it doesn't call ``load_dotenv`` """_loader_builder:LoadAndBuild=None_english_subset:dict=None_french_subset:dict=None_training:dict=None_testing:dict=None
@propertydefloader_builder(self)->LoadAndBuild:"""Object to load sources and build subsets"""ifself._loader_builderisNone:self._loader_builder=LoadAndBuild()returnself._loader_builder
@propertydefenglish_subset(self)->dict:"""The english embeddings subset This is a subset of the Google News embeddings that matches the keys in the english to french dictionaries """ifself._english_subsetisNone:ifnotself.loader_builder.paths.target.english.is_file():self.loader_builder()self._english_subset=self.loader_builder.subset_builder.subset_1else:withself.loader_builder.paths.target.english.open("rb")asreader:self._english_subset=pickle.load(reader)returnself._english_subset
@propertydeffrench_subset(self)->dict:"""Subset of the MUSE French embeddings"""ifself._french_subsetisNone:ifself.loader_builder.paths.target.french.is_file():withself.loader_builder.paths.target.french.open("rb")asreader:self._french_subset=pickle.load(reader)else:self.loader_builder()self._french_subset=self.loader_builder.subset_builder.subset_2returnself._french_subset
@propertydeftraining(self)->dict:"""The english to french dictionary training set"""ifself._trainingisNone:self._training=DictLoader(self.loader_builder.paths.source.training).dictionaryreturnself._training
@propertydeftesting(self)->dict:"""testing english to french dictionary"""ifself._testingisNone:self._testing=DictLoader(self.loader_builder.paths.source.testing).dictionaryreturnself._testing
In the previous post we implemented Locality Sensitive Hashing. It's part of a series of posts building an English to French translator whose links are gathered in this post.
Imports
# pythonfromargparseimportNamespace# pypifromdotenvimportload_dotenvfromnltk.corpusimporttwitter_samplesimportnumpy# this repofromneurotic.nlp.word_embeddings.embeddingsimportEmbeddingsLoaderfromneurotic.nlp.word_embeddings.english_frenchimportTrainingDatafromneurotic.nlp.word_embeddings.hashingimport(DocumentsEmbeddings,PlanesUniverse,HashTable,HashTables)fromneurotic.nlp.word_embeddings.nearest_neighborsimportNearestNeighborsfromneurotic.nlp.twitter.processorimportTwitterProcessorfromneurotic.nlp.word_embeddings.trainingimportTheTrainer
Implement approximate K nearest neighbors using locality sensitive hashing, to search for documents that are similar to a given document at the index doc_id.
Arguments
Variable
Description
doc_id
index into the document list all_tweets
v
document vector for the tweet in all_tweets at index doc_id
planes_l
list of planes (the global variable created earlier)
k
number of nearest neighbors to search for
num_universes_to_use
Number of available universes to use (25 by default)
The approximate_knn function finds a subset of candidate vectors that are in the same "hash bucket" as the input vector 'v'. Then it performs the usual k-nearest neighbors search on this subset (instead of searching through all 10,000 tweets).
Hints
There are many dictionaries used in this function. Try to print out planes_l, hash_tables, id_tables to understand how they are structured, what the keys represent, and what the values contain.
To remove an item from a list, use .remove()
To append to a list, use .append()
To add to a set, use .add()
# UNQ_C21 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)# This is the code used to do the fast nearest neighbor search. Feel free to go over itdefapproximate_knn(document_id:int,document_embedding:numpy.ndarray,multiverse_planes:list,k:int=1,universes:int=TWEET.universes):"""Search for k-NN using hashes Args: document_id: index for the document in the lists document_embedding: vector representing a documents word embeddings multiverse_planes: dictionary of planes for the hash-tables k: number of neighbors to find universes: number of times to repeat the search Returns: list of indexes for neighbor documents """assertuniverses<=TWEET.universes# Vectors that will be checked as possible nearest neighborpossible_neighbors=list()# list of document IDsids_of_possible_neighbors=list()# create a set for ids to consider, for faster checking if a document ID already exists in the setset_of_ids_of_possible_neighbors=set()hasher=HashTable(planes=multiverse_planes,vectors=None)# loop through the universes of planesforuniverseinrange(universes):# get the set of planes from the planes_l list, for this particular universe_idplanes=multiverse_planes[universe]# get the hash value of the vector for this set of planes# hash_value = hash_value_of_vector(v, planes)hash_value=HashTable(planes=planes,vectors=None).hash_value(document_embedding)# get the hash table for this particular universe_idhash_table=hash_tables[universe]# get the list of document vectors for this hash table, where the key is the hash_valuedocument_vectors=hash_table[hash_value]# get the id_table for this particular universe_idid_table=id_tables[universe]# get the subset of documents to consider as nearest neighbors from this id_table dictionarynew_ids_to_consider=id_table[hash_value]### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) #### remove the id of the document that we're searchingifdocument_idinnew_ids_to_consider:new_ids_to_consider.remove(document_id)print(f"removed document_id {document_id} of input vector from new_ids_to_search")# loop through the subset of document vectors to considerforindex,new_idinenumerate(new_ids_to_consider):# if the document ID is not yet in the set ids_to_consider...ifnew_idnotinset_of_ids_of_possible_neighbors:# access document_vectors_l list at index i to get the embedding# then append it to the list of vectors to consider as possible nearest neighborsdocument_vector=document_vectors[index]possible_neighbors.append(document_vector)# append the new_id (the index for the document) to the list of ids to considerids_of_possible_neighbors.append(new_id)# also add the new_id to the set of ids to consider# (use this to check if new_id is not already in the IDs to consider)set_of_ids_of_possible_neighbors.add(new_id)### END CODE HERE #### Now run k-NN on the smaller set of vecs-to-consider.print("Fast considering %d vecs"%len(possible_neighbors))# convert the vecs to consider set to a list, then to a numpy arrayvecs_to_consider_arr=numpy.array(possible_neighbors)# call nearest neighbors on the reduced list of candidate vectorsnearest_neighbors=NearestNeighbors(candidates=possible_neighbors,k=k)nearest_neighbor_ids=nearest_neighbors(document_embedding)# Use the nearest neighbor index list as indices into the ids to consider# create a list of nearest neighbors by the document idsnearest_neighbor_ids=[ids_of_possible_neighbors[index]forindexinnearest_neighbor_ids]returnnearest_neighbor_ids
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
nearest_neighbor_ids=approximate_knn(document_id=doc_id,document_embedding=vec_to_search,multiverse_planes=universes.planes,k=3,universes=5)print(f"Nearest neighbors for document {doc_id}")print(f"Document contents: {doc_to_search}")print("")forneighbor_idinnearest_neighbor_ids:print(f"Nearest neighbor at document id {neighbor_id}")print(f"document contents: {tweets[neighbor_id]}")
Fast considering 79 vecs
Nearest neighbors for document 0
Document contents: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Nearest neighbor at document id 254
document contents: Something to get your #Friday off to a great start :) Have a great day all! #Mclaren #FridayFeeling #TGIF http://t.co/LshgwcXsSv
Nearest neighbor at document id 2714
document contents: Current playlist :D http://t.co/PYKQLD4KHr
Nearest neighbor at document id 51
document contents: #FollowFriday @France_Espana @reglisse_menthe @CCI_inter for being top engaged members in my community this week :)
The first and third neighbors seem reasonable, although the third looks like it's just a re-working of our source tweet.
End
The post that collects all the posts in this project is Machine Translation.
defbasic_hash_table(things_to_hash:list,buckets:int)->dict:"""Create a basic hash table Args : things_to_hash: list of integers to hash buckets: number of buckets in the table Returns: hash_table: the things to hash sorted into their buckets """defhash_function(value:int,buckets:int)->int:"""Maps the value to an integer Args: value: what to hash n_buckets: number of buckets in the hash table Returns: remainder of value//n_buckets """returnint(value)%buckets# Initialize all the buckets in the hash table as empty listshash_table={bucket:[]forbucketinrange(buckets)}forvalueinthings_to_hash:# Get the hash key for the given valuehash_value=hash_function(value,buckets)# Add the element to the corresponding buckethash_table[hash_value].append(value)returnhash_table
The basic_hash_table maps values that can be cast to integers to a dictionary of lists. Let's see what it does.
This Basic Hash Table maps the values based on their remainder after dividing the value by the number of buckets. In this case there are ten buckets so the value gets mapped to the value in its ones column.
Multiplane Hash Functions
To visualize it we'll start with a single plane and color some points based on which side of the plane they fall.
I'll start by defining the vector that we'll use to decide which side of the plane a vector is on (by taking the dot product and checking the sign of the result).
decider=pandas.DataFrame([[1,2]])
This isn't the separating plane but rather a vector perpendicular to the separating plane. You don't need the separating plane to make the categorizations of the vectors, but for the sake of visualization it might be useful to see it. We can create it by creating a rotation matrix that rotates our originar vector 90 degrees.
Now we can plot them along with some categorized points.
First plot the vector we use to decide what side of the plane the points are.
# so to plot it I'll add a starting pointCOLUMNS="X Y".split()start=pandas.DataFrame([[0,0]])decider_plotter=pandas.concat([start,plane])decider_plotter.columns=COLUMNSplot=decider_plotter.hvplot(x="X",y="Y")
Now plot the plane that separates the categories. I'll scale it a little to move the plot back a little. Also the rotation gives us only the line segment rotated by 90 degrees so I'm going to negate it to get the -90 segment as well to complete the rendering of the plane.
Now we get to the points. The main lines to pay attention to are the calculation of the side_of_plane value and the conditional. The side_of_plane is an array but you can do boolean equality checks with integers as shown.
## Get a pair of random numbers between -4 and 4 POINTS=20LIMIT=4for_inrange(0,POINTS):vector=pandas.DataFrame([numpy_random.uniform(-LIMIT,LIMIT,2)],columns=["x","y"])side_of_plane=numpy.sign(numpy.dot(plane,vector.T))ifside_of_plane==1:plot*=vector.hvplot.scatter(x="x",y="y",color=Plot.blue)else:plot*=vector.hvplot.scatter(x="x",y="y",color=Plot.red)plot=plot.opts(title="Plane Hash Table",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,xlim=(-LIMIT,LIMIT),ylim=(-LIMIT,LIMIT))outcome=Embed(plot=plot,file_name="multiplane_hash")()
print(outcome)
So the dashed tan line is our separation plane and the blue line segment is the vector we use to decide which side of the plane the dots are on. The blue dots have a positive dot product with the blue vector and the red dots have a negative dot product with the blue vector.
defside_of_plane(plane:numpy.ndarray,vector:numpy.ndarray)->int:"""Finds the side of the plane that the vector is Args: plane: separating plane vector: location to check Returns: sign of the dot product between the plane and the vector """returnnumpy.sign(numpy.dot(plane,vector.T)).item()
defhash_multi_plane(planes:list,vector:numpy.ndarray)->int:"""Creates hash value for set of planes Args: planes: list of arrays to hash vector: array to determine which side of the planes are positive Returns: hash_value: the hash for plane matching the vector """hash_value=0forindex,planeinenumerate(planes):sign=side_of_plane(plane,vector)# increment the hash if the sign is non-negativehash_i=0ifsign<0else1hash_value+=2**index*hash_ireturnhash_value
defside_of_plane_matrix(planes:numpy.ndarray,vector:numpy.ndarray)->numpy.ndarray:"""Decides which side of planes vector is on Returns: side-of-plane value for vector with respect to each plane """returnnumpy.sign(numpy.dot(planes,vector.T))
defhash_multi_plane_matrix(planes:numpy.ndarray,vector:numpy.ndarray,num_planes:int):"""calculates hash for vector with respect to planes"""sides_matrix=side_of_plane_matrix(planes,vector)hash_value=0foriinrange(num_planes):sign=sides_matrix[i].item()# Get the value inside the matrix cellhash_i=1ifsign>=0else0hash_value+=2**i*hash_i# sum 2^i * hash_ireturnhash_value
This is another lab from Coursera's NLP Specialization. This time it's about using numpy to perform vector operations.
Imports
# pythonfromargparseimportNamespacefromfunctoolsimportpartialimportmath# from pypiimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews
Let's start with a simple matrix. We'll call it R because when we do our machine translation we'll need a rotation matrix which is named R.
R=numpy.array([[2,0],[0,-2]])
Now we'll create another matrix.
x=numpy.array([[1,1]])print(x.shape)
(1, 2)
Note the nested square brackets, this makes it a matrix and not a vector.
The Dot Product
y=numpy.dot(x,R)print(y)
[[ 2 -2]]
The rotation matrix (R) rotates and scales the matrix x. To see the effect we can plot the original vector x and the rotated version y.
X=pandas.DataFrame(dict(X=[0,x[0][0]],Y=[0,x[0][1]]))Y=pandas.DataFrame(dict(X=[0,y[0][0]],Y=[0,y[0][1]]))x_plot=X.hvplot(x="X",y="Y",color=Plot.blue)y_plot=Y.hvplot(x="X",y="Y",color=Plot.red)plot=(x_plot*y_plot).opts(title="Original and Rotated Vectors",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,xlim=(-2,2),ylim=(-2,2))outcome=Embed(plot=plot,file_name="original_and_rotate_vectors")()
print(outcome)
The blue segment is the original vector and the red is the rotated and scaled vector.
More Rotations
In the previous section we rotated the vector using integer values, but if we wanted to rotate the vector a specific number of degrees then the way to do that is to use a rotation matrix.
\[ Ro = \begin{bmatrix} cos \theta & -sin \theta \\ sin \theta & cos \theta \end{bmatrix} \]
Let's start with a vector and rotate it \(100^o\).
You can see that in this case our transformed vector (y2) didn't change in length the way it did in the previous example. Let's plot it and see what it looks like.
This is an extension of the previous two posts about Word Embeddings and Principal Component Analysis. Once again we're going to start with pre-trained word embeddings rather than train our own and then take the embeddings and explore them to better understand them.
Imports
# from pythonfromargparseimportNamespacefromfunctoolsimportpartialfrompathlibimportPathimportmathimportosimportpickle# from pypifromdotenvimportload_dotenvfromexpectsimport(be_true,equal,expect,)fromnumpy.randomimportdefault_rngfromsklearn.decompositionimportPCAimportholoviewsimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews,Timer
Set Up
The Timer
Just something to tell how long some processes take.
These are the same embeddings as in the Word Embeddings exploration. They're loaded a dictionary of arrays (vectors). The original source is the Google News pre-trained data set available from the Word2Vec archive, but it is 3.64 gigabytes so Coursera extracted a subset of it to work with.
The instructors also provide some code to show you how to create a different subset and I'm assuming that what they're showing is the actual way that they built this dataset. For future reference, this is the code given.
City 1 Country 1 City 2 Country 2
0 Athens Greece Baghdad Iraq
1 Athens Greece Bangkok Thailand
2 Athens Greece Beijing China
3 Athens Greece Berlin Germany
4 Athens Greece Bern Switzerland
It looks odd because this is actually an evaluation set. The first three columns are used to predict the fourth (e.g. Athens, Greece, and Baghdad are used to predict that Baghdad is the capital of Iraq).
Middle
Predicting Relationships Among Words
This part is about writing a function that will use the word embeddings to predict relationships among words.
Requirements
The arguments will be three words
The first two will be considered related to each other somehow
The function will then predict a fourth word that is related to the third word in a way that is similar to the relationship between the first two words.
Another way to look at is it that if you are given three words - Athens, Greece, and Bangkok then the function will fill in the blank for "Athens is to Greece as Bangkok is to __".
Because of our input data set what the function will end up doing is finding the capital of a country. But first we need a distance function.
A and B are the word vectors and \(A_i\) or \(B_i\) is the ith item of that vector
If the output is 0 then they are opposites and if the output is 1 then they are the same
If the number is between 0 and 1 then it is a similarity score
If the number is between 0 and -1 then it is a dissimilarity score
defcosine_similarity(A:numpy.ndarray,B:numpy.ndarray)->float:'''Calculates the cosine similarity between two arrays Args: A: a numpy array which corresponds to a word vector B: A numpy array which corresponds to a word vector Return: cos: numerical number representing the cosine similarity between A and B. '''dot_product=A.dot(B)norm_of_A=numpy.linalg.norm(A)norm_of_B=numpy.linalg.norm(B)cos=dot_product/(norm_of_A*norm_of_B)returncos
king=embeddings["king"]queen=embeddings["queen"]similarity=cosine_similarity(king,queen)print(f"The Cosine Similarity between 'king' and 'queen': {similarity:0.2f}.")expected=0.6510956expect(math.isclose(similarity,expected,rel_tol=1e-6)).to(be_true)
The Cosine Similarity between 'king' and 'queen': 0.65.
Euclidean Distance
In addition to the Cosine Similarity we can use the (probably better known) Euclidean Distance.
The more similar the words, the more likely the Euclidean distance will be close to 0 (and zero means they are the same).
defeuclidean(A:numpy.ndarray,B:numpy.ndarray)->float:"""Calculate the euclidean distance between two vectors Args: A: a numpy array which corresponds to a word vector B: A numpy array which corresponds to a word vector Return: d: numerical number representing the Euclidean distance between A and B. """d=numpy.sqrt(((A-B)**2).sum())returnd
actual=euclidean(king,queen)expected=2.4796925print(f"The Euclidean Distance between 'king' and 'queen' is {actual:0.2f}.")expect(math.isclose(actual,expected,rel_tol=1e-6)).to(be_true)
The Euclidean Distance between 'king' and 'queen' is 2.48.
The Predictor
Here's whdere we make the function that tries to predict the Country for a given Capital City. This will use the cosine similarity. This first version will use brute-force.
defget_country(city1:str,country1:str,city2:str,embeddings:dict)->tuple:"""Find the country that has a particular capital city Args: city1: a string (the capital city of country1) country1: a string (the country of capital1) city2: a string (the capital city of country2) embeddings: a dictionary where the keys are words and values are their embeddings Return: countries: most likely country, similarity score """group=set((city1,country1,city2))city1_emb=embeddings[city1]country1_emb=embeddings[country1]city2_emb=embeddings[city2]vec=country1_emb-city1_emb+city2_emb# Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)similarity=-1# initialize country to an empty stringcountry=''forwordinembeddings:ifwordnotingroup:word_emb=embeddings[word]# calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionarycur_similarity=cosine_similarity(vec,word_emb)# if the cosine similarity is more similar than the previously best similarity...ifcur_similarity>similarity:# update the similarity to the new, better similaritysimilarity=cur_similarity# store the country as a tuple, which contains the word and the similaritycountry=(word,similarity)returncountry
actual_country,actual_similarity=get_country("Athens","Greece","Cairo",embeddings)print(f"Cairo is the capital of {actual_country}.")expected_country,expected_similarity="Egypt",0.7626821expect(actual_country).to(equal(expected_country))expect(math.isclose(actual_similarity,expected_similarity,rel_tol=1e-6)).to(be_true)
Cairo is the capital of Egypt.
Checking the Model Accuracy
\[ \text{Accuracy}=\frac{\text{Correct # of predictions}}{\text{Total # of predictions}} \]
country_getter=partial(get_country,embeddings=embeddings)defget_accuracy(data:pandas.DataFrame)->float:'''Calculate the fraction of correct capitals Args: embeddings: a dictionary where the key is a word and the value is its embedding Return: accuracy: the accuracy of the model '''num_correct=0# loop through the rows of the dataframeforindex,rowindata.iterrows():# get city1city1=row["City 1"]# get country1country1=row["Country 1"]# get city2city2=row["City 2"]# get country2country2=row["Country 2"]# use get_country to find the predicted country2predicted_country2,_=country_getter(city1=city1,country1=country1,city2=city2)# if the predicted country2 is the same as the actual country2...ifpredicted_country2==country2:# increment the number of correct by 1num_correct+=1# get the number of rows in the data dataframe (length of dataframe)m=len(data)# calculate the accuracy by dividing the number correct by maccuracy=num_correct/mreturnaccuracy
Now we'll write a function to do the Principal Component Analysis for our embeddings.
The word vectors are of dimension 300.
Use PCA to change the 300 dimensions to n_components dimensions.
The new matrix should be of dimension m, n_components (m being the number of rows).
First de-mean the data
Get the eigenvalues using `linalg.eigh`. Use `eigh` rather than `eig` since R is symmetric. The performance gain when using `eigh` instead of `eig` is substantial.
Sort the eigenvectors and eigenvalues by decreasing order of the eigenvalues.
Get a subset of the eigenvectors (choose how many principle components you want to use using `n_components`).
Return the new transformation of the data by multiplying the eigenvectors with the original data.
defcompute_pca(X:numpy.ndarray,n_components:int=2)->numpy.ndarray:"""Calculate the principal components for X Args: X: of dimension (m,n) where each row corresponds to a word vector n_components: Number of components you want to keep. Return: X_reduced: data transformed in 2 dims/columns + regenerated original data """# you need to set axis to 0 or it will calculate the mean of the entire matrix instead of one per rowX_demeaned=X-X.mean(axis=0)# calculate the covariance matrix# the default numpy.cov assumes the rows are variables, not columns so set rowvar to Falsecovariance_matrix=numpy.cov(X_demeaned,rowvar=False)# calculate eigenvectors & eigenvalues of the covariance matrixeigen_vals,eigen_vecs=numpy.linalg.eigh(covariance_matrix)# sort eigenvalue in increasing order (get the indices from the sort)idx_sorted=numpy.argsort(eigen_vals)# reverse the order so that it's from highest to lowest.idx_sorted_decreasing=list(reversed(idx_sorted))# sort the eigen values by idx_sorted_decreasingeigen_vals_sorted=eigen_vals[idx_sorted_decreasing]# sort eigenvectors using the idx_sorted_decreasing indices# We're only sorting the columns so remember to get all the rows in the sliceeigen_vecs_sorted=eigen_vecs[:,idx_sorted_decreasing]# select the first n eigenvectors (n is desired dimension# of rescaled data array, or dims_rescaled_data)# once again, make sure to get all the rows and only slice the columnseigen_vecs_subset=eigen_vecs_sorted[:,:n_components]# transform the data by multiplying the transpose of the eigenvectors # with the transpose of the de-meaned data# Then take the transpose of that product.X_reduced=numpy.dot(eigen_vecs_subset.T,X_demeaned.T).TreturnX_reduced
I was getting the wrong values because for some reason so I decided to take out the call to random (since the seed was being set the values were always the same anyway) and just declare the test input array.
X_reduced=compute_pca(X,n_components=2)# eigen_vecs, eigen_subset, X_demeaned = compute_pca(X, n_components=2)print("Your original matrix was "+str(X.shape)+" and it became:")print(X_reduced)expected=numpy.array([[0.43437323,0.49820384],[0.42077249,-0.50351448],[-0.85514571,0.00531064],])numpy.testing.assert_almost_equal(X_reduced,expected)
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]
Plot It
We'll use most of the non-country words to create a plot to see how well the PCA does.
labels=reduced.hvplot.labels(x="X",y="Y",text="Word",text_baseline="top")points=reduced.hvplot.scatter(x="X",y="Y",color=Plot.blue,padding=0.5)plot=(points*labels).opts(title="PCA of Words",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,)outcome=Embed(plot=plot,file_name="pca_words")()
print(outcome)
It appears to have worked fairly well.
Sklearn Comparison
As a comparison here's what SKlearn's PCA does.
model=PCA(n_components=2)reduced=model.fit(subset).transform(subset)reduced=pandas.DataFrame(reduced,columns="X Y".split())reduced["Word"]=wordslabels=reduced.hvplot.labels(x="X",y="Y",text="Word",text_baseline="top")points=reduced.hvplot.scatter(x="X",y="Y",color=Plot.blue,padding=0.5)plot=(points*labels).opts(title="PCA of Words (SKLearn)",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,)outcome=Embed(plot=plot,file_name="sklearn_pca_words")()
print(outcome)
They look fairly comparable, I'll conclude that they are close (or close enough).
In this post I'm going to walk through the Lab for Coursera's NLP Specialization in which we take a look at Principal Component Analysis which we're going to use for Dimensionality Reduction later on. While PCA can be used as a black box it's useful to get an intuitive understanding of what it's doing so we'll take a look at a couple of simplified examples and pick apart a little bit of what's going on.
Imports
Just the usual suspects.
# pythonfromargparseimportNamespacefromfunctoolsimportpartialimportmathimportrandom# pypifromnumpy.randomimportdefault_rngfromsklearn.decompositionimportPCAimportholoviewsimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews
Set Up
Plotting
This is a little bit of convenience code for the HoloViews plotting.
And since our line is at a \(45^\circ\) angle, the values in the Eigenvectors are the sin and cos of \(45^\circ\) that are used to rotate the line flat.
And remember that hen we called the uniform function we set b to 2, and a to 1, so we get.
print((2-1)**2/12)
0.08333333333333333
If you look at the Eigenvalues we got, the second term is \(7 \times 10^{-33}\) which is pretty much zero, and the first term is about 0.16, so what we have here is.
It rounds more to 0.167, but close enough, the point is that the first component contributed all the variance and the second didn't contribute any.
Example Two: Normal Random Data
Now we'll move onto normally-distributed data so we can see something a little more interesting.
Generate the Data
Now we'll to use numpy's random normal function to generate the data. The three arguments it takes are loc (the mean), scale (the standard deviation), and size (the number of numbers to generate).
Even though we specify that the mean is 0, because it the data is generated randomly it isn't exactly zero so we'll center it.
print(f"x mean start: {x.mean()}")print(f"y mean start: {y.mean()}")x=x-x.mean()y=y-y.mean()print(f"\nx mean: {x.mean()}")print(f"y mean: {y.mean()}")
x mean start: -0.012000607736595292
y mean start: -0.01409218413437418
x mean: 3.552713678800501e-18
y mean: 2.6645352591003758e-18
Plot It
And now a plot to show the data.
data=pandas.DataFrame(dict(x=x,y=y))plot=data.hvplot.scatter(x="x",y="y").opts(title="Random Normal Data",height=Plot.height,width=Plot.width,fontscale=Plot.fontscale,color=Plot.blue,)outcome=Embed(plot=plot,file_name="random_normal_data")()
print(outcome)
As you can see, the data is pretty uncorrelated so we're going to rotate it to make it a little less of a blob.
Rotate The Data
Now we're going to put the x and y data into a matrix and rotate it.
You might notice that this is the same rotation matrix that we had before with the sklearn eigenvectors, so we could have used that, but this is how you would roll your own.
Now we can apply the rotation by taking the dot-product between the data array and the rotation-matrix.
To get a sense of what our transformation did we can plot it. In addition we'll plot the axes created by the rotation matrix so we can see how they're related. So first thing is to unpack the axes contained within the rotation matrix. In addition we'll scale the axes by the standard deviation we used along each of the original axes to see how that relates to the shape of the data.
first_axis_plot=first_axis.hvplot(x="x",y="y",color="red")second_axis_plot=second_axis.hvplot(x="x",y="y",color="orange")rotated_plot=rotated.hvplot.scatter(x="x",y="y",color=Plot.blue)plot=(rotated_plot*first_axis_plot*second_axis_plot).opts(title="Rotated Normal Data",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,)outcome=Embed(plot=plot,file_name="rotated_normal_data")()
print(outcome)
So our data is now grouped around a 45-degree angle and spread further along the axis that had more variance.
Apply the PCA
pca=PCA(n_components=2)fitted=pca.fit(rotated)
Once again, the Eigenvectors (the transformation matirix).
We're going to plot the rotated and the transformed data along with the axes for the rotated data so the first
transformed=pca_data.hvplot.scatter(x="x",y="y",color=Plot.red,fill_alpha=0)rotated_plot=rotated.hvplot.scatter(x="x",y="y",color=Plot.blue,fill_alpha=0)first_axis_plot=first_axis.hvplot(x="x",y="y",color="red")second_axis_plot=second_axis.hvplot(x="x",y="y",color="orange")plot=(transformed*rotated_plot*first_axis_plot*second_axis_plot).opts(title="PCA of Random Normal Data",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale)outcome=Embed(plot=plot,file_name="pca_random_normal")()
print(outcome)
Looking at the model
The rotation matrix took the original uncorrelated variables and transformed them into correllated variables (the blue circles).
Fitting the PCA to our correlated data finds the rotation matrix that was used to create the blue points.
Applying the PCA transformation undoes the rotation (but the spread doesn't return).
Our orginal standard deviations were 1 and 0.333 and if we look at the Explained Variance it is roughly our original standard deviations squared.
print(numpy.sqrt(variance))
[0.99140088 0.32958007]
Dimensionality Reduction
The previous sections were meant to understand what PCA is doing, but to use the PCA for visualization we will use it to reduce the number of dimensions of a data set so that it can be plotted. We can get a sense of how that works here by looking at our rotated data set with either the entire x-axis set to 0 or the entire y-axis set to 0.
This is a walk through a lab for week 3 of Coursera's Natural Language Processing course. It's going to use some pretrained word embeddings to develop some sense of how to use them.
Set Up
Imports
# pythonfromfunctoolsimportpartialfrompathlibimportPathimportosimportpickle# pypifromdotenvimportload_dotenvfromexpectsimport(equal,expect,)importhvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews
Since there are 300 columns you can't easily visualize them without using PCA or some other method, but this is more about getting an intuition as to how the linear-algebra works, so instead we're going to reduce a subset of words to only two columns so that we can plot them.
words=['oil','gas','happy','sad','city','town','village','country','continent','petroleum','joyful']plot_data=pandas.DataFrame([embeddings[word]forwordinwords])plot_columns=[3,2]plot_data=plot_data[plot_columns]plot_data.columns=["x","y"]plot_data["Word"]=wordsorigins=plot_data*0origins["Word"]=wordscombined_plot_data=pandas.concat([origins,plot_data])segment_plot=combined_plot_data.hvplot(x="x",y="y",by="Word")scatter_plot=plot_data.hvplot.scatter(x="x",y="y",by="Word")plot=(segment_plot*scatter_plot).opts(title="Embeddings Columns 3 and 2",width=Plot.width,height=Plot.height,fontscale=Plot.font_scale)outcome=Embed(plot=plot,file_name="embeddings_segments")()
print(outcome)
You can see that words like "village" and "town" are similar while "city" and "oil" are opposites for whatever reason. Oddly, "joyful" and "country" are also very similar (although I'm only looking at two out of three-hundred columns so that might not be the case once the other columns enter into place).
Word Distance
This is supposed to be a visualization of the difference vectors between sad and happy and town and village, but as far as I can see holoviews doesn't have the equivalent of matplotlib's arrow which lets you use the base coordinate and distance in each dimension to draw arrows, so it's kind of a fake version where I use the points directly. Oh, well.
This is the fake part - when you take the difference between two "points" it gives you a vector with the base at the origin so you have to add the base point back in to move it from the origin, but then all you're doing is undoing the subtraction, giving you what you started with.
First I'll check out the norm of some word vectors using numpy.linalg.norm. This calculates the Euclidean Distance between vectors (but oddly we won't use it here).
Here we'll see how to use the embeddings to predict what country a city is the capital of. To encode the concept of "capital" into a vector we'll use the difference between a specific country and its real capital (in this case France and Paris).
capital=embeddings["France"]-embeddings["Paris"]
Now that we have the concept of a capital encoded as a word embedding we can add it to the embedding of "Madrid" to get a vector near where "Spain" would be. Note that although there is a "Spain" in the embeddings we're going to use this to see if we can find it without knowing that Madrid is the capital of Spain.
country=embeddings["Madrid"]+capital
To make a prediction we have to find the embeddings that are closest to a country. We're going to convert the embeddings to a pandas DataFrame and since our embeddings are a dictionary of arrays we'll have to do a little unpacking first.
Now we'll make a function to find the closest embeddings for a word vector.
defclosest_word(vector:numpy.ndarray)->str:"""Find the word closest to a given vector Args: vector: the vector to match Returns: name of the closest embedding """differences=embeddings-vectorexpect(differences.shape).to(equal(embeddings.shape))distances=(differences**2).sum(axis="columns")expect(distances.shape).to(equal((len(differences),)))returnembeddings.iloc[numpy.argmin(distances)].name
Now we can check what word most closesly matches Madrid + (France - Paris).
print(closest_word(country))
Spain
Like magic.
More Countries
What happens if we use a different know country and its capital instead of France and Paris?
For some reason "Lisbon" is closer to itself than portugal. I tried it with Germany and Italy instead of France as the template capital but it still didn't work. If you try random cities from the embeddings you'll see that a fair amount of them fail.
Sentence Vectors
To use this for sentences you construct a vector with all the vectors for each word and then sum up all the columns to get back to a single vector.
sentence="Canada oil city town".split()vectors=[embeddings.loc[token]fortokeninsentence]summed=numpy.sum(vectors,axis=0)print(closest_word(summed))
I implemented the Logistic Regression Tweet Sentiment Analysis classifier in this post but I'm going to re-use it later so this just gathers everything together. There's already a class called TweetSentiment but I'm going to add the training to this one as well as the tweet pre-processing and vectorization.
Middle
We'll start with the imports.
# from pypiimportattrimportnumpy# this projectfrom.counterimportWordCounterfrom.sentimentimportTweetSentimentfrom.vectorizerimportTweetVectorizer
The Logistic Regression Class
@attr.s(auto_attribs=True)classLogisticRegression:"""train and predict tweet sentiment Args: iterations: number of times to run gradient descent learning_rate: how fast to change the weights during training """iterations:intlearning_rate:float_weights:numpy.array=Noneloss:float=None
Weights
These are the weights for the regression function (\(\theta\)).
@propertydefweights(self)->numpy.array:"""The weights for the regression Initially this will be an array of zeros. """ifself._weightsisNone:self._weights=numpy.zeros((3,1))returnself._weights
The Weights Setter
@weights.setterdefweights(self,new_weights:numpy.array)->None:"""Set the weights to a new value"""self._weights=new_weightsreturn
Sigmoid
defsigmoid(self,vectors:numpy.ndarray)->float:"""Calculates the logistic function value Args: vectors: a matrix of bias, positive, negative wordc ounts Returns: array of probabilities that the tweets are positive """return1/(1+numpy.exp(-vectors))
This is the training function
defgradient_descent(self,x:numpy.ndarray,y:numpy.ndarray):"""Finds the weights for the model Args: x: the tweet vectors y: the positive/negative labels """assertlen(x)==len(y)rows=len(x)self.learning_rate/=rowsforiterationinrange(self.iterations):y_hat=self.sigmoid(x.dot(self.weights))# average lossloss=numpy.squeeze(-((y.T.dot(numpy.log(y_hat)))+(1-y.T).dot(numpy.log(1-y_hat))))/rowsgradient=((y_hat-y).T.dot(x)).sum(axis=0,keepdims=True)self.weights-=self.learning_rate*gradient.Treturnloss
Fit
This is mostly an alias to make it match (somewhat) sklearn's methods.
deffit(self,x_train:numpy.ndarray,y_train:numpy.ndarray)->float:"""fits the weights for the logistic regression Note: as a side effect this also sets counter, loss, and sentimenter attributes Args: x_train: the training tweets y_train: the training labels Returns: The final mean loss (which is also saved as the =.loss= attribute) """self.counter=WordCounter(x_train,y_train)vectorizer=TweetVectorizer(x_train,self.counter.counts,processed=False)y=y_train.values.reshape((-1,1))self.loss=self.gradient_descent(vectorizer.vectors,y)returnself.loss
Predict
defpredict(self,x:numpy.ndarray)->numpy.ndarray:"""Predict the labels for the inputs Args: x: a list or array of tweets Returns: array of predicted labels for the tweets """vectorizer=TweetVectorizer(x,self.counter.counts,processed=False)sentimenter=TweetSentiment(vectorizer,self.weights)returnsentimenter()
Score
defscore(self,x:numpy.ndarray,y:numpy.ndarray)->float:"""Get the mean accuracy Args: x: arrray of tweets y: labels for the tweets Returns: mean accuracy """predictions=self.predict(x)correct=sum(predictions.T[0]==y)returncorrect/len(x)
End
Testing it out.
# pythonfromargparseimportNamespacefrompathlibimportPathimportmathimportos# pypifromdotenvimportload_dotenvfromexpectsimport(be_true,expect)importpandas# this projectfromneurotic.nlp.twitter.logistic_regressionimportLogisticRegression