N-Gram Pre-Processing
Table of Contents
Beginning
Imports
# python
import os
import re
# pypi
from dotenv import load_dotenv
import nltk
Set Up
The Environment
load_dotenv("posts/nlp/.env")
The Data Set
This is the pre-trained tokenizer that comes with NLTK.
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/hades/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
N-gram Corpus Preprocessing
Some common preprocessing steps for the language models include:
- lowercasing the text
- remove special characters
- split text to list of sentences
- split sentence into list words
Lowercase
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()
# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)
learning% makes 'me' happy. i am happy be-cause i am learning! :)
Remove Special Characters
corpus = "learning% makes 'me' happy. i am happy be-cause i am learning! :)"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)
learning makes me happy. i am happy because i am learning!
Splitting Text
# split text by a delimiter to array
input_date="Sat May 9 07:33:35 CEST 2020"
# get the date parts in array
date_parts = input_date.split(" ")
print(f"date parts = {date_parts}")
#get the time parts in array
time_parts = date_parts[4].split(":")
print(f"time parts = {time_parts}")
date parts = ['Sat', 'May', '', '9', '07:33:35', 'CEST', '2020'] time parts = ['07', '33', '35']
Tokenizing Sentences
sentence = 'i am happy because i am learning.'
tokenized_sentence = nltk.word_tokenize(sentence)
print(f'{sentence} -> {tokenized_sentence}')
i am happy because i am learning. -> ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
Find the Length of Each Word
sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
word_lengths = [(word, len(word)) for word in sentence] # Create a list with the word lengths using a list comprehension
print(f' Lengths of the words: \n{word_lengths}')
Lengths of the words: [('i', 1), ('am', 2), ('happy', 5), ('because', 7), ('i', 1), ('am', 2), ('learning', 8), ('.', 1)]
Convert to N-Grams
def sentence_to_trigram(tokenized_sentence):
"""
Prints all trigrams in the given tokenized sentence.
Args:
tokenized_sentence: The words list.
Returns:
No output
"""
# note that the last position of i is 3rd to the end
for i in range(len(tokenized_sentence) - 3 + 1):
# the sliding window starts at position i and contains 3 words
trigram = tokenized_sentence[i : i + 3]
print(trigram)
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
print(f'List all trigrams of sentence: {tokenized_sentence}\n')
sentence_to_trigram(tokenized_sentence)
List all trigrams of sentence: ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.'] ['i', 'am', 'happy'] ['am', 'happy', 'because'] ['happy', 'because', 'i'] ['because', 'i', 'am'] ['i', 'am', 'learning'] ['am', 'learning', '.']
N-Gram Prefix
\begin{equation*} P(w_n|w_1^{n-1})=\frac{C(w_1^n)}{C(w_1^{n-1})} \end{equation*}
# get trigram prefix from a 4-gram
fourgram = ['i', 'am', 'happy','because']
trigram = fourgram[0:-1] # Get the elements from 0, included, up to the last element, not included.
print(trigram)
['i', 'am', 'happy']
Adding Start and End Tokens
n = 3
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
tokenized_sentence = ["<s>"] * (n - 1) + tokenized_sentence + ["</s>"]
print(tokenized_sentence)
['<s>', '<s>', 'i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.', '</s>']