Autocorrect: The System

Edit Distance

In this series of posts we'll implement models that correct words that are 1 and 2 edit distances away.

  • We say two words are n edit distance away from each other when we need n edits to change one word into another.

An edit could consist of one of the following options:

  • Delete (remove a letter): ‘hat’ => ‘at, ha, ht’
  • Switch (swap 2 adjacent letters): ‘eta’ => ‘eat, tea,…’
  • Replace (change 1 letter to another): ‘jat’ => ‘hat, rat, cat, mat, …’
  • Insert (add a letter): ‘te’ => ‘the, ten, ate, …’

We'll be using the four methods above to implement an Auto-correct system by computing the probabilities that a certain word is correct given an input word.

This auto-correct system was first created by Peter Norvig in 2007 in this article.

The goal of our spell check model is to compute the following probability:

\[ P(c|w) = \frac{P(w|c)\times P(c)}{P(w)} \tag{Equation 1} \]

The equation above is Bayes Rule, and it is saying that the probability of a word being correct \(P(c|w)\) is equal to the probability of having a certain word w, given that it is correct \(P(w|c)\), multiplied by the probability of being correct in general \(P(C)\) divided by the probability of that word w appearing \(P(w)\) in general.