Autocorrect: Finding Candidates Using Edits
Beginning
Middle
Our Data
This is the word we are going to autocorrect.
word = "dearz"
Splits
This finds all the splits in our word.
splits = [(word[:index], word[index:]) for index in range(len(word) + 1)]
for split in splits:
print(split)
('', 'dearz') ('d', 'earz') ('de', 'arz') ('dea', 'rz') ('dear', 'z') ('dearz', '')
Delete
Starting with the splits, delete the first letter of the second element of each tuple. This means that each letter gets deleted exactly once.
deletes = [left + right[1:] for left, right in splits if right]
for deleted in deletes:
print(f" - {deleted}")
- earz - darz - derz - deaz - dear
These are now the candidates to use for the correction.
Filter Out
Since not all the candidates are real words you need to filter out all but the real candidates using a pre-defined vocabulary.
vocabulary = ['dean','deer','dear','fries','and','coke']
candidates = set(vocabulary).intersection(set(deletes))
print(candidates)
{'dear'}
End
This doesn't demonstrate all the edits (nor the use of edit distance) but the remaining types of edits are done in a similar manner to these two.