Hands-on Assignment: Implementing Spell Check from Scratch

April 17, 2018

This hands-on assignment guides your through implementing spell checking (and correction) from scratch in Python. We are given about 25,000 words of text, out of which approximately 2500 are spelling errors. We will create a system to (a) detect which words are misspelled, and (b) suggest correct spellings for those words.

Overview

The model we will implement will is the first one described in the Spell Checking and Correction tutorial. That is, $P_{c|w} \propto P_{c} \times P_{w|c}$ where $P_c$ will be estimated using unigram probabilities from text, and $P_{(w|c)} = k^{d(w,c)}$, and $d(w, c)$ denotes the edit distance.

The code included below can be run one after the other. The parts you need to write are marked with triple dots (...).

Dataset

The dataset for this task can be downloaded from the following link: spell_check. It is a slightly cleaned up version of Corpora of misspellings for download. Some sample lines from the the file are included below,

I have four in my Family Dad Mum and <ERR targ=sister> siter </ERR> .
...
I <ERR targ=watch> wach </ERR> it each night.
...
The body was found <ERR targ=at> a </ERR> 7 <ERR targ=o'clock> oclock </ERR> in the morning.

It’s just text, with incorrect spellings enclosed in ERR tags. The syntax is the following: <ERR targ=[CORRECTION]> [WORD] </ERR>.

###############################################################################
data = ' '.join(data.split())
data = data.lower()
###############################################################################
## tokenize
corrected = re.sub("<err targ=(.*?)> (.*?) </err>", '\\1', data)
original = re.sub("<err targ=(.*?)> (.*?) </err>", '\\2', data)
corrected = re.findall("[a-z0-9'-]+", corrected)
original = re.findall("[a-z0-9'-]+", original)
print(original[:25])
print(corrected[:25])
###############################################################################

After loading the data and tokenizing it, we are left with two lists, original and corrected. Each list is a list of tokens. Here are the first 25 tokens from each lists:

original: ['1', 'nigel', 'thrush', 'page', '48', 'i', 'have',
'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'siter',
'my', 'dad', 'works', 'at', 'melton', 'my', 'siter', 'go',
'to', 'tonbury']
corrected: ['1', 'nigel', 'thrush', 'page', '48', 'i', 'have',
'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister',
'my', 'dad', 'works', 'at', 'melton', 'my', 'sister', 'goes',
'to', 'tonbury']

Implementing Spell Check

Step 1: Global constants and vocabulary

###############################################################################
## create vocab
alphabets = "abcdefghijklmnopqrstuvwxyz0123456789'-"
vocab = Counter(corrected)
print(len(vocab))
print(vocab.most_common(10))
###############################################################################

alphabets is the list of characters that a token / word consists of. vocab is a dictionary, with vocab[word] being the number of times a word appeared. In general, vocab should be estimated from a large clean text corpus such as Wikipedia.

Step 2: Estimating $P_c$

###############################################################################
## estimate prob_c using vocab
prob_c = dict()
...
print(prob_c['the']) # => 0.0616
print(prob_c['and']) # => 0.0423
print(prob_c['a'])   # => 0.0293
###############################################################################

Calculate the values of $P_c$ and store it in prob_c.

Step 3: Estimating $P_{(w|c)}$

Calculating the values of $P_{(w|c)}$ requires creating a list of all words within an edit distance of 1 from words in our vocabulary. You need to complete the function all_edit_distance_one. You can turn the DEBUG flag on and off to help with debugging.

###############################################################################
## estimate prob_w_given_c
DEBUG = False
def all_edit_distance_one(word):
# inserts
inserts = list()
for position in range(len(word)+1):
for alphabet in alphabets:
new_word = ...
inserts.append(new_word)
if DEBUG: print(inserts)
# deletes
deletes = list()
for position in range(len(word)):
new_word = ...
deletes.append(new_word)
if DEBUG: print(deletes)
# replacements
replacements = list()
... # calculate replacements
... # append words to list
if DEBUG: print(replacements)
return list(set(inserts + deletes + replacements) - set([word]) )
print(sorted(all_edit_distance_one('the')))
prob_w_given_c = dict()
for word in vocab:
prob_w_given_c[(word, word)] = 0.001 ** 0
for error in all_edit_distance_one(word):
prob_w_given_c[(error, word)] = 0.001 ** 1
print(prob_w_given_c[('thew', 'the')]) # => 0.001
print(prob_w_given_c[('thw', 'the')])  # => 0.001
print(prob_w_given_c[('the', 'the')])  # => 1.000
print(prob_w_given_c[('a', 'an')])     # => 0.001
print(prob_w_given_c[('an', 'an')])    # => 1.000
print(prob_w_given_c[('and', 'an')])   # => 0.001
###############################################################################

Step 4: Predicting

###############################################################################
## predict
def predict(word):
candidates = [word] + all_edit_distance_one(word)
scores = dict()
for candidate in candidates:
if candidate not in prob_c: continue
if (word, candidate) not in prob_w_given_c: continue
scores[candidate] = ... # score the candidate
if not scores: return word
return ... # candidate with highest score
print predict('thew') # => 'the'
print predict('the')  # => 'the'
###############################################################################

In the above code, we generate and score a list of possible candidates for each word in our data. You need to write the core for scoring a particular candidate using the already computed probabilities, and returning the candidate with the highest score.

Evaluation

You can use the following code to evaluate the system.

###############################################################################
## evaluate
def count_matches(lst1, lst2):
return sum((xx == yy) for xx, yy in zip(lst1, lst2))
predictions = [predict(xx) for xx in original]
print 'Total words:', len(corrected)
print 'Total already correct:', count_matches(corrected, original)
print 'Total correct after spell correction:', count_matches(corrected, predictions)
print ''
good_count = 0
for xx, yy, pp in zip(original, corrected, predictions):
if xx == yy and pp != yy:
if pp == yy and xx != yy:
good_count += 1
if good_count <= 10: print 'Good:', xx, yy, pp
if xx != yy and pp != yy:
print ''
print 'Words that were incorrect and were not corrected by the system (bad):', bad_count
print 'Words that were incorrect and were corrected by the system (good!):', good_count
###############################################################################

It produced the following output for me (and if you did everything right, it should produce the exact same output for you).

Total words: 25641
Total correct after spell correction: 23850
Good: siter sister sister
Good: siter sister sister
Good: clob club club
Good: frount front front
Good: squar square square
Good: killd killed killed
Good: rings ring ring
Good: wiskey whiskey whiskey
Words that were correct but made incorrect by the system (very bad): 3
Words that were incorrect and were not corrected by the system (bad): 1788
Words that were incorrect and were corrected by the system (good!): 600

We were able to correct 600 out of ~2400 total errors. We also added 3 errors that were not there originally. Same samples for each type of error are displayed. We see the following patterns.

1. Our language model isn’t great, i.e. relying on the most frequently used word within edit distance of 1 leads to errors. For example, wach is being corrected to each instead of watch.
2. Similarly, it causes errors when the word is legal, but not the intended word. For example, colbe is a legal word as per our dictionary, but should have been corrected to club.
3. For very popular words like and and the, neighboring words will always be treated as spelling mistakes.

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.

Bonus: Ideas for further improvement

Based on the discussion so far, here are some ideas for further improvements.

1. We can expect a lot of improvement from taking context into account. As such, $P_c$ should not be estimated using unigram (i.e. frequency of word occurrence). If instead we used a bigram language model, then $P_c$ would be much more accurate, i.e. estimate probability of next word given previous word.
2. Analyze how many times did we simply not have candidates, since the misspelling was edit distance 2 away. If this happened often, extend the solution for words edit distance 2 away. (Challenge: Since we calculated everything edit distance 1 away for every word in the vocabulary, as well as for every word in the text, this can be done without increasing the computational requirements of the code. This strategy is called Meet in the Middle).

Hope you enjoyed the assignment!