Hands-on Assignment: Implementing Spell Check from Scratch

April 17, 2018

This hands-on assignment guides your through implementing spell checking (and correction) from scratch in Python. We are given about 25,000 words of text, out of which approximately 2500 are spelling errors. We will create a system to (a) detect which words are misspelled, and (b) suggest correct spellings for those words.

Overview

The model we will implement will is the first one described in the Spell Checking and Correction tutorial. That is, $P_{c|w} \propto P_{c} \times P_{w|c}$ where $P_c$ will be estimated using unigram probabilities from text, and $P_{(w|c)} = k^{d(w,c)}$, and $d(w, c)$ denotes the edit distance.

The code included below can be run one after the other. The parts you need to write are marked with triple dots (...).

Dataset

The dataset for this task can be downloaded from the following link: spell_check. It is a slightly cleaned up version of Corpora of misspellings for download. Some sample lines from the the file are included below,

I have four in my Family Dad Mum and <ERR targ=sister> siter </ERR> .
... 
I <ERR targ=watch> wach </ERR> it each night.
... 
The body was found <ERR targ=at> a </ERR> 7 <ERR targ=o'clock> oclock </ERR> in the morning.

It’s just text, with incorrect spellings enclosed in ERR tags. The syntax is the following: <ERR targ=[CORRECTION]> [WORD] </ERR>.

After downloading the dataset, you can load using the following code.

###############################################################################
## load data
data = open('spell_check').read()
data = ' '.join(data.split())
data = data.lower()
###############################################################################
## tokenize
corrected = re.sub("<err targ=(.*?)> (.*?) </err>", '\\1', data)
original = re.sub("<err targ=(.*?)> (.*?) </err>", '\\2', data)
corrected = re.findall("[a-z0-9'-]+", corrected)
original = re.findall("[a-z0-9'-]+", original)
print(original[:25])
print(corrected[:25])
###############################################################################

After loading the data and tokenizing it, we are left with two lists, original and corrected. Each list is a list of tokens. Here are the first 25 tokens from each lists:

original: ['1', 'nigel', 'thrush', 'page', '48', 'i', 'have', 
    'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'siter', 
    'my', 'dad', 'works', 'at', 'melton', 'my', 'siter', 'go', 
    'to', 'tonbury']
corrected: ['1', 'nigel', 'thrush', 'page', '48', 'i', 'have', 
    'four', 'in', 'my', 'family', 'dad', 'mum', 'and', 'sister', 
    'my', 'dad', 'works', 'at', 'melton', 'my', 'sister', 'goes', 
    'to', 'tonbury']

Implementing Spell Check

Step 1: Global constants and vocabulary

###############################################################################
## create vocab
alphabets = "abcdefghijklmnopqrstuvwxyz0123456789'-"
vocab = Counter(corrected)
print(len(vocab))
print(vocab.most_common(10))
###############################################################################

alphabets is the list of characters that a token / word consists of. vocab is a dictionary, with vocab[word] being the number of times a word appeared. In general, vocab should be estimated from a large clean text corpus such as Wikipedia.

Step 2: Estimating $P_c$

###############################################################################
## estimate prob_c using vocab 
prob_c = dict()
... 
print(prob_c['the']) # => 0.0616
print(prob_c['and']) # => 0.0423
print(prob_c['a'])   # => 0.0293
###############################################################################

Calculate the values of $P_c$ and store it in prob_c.

Step 3: Estimating $P_{(w|c)}$

Calculating the values of $P_{(w|c)}$ requires creating a list of all words within an edit distance of 1 from words in our vocabulary. You need to complete the function all_edit_distance_one. You can turn the DEBUG flag on and off to help with debugging.

###############################################################################
## estimate prob_w_given_c
DEBUG = False
def all_edit_distance_one(word):
    # inserts
    inserts = list()
    for position in range(len(word)+1):
        for alphabet in alphabets:
            new_word = ... 
            inserts.append(new_word)
    if DEBUG: print(inserts)
    # deletes
    deletes = list()
    for position in range(len(word)):
        new_word = ... 
        deletes.append(new_word)
    if DEBUG: print(deletes)
    # replacements
    replacements = list()
    ... # calculate replacements 
    ... # append words to list 
    if DEBUG: print(replacements)
    return list(set(inserts + deletes + replacements) - set([word]) )
print(sorted(all_edit_distance_one('the')))
prob_w_given_c = dict()
for word in vocab:
    prob_w_given_c[(word, word)] = 0.001 ** 0
    for error in all_edit_distance_one(word):
        prob_w_given_c[(error, word)] = 0.001 ** 1
print(prob_w_given_c[('thew', 'the')]) # => 0.001
print(prob_w_given_c[('thw', 'the')])  # => 0.001
print(prob_w_given_c[('the', 'the')])  # => 1.000
print(prob_w_given_c[('a', 'an')])     # => 0.001
print(prob_w_given_c[('an', 'an')])    # => 1.000
print(prob_w_given_c[('and', 'an')])   # => 0.001
###############################################################################

Step 4: Predicting

###############################################################################
## predict
def predict(word):
    candidates = [word] + all_edit_distance_one(word)
    scores = dict()
    for candidate in candidates:
        if candidate not in prob_c: continue
        if (word, candidate) not in prob_w_given_c: continue
        scores[candidate] = ... # score the candidate 
    if not scores: return word
    return ... # candidate with highest score 
print predict('thew') # => 'the'
print predict('the')  # => 'the'
###############################################################################

In the above code, we generate and score a list of possible candidates for each word in our data. You need to write the core for scoring a particular candidate using the already computed probabilities, and returning the candidate with the highest score.

Evaluation

You can use the following code to evaluate the system.

###############################################################################
## evaluate
def count_matches(lst1, lst2):
    return sum((xx == yy) for xx, yy in zip(lst1, lst2))
predictions = [predict(xx) for xx in original]
print 'Total words:', len(corrected)
print 'Total already correct:', count_matches(corrected, original)
print 'Total correct after spell correction:', count_matches(corrected, predictions)
print ''
very_bad_count = 0
bad_count = 0
good_count = 0
for xx, yy, pp in zip(original, corrected, predictions):
    if xx == yy and pp != yy:
        very_bad_count += 1
        if very_bad_count <= 10: print 'Very Bad:', xx, yy, pp
    if pp == yy and xx != yy:
        good_count += 1
        if good_count <= 10: print 'Good:', xx, yy, pp
    if xx != yy and pp != yy:
        bad_count += 1
        if bad_count <= 10: print 'Bad:', xx, yy, pp
print ''
print 'Words that were correct but made incorrect by the system (very bad):', very_bad_count
print 'Words that were incorrect and were not corrected by the system (bad):', bad_count
print 'Words that were incorrect and were corrected by the system (good!):', good_count
###############################################################################

It produced the following output for me (and if you did everything right, it should produce the exact same output for you).

Total words: 25641
Total already correct: 23253
Total correct after spell correction: 23850
Good: siter sister sister
Good: siter sister sister
Bad: go goes go
Good: clob club club
Bad: wakh watch wash
Good: frount front front
Bad: sexeon second sexton
Bad: wach watch each
Bad: wach watch each
Bad: colbe club colbe
Bad: wach watch each
Bad: wach watch each
Bad: thing think thing
Good: squar square square
Bad: iyes eyes yes
Good: killd killed killed
Good: rings ring ring
Good: marsks masks masks
Good: marsks masks masks
Good: wiskey whiskey whiskey
Very Bad: band band and
Very Bad: land land and
Very Bad: tie tie the
Words that were correct but made incorrect by the system (very bad): 3
Words that were incorrect and were not corrected by the system (bad): 1788
Words that were incorrect and were corrected by the system (good!): 600

We were able to correct 600 out of ~2400 total errors. We also added 3 errors that were not there originally. Same samples for each type of error are displayed. We see the following patterns.

Our language model isn’t great, i.e. relying on the most frequently used word within edit distance of 1 leads to errors. For example, wach is being corrected to each instead of watch.
Similarly, it causes errors when the word is legal, but not the intended word. For example, colbe is a legal word as per our dictionary, but should have been corrected to club.
For very popular words like and and the, neighboring words will always be treated as spelling mistakes.

Solution on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.

Bonus: Ideas for further improvement

Based on the discussion so far, here are some ideas for further improvements.

We can expect a lot of improvement from taking context into account. As such, $P_c$ should not be estimated using unigram (i.e. frequency of word occurrence). If instead we used a bigram language model, then $P_c$ would be much more accurate, i.e. estimate probability of next word given previous word.
Analyze how many times did we simply not have candidates, since the misspelling was edit distance 2 away. If this happened often, extend the solution for words edit distance 2 away. (Challenge: Since we calculated everything edit distance 1 away for every word in the vocabulary, as well as for every word in the text, this can be done without increasing the computational requirements of the code. This strategy is called Meet in the Middle).

Hope you enjoyed the assignment!