Automatic spelling correction in an important and interesting problem in NLP. This article walks you through the approaches and algorithms that can be used to understand automatic spell checking and correction.

Companies like Microsoft and Google have achieved high accuracy in spelling correction, which forms an integral part of their editors, email clients, word processors and search engines. In addition, spell checking is also often used as a pre-processing step in NLP to clean up text data before feeding it to machine learning algorithms for topic classification, search engine ranking, etc.

# Spell Checking Illustration in Python

Here is a small python code that does spell checking and suggests the closest correction.

from autocorrect import spell

spell('somthing')

'Something'

Ensure that you have the python package installed.

pip install autocorrect

# Spell Check Implementation

The concept behind a spell checker is fairly straight-forward. We start with the Bayes' theorem. Probability of correction **c** out of all possible candidate corrections, given the original word **w** is:

P_{c|w} = \frac{P_{c} \times P_{w|c}}{P_w}

Since P_w is the same for every possible candidate **c**, we can factor it out. Thus, we get

P_{c|w} \propto P_{c} \times P_{w|c}

where

- P_c, the probability that c appears as a word of a language and can be found from the language model. For example, occurrences of "the" make up about 7% of English text, so we should have P
_{(the)} = 0.07. - P_{w|c}, the error model is the probability that w would be typed in a text when the author meant c. For example, P
_{(teh|the)} is relatively high, but P_{(theeexyz|the)} would be very low.

Intuitively, P_{c|w} is effected by two factors. How often is the word *c* overall, and closely are the spelling of *w* and *c*.

**Example:** Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". The P_{(c|w) }have to consider both the probability of c and the probability of the change from c to w. For example,

**P**_{(the|thew)} = P_{(the)} * P_{(thew|the)}

and

**P**_{(thaw|thew)} = P_{(thaw)} * P_{(thew|thaw)}

For the correction "thaw", the only change is "a" to "e", therefore the error model P_{(thew|thaw) }would be high. But on the other hand, since "the" is a very common word, P_{(the) }would be very high. Therefore eventually, the combination of would would decide the closest spell correction.

Next, we'll see how to estimate each of these values.

# Estimating likelihood of a word P_{(c)}

P_{c} can be estimated by simply counting the number of times the each word appears in a text dataset. For example, occurrences of "the" make up about 7% of English text, so we should have P_{(the)} = 0.07.

We need to only store words that appear at-least somewhat commonly. For example, say at-least 5 times in the dataset.

# Estimating likelihood of a typo P_{(w|c)}

## Approach 1: Edit Distance

To understand this part, it is important to first understand what is **edit distance** between two strings. The edit distance between any two strings is the minimum number of edit operations required to transform string1 to string2. The edit operations could be the insertion, deletion or replacement of a character in the string. For example, the edit distance between boy and bat is 2 and between cats and cat is 1.

The best suggested correction candidates would have the minimal edit distance from the misspelled word. However, there could be various alternative corrections with the same edit distance, for example the correction for "lates" could be "late", "later" or "latest", each have an edit distance of 1. There is no way to know for sure the actual replacement candidate and therefore the probabilities P_{c} would help us decide the best ones having maximum probabilities.

Mathematically, we say P_{(w|c)} = k^{d(w, c)}, where k is a constant (say 0.001), and d(w, c) is the edit distance between w and c. This means that if there are two corrections c_{1} and c_{2} at edit distance of 1 and 2 from w, then P_{(w|c1)} is 1000 times more than P_{(w|c2)}.

To calculate edit distance efficiently, we generate all possible terms with an edit distance <=2 for each query term and search them in a hash-map of all words in the dictionary. This way it is not required to find edit distance with each word in the dictionary.

## Approach 2: N-Gram Analysis

Another way find the closest correction for the spelling error is N-gram analysis. In this method, instead of comparing the entire word in a text to a dictionary, just n-grams (characters) are used to find the similarity between strings. An n-gram is a set of consecutive characters taken from a string with a length of whatever n is set to. Let’s understand how to find the similarity coefficient between 2 strings using n-grams with a simple example.

**Example: **Consider 2 strings “statistics” and “statistical”. If n is set to 2 (bi-grams are being extracted), then the similarity of the two strings is calculated as follows:

Initially, the two strings are split into bi-grams:

Statistics - st ta at ti is st ti ic cs 9 bigrams

Statistical - st ta at ti is st ti ic ca al 10 bigrams

Then find the unique bi-grams in each string

Statistics - st ta at is ti ic cs (7 unique bigrams)

Statistical - st ta at ti is ic ca al (8 unique bigrams)

Next, find the unique bi-grams that are shared with both the terms.

There are 6 such bi-grams: st ta at ic is ti.

The similarity measure is calculated using similarity coefficient with the following formula:

Similarity coefficient = 2*C/A+B

- A - unique n-grams in term 1.
- B - unique n-grams in term 2.
- C - unique n-grams appearing in term 1 and term 2.

The above example would produce the result (2*6) / (7+8) = 0.80. Higher the similarity measure is, more relevant is the word for correction.

Again, we have P_{(w|c)} = k^{s(w, c)}, where k is a (larger) constant (say 100,000), and s(w, c) is the n-gram similarity between w and c.

Implementing N-grams would need scouting for the word in a large list of word dictionary. To implement this efficiently, we would maintain a hash-map from combinations of n-grams to words in the dictionary.

## Rule based

In addition to the above, we can also use common rules such as grammatical rules, homophones, etc. to detect spell errors and increase the probability of typing one word for another. Examples of such rules would be, (a) the wrong use of indefinite article "an" instead of "a" before a word beginning with a vowel sound, or (b) "write" and "right" used in a wrong way.

# Conclusion

In this tutorial, we saw some approaches for spell checking and correction using concepts of edit distance and n-gram similarity score. A more detailed explanation and implementation of the spell check algorithm is available at Peter Norvig’s blog.