PCPlus 280: Writing a spellchecker : Algorithms for the masses

PCPlus 280: Writing a spellchecker

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there’s an Xmas issue as well — so it’s a bit more than monthly). The column is called Theory Workshop and appears in the Make It section of the magazine. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. What I’ll do is publish the article from a year ago or so here when I purchase the current issue.

Since I picked up April 2010’s issue when I was in England just over a week ago, I’m publishing this article from April 2009 slightly earlier than I usually do (which means May 2009’s article will be posted in June or something).

The topic is pretty interesting: how does a spell-checker give you alternatives to a badly-spelled word? The first step — finding out if a word is misspelled — is simple enough as a basic algorithm: just look it up in a big ol’ list o’ words. The next step can seem daunting though: how can you generate a valid list of possible corrections? The article talks about a few possibilities: the Levenshtein distance, the Damerau algorithm, and the Soundex method.

The Levenshtein distance is a bit impractical for a large list of words, although there are certain techniques you can use to speed things up. It essentially calculates the edit distance (the number of character insertions and deletions to go from one word to another) of the misspelled word and every other word in the list. The words with the smallest edit distance are your correction candidates. As I said, possibly too slow for a large dictionary, although it’s a great algorithm for creating diff tools.

The Damerau algorithm essentially calculates all the words that can be formed form the misspelled word by a single mistyped letter and checks those in the dictionary. Those that are present are valid correction candidates. Although single mistakes are common, two or more mistyped letters in a single word is fairly rare, so the algorithm does pretty well (as the article shows).

The Soundex algorithm is a “phonetic” algorithm: it finds candidate words that “sound like” the misspelled word.

Of course the article doesn’t go into deeper, harder algorithms for space reasons. For example, Bloom filters are great for compressing the dictionary at the expense of a few false positives and can be used to replace the article’s hash table for the word list. Different dictionaries could be employed that just store root words and then describe possible suffixes, prefixes, and circumfixes that can go with each word (for example, Hunspell). Better phonetic algorithms now exist than Soundex (for example, Metaphone). Nevertheless, I think the article hangs together quite well and I’m pretty pleased with it.

This article first appeared in issue 280, April 2009.

You can download the PDF here.

Now playing:
Armstrong, Craig - Revelations
(from Plunkett & Macleane [Original Score])

Tue 16-Mar-2010 9:50 PM PCPlus / tags: spellchecker damerau diff soundex levenshtein

Loading links to posts on similar topics...

previous post next post

4 Responses

#1 Ruud said...

17-Mar-10 2:44 AM

In the past I worked with a dictionary (for a song book) in Dutch.

In Dutch (especially in poetry) an apostrophe can be used for ommision of one or more letters. This can even be the case at the beginning or the end of a word.

But the same computer character that is used for the apostrophe is also used for the single quotation mark (at least in that ASCII time). So an apostrophe at the end of a word (with no whitespace in between) can be part of the word or part of the quotation...

I have never found a satifying solution for this problem but manual intervention. Are you aware of solutions ot this problem?

If you have a dictionary, looking it up is a possibility, but if you want to extract a dictionary (which was the case I was dealing with), what then?

(Side note: Dutch also uses compound words, like German does.)

#2 Mark Gohara said...

17-Mar-10 5:51 AM

A better solution is the metaphone algorithm than soundex it seems to be more precise.

#3 julian m bucknall said...

17-Mar-10 7:35 PM

Ruud: the problem with a spellchecker is that you tend to 'embed' knowledge about the language you're spellchecking into the algorithm. So, although a lot of what I described would work in Dutch, quite a bit wouldn't. You've brought up a good issue: essentially it's an area which would require more language expertise.

Cheers, Julian

#4 julian m bucknall said...

17-Mar-10 7:42 PM

Mark: Quite. Which is why I mentioned it in passing in this blog post. Metaphone is one of those algorithms which is just horrendous in terms of 'special cases' and one that would really not be suitable for a sidebar in an article for the layman.

For instance, I can't remember off-hand how it treats -ough endings: rough, bough, hiccough, cough, dough, and so on. Soundex is crap at this, but the discussion of how Metaphone caters for them would be tedious.

Cheers, Julian

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

Emphasize with italics: surround word with underscores _emphasis_
Emphasize strongly: surround word with double-asterisks **strong**
Link: surround text with square brackets, url with parentheses [text](url)
Inline code: surround text with backticks `IEnumerable`
Unordered list: start each line with an asterisk, space * an item
Ordered list: start each line with a digit, period, space 1. an item
Insert code block: start each line with four spaces
Insert blockquote: start each line with right-angle-bracket, space > Now is the time...

by Julian M Bucknall