I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months. When I buy the current issue, I'll publish the article from the issue a year ago. Since I've now got July's issue, here's July 2008's article.
Although the article is short, the usual two pages, the topic interests me greatly and I wish I could do it more justice. It's about calculating the longest common subsequence (LCS) of two strings and thereby calculate their edit distance and from that, the diffs or the differences between the two strings. Although the diffs between two strings is not that interesting (to a programmer, at least, although I can imagine it's highly meaningful in genetics), the same process can be applied to two text files and the results of that are of great importance. Every version control system out there stores revisions to a file as some kind of diff, for compression, if nothing else.
The standard algorithm I described in the article is pretty simple to understand, but astonishing in that it's so counter-intuitive. I can't imagine how it came about; although, in reality, it's just a dynamic programming problem, when all's said and done. The algorithm is pretty efficient — I even incorporated it into a demo diff program I wrote at Microsoft to show off the new color support in the Console class during the Whidbey cycle (Visual Studio 2005) — but with large files/strings the amount of memory needed can get unwieldy and things slow down. (The algorithm is quadratic in both time and space: O(n*m) essentially.)
There's a version of the algorithm that requires linear space instead, from Dan Hirschberg's 1975 paper A linear space algorithm for computing maximal common subsequences. There are other improvements that deal with special cases, for example in the fields of genetics.
This article first appeared in issue 270, July 2008.
You can download the PDF here.
Now playing:
Somerville, Jimmy - Was That All It Was
(from Suddenly Last Summer)
3 Responses
#1 Craig Stuntz said...
13-Jul-09 6:18 AMSome VCSs (StarTeam for sure, perhaps others) store compressed, full versions instead of diffs. StarTeam compresses, hashes the compressed file, and uses the hash as the filename. This significantly increases the performance of working with branches and older revisions. Diffs are still useful even in this case, though, for merging and the like. Storage requirement is higher than with reverse deltas, though less than you'd think, as duplicate revisions are never stored more than once.
#2 julian m bucknall said...
13-Jul-09 8:39 AMCraig: Interesting information. It reminds me of a blog post that Eric Sink did fairly recently, where he did an investigation of space vs. speed as he varied the number of diffs being stored for a file history against the whole file.
Time and Space Tradeoffs in Version Control Storage
Cheers, Julian
#3 PCPlus 280: Writing a spellchecker said...
16-Mar-10 9:50 PMI write a monthly column for PCPlus , a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there’s an Xmas issue as well — so it’s a bit more than monthly). The column is called Theory Workshop and appears in the Make
Leave a response
Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.
_emphasis_
**strong**
[text](url)
`IEnumerable`
* an item
1. an item
> Now is the time...
Preview of response