A couple of weeks ago, Hacker News pointed to an article about how bad hash tables could be in a worse case scenario.
In essence, the thrust of the argument was that, in a worse case scenario, hash tables degenerate from a very nice O(1) for insertion and deletion of items to O(N). The reason is that hash tables’ quoted run-time efficiency is an amortized value and not an absolute result. If you like, on average over very many insertions and deletions, hash tables behave in a O(1) manner, but there can be cases where a single insertion can be O(N), generally when the table is grown because its load factor grows above a certain value.
The author of the piece then goes on to say that because of this he always uses a balanced binary tree for a dictionary or associative array, because its run-time efficiency is O(log N) for insertions and deletions, even in the worse case. That is, a balanced binary tree, such as a red-black tree, offers a worse-case guarantee that insertions and deletions will not take longer than O(log N).
This is all quite true, but I would venture to guess that the worse case scenario does not occur for the vast majority of applications. For example, let us take the spell checker scenario where we store a wordlist in a hash table. (Ignore for the moment the fact that a trie or a ternary tree might be a better structure than either a hash table or a balanced binary tree for this.) One would assume that the wordlist size would not change drastically during the run of the application with violent swings in the number of words. In which case, the developer would optimize the dictionary, for example, by tweaking the hash function or by initializing the number of elements in the table correctly. In such an example, there would be no such worse case. (I shall ignore the worse case of the crappy developer, as one does in all discussions of run-time efficiency.)
Another example is associative arrays in dynamic languages like JavaScript or Ruby. There you are pretty much forced to use hash tables because they’re, well, part of the infrastructure. Presumably performance testing will reveal problems in your code (again, ignoring the issue of the crap developer who doesn’t test) that you can then fix, if possible.
My viewpoint here that, unless you are aware of wildly varying numbers of name-value pairs in your application, it’s very likely that a hash table will suffice for your dictionary needs. Instead of faster efficiency the vast majority of time, by using a balanced binary tree you are slower all the time.
After all, every implementation of quicksort will have a sequence of input items that will force an O(n^2) run-time. Using random selection of the pivot can alleviate this to a certain extent (that is, make it extremely unlikely to happen). Yet, how many people worry about that? Indeed, how many people know where quicksort is being used in their run-time library?
Mind you, all this is very academic. The real rule is: measure your own code’s performance. Don’t take big-Oh numbers as gospel: for your application and your data, the real world may be very different.
No Responses
Feel free to add a comment...
Leave a response
Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.
_emphasis_
**strong**
[text](url)
`IEnumerable`
* an item
1. an item
> Now is the time...
Preview of response