More on the lock-free stack/queue: memory fences : Algorithms for the masses

More on the lock-free stack/queue: memory fences

Last time I talked about the lock-free stack and queue, I was more concerned about proving that my code was free from the ABA problem than anything else. In making my argument I naturally glossed over such niceties as the .NET memory model and so assumed (pretty much) a sequential memory model, because, to be honest, that's how we think as programmers.

Spring path with picket fence Consider the following set of statements in C#:

      int foo;
      bool fooHasBeenSet = false;
      // ...
      foo = 42;
      fooHasBeenSet = true;

Now, you and I when we read this will naturally assume that the assignment statements occur in the order shown. In fact, that assumption is more along the lines of a tautology, a duh! statement. It can't be otherwise, surely? The foo variable will be set to 42 and then the boolean variable will be set to true. This assumption is known as the sequential memory model, and it is the strictest memory model we can use.

As it happens, on modern machines, this assumption could be false. .NET allows the compiler or the JITter or the hardware to rearrange statements as they see fit, so long as the rearrangement has the same overall semantics as the code as written. The reason for allowing this is to improve the overall efficiency of the code given the presence of pipelined CPUs, caches, and the like. So in fact the boolean variable could be set before the int variable. The boolean variable may not be set until we actually want to read it, ditto foo. We couldn't tell because as soon as either variable was read after the relevant assignment, we'd see the correct value.

Now in a single-threaded program, there's absolutely no way to find out if the order was changed. We can't throw in a Console.Write(value); immediately after the assignment to the int variable, since we would just see the correct value of the variable (but note, the assignment to the boolean could still occur prior to that). In essence, in a single-threaded application, all this talk of rearranging control flow to suit the compiler or JITter or hardware is just navel-gazing. We can't tell, and so it doesn't matter.

In a multi-threaded application though, it does make a difference. A big one.

In a multithreaded environment, one thread could set the boolean before the int variable due to this allowed rearrangement in control flow. Another thread could then see fooHasBeenSet as true, but still see the old value of foo, with presumably dire results. But, wait, there's more! In a modern PC, main memory is too slow and so there's one, generally two, levels of memory cache, known by the helpful names of L1 and L2. L1 is hyper fast, very small and is close to the CPU. L2 is much faster than RAM but slower than L1, and is a few megabytes in size. (For example, on my desktop, L1 cache is 32KB for data and 32KB for code, and the L2 cache is 4MB. I have 8GB of RAM.) The issue here is each core has separate caches. When you read data, you may be reading it from the cache for that CPU, not the shared memory for the machine. One thread may update a variable (that is, some memory slot), and another thread may not even see that it has been changed since the old value is in the cache of the core that's executing that thread.

How the heck do multithreaded applications work then? it sounds like they should all completely fail with invalid unsynchronized data.

In particular, going back to the lock-free stack and its Pop() method,

    public bool Pop(out T item) {
      SingleLinkNode<T> node;
      do {
        node = head.Next;
        if (node == null) {
          item = default(T);
          return false;
        }
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, node, node.Next));
      item = node.Item;
      return true;
    }

head.Next could be in the L1/L2 cache for a particular thread, but in fact be altered by another thread. The test to see whether head.Next to be equal to node in the CAS method could just be true for that particular thread because both variables happen to be in the same cache. We could pop the same node off the stack twice! Man, this is even worse than ABA, we are so hosed.

Luckily for us, all is not lost. Without going into too much detail (see this MSDN article if you want more), the hardware contains primitives to ensure data coordination and synchronization. These primitives go under the name of memory fences (or barriers) and since .NET only supports full memory fences, that's all I'll talk about here. To put it in layman's terms, a memory fence resets the caches and ensures that all pending memory writes and reads have completed. Deferred reads can't cross the memory fence (the fence forces them to happen), and memory writes must happen before leaving the fence. That is, a memory fence will ensure that, in our original code example, the second thread will "see" the assignments of the data in the correct order, and that in the Pop() code that head.Next is read from RAM.

A memory fence is therefore a slow operation. By slow, I mean in comparison with the normal running of code and fetching data that a core CPU does. It's certainly not slow like reading from a disk. A few hundred cycles, perhaps; nothing you'd notice. It forces everything, at least for that moment, to be synchronized across all threads.

How do we impose a memory fence to ensure synchronization? By far the most traditional way is to use a lock. This is the granddaddy of all memory fences in a way, since a lock is all about the explicit synchronization of threads. I'm sure, just like me, you never even thought about a lock including data synchronization — you just assumed it was so and never thought about any other possibility. But my lock-free stack and queue don't use locks.

Another memory fence is provided by the interlocked primitives, including the .NET methods that call them. And, since it uses the Interlocked.CompareExchange method, the CAS method I wrote includes a memory fence. The global value of head.Next is forced to be updated by the completion of the CAS method, and the CAS method is forced to read its value anew from shared RAM. Phew. The lock-free stack code still works.

Now playing:
Vangelis - Blade Runner Blues
(from Blade Runner)

Sun 19-Apr-2009 1:15 PM Blog / tags: lock-free stack queue memory-fence

Loading links to posts on similar topics...

previous post next post

1 Response

#1 Dew Drop – April 20, 2009 | Alvin Ashcraft's Morning Dew said...

20-Apr-09 6:00 AM

Pingback from Dew Drop – April 20, 2009 | Alvin Ashcraft's Morning Dew

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

Emphasize with italics: surround word with underscores _emphasis_
Emphasize strongly: surround word with double-asterisks **strong**
Link: surround text with square brackets, url with parentheses [text](url)
Inline code: surround text with backticks `IEnumerable`
Unordered list: start each line with an asterisk, space * an item
Ordered list: start each line with a digit, period, space 1. an item
Insert code block: start each line with four spaces
Insert blockquote: start each line with right-angle-bracket, space > Now is the time...

by Julian M Bucknall