Someone is wrong on the internet — the lock-free stack edition : Algorithms for the masses

Someone is wrong on the internet — the lock-free stack edition

Getting on for 4 years ago, I wrote a series of posts on lock-free containers in .NET, like the lock-free stack and the lock-free queue. They were fun to write, mainly because of the difficulty of the topic and trying to rationalize why early versions didn't work, and so on so forth. Along the way I learned about the .NET memory model, volatile, memory fences, and other arcana, but in the end I had something that worked and worked well. Over the years since then I've had many emails from readers about lock-free containers and I've been happy to have been linked from many different places, so much so that currently I'm number 1 for "lock free stack" and queue on Google.

Dunce's Chair And then yesterday, I suffered from "Someone is wrong on the internet" syndrome. Big time.

Someone, let's call him GC after the garbage collector in .NET (since as you'll see he demonstrated that he doesn't understand it) sent me an email. It seems GC had been reading up on the ABA problem in wikipedia as part of learning about lock-free containers and then found my lock-free stack and queue implementations. Without taking any time to understand my implementation or even consider the differences between the environment for the code in the wikipedia article and that for my code, someone was wrong on the internet, and that person was me. An email was duly written to me saying my implementation was broken. You see, the wikipedia article had a broken lock-free stack, therefore all lock-free stacks were broken.

Huh? Let's see: we have an anonymous author on wikipedia demonstrating some code that was broken and a named author (that would be me) showing off some similar code that purported to be working, and therefore the named author was automatically wrong? Wow.

Sure the person who's willing to put his name to some technical article could be wrong, but isn't it likely that, you know, he could be right (by naming himself he's staking his reputation on what he writes after all — that stake isn't so visible on wikipedia) and that you were missing some information? So, wouldn't you be, like, "I think I'm misunderstanding something. I read this on wikipedia and your code doesn't seem to do anything about this problem. Could you help me understand why?" And I'd be all, sure, let me explain why the wikipedia problem, although a valid problem, doesn't apply to the code I wrote, and we'd have this illuminating conversation. It's happened in the past, and it'll happen in the future, and I'm happy to do it.

But no. Someone was wrong on the internet and it was obviously me.

GC sends me this email that said things like "[Your implementations of the lock-free stack and queue] are busted. You have not solved the ABA problem." and then rehashed the wikipedia article to prove I was wrong, only to terminate with a snide "You might consider a disclaimer on your web site . People trying to use your algorithms may believe they are getting atomic lock-free algorithms, only to discover their applications crash (in the field) in ways that are almost impossible to reproduce deterministically." (No, I'm not sure what the word atomic means in there, either.)

To put it mildly I was shocked that someone who obviously hadn't understood any of the issues with lock-free programming (and I don't pretend to know them all) or the ABA problem was willing to spend five minutes reading up on it and then ten minutes writing an email to a stranger saying that said stranger was wrong despite the fact he could have downloaded and run the code that showed said stranger was right. It blew me away.

Take a look at the wikipedia article on the ABA problem, especially the example. It's a not very well written lock-free stack (there's hardly any encapsulation) written in C++ and exemplifies a standard ABA problem. To recap, ABA is all about the invalid reuse of memory before all threads have finished with that memory. The thing about the lock-free stack code on wikipedia is that, in order to show off ABA easily, it exposes the nodes that go to make up the stack to the callers. Obviously the callers can then do silly things to said nodes, including deleting them (that is, deallocating them) and reusing them inappropriately. And by doing so, it's then very easy to show the ABA problem. If you like, the code was a pedagogical device, not an example of how to write a lock-free stack or that all lock-free stacks are flawed. Unfortunately for my friend GC he wasn't experienced or equipped enough to understand the difference.

Here's the full code for the lock-free stack that I wrote.

  public class LockFreeStack<T> {

    private SingleLinkNode<T> head;

    public LockFreeStack() {
      head = new SingleLinkNode<T>();
    }

    public void Push(T item) {
      SingleLinkNode<T> newNode = new SingleLinkNode<T>();
      newNode.Item = item;
      do {
        newNode.Next = head.Next;
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, newNode.Next, newNode));
    }

    public bool Pop(out T item) {
      SingleLinkNode<T> node;
      do {
        node = head.Next;
        if (node == null) {
          item = default(T);
          return false;
        }
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, node, node.Next));
      item = node.Item;
      return true;
    }

    public T Pop() {
      T result;
      Pop(out result);
      return result;
    }
  }

It's slightly different from the code in my original post since it implements two different Pop() methods, one that returns true/false plus the item, the other that just returns the item or default if the pop failed. Also the code shown in my article simplifies the CAS call syntax. Nevertheless all the main features are there.

Notice that it is the stack that is in charge of allocating nodes (this is in Push()). There is no "deallocation" of nodes because .NET doesn't do deallocation; instead it's the garbage collector (the real "GC") that comes along every now and then and compacts the heap, removing all objects that are no longer referenced. (A point that my friend GC was unable to appreciate given that he was fixated on the correctness of the wikipedia article.) The main cause of ABA in the wikipedia code is simply not there since the nodes are completely encapsulated inside the stack code. Nevertheless, there could be an ABA problem solely inside the stack. Let's destroy that notion.

Imagine the following scenario. The stack looks like this: head -> A -> B -> C -> etc, and thread 1 is in the middle of a pop operation. Unfortunately, it's a really slow thread. It takes a copy of the value of head.Next; in this case a reference to the node A. Slowly, but surely, it enters the CAS. The code evaluates the address of the head.Next field (it's a ref parameter), the local variable node (which is a reference to the node A), and whatever node.Next evaluates to, which is a reference to the node B, we hope. Leisurely, it executes the underlying operation to the CAS method (which happens to be cmpxchg, another point that old GC was sure to point out to me to show off his knowledge).

Meanwhile, other threads are whipping through their execution, pushing and popping items (and therefore nodes) off the stack. The node A gets removed, long before thread 1's execution actually does the cmpxchg. Or, even. horror, before thread 1 evaluates A.Next.

We're obviously well into the doom scenario from the wikipedia article, surely? Nope, not even close. This was bubba GC's argument too: the memory for node A could get reused! It could be added back onto the stack! Thread 1's view of what its reference to A points to could be completely different! It could point to another node on the stack! If thread 1 continued you could lose many other items on the stack! YOU HAVE THE ABA PROBLEM, DOOFUS!

OK, perhaps he didn't quite go so wild, but that was the thrust of his argument (and indeed the wikipedia argument too). Unfortunately, in the C# code I presented (and present above) the basis of the argument is specious and fallacious. Why? Because node A never gets deallocated and therefore cannot be allocated again. The reason for that is that thread 1 still contains a reference to A and the object is live. The GC will therefore never collect it, even if I put code like if (CurrentThread == thread1) Sleep(ABloodyLongTime); in the middle of the Pop() method. The reference to A only goes away if we go round the loop and read head.Next again, or if we succeed in the CAS and the reference is discarded at the final brace for the method (that is, goes out of scope). It's only when all references to A in all threads have gone that the GC will collect its memory.

However, to be fair, there could be other ABA problems. Even though A won't get freed/reused while thread 1 has it, the data within it (Next and Item) may change with similar ABA results. Let's consider Next. Apart from head.Next, the only place in the stack code where a node's Next field gets set is in Push(), and that only ever happens on a brand new, just-newed-up node. Since I've shown that the memory for A cannot get freed while thread 1 has a reference to it, it can't be reallocated, and so this scenario can't happen either. (In fact, there's an even stronger conclusion: A.Next will always reference B.) A similar argument can be made for a node's Item field. The final case is that the outside world gets hold of nodes and does something silly with them. As I pointed out above, that doesn't happen here either.

All in all, there is no ABA problem with the code as written above. The main reason for this avoidance of such a large issue is .NET's garbage collection. It ensures that we completely avoid the problems of reusing memory that one or more threads could still be using.

But, sadly, my friend GC refused to see this reasoning. He ignored it completely, because it didn't fit into his wikipedia worldview. In fact he helpfully wrote back a final time to say "Wow, what a fountain of crap.", and "Perhaps it’s difficult for you to accept that all of that work [that is, my articles and code] is so fundamentally flawed. But, I assure you, it is.", and "Your simple stack class is the poster child for the ABA problem." (Yes, he discovered bold.) And then, for good measure, he went on to mention his "co-workers at Microsoft", especially those in the "Many Core" team who seem to be uproariously laughing away at me whenever they take a break. (I'm guessing he thought I'd be intimidated by this or something.)

Sad to say, GC does indeed work at Microsoft. Happy to say, though, he's not an actual employee just a contractor in the Project Management Excellence (PMX) group, although he's obviously falling short on the excellence part.

Now playing:
Vegas - Wise guy
(from Vegas)

Sat 18-Apr-2009 4:45 PM Blog / tags: lock-free stack queue aba-problem

Loading links to posts on similar topics...

previous post next post

11 Responses

#1 Ben Vanik said...

18-Apr-09 6:22 PM

Whoa, nice post. You're a bigger man than I, as I would have posted his name ;)

#2 Dustin Campbell said...

18-Apr-09 11:34 PM

This really fires me up. Regardless of whether the person in question is a contractor or a full-time employee, they are a representative of Microsoft and should act as such. In reality, this person might well put their foot in their mouth if they learned that you are a former Microsoft employee and a valued partner.

I think we know who was really wrong on the Internet this time.

#3 Jamie said...

19-Apr-09 8:34 AM

I enjoyed the "intimidating" part at the end, desperate people always mention their friends as a last grasp at an argument.

It's not worth getting riled up about though.. there is always someone wrong on the internet!

#4 julian m bucknall said...

19-Apr-09 9:15 AM

Jamie. To be honest, although I was pretty riled up about it on Friday (notice I didn't publish this post until Saturday :)), I'm more concerned about GC, and the attitude he evinced during our conversation.

Yes, we all make mistakes; yes, we can all unload on someone, justifiably or not; yes, we've all felt that moment of dread when we find out that our argument is completely wrong. I would hope that we would all have the cojones to apologize in those situations.

The concern I feel is that GC wasn't willing to read what I'd sent him and entertain the notion that he might have been wrong. I certainly was willing to entertain the notion that I was wrong, I've been doing it all my programming life (in essence I've been exposing my code to thousands of other programmers for a long time -- I've been working for third-party component vendors for a while). I even reran the app to make sure that the tests still passed (back when I wrote it, I only had a dual CPU machine, now I have a four-core one). I even reread the relevant sections in Joe Duffy's book to make doubly sure.

But GC, once he'd latched onto his worldview, was not going to change it. My unease is that he does this at work too, in which case, we all suffer.

Cheers, Julian

#5 James Brock said...

15-Mar-10 3:33 PM

Hi Julian,

GC is right, and you are wrong (on the internet). Your stack has the ABA problem.

Best Regards,

James Brock

#6 julian m bucknall said...

15-Mar-10 4:17 PM

James: Ah, if only you knew what you were talking about. But since you don't -- otherwise you'd have perhaps put your money where your mouth is and written a blog post to show the error of my ways and referenced it here for all to see -- I'll assume that you're clueless in these matters. Thanks for passing by.

Cheers, Julian

#7 Bla said...

25-Apr-10 5:55 PM

No response from James ..cccc

#8 meh said...

06-Jul-11 12:09 PM

I've admittedly stopped reading after the 5th paragraph, and I have not read the original communication between you and that person whom you call GC. Insofar I wouldn't know whether that person is maybe really a jackass.

Nevertheless, from the little that I've seen: Whether or not you are accidentially right, the amount of hybris that you expose is outright scary.

You posted a code snippet that is - in principle - fundamentally flawed. Incidentially, however, it works fine under .NET because of some implementation detail. And now someone points you to it. Oh what a bastard. Obviously, that person is a total asshole, he doesn't understand anything.

You just could not admit "Thank you, I didn't think of this, but luckily the .NET garbage collector makes sure that it works anyway, phew...". No, of course you did it right all the time, you explicitly took the GC's properties into consideration (just didn't mention), and the other person is only too stupid to see it. You can't be wrong anyway, because you have worked at Microsoft for an entire 6 months, so you know everything. How long is the probation time at MS again?

Seriously, man...

#9 julian m bucknall said...

06-Jul-11 1:41 PM

Meh: tl;dr eh? But you'll pontificate anonymously anyway. Nice.

Let's see: I write a bunch of articles about lock-free structures in .NET and using C#. I make no representations at all about how it would work using another language or run-time. None. It works in .NET just fine (gosh, you think I'd hadn't tested it or something) and I understand why it works in .NET: there's no accidently about it.

GC wrote me a rather nasty email saying that my code didn't work. Fine. I've been wrong before and will be again. Only he didn't argue from, you know, actually trying it. He was just talking about it theoretically, possibly from from a C/C++ viewpoint, but asserted that I was an idiot for even believing that it worked under the preconditions I'd assumed. The only hubris here (I assume you know what it means, despite the misspelling) is his, because despite his fiercely argued position, he was utterly wrong to say my C# code didn't work.

Oh, and I do like the ad hominem attack at the end. Keepin' it real classy. Seriously, man...

Cheers, Julian

#10 Dmitry Bogdanov said...

18-Aug-11 9:02 AM

I would like to ask one question.

node = head.Next;

Do you mean that read-and-increment-reference-count is atomic operation in C#? Can a thread be preempted after read and before incrementing reference counter?

#11 julian m bucknall said...

18-Aug-11 1:27 PM

Dmitry: .NET does not use reference counting as a means to identify whether objects are being used, so the issue doesn't exist.

Cheers, Julian

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

Emphasize with italics: surround word with underscores _emphasis_
Emphasize strongly: surround word with double-asterisks **strong**
Link: surround text with square brackets, url with parentheses [text](url)
Inline code: surround text with backticks `IEnumerable`
Unordered list: start each line with an asterisk, space * an item
Ordered list: start each line with a digit, period, space 1. an item
Insert code block: start each line with four spaces
Insert blockquote: start each line with right-angle-bracket, space > Now is the time...

by Julian M Bucknall