Introduction to lock-free/wait-free and the ABA problem

I recently read a paper on Hazard Pointers[2] and thought I would share what I learned and some general information on lock-free/wait-free programming. I hope it is useful to others getting started.

I'm not very clever, so keep in mind this is mostly regurgitating knowledge I've found in various places. If you find anything wrong or that could use more/better explaination leave a comment and we can cooperatively improve this.

Traditional problems in Mutli-Processor programming

As we reached the speed-limit of serialized CPU instructions we started to add more cores. Obviously, adding more cores means we have to change how we process information. Instead of optimizing for the fastest serial processing of data, we look for ways to split the work into multiple tasks. That way, we can put tasks on different cores and efficiently use the resources available.

So now we have lots of individual tasks. It would be most excellent if they could communicate. Often times we end up using a Mutex or other synchronization primitive to lock access to some shared state. Unfortunately, this doesn't scale well as you add more concurrent workers. This ends up creating more lock-contention and reduces the overall usefulness of splitting the problem into tasks.

What if we could communicate to manage shared state without introducing locks? That is precisely what some very smart folks have been designing for years. Now the rest of us are catching up.

What are lock-free algorithms

Lock-free algorithms provide a way to perform operations on shared state without the need to perform costly synchronization between threads. Think about the times you've used a mutex to protect access to shared state such as a queue. In highly-threaded scenarios you probably noticed CPU time wasted in lock-contention (often "time in kernel"). There are many types of synchronization, but many of these result in the same perception by users: inefficient use of their computing resources. That is not very green. The polar bears are crying.

Lets think for a moment about why there is so much CPU time wasted under traditional locking scenarios. For the sake of simplicity, lets build a very simple spinlock.

  1. typedef struct {
  2. int s;
  3. } spinlock_t;
  4.  
  5. void
  6. spinlock_lock(spinlock_t *l)
  7. {
  8. while (!g_atomic_int_compare_and_exchange(&l->s, FALSE, TRUE));
  9. }
  10.  
  11. void
  12. spinlock_unlock(spinlock_t *l)
  13. {
  14. l->s = FALSE;
  15. }

Experienced readers will forgive me for the naivete. What essentially happens is that we try to set the lock bit of the structure until we succeed; potentially blocking for long periods of time (in terms of CPU cycles). Not to mention that the CPU will spin up at full speed and waste power in the meantime.

Why is this generally a bad approach? Well, to begin with, this is just a synchronization primitive to acquire our lock. Once acquired, the critical section is performed. Meanwhile, all those other threads are still spinning! If only they were doing something constructive! In addition, each Compare and Swap (CAS here-forward) requires a Memory Barrier. This means that each call to the method requires that the cache line containing the structure be flushed from the CPU cache and fetched from main-memory. This alone can take hundreds of CPU cycles and now each thread is doing it repeatedly.

Now, that doesn't mean that CAS is a scary or bad thing! We just need to use it more appropriately.

For good measure I should mention that there are times when spin-locks and variants there-of are the right data structure. For example, many locks now days are complex and contain multiple steps such as spinning for a short period of time followed by a sleep back-off upon failure.

This non-deterministic behavior is precisely what lock-free algorithms are designed to avoid. Lock-free algorithms are carefully designed data-structures and functions to allow for multiple threads to attempt to make progress independently of one-another. This means that you do not try to acquire a lock before performing your critical region. Instead, you independently update a local copy of a portion of the data-structure and then apply it atomically to the shared structure with a CAS.

Wait-Free, a better Lock-Free

Those experienced with lock-free programming will likely know of a superset called wait-free programming. Wait-free is the same as lock-free except it provides a stellar guarantee. Each thread is guaranteed to be progressing itself or a cooperative thread. This is an important phenomenon because wait-free uses concurrent access to a data-structure to cooperatively progress the data-structure as a whole.

Constructing Lock-Free/Wait-Free algorithms

Lets take a look a lock-free/wait-free implementation of a Queue. This is a fairly strait-forward implementation of the design by M.M. Michael and M.L. Scott[1].

  1. typedef struct _Node Node;
  2. typedef struct _Queue Queue;
  3.  
  4. struct _Node {
  5. void *data;
  6. Node *next;
  7. };
  8.  
  9. struct _Queue {
  10. Node *head;
  11. Node *tail;
  12. };
  13.  
  14. Queue*
  15. queue_new(void)
  16. {
  17. Queue *q = g_slice_new(sizeof(Queue));
  18. q->head = q->tail = g_slice_new0(sizeof(Node));
  19. return q;
  20. }
  21.  
  22. void
  23. queue_enqueue(Queue *q,
  24. gpointer data)
  25. {
  26. Node *node, *tail, *next;
  27.  
  28. node = g_slice_new(Node);
  29. node->data = data;
  30. node->next = NULL;
  31. while (TRUE) {
  32. tail = q->tail;
  33. next = tail->next;
  34. if (tail != q->tail)
  35. continue;
  36. if (next != NULL) {
  37. CAS(&q->tail, tail, next);
  38. continue;
  39. }
  40. if (CAS(&tail->next, null, node)
  41. break;
  42. }
  43. CAS(&q->tail, tail, node);
  44. }
  45.  
  46. gpointer
  47. queue_dequeue(Queue *q)
  48. {
  49. Node *node, *tail, *next;
  50.  
  51. while (TRUE) {
  52. head = q->head;
  53. tail = q->tail;
  54. next = head->next;
  55. if (head != q->head)
  56. continue;
  57. if (next == NULL)
  58. return NULL; // Empty
  59. if (head == tail) {
  60. CAS(&q->tail, tail, next);
  61. continue;
  62. }
  63. data = next->data;
  64. if (CAS(&q->head, head, next))
  65. break;
  66. }
  67. g_slice_free(Node, head); // This isn't safe, we'll discuss why.
  68. return data;
  69. }

The FIFO queue is constructed using a singly-linked list. We keep track of the head and tail of the queue. Because the structure uses a linked-list, we can CAS the node's next pointer to atomically perform updates.

I won't go into too much detail on the implementation of the queue. The gist is this. To enqueue a new node, we work locally to prepare the new node and try to apply it atomically to the tail. If we detect an incomplete operation we help cleanup after it. To dequeue, we try to take the first item off the queue atomically. Again, we help proceed an inconsistent state.

The ABA Problem

There is an interesting problem that arises in this algorithm. What happens if a node is removed, de-allocated (through free() or similar), re-allocated (through malloc() or similar) and added back; all within the time in which another thread is paused? If that other thread is about to perform a CAS using that pointer it could still succeed even though the state has changed! This is what is known as the ABA problem in lock-free programming.

In garbage collected languages this isn't a problem. Why? Because the node's memory cannot be reclaimed for a new object until observing threads containing pointers to the structure have released them.

In C, however, we don't typically have the luxury of a garbage collector. Historically, there have been a few approaches to work around this. The original approach, by IBM, was to use a tag next to the pointer to be swapped. They would increment that tag counter and use a double-word CAS. While many 32-bit machines provide a 64-bit CAS, most 64-bit machines do not provide a 128-bit CAS. This prevents it from being a truly useful solution for general use.

My concurrency library, libiris[4], currently does something similar to this. By aligning the pointers properly you can use portions of the pointer itself for tagging. This isn't scalable as you add workers, but it was a start.

But, ever the pragmatist that I am, I thought it was time to research the proper way to handle the situation. Turns out there is a fantastic paper written in 2004 on a methodology called Hazard Pointers[2]. Hazard pointers are a way of notifying cooperative threads that you are removing some potentially unsafe structure. Later, a reclamation step is performed which can free the memory once it is safe to do so.

To make these lock-free/wait-free algorithms ABA safe we can notify cooperative threads using these hazard pointers. I've implemented the hazard pointer methodology from the paper here. We can include that and alter the algorithm slightly.

  1. void
  2. queue_enqueue(Queue *q,
  3. gpointer data)
  4. {
  5. Node *node, *tail, *next;
  6.  
  7. node = g_slice_new(Node);
  8. node->data = data;
  9. node->next = NULL;
  10. while (TRUE) {
  11. tail = q->tail;
  12. HAZARD_SET(0, tail); // Mark tail has hazardous
  13. if (tail != q->tail) // Check tail hasn't changed
  14. continue;
  15. next = tail->next;
  16. if (tail != q->tail)
  17. continue;
  18. if (next != NULL) {
  19. CAS(&q->tail, tail, next);
  20. continue;
  21. }
  22. if (CAS(&tail->next, null, node)
  23. break;
  24. }
  25. CAS(&q->tail, tail, node);
  26. }
  27.  
  28. gpointer
  29. queue_dequeue(Queue *q)
  30. {
  31. Node *tail, *next, *head;
  32.  
  33. while (TRUE) {
  34. head = q->head;
  35. LF_HAZARD_SET(0, head); // Mark head as hazardous
  36. if (head != q->head) // Check head hasn't changed
  37. continue;
  38. tail = q->tail;
  39. next = head->next;
  40. LF_HAZARD_SET(1, next); // Mark next has hazardous
  41. if (head != q->head)
  42. continue;
  43. if (next == NULL)
  44. return NULL; // Empty
  45. if (head == tail) {
  46. CAS(&q->tail, tail, next);
  47. continue;
  48. }
  49. data = next->data;
  50. if (CAS(&q->head, head, next))
  51. break;
  52. }
  53. LF_HAZARD_UNSET(head); // Retire head, and perform
  54. // reclamation if needed.
  55. return data;
  56. }

The paper discusses a few other algorithms too which I'd like to add to the repository. Including a Set and LIFO Stack. After which, I hope to do some merging and refactoring of this into libiris[5]. I want to make it easy for gtk+ applications to be fully asynchronous from within the main-loop. Doing synchronous IO, or other blocking operations from the main-loop is unacceptable as far as I'm concerned.

--

[1] M.M. Michael and M.L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,” Proc. 15th Ann. ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996.
[2] IEEE Transactions on Parallel and Distributed Systems, VOL. 15, NO. 6, June 2004 Page 491
[3] http://github.com/chergert/dukes_of_hazard
[4] http://git.dronelabs.com/iris

Comments (12)

  1. Very interesting! Does libiris also document which structures are lock/wait free and which aren’t?

    Friday, December 25, 2009 at 11:55 am #
  2. Ok (took a quick look at the git repo that you mention at the end of your post), I guess that you changed IrisQueue into an abstract thing, and now made IrisLfQueue and stuff (I assume LF stands for lock-free)? And apparently there’s indeed some documentation about it.

    Good, good. Nice.

    Friday, December 25, 2009 at 11:57 am #
  3. Jonathan Turner wrote:

    A good book on this topic is Art of Multiprocessor Programming: http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0123705916

    Friday, December 25, 2009 at 4:42 pm #
  4. chergert wrote:

    @Phillip

    Yeah I did some abstraction on it to make it easier to wrap into Python and other languages. I’m not sure whether I really like that.

    What I’d like to do is take what I’ve learned recently and put together an idea for cleaning up the library and get some comments from people.

    Friday, December 25, 2009 at 6:44 pm #
  5. chergert wrote:

    @Jonathon

    Yes, that book is fantastic. So is The Art of Concurrency.

    http://www.amazon.com/Art-Concurrency-Monkeys-Parallel-Applications/dp/0596521537/ref=pd_bxgy_b_img_b

    Friday, December 25, 2009 at 6:45 pm #
  6. stieg wrote:

    @Chergert

    Good post. Perhaps elaborating a bit more about how Hazard pointers work and keep the ABA problem from occurring in this particular situation would help clarify why this solution is ideal. Currently I am going through the paper myself to see how it works. Otherwise well done.

    Friday, December 25, 2009 at 9:02 pm #
  7. chergert wrote:

    @stieg

    Yeah makes sense. I need to spend some time thinking on it more so I can explain it accurately. I barely understand it myself meaning its pretty tough to regurgitate. :-)

    For those interested, I just pushed support for Mac OS X.

    Friday, December 25, 2009 at 10:57 pm #
  8. Steven wrote:

    Nice! Another possible solution is reference counting “Correction of a Memory Management method for Lock-Free data structures” http://eprints.kfupm.edu.sa/32809/1/32809.pdf

    Saturday, December 26, 2009 at 6:04 pm #
  9. thedude42 wrote:

    Thanks for putting this together. I enjoyed learning about this, and that locking isn’t really the end all in synchronisation.

    Sunday, December 27, 2009 at 12:30 am #
  10. Throw that silly M-S stuff away, you can do enqueue() much simplier, faster and at the same time in a way that does not require any safe memory management (Hazard Pointers):
    void
    queue_enqueue(Queue *q,
    gpointer data)
    {
    Node *node, *prev;
    node = g_slice_new(Node);
    node->data = data;
    node->next = NULL;
    prev = XCHG(&tail, node);
    prev->next = node;
    }

    See the complete MPSC queue implementation here:
    http://groups.google.com/group/lock-free/browse_frm/thread/55df71b87acb8201

    If you can afford double-word atomic operations, then you can get rid of Safe Memory Reclamation on consumer side too. And the real trick is how to do lock-free queue in an intrusive way, because if you are about performance then non-intrusiviness is not quite what you want. You can see all that here:
    http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300

    Note that if you need to provide blocking dequeue() in lock-free queue you can’t use condition variable anymore, because it requires all the data to be protected by a mutex. In order to provide blocking dequeue() you can use an eventcount algorithm (which is basically “condvar for lock-free algos”):
    http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/62364

    Also note that Hazard Pointers is actually not the best possible Safe Memory Reclamation scheme. Aside for a moment that they are covered by US patent, the problem is in expensive #StoreLoad style memory barrier required in each and every LF_HAZARD_SET(). There are some schemes that tries to workaround that: RCU, SMR+RCU, VZOOM, Amortized Proxy-Collector, etc. The best solution is to get away with algorithm that just does not require Safe Memory Reclamation at all, though.

    There is a small typo in your code:
    next = head->next;
    LF_HAZARD_SET(1, next); // Mark next has hazardous
    if (head != q->head)
    continue;
    I guess it must be:
    next = head->next;
    LF_HAZARD_SET(1, next); // Mark next has hazardous
    if (next != head->next)
    continue;

    In order to avoid such problems you may consider providing LF_HAZARD_DEREFERENCE() which handles looping internally. Then the code will be just:
    next = LF_HAZARD_DEREFERENCE(1, head->next);


    Best regards,
    Dmitriy V’jukov

    Sunday, December 27, 2009 at 2:21 pm #
  11. chergert wrote:

    @Dmitriy

    Thank you so much for all that information! Looks like I have some quality reading for the week.

    Sunday, December 27, 2009 at 7:16 pm #
  12. Nice, really informative!
    And thank you for the book recommendation as well :)

    Wednesday, December 30, 2009 at 12:20 am #