Random Updates

It's been a while, so I figured I'd just jot a few things down that I've been doing in my spare time.

  • Went to Yosemite again in January.   We usually go about 3 times a year.  We have a bigger trip coming up in March.  If you are in the area and want to join, contact me.  My email is available in my source code.
  • EggBuffer - A simple reader/writer for the protocol-buffer wire format.  This does not include a message generator or anything.  I simply wanted to be able to crack open message blobs.
  • generators - A small set of basic code generators I use to bootstrap new objects/structs/licenses/etc.  I'd like to convert these to that fancy vim snippet plugin.
  • EggFmt - A crappy little formatter helper for doing mysql style tables in console.
  • EggLine - A simplified, glib-ish, interface to libreadline.

As soon as it's ready, I'll share the project I'm using these tools to build.  Feel free to speculate.

Also, I'll be back in LA, my favorite home, for Scale8x February 19-22nd.  Your rockstar Gnome Sys-Admin, Jeff Schroeder, and I will be hosting the Gnome booth along with Jordan.

http://www.flickr.com/photos/jordanlarrigan/ / CC BY-NC-SA 2.0

Introduction to lock-free/wait-free and the ABA problem

I recently read a paper on Hazard Pointers[2] and thought I would share what I learned and some general information on lock-free/wait-free programming. I hope it is useful to others getting started.

I'm not very clever, so keep in mind this is mostly regurgitating knowledge I've found in various places. If you find anything wrong or that could use more/better explaination leave a comment and we can cooperatively improve this.

Traditional problems in Mutli-Processor programming

As we reached the speed-limit of serialized CPU instructions we started to add more cores. Obviously, adding more cores means we have to change how we process information. Instead of optimizing for the fastest serial processing of data, we look for ways to split the work into multiple tasks. That way, we can put tasks on different cores and efficiently use the resources available.

So now we have lots of individual tasks. It would be most excellent if they could communicate. Often times we end up using a Mutex or other synchronization primitive to lock access to some shared state. Unfortunately, this doesn't scale well as you add more concurrent workers. This ends up creating more lock-contention and reduces the overall usefulness of splitting the problem into tasks.

What if we could communicate to manage shared state without introducing locks? That is precisely what some very smart folks have been designing for years. Now the rest of us are catching up.

What are lock-free algorithms

Lock-free algorithms provide a way to perform operations on shared state without the need to perform costly synchronization between threads. Think about the times you've used a mutex to protect access to shared state such as a queue. In highly-threaded scenarios you probably noticed CPU time wasted in lock-contention (often "time in kernel"). There are many types of synchronization, but many of these result in the same perception by users: inefficient use of their computing resources. That is not very green. The polar bears are crying.

Lets think for a moment about why there is so much CPU time wasted under traditional locking scenarios. For the sake of simplicity, lets build a very simple spinlock.

  1. typedef struct {
  2. int s;
  3. } spinlock_t;
  4.  
  5. void
  6. spinlock_lock(spinlock_t *l)
  7. {
  8. while (!g_atomic_int_compare_and_exchange(&l->s, FALSE, TRUE));
  9. }
  10.  
  11. void
  12. spinlock_unlock(spinlock_t *l)
  13. {
  14. l->s = FALSE;
  15. }

Experienced readers will forgive me for the naivete. What essentially happens is that we try to set the lock bit of the structure until we succeed; potentially blocking for long periods of time (in terms of CPU cycles). Not to mention that the CPU will spin up at full speed and waste power in the meantime.

Why is this generally a bad approach? Well, to begin with, this is just a synchronization primitive to acquire our lock. Once acquired, the critical section is performed. Meanwhile, all those other threads are still spinning! If only they were doing something constructive! In addition, each Compare and Swap (CAS here-forward) requires a Memory Barrier. This means that each call to the method requires that the cache line containing the structure be flushed from the CPU cache and fetched from main-memory. This alone can take hundreds of CPU cycles and now each thread is doing it repeatedly.

Now, that doesn't mean that CAS is a scary or bad thing! We just need to use it more appropriately.

For good measure I should mention that there are times when spin-locks and variants there-of are the right data structure. For example, many locks now days are complex and contain multiple steps such as spinning for a short period of time followed by a sleep back-off upon failure.

This non-deterministic behavior is precisely what lock-free algorithms are designed to avoid. Lock-free algorithms are carefully designed data-structures and functions to allow for multiple threads to attempt to make progress independently of one-another. This means that you do not try to acquire a lock before performing your critical region. Instead, you independently update a local copy of a portion of the data-structure and then apply it atomically to the shared structure with a CAS.

Wait-Free, a better Lock-Free

Those experienced with lock-free programming will likely know of a superset called wait-free programming. Wait-free is the same as lock-free except it provides a stellar guarantee. Each thread is guaranteed to be progressing itself or a cooperative thread. This is an important phenomenon because wait-free uses concurrent access to a data-structure to cooperatively progress the data-structure as a whole.

Constructing Lock-Free/Wait-Free algorithms

Lets take a look a lock-free/wait-free implementation of a Queue. This is a fairly strait-forward implementation of the design by M.M. Michael and M.L. Scott[1].

  1. typedef struct _Node Node;
  2. typedef struct _Queue Queue;
  3.  
  4. struct _Node {
  5. void *data;
  6. Node *next;
  7. };
  8.  
  9. struct _Queue {
  10. Node *head;
  11. Node *tail;
  12. };
  13.  
  14. Queue*
  15. queue_new(void)
  16. {
  17. Queue *q = g_slice_new(sizeof(Queue));
  18. q->head = q->tail = g_slice_new0(sizeof(Node));
  19. return q;
  20. }
  21.  
  22. void
  23. queue_enqueue(Queue *q,
  24. gpointer data)
  25. {
  26. Node *node, *tail, *next;
  27.  
  28. node = g_slice_new(Node);
  29. node->data = data;
  30. node->next = NULL;
  31. while (TRUE) {
  32. tail = q->tail;
  33. next = tail->next;
  34. if (tail != q->tail)
  35. continue;
  36. if (next != NULL) {
  37. CAS(&q->tail, tail, next);
  38. continue;
  39. }
  40. if (CAS(&tail->next, null, node)
  41. break;
  42. }
  43. CAS(&q->tail, tail, node);
  44. }
  45.  
  46. gpointer
  47. queue_dequeue(Queue *q)
  48. {
  49. Node *node, *tail, *next;
  50.  
  51. while (TRUE) {
  52. head = q->head;
  53. tail = q->tail;
  54. next = head->next;
  55. if (head != q->head)
  56. continue;
  57. if (next == NULL)
  58. return NULL; // Empty
  59. if (head == tail) {
  60. CAS(&q->tail, tail, next);
  61. continue;
  62. }
  63. data = next->data;
  64. if (CAS(&q->head, head, next))
  65. break;
  66. }
  67. g_slice_free(Node, head); // This isn't safe, we'll discuss why.
  68. return data;
  69. }

The FIFO queue is constructed using a singly-linked list. We keep track of the head and tail of the queue. Because the structure uses a linked-list, we can CAS the node's next pointer to atomically perform updates.

I won't go into too much detail on the implementation of the queue. The gist is this. To enqueue a new node, we work locally to prepare the new node and try to apply it atomically to the tail. If we detect an incomplete operation we help cleanup after it. To dequeue, we try to take the first item off the queue atomically. Again, we help proceed an inconsistent state.

The ABA Problem

There is an interesting problem that arises in this algorithm. What happens if a node is removed, de-allocated (through free() or similar), re-allocated (through malloc() or similar) and added back; all within the time in which another thread is paused? If that other thread is about to perform a CAS using that pointer it could still succeed even though the state has changed! This is what is known as the ABA problem in lock-free programming.

In garbage collected languages this isn't a problem. Why? Because the node's memory cannot be reclaimed for a new object until observing threads containing pointers to the structure have released them.

In C, however, we don't typically have the luxury of a garbage collector. Historically, there have been a few approaches to work around this. The original approach, by IBM, was to use a tag next to the pointer to be swapped. They would increment that tag counter and use a double-word CAS. While many 32-bit machines provide a 64-bit CAS, most 64-bit machines do not provide a 128-bit CAS. This prevents it from being a truly useful solution for general use.

My concurrency library, libiris[4], currently does something similar to this. By aligning the pointers properly you can use portions of the pointer itself for tagging. This isn't scalable as you add workers, but it was a start.

But, ever the pragmatist that I am, I thought it was time to research the proper way to handle the situation. Turns out there is a fantastic paper written in 2004 on a methodology called Hazard Pointers[2]. Hazard pointers are a way of notifying cooperative threads that you are removing some potentially unsafe structure. Later, a reclamation step is performed which can free the memory once it is safe to do so.

To make these lock-free/wait-free algorithms ABA safe we can notify cooperative threads using these hazard pointers. I've implemented the hazard pointer methodology from the paper here. We can include that and alter the algorithm slightly.

  1. void
  2. queue_enqueue(Queue *q,
  3. gpointer data)
  4. {
  5. Node *node, *tail, *next;
  6.  
  7. node = g_slice_new(Node);
  8. node->data = data;
  9. node->next = NULL;
  10. while (TRUE) {
  11. tail = q->tail;
  12. HAZARD_SET(0, tail); // Mark tail has hazardous
  13. if (tail != q->tail) // Check tail hasn't changed
  14. continue;
  15. next = tail->next;
  16. if (tail != q->tail)
  17. continue;
  18. if (next != NULL) {
  19. CAS(&q->tail, tail, next);
  20. continue;
  21. }
  22. if (CAS(&tail->next, null, node)
  23. break;
  24. }
  25. CAS(&q->tail, tail, node);
  26. }
  27.  
  28. gpointer
  29. queue_dequeue(Queue *q)
  30. {
  31. Node *tail, *next, *head;
  32.  
  33. while (TRUE) {
  34. head = q->head;
  35. LF_HAZARD_SET(0, head); // Mark head as hazardous
  36. if (head != q->head) // Check head hasn't changed
  37. continue;
  38. tail = q->tail;
  39. next = head->next;
  40. LF_HAZARD_SET(1, next); // Mark next has hazardous
  41. if (head != q->head)
  42. continue;
  43. if (next == NULL)
  44. return NULL; // Empty
  45. if (head == tail) {
  46. CAS(&q->tail, tail, next);
  47. continue;
  48. }
  49. data = next->data;
  50. if (CAS(&q->head, head, next))
  51. break;
  52. }
  53. LF_HAZARD_UNSET(head); // Retire head, and perform
  54. // reclamation if needed.
  55. return data;
  56. }

The paper discusses a few other algorithms too which I'd like to add to the repository. Including a Set and LIFO Stack. After which, I hope to do some merging and refactoring of this into libiris[5]. I want to make it easy for gtk+ applications to be fully asynchronous from within the main-loop. Doing synchronous IO, or other blocking operations from the main-loop is unacceptable as far as I'm concerned.

--

[1] M.M. Michael and M.L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,” Proc. 15th Ann. ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996.
[2] IEEE Transactions on Parallel and Distributed Systems, VOL. 15, NO. 6, June 2004 Page 491
[3] http://github.com/chergert/dukes_of_hazard
[4] http://git.dronelabs.com/iris

readline

I made a crappy little abstraction on top of readline to make things simpler for myself. Someone might find it useful.

  1. #include "egg-line.h"
  2. // ...
  3. EggLine *line = egg_line_new ();
  4. egg_line_set_prompt (line, "Linux-Router> ");
  5. egg_line_set_entries (line, completion_tree);
  6. egg_line_run (line);
  7. // ...
  8.  

Code: http://github.com/chergert/egg-line/
Example: http://github.com/chergert/egg-line/blob/master/main.c

Rapid Prototyping Tools

It's nice to see people working on meta recently. For example, Quickly. After some long discussions with an architect at my last job, I hacked up something drastically different for myself. It's a graphical tool to combine lots of meta pieces together from visual designs.

However, my available time keeps shrinking so I thought I'd share it just to keep the meme going. The idea was to write a meta-toolbox for myself that I can continually extend as I go and eventually write entire programs using it.

The gist is this. Create nodes in the system. Tag them with information. Run the analyzer+generator against them. The analyzer infers lots of things based on patterns I know and see in my head. This allows for minimal time between cognition and implementation.

The input is simple as you'll see, the output is a ready to go project, runnable code in C, dbus support, threading or message passing, classes, properties, signals, methods, documentation, etc, etc.

Video is best watched fullscreen.

My code isn't really in a shape to share as it isn't really useful unless your brain works exactly as mine, but you should be able to hack up your own in minimal time.

GDateTime

I got fed up with working around all the limitations with date and time in glib and C (time_t, struct timeval, struct tm, GTimeVal, GDate, etc) so I decided to write something new.

The result is GDateTime. It handles dates and times from 1/1/1 to 12/31/9999 on 100-nanosecond intervals. As requested by a few individuals, it is an opaque/boxed type. String parsing is incomplete but the rest should be in good shape.

The glib branch is available here. Or for those with short attention spans, the header gdatetime.h.

You can even use GDateTimes as keys within a GHashTable using g_date_time_hash, g_date_time_equal.

GHashTable *hash = g_hash_table_new_full (g_date_time_hash, g_date_time_equal, g_date_time_free, NULL);

I'm interested in feedback and patches.

New Job

I'm now working at VMWare. I'm half way in the process of moving to San Francisco and should be up in the SoMa area full-time by next week.

Yosemite – August 2009

I go to Yosemite a few times a year because its possibly the best hiking in California. Drew and Jordan went with me and took some amazing photos. There was a large forest fire at the time which made hiking Half Dome impossible due to smoke inhalation.

Regardless, they got some really good photos during our time there. Photos are here.




My take on Tornado and "modern" web servers

I am happy to see Facebook release Tornado, their Python based web-server that was developed to run the FriendFeed service. They have done their best to optimize the stack within the Python process. However, I'm inclined to believe that it is more of the same problem from which many HTTP toolkits suffer.

Since Tornado is written in Python it is typically restricted[1] by the Global Interpreter Lock. The GIL, for short, is a technique often used by dynamic languages such as Python and Ruby 1.9 to simplify development of the languages runtime as well as providing certain features to be implemented safely in the language. For example, it is the GIL that makes multi-swap features atomic in Python (a, b = b, a). The question is "How do we scale Tornado"? The answer is the same as most GIL based systems. You scale using processes, typically, equal to the number of concurrent threads your system can execute.

Update: The previous paragraph reads differently than I intended. Tornado doesn't really suffer from the GIL as its single threaded. What I meant to get across that is if you needed to scale to all the cores of your system you need either threads or processes. Future paragraphs explain this in better detail.

Scaling in this model, however, presents a few challenges. Each process ends up requiring connections to your database and cache tiers. So now instead of N connections you will ultimately grow to N*N_PROCESSES connections. Scaling the number of cores will inversely effect your data tier performance which is typically your bottleneck to begin with.

I think the need to design the system in this direction is ultimately due to the real web-servers letting us down. Tornado handling data in an event-based fashion should not be news in general. Most web-servers have been doing this for a while. Cherokee[2] does it quite well, Apache's mpm-event does it, and Python Twisted's reactor has done it for years. However, all of these systems let your infrastructure developers down in a critical component, the module API.

Once you have a socket accepted you need to do something with it. Web-servers tend to differentiate themselves in this area but I don't think they have gotten it right. Apache provides different models such as MPM which is a combination of worker processes and threads. This was done primarily to work around inefficiencies in massive numbers of threads per process. "Why would you need a massive number of threads" you might ask? It's due to the design-flaw that the module API's are designed around executing synchronously. After the bottom-half (the socket handler) accepts a new request, it is passed to a handler. The handler calls the configured modules to process the request and blocks while this happens. Ultimately, this means you are wasting a thread regardless of your application having the resources to immediately process the request. When your system cannot keep up with requests you create a massive number of threads/processes up to your configured bounds which slows down the threads that actually are able to process a request. Obviously this exacerbates the problem.

What needs to happen is a module API that supports a fully asynchronous and event-based processing of requests from the socket handling bottom-half through the application-based upper-half. A primary example of where this can be useful is exemplified by "partition-based" data such as most user-content systems. If you accept too many requests on a single server for a data partition you will slow down the average response time per request. This is due to the contention you create on your database and/or cache connections. So while you may want to accept the request, you may not want to start processing it until a number of other requests in that partition range are completed.

Once we have switched to a server that can work in this model we will run into the problem of how to handle all those cores efficiently. Thankfully, Apache styled MPM models wont be necessary anymore. If applications can be processed in an asynchronous model then they will be able to yield the thread on IO and other shared resources that are not yet available. Doing so will mean the optimum number of threads will be equal to the number of cores you have to process.

For this to work optimally with language runtimes dependent on a GIL we need to use a little-known technique to have multiple runtimes within a single process. If you have multiple copies of the runtimes shared library each subsequent call to dlopen() will result in new memory zones for the loaded library. In such cases, you can even have different versions of the same runtime loaded within one process (python2.3, 2.4, 2.5, etc). We can load a runtime per core so that each worker has it's own runtime with the GIL disabled (since its single threaded). As I'm sure you noticed this is an implementation detail.

A few things are preventing this from becoming an immediate reality. There needs to be a standard for languages to be able to integrate into the web-server so that shared resources can be stored outside of each runtime. For example, one of our python workers might yield on getting a database connection from the servers connection pool. Now the worker can pause that request until the connection is available and process other requests. This should feel natural to those who have worked in co-routine based systems.

One of the problems that interests me is getting the request from the bottom-half to workers in the upper-half as efficient as possible. Typically this is done with a thread pool. Most thread pools are implemented with a single Queue and various worker threads that pop their future work item off the queue. What research has shown is that this creates a considerable amount of lock-contention as you scale the number of cores. For each additional worker you create a new contender for the lock to retrieve an item from the queue. Nir Shavit et al pioneered a concept called work-stealing which greatly reduces this problem. It consists of a queue per worker and distributing work items in a round-robin fashion among the workers. To prevent worker-starvation, when a thread runs out of work-items, it will look to its neighbors to try to steal an item off their queue. Some fancy footwork is performed to reduce the contention by stealing from the opposite side of the queue and implementing the fast path as lock-free. Additionally, you can choose neighbors that share your die on NUMA. I was so interested in this and the challenges of implementing lock-free algorithms in C (they are easier to implement with garbage collection), that I implemented it in Iris[3], my concurrency toolkit for glib.

The primary thing we've learned from web-servers over the recent years is that to handle high-load scenarios you need to do everything you can to reduce resource usage when forward-motion cannot be performed. By continuing to build web and application frameworks within the container of the synchronous module API we cannot hope to get the massive improvements in performance that we all desire. We need to consider the high-performance web-server as part of our application which means writing HTTP container standards.

[1] I imagine that someone will make this work on IronPython or Google's LLVM based Python.
[2] http://www.cherokee-project.org
[3] http://git.dronelabs.com/iris

Connecting people and Mentoring

Is there anything out there to help connect newer programmers with experienced programmers in GNOME or Linux in general? A big-brother/big-sister type thing. It's quite rewarding to both those learning and those with experience to help each other when they have common goals and passions. It allows for broader dissemination of information, more testers, more eyes to find bugs and a sense of shared vision.

Maybe a yellowpages of people and their interests?

Faster code completion

The python code completion engine in MonoDevelop is now much faster. It's was a simple fix and one that I think gets overlooked too often in these "dynamic" days.

Thankfully, python makes it easy to index information based on a fully-qualified name. Everything can basically be represented in package format such as "xml.etree.ElementTree.ElementTree.isinstance". Of course, the simplest way to find anything matching a prefix is to just use the SQL LIKE operator. It's fast enough, and gets the job done.

Now, where it gets tricky. I need to be able to limit the result set to only show items that do not have a '.' after the prefix of my LIKE query. SQLite does not have a built in function to help us do this, and if it did, it would be the wrong way.

Since the data in each row will never change (other than a complete row update) we know that we have the ability to safely pre-compute information about the row. Remember how the data is in a package format? Each of those '.' indicate a new depth which we can use for indexing.

So the simple fix is to pre-compute the number of '.' in the fully qualified name and simply add your desired depth to the query. Less data returned in your data reader means less memory allocated by the VM, less memory to free and less memory fragmentation. Good times.

The new field will mean your completion database needs to be upgraded. No fear though, that's done for you automatically.