GDateTime

Some readers may remember that I spent some time working on a DateTime structure for GLib last year. After much refinement by others better than myself, it has been included in GLib master. Much thanks to Thiago Sousa Santos, Emmanuele Bassi, and the countless reviewers.

You can find some quick examples here.

GDateTime is both immutable and reference counted. See gdatetime.h for relevant API.

  1. {
  2. GDateTime *dt;
  3. char *str;
  4.  
  5. dt = g_date_time_new_now();
  6. str = g_date_time_printf(dt, "It is currently %x %X %z");
  7. g_print("%s\n", str);
  8.  
  9. g_date_time_unref(dt);
  10. g_free(str);
  11. }

Realtime Graphing (again)

As suggested by Benjamin, I switched UberGraph to use a single GdkPixmap instead of two. Previously I copied the content back and forth between Pixmaps as I added the new second's worth of data. That wasted memory in X as well as being unnecessary. Now it uses a single GdkPixmap as a circular buffer and blits in two steps (no fear, as we get double buffering for free from gtk).

Realtime Graphing

I recently had the need for a realtime graph and thought it would be a fun hack. I spent some time thinking about a few ways to optimize the drawing to reduce both the client and server side overhead. Having a way to visualize data without a significant effect on system performance is a must.

The result is UberGraph. For those that care less about the how and why and more about seeing the code, you can find it here. I'm working on a few different ideas in there. But please, keep in mind that the data collectors are ugly hacks and only used to test the widget.

For the impatient, here is a comparison of the drawing overhead to System Monitor.

Requirements

In it's most simplist form, the graph must be able to present a grid with two scales; the x and y axis. Presented upon these scales are the samples which will animate right-to-left. In the most unoptimized form, this seems like a simple problem and we might as well start to attack it like such and optimize from there.

  1. Create a new widget inheriting from GtkDrawingArea.
  2. On the expose event, draw the entire graph (clipping to exposure area). The graph contents are retrieved from an array of nodes in the graph. The value is scaled to the pixel range and drawing using Bézier curves with cairo_curve_to().
  3. Invalide the widget contents after (1 / Frames-Per-Second) seconds have passed; forcing another exposure.

Implementation

This is essentially how the graphs are done Gnome System Monitor. I don't mean to pick on G-S-M, but it was an interesting problem to me (being that the CPU usage every time I ran G-S-M skewed my results) and it was a good learning experience.

There are many places for which this can be optimized. To begin with, it is important to consider the design of the X11 protocol and how to take advantage of it. X11 provides pixmaps, which are a "server-side" texture. This means you push your content to the server over the X11 protocol, and then reference by a unique-id going forward. If designed correctly, we can exploit this to do the sliding animation efficiently without sending a new image for every frame. This results in less work for X, less work for our process, and less bandwidth to X (resulting in better performance over ssh -X). When you draw onto a GdkWindow, you are creating the pixmap on the client, and sending the contents over the wire to the X server via the X11 protocol and exposing it to the window. Now, if you are drawing a graph on every exposure (up to say 20 frames per second), you can imagine how this can take a hit on your systems performance. You are rendering the pixmap, encoding the request, delivering it to the server (via UNIX/TCP socket), decoding the request, loading the pixmap, and then copying it into the framebuffer (through the driver infrastructure). Thats a lot of work to do 20x per second. Even more so over ssh forwarding.

If you ever want to get a quick and dirty idea of the amount of data you are sending to the X server, use strace. "strace -ewritev" seems to do the trick for me. The X socket typically ends up as FD 3 in my experience. Just grab the last value in the line for number of bytes sent. Sadly, I haven't found an easy way to determine this with xtrace, yet.

To allow us to get some of the desired speedups, we must first split the foreground and background into separate pixmaps. This way, the only content we ever need to continually ship to the X server is the foreground. The background is an opaque pixmap and the foreground will have mostly transparent pixels (which should compress well too!). Thanks to cairo, we can easily composite the transparent pixmap onto the background using gdk_cairo_set_source_pixmap(). Alternatively, on older systems you can blend using the GDK_XOR operation (which will only blit a portion of the foreground based on its comparison to the background). When using XOR, you need to pre XOR your drawing colors so they result in the proper color.

So now that we have our foreground and background separated, we can move on to getting some decent CPU savings. If we make one small sacrifice, we can keep the CPU from doing a significant amount of work repeatedly. This sacrifice is allowing ourself up to 1 second of latency between the time of a sample and presenting it on the screen. This is useful because it allows us to render the new data outside the visible range and slide the pixmap right-to-left on the server. Doing this means we only draw once per second, rather than 20 times.

Lets take a quick look at what I mean.

Using this model, we can render the new version of the graph and send the pixmap to the server. Then, in our frame-expired callback every (1 / Frames-Per-Second) seconds, we simply adjust the position at which the pixmap is rendered by the horizontal distance between frames.

At 20 frames-per-second, this saves us 19 iterations of rendering, encoding, sending, decoding, and copying to the framebuffer per second. Instead, we simply tell the server that we wish to render the pixmap at a given offset and it will copy it into the framebuffer for us.

However, why should we send the entire content once per second? Surely the historical data hasn't changed. In fact, if we use two pixmaps on the server-side to flip between, we can get another large X bandwidth savings. We start by copying the first pixmap into the second pixmap at its new offset. Then, on the client, we render the new second worth of content (properly clipped with cairo_clip()) and send it to the X server. The content is moved to the "New Data" region of the pixmap and then animated into view over the following second. This saves us ((Width - WidthPerSecond) * Height) pixels to send across the wire every second. This is good since that will often exceed the size of a single packet (again, helping over ssh -X). Locally, the speedup is modest. However, remotely it allows us to provide almost the same experience even on low-bandwidth links (ssh -X from my n900 via AT&T Edge for example).

Autoscaling the graph to handle values outside the visible range is done by checking the value bounds when retrieving the value. Scaling down is done during a callback to see if we can shrink (currently every 5 seconds). No sense in having this in a hot path, it's simply not that important.

It is worth adding, that while I mention X throughout this post, this should work similarly on backends other than X11 since GdkPixmap abstracts that.

Fork me on Github.

Thanks for reading.

Conquering DBus

I haven't written in a while because I've been pretty heads down on a project I'm working on. Hopefully, it will be in a shape where it makes sense to share with everyone before too long.

In this project, I needed an RPC system that could have multiple transport and be fully asynchronous (with GAsyncResult, etc). So I took a little time and wrote a crappy little python program that generates the following peices in GObject/C from a simple input format.

  • Transport agnostic (currently has DBus backend).
  • A client library with lowlevel RPC calls and an object model built on top of the lowlevel calls. Consumer can choose which best fits their use case. (Object model not finished).
  • A server implementation (also transport agnostic).
  • Fully Asynchronous. Synchronous implementation is built on top of the asynchronous implementation.
  • Unit tests (for testing RPC round trips, etc).
  • Generated code should be clean to read and debug.
  • Tracing support for debugging. (Use -DDISABLE_TRACING to turn off).
  • Generate DBus Introspection XML.

Anyway, code is here[1]. There is a sample video below where I generate a mockup service for getting information on "Books" and "Authors". See the Makefile in the outdir directory for how to build things. Keep in mind I have absolutely NO INTENTION of maintaining this. But at minimum, I figured it might serve useful for someone.

So a quick example of the imput format would be like the following. I've removed the header lines for succinctness.

  1. object Book {
  2. # books are identified by a unique int (string also supported).
  3. id int
  4.  
  5. """
  6. Document your rpc's with python style docs. Comments persist into
  7. the generated code's gtkdoc.
  8. """
  9. rpc get_name (out string name)
  10.  
  11. """
  12. Retrieves the book's author.
  13. """
  14. rpc get_author (out Author author)
  15. }
  16.  
  17. object Author {
  18. id int
  19.  
  20. """
  21. Retrieves the books by this author.
  22. """
  23. rpc get_books (out Book[] books)
  24. }
  25.  
  26. object Manager {
  27. # only one of these per application (/org/blah/Project/Manager).
  28. # might consider this later to simply be the default servce such as
  29. # (/org/blah/Project).
  30. singleton
  31.  
  32. rpc get_books (out Book[] books)
  33. rpc get_authors (out Author[] authors)
  34. }

Thats pretty much it. The code is uglier than anything you've ever seen, but at least the generated C is pretty. The code parser/generator is not super resilient or anything. I pretty much hack it to have what I need as I go.

[1] http://github.com/chergert/yosemite

Random Updates

It's been a while, so I figured I'd just jot a few things down that I've been doing in my spare time.

  • Went to Yosemite again in January.   We usually go about 3 times a year.  We have a bigger trip coming up in March.  If you are in the area and want to join, contact me.  My email is available in my source code.
  • EggBuffer - A simple reader/writer for the protocol-buffer wire format.  This does not include a message generator or anything.  I simply wanted to be able to crack open message blobs.
  • generators - A small set of basic code generators I use to bootstrap new objects/structs/licenses/etc.  I'd like to convert these to that fancy vim snippet plugin.
  • EggFmt - A crappy little formatter helper for doing mysql style tables in console.
  • EggLine - A simplified, glib-ish, interface to libreadline.

As soon as it's ready, I'll share the project I'm using these tools to build.  Feel free to speculate.

Also, I'll be back in LA, my favorite home, for Scale8x February 19-22nd.  Your rockstar Gnome Sys-Admin, Jeff Schroeder, and I will be hosting the Gnome booth along with Jordan.

http://www.flickr.com/photos/jordanlarrigan/ / CC BY-NC-SA 2.0

Introduction to lock-free/wait-free and the ABA problem

I recently read a paper on Hazard Pointers[2] and thought I would share what I learned and some general information on lock-free/wait-free programming. I hope it is useful to others getting started.

I'm not very clever, so keep in mind this is mostly regurgitating knowledge I've found in various places. If you find anything wrong or that could use more/better explaination leave a comment and we can cooperatively improve this.

Traditional problems in Mutli-Processor programming

As we reached the speed-limit of serialized CPU instructions we started to add more cores. Obviously, adding more cores means we have to change how we process information. Instead of optimizing for the fastest serial processing of data, we look for ways to split the work into multiple tasks. That way, we can put tasks on different cores and efficiently use the resources available.

So now we have lots of individual tasks. It would be most excellent if they could communicate. Often times we end up using a Mutex or other synchronization primitive to lock access to some shared state. Unfortunately, this doesn't scale well as you add more concurrent workers. This ends up creating more lock-contention and reduces the overall usefulness of splitting the problem into tasks.

What if we could communicate to manage shared state without introducing locks? That is precisely what some very smart folks have been designing for years. Now the rest of us are catching up.

What are lock-free algorithms

Lock-free algorithms provide a way to perform operations on shared state without the need to perform costly synchronization between threads. Think about the times you've used a mutex to protect access to shared state such as a queue. In highly-threaded scenarios you probably noticed CPU time wasted in lock-contention (often "time in kernel"). There are many types of synchronization, but many of these result in the same perception by users: inefficient use of their computing resources. That is not very green. The polar bears are crying.

Lets think for a moment about why there is so much CPU time wasted under traditional locking scenarios. For the sake of simplicity, lets build a very simple spinlock.

  1. typedef struct {
  2. int s;
  3. } spinlock_t;
  4.  
  5. void
  6. spinlock_lock(spinlock_t *l)
  7. {
  8. while (!g_atomic_int_compare_and_exchange(&l->s, FALSE, TRUE));
  9. }
  10.  
  11. void
  12. spinlock_unlock(spinlock_t *l)
  13. {
  14. l->s = FALSE;
  15. }

Experienced readers will forgive me for the naivete. What essentially happens is that we try to set the lock bit of the structure until we succeed; potentially blocking for long periods of time (in terms of CPU cycles). Not to mention that the CPU will spin up at full speed and waste power in the meantime.

Why is this generally a bad approach? Well, to begin with, this is just a synchronization primitive to acquire our lock. Once acquired, the critical section is performed. Meanwhile, all those other threads are still spinning! If only they were doing something constructive! In addition, each Compare and Swap (CAS here-forward) requires a Memory Barrier. This means that each call to the method requires that the cache line containing the structure be flushed from the CPU cache and fetched from main-memory. This alone can take hundreds of CPU cycles and now each thread is doing it repeatedly.

Now, that doesn't mean that CAS is a scary or bad thing! We just need to use it more appropriately.

For good measure I should mention that there are times when spin-locks and variants there-of are the right data structure. For example, many locks now days are complex and contain multiple steps such as spinning for a short period of time followed by a sleep back-off upon failure.

This non-deterministic behavior is precisely what lock-free algorithms are designed to avoid. Lock-free algorithms are carefully designed data-structures and functions to allow for multiple threads to attempt to make progress independently of one-another. This means that you do not try to acquire a lock before performing your critical region. Instead, you independently update a local copy of a portion of the data-structure and then apply it atomically to the shared structure with a CAS.

Wait-Free, a better Lock-Free

Those experienced with lock-free programming will likely know of a superset called wait-free programming. Wait-free is the same as lock-free except it provides a stellar guarantee. Each thread is guaranteed to be progressing itself or a cooperative thread. This is an important phenomenon because wait-free uses concurrent access to a data-structure to cooperatively progress the data-structure as a whole.

Constructing Lock-Free/Wait-Free algorithms

Lets take a look a lock-free/wait-free implementation of a Queue. This is a fairly strait-forward implementation of the design by M.M. Michael and M.L. Scott[1].

  1. typedef struct _Node Node;
  2. typedef struct _Queue Queue;
  3.  
  4. struct _Node {
  5. void *data;
  6. Node *next;
  7. };
  8.  
  9. struct _Queue {
  10. Node *head;
  11. Node *tail;
  12. };
  13.  
  14. Queue*
  15. queue_new(void)
  16. {
  17. Queue *q = g_slice_new(sizeof(Queue));
  18. q->head = q->tail = g_slice_new0(sizeof(Node));
  19. return q;
  20. }
  21.  
  22. void
  23. queue_enqueue(Queue *q,
  24. gpointer data)
  25. {
  26. Node *node, *tail, *next;
  27.  
  28. node = g_slice_new(Node);
  29. node->data = data;
  30. node->next = NULL;
  31. while (TRUE) {
  32. tail = q->tail;
  33. next = tail->next;
  34. if (tail != q->tail)
  35. continue;
  36. if (next != NULL) {
  37. CAS(&q->tail, tail, next);
  38. continue;
  39. }
  40. if (CAS(&tail->next, null, node)
  41. break;
  42. }
  43. CAS(&q->tail, tail, node);
  44. }
  45.  
  46. gpointer
  47. queue_dequeue(Queue *q)
  48. {
  49. Node *node, *tail, *next;
  50.  
  51. while (TRUE) {
  52. head = q->head;
  53. tail = q->tail;
  54. next = head->next;
  55. if (head != q->head)
  56. continue;
  57. if (next == NULL)
  58. return NULL; // Empty
  59. if (head == tail) {
  60. CAS(&q->tail, tail, next);
  61. continue;
  62. }
  63. data = next->data;
  64. if (CAS(&q->head, head, next))
  65. break;
  66. }
  67. g_slice_free(Node, head); // This isn't safe, we'll discuss why.
  68. return data;
  69. }

The FIFO queue is constructed using a singly-linked list. We keep track of the head and tail of the queue. Because the structure uses a linked-list, we can CAS the node's next pointer to atomically perform updates.

I won't go into too much detail on the implementation of the queue. The gist is this. To enqueue a new node, we work locally to prepare the new node and try to apply it atomically to the tail. If we detect an incomplete operation we help cleanup after it. To dequeue, we try to take the first item off the queue atomically. Again, we help proceed an inconsistent state.

The ABA Problem

There is an interesting problem that arises in this algorithm. What happens if a node is removed, de-allocated (through free() or similar), re-allocated (through malloc() or similar) and added back; all within the time in which another thread is paused? If that other thread is about to perform a CAS using that pointer it could still succeed even though the state has changed! This is what is known as the ABA problem in lock-free programming.

In garbage collected languages this isn't a problem. Why? Because the node's memory cannot be reclaimed for a new object until observing threads containing pointers to the structure have released them.

In C, however, we don't typically have the luxury of a garbage collector. Historically, there have been a few approaches to work around this. The original approach, by IBM, was to use a tag next to the pointer to be swapped. They would increment that tag counter and use a double-word CAS. While many 32-bit machines provide a 64-bit CAS, most 64-bit machines do not provide a 128-bit CAS. This prevents it from being a truly useful solution for general use.

My concurrency library, libiris[4], currently does something similar to this. By aligning the pointers properly you can use portions of the pointer itself for tagging. This isn't scalable as you add workers, but it was a start.

But, ever the pragmatist that I am, I thought it was time to research the proper way to handle the situation. Turns out there is a fantastic paper written in 2004 on a methodology called Hazard Pointers[2]. Hazard pointers are a way of notifying cooperative threads that you are removing some potentially unsafe structure. Later, a reclamation step is performed which can free the memory once it is safe to do so.

To make these lock-free/wait-free algorithms ABA safe we can notify cooperative threads using these hazard pointers. I've implemented the hazard pointer methodology from the paper here. We can include that and alter the algorithm slightly.

  1. void
  2. queue_enqueue(Queue *q,
  3. gpointer data)
  4. {
  5. Node *node, *tail, *next;
  6.  
  7. node = g_slice_new(Node);
  8. node->data = data;
  9. node->next = NULL;
  10. while (TRUE) {
  11. tail = q->tail;
  12. HAZARD_SET(0, tail); // Mark tail has hazardous
  13. if (tail != q->tail) // Check tail hasn't changed
  14. continue;
  15. next = tail->next;
  16. if (tail != q->tail)
  17. continue;
  18. if (next != NULL) {
  19. CAS(&q->tail, tail, next);
  20. continue;
  21. }
  22. if (CAS(&tail->next, null, node)
  23. break;
  24. }
  25. CAS(&q->tail, tail, node);
  26. }
  27.  
  28. gpointer
  29. queue_dequeue(Queue *q)
  30. {
  31. Node *tail, *next, *head;
  32.  
  33. while (TRUE) {
  34. head = q->head;
  35. LF_HAZARD_SET(0, head); // Mark head as hazardous
  36. if (head != q->head) // Check head hasn't changed
  37. continue;
  38. tail = q->tail;
  39. next = head->next;
  40. LF_HAZARD_SET(1, next); // Mark next has hazardous
  41. if (head != q->head)
  42. continue;
  43. if (next == NULL)
  44. return NULL; // Empty
  45. if (head == tail) {
  46. CAS(&q->tail, tail, next);
  47. continue;
  48. }
  49. data = next->data;
  50. if (CAS(&q->head, head, next))
  51. break;
  52. }
  53. LF_HAZARD_UNSET(head); // Retire head, and perform
  54. // reclamation if needed.
  55. return data;
  56. }

The paper discusses a few other algorithms too which I'd like to add to the repository. Including a Set and LIFO Stack. After which, I hope to do some merging and refactoring of this into libiris[5]. I want to make it easy for gtk+ applications to be fully asynchronous from within the main-loop. Doing synchronous IO, or other blocking operations from the main-loop is unacceptable as far as I'm concerned.

--

[1] M.M. Michael and M.L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,” Proc. 15th Ann. ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996.
[2] IEEE Transactions on Parallel and Distributed Systems, VOL. 15, NO. 6, June 2004 Page 491
[3] http://github.com/chergert/dukes_of_hazard
[4] http://git.dronelabs.com/iris

readline

I made a crappy little abstraction on top of readline to make things simpler for myself. Someone might find it useful.

  1. #include "egg-line.h"
  2. // ...
  3. EggLine *line = egg_line_new ();
  4. egg_line_set_prompt (line, "Linux-Router> ");
  5. egg_line_set_entries (line, completion_tree);
  6. egg_line_run (line);
  7. // ...
  8.  

Code: http://github.com/chergert/egg-line/
Example: http://github.com/chergert/egg-line/blob/master/main.c

Rapid Prototyping Tools

It's nice to see people working on meta recently. For example, Quickly. After some long discussions with an architect at my last job, I hacked up something drastically different for myself. It's a graphical tool to combine lots of meta pieces together from visual designs.

However, my available time keeps shrinking so I thought I'd share it just to keep the meme going. The idea was to write a meta-toolbox for myself that I can continually extend as I go and eventually write entire programs using it.

The gist is this. Create nodes in the system. Tag them with information. Run the analyzer+generator against them. The analyzer infers lots of things based on patterns I know and see in my head. This allows for minimal time between cognition and implementation.

The input is simple as you'll see, the output is a ready to go project, runnable code in C, dbus support, threading or message passing, classes, properties, signals, methods, documentation, etc, etc.

Video is best watched fullscreen.

My code isn't really in a shape to share as it isn't really useful unless your brain works exactly as mine, but you should be able to hack up your own in minimal time.

GDateTime

I got fed up with working around all the limitations with date and time in glib and C (time_t, struct timeval, struct tm, GTimeVal, GDate, etc) so I decided to write something new.

The result is GDateTime. It handles dates and times from 1/1/1 to 12/31/9999 on 100-nanosecond intervals. As requested by a few individuals, it is an opaque/boxed type. String parsing is incomplete but the rest should be in good shape.

The glib branch is available here. Or for those with short attention spans, the header gdatetime.h.

You can even use GDateTimes as keys within a GHashTable using g_date_time_hash, g_date_time_equal.

GHashTable *hash = g_hash_table_new_full (g_date_time_hash, g_date_time_equal, g_date_time_free, NULL);

I'm interested in feedback and patches.

New Job

I'm now working at VMWare. I'm half way in the process of moving to San Francisco and should be up in the SoMa area full-time by next week.