Practical Concurrency Patterns: Lock-Free Operations

July 30, 2008

3 comments

In the previous installments we have reviewed multiple strategies for caching or storing calculated key-value data so that accesses to it are optimized to the highest applicable degree.  For simpler storage types, such as a work item queue, there are even cheaper alternatives that what we’ve seen so far.

Consider the following scenario:image

  • A stock trade application receives trade orders at the alarming rate of thousands of operations per second from multiple sources
  • Orders are placed in a queue for further processing in order to free the front-end component(s)
  • Orders are removed from the queue by multiple processing components which proceed to perform very brief update operations on in-memory data

This scenario can be easily dominated by CPU work, as most of the operations are performed in memory.  The single bottleneck in the above design is the queue.  A read-write strategy will be useless because all operations on the queue are writes (insert and remove operations), so we need something else to minimize contention.

We observe that the only two operations actually requiring a lock are the inserting items to and removing items from the queue.  Creating, filtering or otherwise manipulating work items can be performed without synchronization.

This is the classic use case for a lock-free queue.  If the queue is maintained as a singly linked list, an insert operation can be implemented by performing one atomic write – updating the list tail.  Similarly, a remove operation can be implemented by performing one atomic write – updating the list head.  (Or so it might appear.)

Since reference writes under the CLR memory model are guaranteed to be atomic, it might appear that we don’t need synchronization around the queue at all.  This assumption quickly breaks, however, if we consider the following naive implementation:

class NaiveLockFreeQueue<T>

{

    volatile QueueNode<T> head;

    volatile QueueNode<T> tail;

 

    public NaiveLockFreeQueue()

    {

        tail = head = new QueueNode<T>(default(T));

    }

    public void Insert(T data)

    {

        QueueNode<T> node = new QueueNode<T>(data);

        tail.Next = node;

        tail = node;

    }

    public T Remove()

    {

        QueueNode<T> oldHead = head.Next;

        head = oldHead.Next;

        return oldHead.Data;

    }

}

The funny thing around here is that you might test this code a long many times and it will work properly across multiple threads, especially if you’re testing it on a single CPU.  The reason is that it is rather unlikely for a race condition to occur and invalidate the above implementation – it would require a context switch between two specific instructions in the Insert or Remove implementation.  But the results would be disastrous.  Consider the following sequence:

Thread 1 Thread 2
Sets tail.Next to X1  
  Sets tail.Next to X2
  Sets tail to X2
Sets tail to X1  

As a result, we have a split list!

image

Different and boring is the way Remove can break:

Thread 1 Thread 2
Sets oldHead to X  
  Sets oldHead to X
  Sets head to oldHead.Next (Y)
Sets head to oldHead.Next (Y)  
Returns X.Data  
  Returns X.Data

As a result, both threads observe the same removed item!

As you see, reasoning about lock-free data structures is already not trivial.  To make the insert and remove logic valid, we must introduce some sort of synchronization to address both problems:

  • While inserting an element to the queue, we must ensure that the tail doesn’t lag behind
  • While removing an element from the queue, we must ensure that the head was not updated before we update it

Fortunately, it is possible through the use of Interlocked.CompareExchange and very cautious reasoning about what can possibly break.  An analysis of a full implementation is beyond the scope of this post, but there are multiple sources on the web, including Julian M. Bucknail’s excellent series.  To examine a practical implementation, you might want to take a look under the hood of System.Threading.dll in the Parallel Extensions CTP – the ConcurrentQueue<T> class is your friend.

Replacing the stock lock implementation with a lock-free data structure removes the single biggest bottleneck in the system.  Interlocked operations are not free of cost (cache invalidation and memory fences are expensive), but they are orders of magnitude cheaper than a full kernel lock.  By the way, if you think a lock is expensive only because a kernel mechanism is involved, you’re not seeing the full picture.  The dire consequence of contending for a frequently taken lock is lock convoys!

The following diagram depicts a happy situation, where threads are making forward progress taking a lock for a very short time but doing so frequently.  Note that there is no stalling at all because everything appears to be arranged and timed so that threads never contend for the lock.

image

On the other hand, a minor aberration in the system can lead to a thread being switched out while holding a lock (in the above diagram, the third and fourth threads are pretty damn close to it happening).  This results in the following:

image

Our threads are stalling more than half of the time now, once the convoy is in effect.  Due to the initial context switch while a lock was held, we now witness the effect of multiple threads stalling for the lock.  With lock-free structures, this would never happen.  QED.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

3 comments

  1. DmitriyAugust 3, 2008 ב 3:25 AM

    I think there is a little mistake here:
    public T Remove()
    {
    QueueNode oldHead = head.Next;
    head = oldHead.Next;
    return oldHead.Data;
    }
    should be:
    QueueNode
    oldHead = head;

    Reply
  2. Shital ShahAugust 3, 2008 ב 8:26 PM

    I fail to see how this is “lock free”. You ARE indeed using Interlocked.CompareExchange and it internally WILL lock and block other threads for fixed amount of time. It is irrelavent whether lock is kernel based or runtime based. You ARE locking!

    Reply
  3. Sasha GoldshteinAugust 9, 2008 ב 6:07 AM

    @Shital Shah: A more formal categorization of this kind of algorithm would be termed “wait-free algorithms” because it does not introduce a wait.

    The locking used in a wait-free algorithm is implemented by the CPU and memory bus, and from the thread’s or operating system’s perspective it does not wait. It is significantly cheaper than a kernel wait (which induces a context switch), which is why this is a considerable alternative. It’s highly relevant whether the lock is kernel-based.

    Reply