Practical Concurrency Patterns: Lock-Free Operations
In the previous installments we have reviewed multiple strategies for caching or storing calculated key-value data so that accesses to it are optimized to the highest applicable degree. For simpler storage types, such as a work item queue, there are even cheaper alternatives that what we've seen so far.
Consider the following scenario:
- A stock trade application receives trade orders at the alarming rate of thousands of operations per second from multiple sources
- Orders are placed in a queue for further processing in order to free the front-end component(s)
- Orders are removed from the queue by multiple processing components which proceed to perform very brief update operations on in-memory data
This scenario can be easily dominated by CPU work, as most of the operations are performed in memory. The single bottleneck in the above design is the queue. A read-write strategy will be useless because all operations on the queue are writes (insert and remove operations), so we need something else to minimize contention.
We observe that the only two operations actually requiring a lock are the inserting items to and removing items from the queue. Creating, filtering or otherwise manipulating work items can be performed without synchronization.
This is the classic use case for a lock-free queue. If the queue is maintained as a singly linked list, an insert operation can be implemented by performing one atomic write - updating the list tail. Similarly, a remove operation can be implemented by performing one atomic write - updating the list head. (Or so it might appear.)
Since reference writes under the CLR memory model are guaranteed to be atomic, it might appear that we don't need synchronization around the queue at all. This assumption quickly breaks, however, if we consider the following naive implementation:
class NaiveLockFreeQueue<T>
{
volatile QueueNode<T> head;
volatile QueueNode<T> tail;
public NaiveLockFreeQueue()
{
tail = head = new QueueNode<T>(default(T));
}
public void Insert(T data)
{
QueueNode<T> node = new QueueNode<T>(data);
tail.Next = node;
tail = node;
}
public T Remove()
{
QueueNode<T> oldHead = head.Next;
head = oldHead.Next;
return oldHead.Data;
}
}
The funny thing around here is that you might test this code a long many times and it will work properly across multiple threads, especially if you're testing it on a single CPU. The reason is that it is rather unlikely for a race condition to occur and invalidate the above implementation - it would require a context switch between two specific instructions in the Insert or Remove implementation. But the results would be disastrous. Consider the following sequence:
| Thread 1 | Thread 2 |
| Sets tail.Next to X1 | |
| | Sets tail.Next to X2 |
| | Sets tail to X2 |
| Sets tail to X1 | |
As a result, we have a split list!
Different and boring is the way Remove can break:
| Thread 1 | Thread 2 |
| Sets oldHead to X | |
| | Sets oldHead to X |
| | Sets head to oldHead.Next (Y) |
| Sets head to oldHead.Next (Y) | |
| Returns X.Data | |
| | Returns X.Data |
As a result, both threads observe the same removed item!
As you see, reasoning about lock-free data structures is already not trivial. To make the insert and remove logic valid, we must introduce some sort of synchronization to address both problems:
- While inserting an element to the queue, we must ensure that the tail doesn't lag behind
- While removing an element from the queue, we must ensure that the head was not updated before we update it
Fortunately, it is possible through the use of Interlocked.CompareExchange and very cautious reasoning about what can possibly break. An analysis of a full implementation is beyond the scope of this post, but there are multiple sources on the web, including Julian M. Bucknail's excellent series. To examine a practical implementation, you might want to take a look under the hood of System.Threading.dll in the Parallel Extensions CTP - the ConcurrentQueue<T> class is your friend.
Replacing the stock lock implementation with a lock-free data structure removes the single biggest bottleneck in the system. Interlocked operations are not free of cost (cache invalidation and memory fences are expensive), but they are orders of magnitude cheaper than a full kernel lock. By the way, if you think a lock is expensive only because a kernel mechanism is involved, you're not seeing the full picture. The dire consequence of contending for a frequently taken lock is lock convoys!
The following diagram depicts a happy situation, where threads are making forward progress taking a lock for a very short time but doing so frequently. Note that there is no stalling at all because everything appears to be arranged and timed so that threads never contend for the lock.
On the other hand, a minor aberration in the system can lead to a thread being switched out while holding a lock (in the above diagram, the third and fourth threads are pretty damn close to it happening). This results in the following:
Our threads are stalling more than half of the time now, once the convoy is in effect. Due to the initial context switch while a lock was held, we now witness the effect of multiple threads stalling for the lock. With lock-free structures, this would never happen. QED.