Last time we have minimized contention by using lock-free operations instead of acquiring a lock on a work item queue (neither a standard lock nor a reader-writer lock offer nearly-linear scaling). On the other hand, a few weeks ago we’ve also seen a technique for eliminating contention altogether by keeping state in thread-local storage.
However, there’s another trick in the bag for using shared state across threads without any locks. It is applicable for the scenario where multiple consumers “subscribe” to a work item, such that all work items must be processed by all consumers. This is the classic publish/subscribe, within the scope of a single process and very high performance demands.
This pattern uses a cyclic buffer with a single counter indicating the next item to be consumed. The producers increment the counter and subsequently update the previous item; the consumers keep a local counter which specifies the last item they have read.
int _currentIndex = -1;
volatile T _items;
public CyclicQueue(int size)
_items = new T[_size = size];
public void Enqueue(T item)
int prevIndex = Interlocked.Increment(ref _currentIndex);
_items[prevIndex % _size] = item;
public bool Dequeue(ref int threadPrivateIndex, out T value)
if (threadPrivateIndex == _currentIndex + 1)
value = default(T);
value = _items[threadPrivateIndex++ % _size];
As long as consumers don’t lag behind too much as to cause an overflow with loss of data, this pattern provides lock-less operations. The only synchronization required is around the single counter indicating the current index into the buffer, and this can get a bit tricky in the face of an overflow. To simplify, we use counter % BUFSIZE whenever accessing the data instead of taking care of wrapping in the case of overflow when incrementing the counter.
Note that consumers in this case are required to busy-wait around the Dequeue operation as long as there is no new data in the queue. If this busy-wait is known to be expensive, we can either resort to kernel synchronization or use a spinlock and fail back to kernel synchronization. Using spinlocks to minimize contention and context switches will be the subject of our next installment.