Previously in the series we have examined the performance differences between concurrency patterns based on kernel synchronization (critical sections, events, mutexes etc.) and concurrency patterns based on wait-free synchronization (such as the interlocked family of operations). Kernel synchronization tends to be expensive if locks are frequently acquired and released because of the associated context switches, introducing anti-patterns such as lock convoys. Wait-free synchronization tends to be expensive if locks are held for a long time because of the spinning which burns CPU without making progress.
Oftentimes neither approach is flexible enough to implement a practical scenario, and an intermediate mechanism is required. Such an intermediate mechanism is available in the Win32 API in the form of a critical section with a spin count.
The InitializeCriticalSectionAndSpinCount API initializes a critical section and associates it with a spin count. When a thread attempts to acquire the critical section and finds it locked, it will first spin the CPU for the specified number of iterations (i.e. the spin count) before transitioning to a kernel wait. If the critical section is released while the thread is spinning, it will acquire the critical section without context switching and without a kernel wait, producing a significant performance boost.
Critical sections with spin counts are a specific example of the general spinlock pattern used in the Windows kernel and in user-mode frameworks such as the Parallel Extensions for .NET.
A typical kernel implementation of a spinlock is a single bit. The spinlock has the following semantics:
- When the spinlock is taken, the bit is set to 1.
- When the spinlock is free, the bit is set to 0.
- When a thread attempts to acquire the spinlock, it will spin indefinitely until the bit is set to 0 (using an atomic operation such as the x86 BTS or InterlockedCompareExchange).
A refinement on the spinlock pattern that is better suited for multi-processor machines is the queued spinlock. A queued spinlock is typically implemented using a set of bits (queue), one for each CPU waiting for the spinlock. Each CPU is also associated with a bit that it uses to spin while waiting for the spinlock. When a spinlock is released, the releasing CPU hands it over to the next waiting CPU by setting its private bit and removing it from the queue.
This pattern minimizes inter-processor contention for memory because each CPU spins on a private bit which serves as its spinlock indicator. It also introduces first-in-first-out order for acquiring the spinlock to guarantee fairness.
Side note: In the Windows kernel, queued spinlocks are used on multi-processor systems. On single-processor systems, spinlocks are not needed because spinlock synchronization is required on high IRQLs only. On high IRQLs (above dispatch IRQL) a context switch cannot occur, so instead of spinning the acquiring thread can simply request an interrupt on the relevant IRQL and return; the interrupt will be masked until the releasing thread lowers the IRQL below the requested IRQL.