Here’s a quick brainteaser for you. Suppose you really want to find all the prime numbers in a certain range, and store them in a List<uint>. And also suppose that you want to parallelize that calculation to make it as quick as possible. You then need to synchronize access to the list so that it’s not corrupted by add operations performed in multiple threads. Would it be better to use a C# lock (CLR Monitor) or a Windows mutex to protect the list of primes?
Parallel.For(2, 400000, n =>
mutex.WaitOne(); //Or maybe lock(something) would be better?
It turns out the answer depends on the number of processors you have. If you limit the parallelization to only a single processor, the mutex version takes “only” 2x longer to run on my machine. However, when I let the loop parallelize across 2 processors, the mutex version becomes 90x slower (!!!). Finally, when parallelized across 4 processors, the mutex version grinds to a complete halt and runs 200x slower than the other version. In fact, the more I parallelize, the worse the performance of the mutex-based version becomes.
This is not very surprising considering that the mutex introduces a strong bottleneck and requires a system call to acquire and release. In fact, even if the mutex is free, it takes a system call to acquire and release it. But this seems also true of the lock keyword; how can it be so fast?
In fact, if you use the Visual Studio Concurrency Visualizer, you can see an interesting phenomenon. Here are the application’s threads in the mutex-based version – the reddish pink is time spent doing synchronization, the green is when the threads are actually running:
Here’s what it looks like zoomed in to a tiny segment of less than 10 milliseconds:
And here are the threads in the lock-based version (note that the bottom two threads complete their part of the work very quickly, so the synchronization you see towards the end is simply thread pool worker threads waiting for work, and not synchronization induced by our algorithm):
It is clearly evident that the lock-based version introduces almost no synchronization. How can that be? After all, there must be at least some contention over the lock, even though it might be faster to acquire?
It turns out (and this is partially documented elsewhere) that the CLR Monitor tries spinning for several hundred cycles before issuing a true wait on a kernel event handle. Because the time we spend inside the lock is very short, it’s always the case that the lock is released before the waiter gives up on the spinning (busy waiting). This is why we don’t see any synchronization, and, as a corollary, don’t incur any kernel transitions and expensive context switches.
Although this example is overly pessimistic, it illustrates well that you should take care to choose the synchronization mechanism that does exactly what you need and not more; specifically, if you don’t need the extra abilities of the mutex (such as cross-process synchronization), you should stick with the simpler lock.
I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn