Lock vs. Mutex

July 12, 2013

15 comments

Here’s a quick brainteaser for you. Suppose you really want to find all the prime numbers in a certain range, and store them in a List<uint>. And also suppose that you want to parallelize that calculation to make it as quick as possible. You then need to synchronize access to the list so that it’s not corrupted by add operations performed in multiple threads. Would it be better to use a C# lock (CLR Monitor) or a Windows mutex to protect the list of primes?

Parallel.For(2, 400000, n =>
{
    if (IsPrime((uint)n))
    {
        mutex.WaitOne(); //Or maybe lock(something) would be better?
        primes.Add((uint)n);
        mutex.ReleaseMutex();
    }
});

It turns out the answer depends on the number of processors you have. If you limit the parallelization to only a single processor, the mutex version takes “only” 2x longer to run on my machine. However, when I let the loop parallelize across 2 processors, the mutex version becomes 90x slower (!!!). Finally, when parallelized across 4 processors, the mutex version grinds to a complete halt and runs 200x slower than the other version. In fact, the more I parallelize, the worse the performance of the mutex-based version becomes.

This is not very surprising considering that the mutex introduces a strong bottleneck and requires a system call to acquire and release. In fact, even if the mutex is free, it takes a system call to acquire and release it. But this seems also true of the lock keyword; how can it be so fast?

In fact, if you use the Visual Studio Concurrency Visualizer, you can see an interesting phenomenon. Here are the application’s threads in the mutex-based version – the reddish pink is time spent doing synchronization, the green is when the threads are actually running:

image

Here’s what it looks like zoomed in to a tiny segment of less than 10 milliseconds:

image

And here are the threads in the lock-based version (note that the bottom two threads complete their part of the work very quickly, so the synchronization you see towards the end is simply thread pool worker threads waiting for work, and not synchronization induced by our algorithm):

image

It is clearly evident that the lock-based version introduces almost no synchronization. How can that be? After all, there must be at least some contention over the lock, even though it might be faster to acquire?

It turns out (and this is partially documented elsewhere) that the CLR Monitor tries spinning for several hundred cycles before issuing a true wait on a kernel event handle. Because the time we spend inside the lock is very short, it’s always the case that the lock is released before the waiter gives up on the spinning (busy waiting). This is why we don’t see any synchronization, and, as a corollary, don’t incur any kernel transitions and expensive context switches.

Although this example is overly pessimistic, it illustrates well that you should take care to choose the synchronization mechanism that does exactly what you need and not more; specifically, if you don’t need the extra abilities of the mutex (such as cross-process synchronization), you should stick with the simpler lock.


I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

15 comments

  1. ErikJuly 15, 2013 ב 8:44 AM

    Wow, I had no idea. o.O I don’t often use mutexes anyway – only when I need cross-process sync as you mention – but this is still very, very good to know. Thank you for the insight! =)

    Reply
  2. QuintonJuly 16, 2013 ב 1:03 AM

    I always used lock() but never knew the alternative was so much slower. Wow, thanks for sharing!

    Reply
  3. JeremyCaJuly 16, 2013 ב 2:19 AM

    Isn’t this scenario the reason why there are thread-safe collections? e.g.:http://msdn.microsoft.com/en-us/library/dd997305%28v=vs.100%29.aspx

    Granted the concurrent collections are slower than a lock in some instances, but they do take the headache away from having to deal with synchronisation.

    Reply
  4. GheorgheJuly 16, 2013 ב 2:28 AM

    There is a difference between inter-process synchronization (IPS) and thread synchronization (TS). Mutex is mainly aimed to IPS, while locks for TS.
    You should not used Mutex-es if you stay within one process.

    Reply
  5. KavJuly 16, 2013 ב 2:54 AM

    Unfortunately, I cannot see your lock equivalent, so can only guess that you are comparing it with something like this. Is this correct?

    Parallel.For(2, 400000, n =>
    {
    if (IsPrime((uint)n))
    {
    lock(primes)
    {
    primes.Add((uint)n);
    }
    }
    });

    Reply
  6. thomasJuly 16, 2013 ב 8:45 AM

    Probably you dont even need any lock. There is no need
    to read and to write to your prime list at the same time.
    If you have n prime numbers in your list, you can calculate all primes up to n^2 from this list, while concurrent read access shouldnt be any problem.
    After this you could copy all calculated primes to your
    list. Should be a lot faster:)

    Reply
  7. Larry OHeronJuly 16, 2013 ב 9:36 AM

    I tried several variations and did not see this behavior. Both ran in equal time.
    Intel i5 (64-bit) Quad-core w/8GB ram
    Win7 Pro SP1
    VS 2012 SP1

    Reply
  8. Michael WatersJuly 16, 2013 ב 9:42 AM

    So is C#/CLR’s lock analogous to MFC’s CCriticialSection.Lock() and .Unlock()?

    Reply
  9. OlegJuly 16, 2013 ב 10:17 AM

    It’s called Spinlock, not just a Lock.

    Reply
  10. BradenJuly 16, 2013 ב 11:19 AM

    Very insightful, thanks!

    Reply
  11. TomJuly 16, 2013 ב 11:22 AM

    Curious how they compare for multiple threads running on a single processor. I think here the mutex should perform better – as the spinlock will eat cycles that could be used by the other process. Although it’s not particularly relevant to the parallelisation question, it is quite relevant to situations where threads are used to separate different tasks rather than all execute the same task.

    Reply
  12. RIchardJuly 16, 2013 ב 11:41 AM

    Yup.. always pick the synchronization mechanism that lives closest to your process. Inter-process mechanism are very bad if you are only synchronizing threads in a single process.

    On Windows inter-process is Mutex, single process are lock (in .NET). For native, those are called Critical Sections (single process locks).

    In native its a 10-1 difference between mutexes (10) and Critical Sections (1).

    Each operating system has a similar concept.. the difference between process-owned and OS-owned locks.. and the OS-owned locks are always more expensive. If we are talking about something that happens only 10 times during a run, it doesn’t really matter which you choose.. but if its heavily used, once you start to scale you’ll feel it big time. So always beware.

    Reply
  13. AxelJuly 17, 2013 ב 5:17 AM

    Did you consider a CAS algorithm to get lock & wait freedom?

    Reply
  14. ErvingJuly 17, 2013 ב 1:27 PM

    Don’t blame the locks.
    I don’t think you have optimized to solution correctly.

    No sane person would calculate every prime number starting from 1 !
    No sane thread would either.

    How can you properly parallelize this task?
    You can have several threads looking into an array,
    and waiting until it reaches certain N; plus a thread that calculates the array.
    That’s trivial.
    And unnecessary – you are trying to use parallelism on search in the array + output?

    You can create several threads for filtering out non-primes in different [ranges] of N.
    It is less trivial.

    Bottom line – you did not show the whole program.
    Who knows what it does, and why is it slow for you.

    Reply
  15. Sasha GoldshteinJuly 18, 2013 ב 8:46 AM

    Thank you all for the insightful comments. Indeed, switching to lock-free code, using a thread-safe collection, aggregating results on each thread without accessing shared state, or even improving the algorithm — are all very good ideas. I have indeed explored some of these ideas in my book, “Pro .NET Performance”: http://www.amazon.com/Pro-NET-Performance-Optimize-Applications/dp/1430244585

    However, the purpose of this post was to explain the difference between mutexes and CLR monitors in a simple setting. Hopefully, this was sufficiently convincing.

    Reply