Next Generation Production Debugging: Demo 4

April 8, 2008

2 comments

After utilizing WinDbg and SOS to diagnose a memory leak in our application, I shifted focus to a whole different category of problems – deadlocks.

By issuing the “Move” command on a particular picture in the client application, the user ends up with a non-responsive UI.  We can’t tell for sure whether the reason for the hang is in the UI or in the WCF service being called without forcing our way in with a debugger.

However, there’s a basic way of diagnosing deadlocks on Windows Vista and Windows Server 2008, which is built in into the operating system and requires no external tools whatsoever.  This mechanism is called Wait Chain Traversal, and it enables us to traverse the threads and the synchronization objects these threads are waiting on.  It will only work for kernel synchronization mechanisms (such as a mutex, a semaphore, an LPC port, a Windows message queue and others), and if they have names we will benefit from knowing exactly what object we’re talking about.

To make this easier, I’ve written a simple tool called WCT Deadlock Detector which provides a very thin wrapper on top of the WCT APIs (the tool itself is available in the source code for the session I uploaded earlier).

If you run the tool and choose a process from the left, you get the output from the WCT API:

image

Analyzing what we see on the right, the first and most obvious thing is the big “DEADLOCK” we just detected between the two threads we are looking at.  The basic scenario is that thread A is waiting for mutex 1, which is currently held by thread B.  Thread B, on the other hand, is waiting for mutex 2, which is currently held by thread A.  This results in two threads waiting for each other to release the mutex in order to proceed, and so the system cannot make any forward progress – it is in a state of deadly embrace.

Since we have the mutex names, all we have to do now is go back to the source code and find why the threads are entering this kind of deadly embrace.  A typical technique for fixing deadlocks of this kind is defining a lock leveling strategy.

Let’s try something else.  The user, hoping to somehow disentangle the deadlock, tries performing the “Move” operation with the “Overwrite” checkbox checked.  However, the application is still stuck, and the WCT Deadlock Detector doesn’t show any useful output except for the fact the threads are blocked waiting for something to happen.

In that case, we have to resort to breaking in with a debugger and trying to analyze what’s going on.  Since we don’t know whether the service or the client is responsible for the hang, we have to guess.  So let’s guess at the service (as the author of the buggy application I can make a really educated guess here).  After attaching and loading SOS, we can use the !threads and !clrstack commands to see what the individual threads are doing.  From the debugger spew we can establish the following two call stacks for the only threads that seem related to the message processing of the client request:

image

image

So both threads are waiting for a .NET Monitor (which is the underlying mechanism for the C# lock keyword).  What are they waiting on and do we have a deadlock here?  The built-in SOS commands will make it difficult for us, even though we could make use of the !syncblk extension.  However, the entire process can be fully automated by using another community debugger extension called SOSEX (so it’s really a debugger extension extension).  Once we have it loaded (using .load) we can use the !dlk command to see the following deadlock beautifully detected:

image

We can see the exact objects the two threads are waiting for a lock, and we can see the exact point in code where the threads are stuck waiting to acquire it.  From here we can proceed to the source to fix the problem, which is very similar to the previously discussed native case (where we had kernel synchronization objects, mutexes, causing the deadlock).

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

2 comments

  1. GevaApril 8, 2008 ื‘ 8:32 PM

    1. You said that production debugging is good for the users, but you can’t (or at least I haven’t seen how you can) solve the problem in runtime anyway, so the users don’t really care. Maybe it helps *you* track down the bug faster and you don’t have to go back to dev and find the bug’s scenario, but the users don’t really get an instant fix, so why do they care?

    2. You said that we have to learn the theory. Where to start? Is there a book to read? ๐Ÿ˜‰

    Reply
  2. Mark PlummerJanuary 30, 2009 ื‘ 5:45 PM

    Have you seen Debug Inspector? http://www.debuginspector.com It detects deadlocks in managed and native code

    Reply