Next Generation Production Debugging: Demo 4
After utilizing WinDbg and SOS to diagnose a memory leak in our application, I shifted focus to a whole different category of problems - deadlocks.
By issuing the "Move" command on a particular picture in the client application, the user ends up with a non-responsive UI. We can't tell for sure whether the reason for the hang is in the UI or in the WCF service being called without forcing our way in with a debugger.
However, there's a basic way of diagnosing deadlocks on Windows Vista and Windows Server 2008, which is built in into the operating system and requires no external tools whatsoever. This mechanism is called Wait Chain Traversal, and it enables us to traverse the threads and the synchronization objects these threads are waiting on. It will only work for kernel synchronization mechanisms (such as a mutex, a semaphore, an LPC port, a Windows message queue and others), and if they have names we will benefit from knowing exactly what object we're talking about.
To make this easier, I've written a simple tool called WCT Deadlock Detector which provides a very thin wrapper on top of the WCT APIs (the tool itself is available in the source code for the session I uploaded earlier).
If you run the tool and choose a process from the left, you get the output from the WCT API:
Analyzing what we see on the right, the first and most obvious thing is the big "DEADLOCK" we just detected between the two threads we are looking at. The basic scenario is that thread A is waiting for mutex 1, which is currently held by thread B. Thread B, on the other hand, is waiting for mutex 2, which is currently held by thread A. This results in two threads waiting for each other to release the mutex in order to proceed, and so the system cannot make any forward progress - it is in a state of deadly embrace.
Since we have the mutex names, all we have to do now is go back to the source code and find why the threads are entering this kind of deadly embrace. A typical technique for fixing deadlocks of this kind is defining a lock leveling strategy.
Let's try something else. The user, hoping to somehow disentangle the deadlock, tries performing the "Move" operation with the "Overwrite" checkbox checked. However, the application is still stuck, and the WCT Deadlock Detector doesn't show any useful output except for the fact the threads are blocked waiting for something to happen.
In that case, we have to resort to breaking in with a debugger and trying to analyze what's going on. Since we don't know whether the service or the client is responsible for the hang, we have to guess. So let's guess at the service (as the author of the buggy application I can make a really educated guess here). After attaching and loading SOS, we can use the !threads and !clrstack commands to see what the individual threads are doing. From the debugger spew we can establish the following two call stacks for the only threads that seem related to the message processing of the client request:
So both threads are waiting for a .NET Monitor (which is the underlying mechanism for the C# lock keyword). What are they waiting on and do we have a deadlock here? The built-in SOS commands will make it difficult for us, even though we could make use of the !syncblk extension. However, the entire process can be fully automated by using another community debugger extension called SOSEX (so it's really a debugger extension extension). Once we have it loaded (using .load) we can use the !dlk command to see the following deadlock beautifully detected:
We can see the exact objects the two threads are waiting for a lock, and we can see the exact point in code where the threads are stuck waiting to acquire it. From here we can proceed to the source to fix the problem, which is very similar to the previously discussed native case (where we had kernel synchronization objects, mutexes, causing the deadlock).