Next Generation Production Debugging: Demo 5

April 9, 2008

no comments

After looking at managed and native deadlock diagnosis, we transitioned to a state of banging our heads against the table, which is a state familiar to many developers from their debugging all-nighters.  How did we get in such a state?  By issuing the modest “Batch Move” command on a set of pictures we wanted to move to a separate folder.  The application responds in a way we have already seen before – it gets completely stuck.

If you try deadlock diagnosis using the techniques I’ve shown earlier (WCT or the SOSEX !dlk extension), you won’t find anything useful.  Our last resort is trying to understand what our threads are doing, and see if we can come up with anything.

So at the client side (after attaching WinDbg and loading SOS), we have three threads, as shown in the debugger output of the !threads command:

image

Thread #0 is most likely the WinForm main GUI thread, #2 is the finalizer (we can see that in the Exception column not depicted in the above screenshot), and #6 is a thread we’re not entirely sure about.

So let’s go ahead and see what threads #0 and #6 are up to (the finalizer is of little interest unless other threads are waiting for finalization).  By switching into the thread and issuing the !clrstack and k commands we can see the managed and native call stack for the thread:

~0s
!clrstack

The debugger produces the following output (I’ve marked interesting frames in red and bold and cleaned the output a little bit by removing method parameters to make sure we can see what’s going on):

001be510 76ffa69d [NDirectMethodFrameStandalone: 001be510] System.Net.UnsafeNclNativeMethods+OSSOCK.recv
001be528 7a5ff523 System.Net.Sockets.Socket.Receive
001be554 7a5ff401 System.Net.Sockets.Socket.Receive
001be570 7a5d3798 System.Net.Sockets.NetworkStream.Read
001be5a0 7a5ab17a System.Net.PooledStream.Read
001be5b8 7a5b01d3 System.Net.Connection.SyncRead
001be60c 7a5b0089 System.Net.Connection.PollAndRead
001be624 7a5b52fb System.Net.ConnectStream.PollAndRead
001be630 7a582278 System.Net.HttpWebRequest.EndWriteHeaders
001be65c 7a580be5 System.Net.HttpWebRequest.WriteHeadersCallback
001be66c 7a5b52bc System.Net.ConnectStream.WriteHeaders
001be6bc 7a5820d2 System.Net.HttpWebRequest.EndSubmitRequest
001be6e8 7a57f61b System.Net.HttpWebRequest.CheckDeferredCallDone
001be6f8 7a57f842 System.Net.HttpWebRequest.GetResponse
001be738 50971909 System.ServiceModel.Channels.HttpChannelFactory+HttpRequestChannel+HttpChannelRequest.WaitForReply
001be774 5092e8c6 System.ServiceModel.Channels.RequestChannel.Request
001be7f0 50b81bd5 System.ServiceModel.Dispatcher.RequestChannelBinder.Request
001be800 50b5e105 System.ServiceModel.Channels.ServiceChannel.Call001be930 50b5df23 System.ServiceModel.Channels.ServiceChannel.Call001be950 50b63ab1 System.ServiceModel.Channels.ServiceChannelProxy.InvokeService001be978 5091291b System.ServiceModel.Channels.ServiceChannelProxy.Invoke
001be9bc 79374dc3 System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke
001bec58 79f98b43 [TPMethodFrame: 001bec58] PictureViewer.Client.PictureViewerService.IPictureViewer.BatchMovePictures
001bec6c 00332e13 PictureViewer.Client.PictureViewerForm.batchMoveToolStripMenuItem_Click

What we can make out of the following output is the propagation of a service call to the top of the stack, where it finally simply waits for a response.  So this is an innocent thread all in all.  Let’s take a look at #6 then:

05fbf780 76ffaec5 [HelperMethodFrame_1OBJ: 05fbf780] System.Threading.WaitHandle.WaitMultiple
05fbf84c 7940347e System.Threading.WaitHandle.WaitAny
05fbf868 7a5d2712 System.Net.TimerThread.ThreadProc
05fbf8b4 793b0d1f System.Threading.ThreadHelper.ThreadStart_Context
05fbf8bc 79373ecd System.Threading.ExecutionContext.Run
05fbf8d4 793b0c68 System.Threading.ThreadHelper.ThreadStart
05fbfafc 79e7c74b [GCFrame: 05fbfafc]

This isn’t even a thread we control – it doesn’t have a single frame related to our application.  It’s running some sort of timer thread, so we’ll leave it alone for now.  We still can’t reach any definitive conclusions, but it seems like the service side is a better place to start looking for the deadlock at.

If we look at the service side, we have thread #0 doing an innocent Console.ReadLine (that’s the service hosting code):

002feed8 758002b4 [NDirectMethodFrameStandaloneCleanup: 002feed8] System.IO.__ConsoleStream.ReadFile
002feef4 7948d2bb System.IO.__ConsoleStream.ReadFileNative
002fef20 7948d1ed System.IO.__ConsoleStream.Read
002fef40 793a3350 System.IO.StreamReader.ReadBuffer
002fef50 793aaa2f System.IO.StreamReader.ReadLine
002fef64 79497b5a System.IO.TextReader+SyncTextReader.ReadLine
002fef6c 001e01f1 PictureViewer.Service.Host.Main
002ff1ac 79e7c74b [GCFrame: 002ff1ac]

And then we have thread #5 left.  After a moment of dramatic suspense, let’s see what that thread is up to (I’ve snipped most of the frames from the bottom of the output):

0552ed78 76ffa69d [NDirectMethodFrameStandalone: 0552ed78] <Module>.WaitForSingleObject(Void*, UInt32)
0552ed88 001e7882 PictureViewer.NativeHelpers.PictureUtils.BatchMovePictures
0552ed8c 001e77f2 PictureViewer.Service.PictureViewerService.BatchMovePictures
0552eda0 00c00693 DynamicClass.SyncInvokeBatchMovePictures
0552edb8 50b8d90b System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke
0552ee30 50b6d245 System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin
0552ee84 509137ad System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage5
 

This is a WCF service-side stack, invoking the BatchMovePictures method, which is exactly what we should expect.  It ends up calling <Module>.WaitForSingleObject, which is something that might be a little bit more difficult to come to good terms with.  As a general rule, whenever you see <Module> in your call stack, you can be pretty much sure that you’re dealing with C++/CLI.  And as for WaitForSingleObject, that’s just a Win32 API from kernel32.dll, so we’re transitioning to native code.  Time for the native call stack (again, I snipped most of the bottom frames):

0:005> kb
ChildEBP RetAddr  Args to Child             
0552ecbc 757e1220 000004b4 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
0552ed2c 757e1188 000004b4 ffffffff 00000000 KERNEL32!WaitForSingleObjectEx+0xbe
0552ed40 00ba05d5 000004b4 ffffffff 4ecacae1 KERNEL32!WaitForSingleObject+0x12
WARNING: Frame IP not in any known module. Following frames may be wrong.
0552edac 50b8d90b 02b8e948 0552f054 00000000 0xba05d5

That’s a pretty standard wait call stack, what we should like to understand now is why the pictures aren’t getting moved, but instead we’re waiting for something that clearly isn’t happening.

The highlighted parameter (000004b4) is a handle to an object.  That’s the synchronization object we’re waiting for, so it could be nice to see what it is.  We can always use the !handle extension:

0:005> !handle 4b4
Could not duplicate handle 4b4, error 6

Error 6?  For those of us who don’t have the entire Win32 error code list memorized, there’s the !error extension:

0:005> !error 6
Error code: (Win32) 0x6 (6) – The handle is invalid.

Now is time for a little common sense.  We know that the handle is invalid.  However, if it were invalid at the time we called WaitForSingleObject, then we wouldn’t be waiting right now – instead, WaitForSingleObject would simply have returned with a last error indicating that the handle is invalid.  Therefore, we must deduce that the handle was closed after the WaitForSingleObject call was made – someone pulled the handle from right under our feet!

In a case like this, where a handle suddenly becomes invalid, our best resort is Application Verifier.  It is a fantastic tool (that I already mentioned before) which can enable a whole suit of verifications on your application’s code, without you having to modify a single line.  So fire up Application Verifier, and add the service application with the suite of checks you’re interested in.  In this case, we’re interested in the checks related to handle usage:

image

With that configured, we launch the service, attach a debugger (to ensure Application Verifier is loaded we can issue the !avrf command), and reproduce the scenario.  This time, instead of getting stuck, we have an exception in the debugger, also known as a verifier stop:

=======================================
VERIFIER STOP 00000300 : pid 0x2B54: Invalid handle exception for current stack trace.

    C0000008 : Exception code.
    0722F4EC : Exception record. Use .exr to display it.
    0722F53C : Context record. Use .cxr to display it.
    00000000 : Not used.

=======================================
This verifier stop is continuable.
After debugging it use `go’ to continue.

=======================================

What this tells us is that Application Verifier has detected an invalid handle being passed to a system API.  Since we’re looking for invalid handles, this is an interesting one to look at.  Note that we are certainly before the point where the application gets completely stuck, because we have seen earlier that the WaitForSingleObject call doesn’t fail.  Therefore, some other code fails when trying to use the handle.  Where are we?  (Formatted for clarity.)

0:009> kb
ChildEBP RetAddr  Args to Child             
0722f1cc 71603933 63408066 0079f6f8 0079f6f0 ntdll!DbgBreakPoint
0722f3d0 6f9a3001 6f9a7ba8 00000300 c0000008 vrfcore!VerifierStopMessageEx+0x4bd
0722f3f4 6f998f5b 00000300 6f993204 c0000008 vfbasics!VfBasicsStopMessage+0xd1
0722f428 7700aa43 0722f440 0722f4ec 0722f4ec vfbasics!AVrfpVectoredExceptionHandler+0x9b
0722f450 76feb852 00000000 0079f6f0 77090180 ntdll!RtlpCallVectoredHandlers+0x57
0722f464 76feb5d1 0722f4ec 0722f53c 00000000 ntdll!RtlCallVectoredExceptionHandlers+0x15
0722f4d4 76fcee57 0722f4ec 0722f53c 0722f4ec ntdll!RtlDispatchException+0x19
0722f4d4 76ffa7ba 0722f4ec 0722f53c 0722f4ec ntdll!KiUserExceptionDispatcher+0xf
0722f804 6f99a14e 000006f8 00000002 0722f820 ntdll!NtQueryObject+0x12
0722f8b0 6f99a221 000006f8 00000000 0722f8d4 vfbasics!AVrfpCheckObjectType+0x4e
0722f8c0 6f99a4c3 000006f8 00000000 76ffa778 vfbasics!AVrfpHandleSanityChecks+0x51
0722f8d4 757e18e1 000006f8 00000000 0722f8f0 vfbasics!AVrfpNtSetEvent+0x23
0722f8e4 715e2d48 000006f8 0722f930 6f99544f KERNEL32!SetEvent+0x10
0722f8f0 6f99544f 00000000 7dbad029 00000000 PictureViewer_NativeHelpers!PictureViewer::NativeHelpers::BatchMoveWorkerThread+0x18
0722f930 758519f1 002cd258 0722f97c 7702d109 vfbasics!AVrfpStandardThreadFunction+0x6f
0722f93c 7702d109 002cd258 07224433 00000000 KERNEL32!BaseThreadInitThunk+0xe
0722f97c 00000000 6f9953e0 002cd258 00000000 ntdll!_RtlUserThreadStart+0x23

This is our thread (BatchMoveWorkerThread) trying to call the SetEvent API.  Note that there’s no managed code on this stack – using !clrstack will only reveal that this is not a managed thread.  So we’re trying to use the SetEvent API, and the handle is invalid (by the way, we already know that the handle’s intended usage scenario was as an event).  Let’s take a look at the handle with the !htrace extension, which has the ability of tracing through every handle operation if Application Verifier handle tracking is enabled:

0:009> !htrace 000006f8
————————————–
Handle = 0x000006f8 – *** BAD REFERENCE ***
Thread ID = 0x00000378, Process ID = 0x00002b54

0x76e5037a: +0x76e5037a
————————————–
Handle = 0x000006f8 – CLOSE
Thread ID = 0x00000378, Process ID = 0x00002b54

0x76e5036a: +0x76e5036a
————————————–
Handle = 0x000006f8 – OPEN
Thread ID = 0x00002a7c, Process ID = 0x00002b54

0x76e506fa: +0x76e506fa
————————————–
Handle = 0x000006f8 – CLOSE
Thread ID = 0x000029d4, Process ID = 0x00002b54

0x76e5036a: +0x76e5036a
————————————–
Handle = 0x000006f8 – OPEN
Thread ID = 0x000029d4, Process ID = 0x00002b54

0x76e5039a: +0x76e5039a

————————————–
Parsed 0xBD6 stack traces.
Dumped 0x5 stack traces.

If we trace through the list of operations on the handle, we see that it was opened by threads 29d4 and 2a7c, and closed by threads 29d4 and 378.  Finally, thread 378 tried to reference that handle (this is the moment of time when we walked in) and that was an invalid reference that caused a verifier stop.  This gives us every reason to go and look at the code for the current thread, because before using the SetEvent API it somehow caused the handle to be closed.

But wait, this isn’t the thread calling WaitForSingleObject – that thread was a managed thread.  What if the thread calling WaitForSingleObject is waiting for the event (and the wait was issued before the handle was closed), and now the event was supposed to get signaled through SetEvent but the handle was already invalid?

Looking at the managed threads with !threads, !clrstack and kb again, we find the thread waiting for that very handle, in a wait that is unlikely to ever be satisfied…

0:006> kb
ChildEBP RetAddr  Args to Child             
06d3ef38 6f99d6e5 000006f8 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
06d3ef50 757e1220 000006f8 00000000 00000000 vfbasics!AVrfpNtWaitForSingleObject+0x25
06d3efc0 757e1188 000006f8 ffffffff 00000000 KERNEL32!WaitForSingleObjectEx+0xbe
06d3efd4 6f99d3c2 000006f8 ffffffff 00000000 KERNEL32!WaitForSingleObject+0x12
06d3eff0 011305d5 000006f8 ffffffff 30d379f5 vfbasics!AVrfpWaitForSingleObject+0xc2
WARNING: Frame IP not in any known module. Following frames may be wrong.
06d3f05c 50b8d90b 040fe948 06d3f304 00000000 0x11305d5

By the way, if you think this was impressive, there are so many other options in Application Verifier yet to be explored.  I extend my strongest recommendations to use this tool!

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*