Next Generation Production Debugging: Demo 2 and Demo 3
After seeing what can be done without a debugger, it was time to dive in and start experimenting with actual production debugging techniques. I briefly explained what debugging symbols are (and how you configure your debugger to download symbols for Microsoft product automatically - just set the _NT_SYMBOL_PATH environment variable to srv*C:\Symbols*http://msdl.microsoft.com/download/symbols), and continued to demonstrate how a dump file can be generated. The one thing many people don't know yet is that on any Windows version you can generate a dump with the tools out of the box, without resorting to any external debugging package.
On NT 5.2 and below (Windows Server 2003, Windows XP, Windows 2000 and so on) we have the built-in NT debugger, ntsd. If you're really addicted to it you can get it to work on Windows Vista and Windows Server 2008. You can use the following command line to capture a dump using ntsd:
ntsd -p <ProcessId> -g -c ".dump MyDump.dmp"
On NT 6.0 and above (Windows Vista and Windows Server 2008), the built-in Task Manager can capture a dump for you. All you need to do is right-click the process and choose "Create Dump File", like in the following screenshot:
After you've captured the dump, it can be analyzed further on the production machine (if you have the tools) or on a development machine which is usually better equipped for the task. Again, there aren't many people who have used Visual Studio to open a dump file, which is a pity, because if you have symbols and sources all lined up, you can get a fantastic experience. Just fire up Visual Studio, choose File -> Open -> Project or Solution, load the dump and hit F5. What you have now is the exact state of the application when you have captured the dump, as if you are now debugging it:
After having demonstrated that, I proceed to show a more real-life scenario. In that scenario, the client application issues "Batch Delete" requests for pictures, and the application's memory usage seems to go up with each such request. This can be established by using Performance Monitor to monitor the native and managed memory usage for the client process. Subsequently, I fired up WinDbg (from the Debugging Tools for Windows package) and attached to the process (which could effectively be done by opening a dump file, as well). This is done by choosing File -> Attach to Process, or using the F6 shortcut key.
After attaching, you get a bunch of debugger output (a.k.a. debugger spew), and a command line at the bottom which you can use to input your commands. WinDbg also has some floating windows (such as call stack, processes and threads, memory etc.) but for our scenario they will prove less useful.
The next thing I did was load the SOS debugging extension, which is the primary debugging extension for .NET applications. It ships with the .NET framework, so you can expect to find it on any production machine, and of course on your development machine if you have .NET installed. Since we need to load the right version of SOS depending on the .NET version we have in the process, I issued the .loadby command:
.loadby sos mscorwks
Nothing is printed to the console, which means the debugger is happy and the extension is loaded. This can be verified by typing !help and seeing a list of commands offered by the SOS extension.
The first command I used was !dumpheap, which proceeds to give you a summary listing of all objects on your GC heap. This is an extremely useful command because to track down a memory leak you first need to know which objects are occupying the memory that is not being released.
The debugger output in this case was significantly more elaborate. It is sorted by total object size in ascending order:
The first column here is the method table address, the second column is the number of objects we have from that type, the third column is the total amount of memory these objects occupy, and the final column is the human-readable type name. So what we know now is that we have 660 arrays of bytes occupying 8.7MB of memory. If we try the "Batch Delete" command again, we will see even more such arrays.
How do we look at a list of specific array objects? Well, issue the !dumpheap command again, with the method table address for the System.Byte entry (you could also use -type System.Byte but it's less accurate because it does string matching against the type name):
!dumpheap -mt 7912dae8
This produces the following output, sorted by object size in ascending order again:
So, there are some objects that particularly stand out. I have performed a "Batch Delete" operation on four pictures, and there are four large arrays of bytes that are for some reason not garbage collected.
Who is keeping these arrays from being garbage collected? To determine that, we need to take the addresses of these individual arrays and issue the !gcroot command (we can also write a loop in the debugger to automate the work for us):
The output for any one of these four objects will look very similar to the following:
What we see here is that the stack pointer (ESP) is holding an Application.ThreadContext object, which holds a reference to the PictureViewerForm (which is the primary form for our application). The form holds a reference to an EventHandler, which holds a reference to an array of objects, which holds a reference to an EventHandler, which holds a reference to a PictureViewerForm.NotifyOnDeleteSubscriber object, which holds the array of bytes we were looking for in the first place.
This surely looks like an event handler wired up for an event and not getting removed from the event when appropriate, but we can even know which event it is. Let's remember the address of the first EventHandler in this list, in this case it's 02eabf44. Additionally, let's issue a !do (dump object) command on the form's address:
The result is:
If you examine this listing, you'll see the address I asked you to remember as the address of the instance field called NotifyOnDelete. This is certainly an event handler. (For many more walkthroughs on SOS, I suggest you stop whatever you're doing right now and go read the fantastic blog at http://blogs.msdn.com/tess by an Escalation Engineer working for Microsoft Sweden and doing some fantastic work utilizing and evangelizing WinDbg and SOS.)
To automate the process and make it significantly more visual, you can use another obscure extension command called !traverseheap. This command generates a log file (that can be binary or XML formatted), which you can then feed to CLR Profiler or your own XML reader which analyzes and visualizes the output.
When you launch CLR Profiler and tell it to open the log file, you get a gray dialog with a "Show Heap Graph" button on the bottom right:
After clicking that, you get a nice visual representation of what's going on in your application. Here's what it looks like:
If you continue following the graph along the thickest line, you'll eventually find the same objects we were looking at earlier, to get a visual confirmation of where our memory leak stems from:
If you are still uncertain, or want to try a different approach, there's a very cool tool called Hawkeye. For WinForms applications, it provides a runtime object explorer and editor, allowing you to view and edit object properties, invoke methods, view event registrations and lots of other interesting things. So if we just launch Hawkeye, we get the following dialog, with a large button waiting to be dragged to a form or control within a form:
If we drag it to our application's main form, we get the following window, displaying what's going on in our form (the magic works by injecting a DLL into our process):
If we click on the events tab, we see the list of events in our client application, and have the ability of viewing the registered subscribers for the event:
If you right-click one of the subscribers, you can analyze the tree further by viewing all properties to see where the byte array actually joins in:
As I mention SOS, it's also important to note another community tool called SOSAssist, which has the potential of providing a visual user interface on top of SOS commands. The current state of the tool is rather rudimentary, but with enough community support it can be picked and extended to a state of the art Production Debugging Studio like we all deserve.
A quick note regarding SOSAssist on 64-bit systems: if you try attaching to a process on a 64-bit system, SOSAssist will crash regardless of whether the target process is 32-bit or 64-bit (the latter is currently not supported one way or another). The reason for that is that the SOSAssist main executable is compiled as IL only without a bitness preference, so on a 64-bit system it will be loaded as a 64-bit process. However, some of the controls it uses are within 32-bit assemblies which will then fail to load. This can be easily mitigated by using the corflags tool:
corflags SOSAssist.exe /32bit+
... and voila, SOSAssist works on a 64-bit system! (This is a case of production debugging a production debugging tool, which really demonstrates that even production debugging is not without its sense of irony.)