July 2011 - Posts
Two years ago SELA had almost 20 experts attending Microsoft PDC 2009. And what a conference it was!
I just know this year’s BUILD/Windows is going to be amazing (like the website says, “Windows 8 changes everything”).
With the tidbits of rumors about Windows 8, HTML 5, Visual Studio vNext, Windows Phone, and everything else around the Microsoft stack – we’re going to come back home with enough stuff to learn and work on to fill 2012 to the brim.
At the time of writing, this is the SELA Olympic Team to the BUILD conference:
If you’d like to meet and are attending the conference, or are in the Anaheim, CA area in the middle of September, let me know. See you at BUILD!
Suppose you want a more detailed drill-down into your application's GC heap usage. For example, you want to see if there's fragmentation going on, or if there are lots of large objects, or if the XMLDocument objects you allocated a while ago are finally gone. This is something you can do with the CLR Profiler, another free Microsoft tool that supports memory allocation profiling as well as visualizing the managed heap at the individual object level.
Unfortunately, running the application under the CLR Profiler is very expensive. What do you need to do to obtain a memory view without running the application with the profiler attached?
- Open a dump file of the relevant application, or attach WinDbg to it.
- Load SOS and run the !traverseheap command to store a heap dump to a file (explained here in more detail). The larger the heap the longer this takes—may take hours for really large processes.
- Run CLR Profiler, select File | Open Log File and navigate to your heap dump.
- Click the "Object by Address" button to get a pretty view of the heap, with colors indicating various object types. If you zoom in (by adjusting the "Bytes/Pixel" and "Width / Addressrange" scales) you can see individual objects.
Note: CLR Profiler has another useful visualization feature (that I already blogged about), "Heap Graph", which is often relevant for debugging memory leaks.
To summarize, you have lots of options when it comes to understanding your .NET application's memory usage. Even large applications with a variety of allocation sources can be dissected and analyzed, and every bit of memory can be accounted for. In the managed heaps, tools like CLR Profiler give you an object-level granular view of heap contents.
Corrupted stacks are no fun at all – when you get a crash dump or a live exception in an application, pretty much the first thing you do is take a look at the call stack. When the stack itself is corrupted, your primary investigation tool is taken away.
Still, it is sometimes possible to reconstruct the stack even in face of a corruption. I’ve been showing how in the .NET Debugging and C++ Debugging courses, but by popular demand will show one example here as well.
You can follow along on your own with the dump file, symbol file, and sources from here.
Here we go – open the dump file in WinDbg (32-bit) obtains the following output:
User Mini Dump File: Only registers, stack and portions of memory are available
. . .
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(1ed0.870): Access violation - code c0000005 (first/second chance not available)
eax=00000000 ebx=00000001 ecx=73536122 edx=00000000 esi=002af37c edi=0000004e
eip=00000000 esp=002af1a8 ebp=00000000 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
00000000 ?? ???
WARNING: Frame IP not in any known module. Following frames may be wrong.
002af1a4 00000000 0x0
This is already bad news – the current instruction is at address 0x00000000, which means the instruction pointer (EIP) has been corrupted. You can also see that EBP has been corrupted – its value is 0x00000000 as well, which is why the k command has nothing to report.
Fortunately, ESP seems to have a valid value – well, we can’t really tell if it’s valid or not from looking at it, but we can try reading the memory it points to. If we manage to read the memory, it is almost 100% certain that ESP still points to the stack – because this is a mini dump that contains (almost) only stack memory.
If ESP indeed points to the stack, we can try looking at the stack manually and try to find something that looks like a return address. Immediately before the return address we should find a saved EBP value – unless the frame uses FPO, which I plan to discuss in a future post. This EBP value will provide the foundation for walking the stack further back – EBPs are chained in the sense that EBP always points to the previous saved EBP on the stack, which points to the even earlier saved EBP on the stack, and so on. (Refresh your memory on how an x86 stack is laid out.)
Here’s the raw stack contents from ESP (this would be a good time to set up the symbol path to include the folder which contains BatteryMeter.pdb):
0:000> dds ESP
002af1bc 742fd594 uxtheme!StreamInit+0x36
002af1dc 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
002af1e4 77dbc290 mfc100u!AfxDlgProc [...\dlgcore.cpp @ 22]
002af214 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
First of all, it’s nice to see that ESP points into a memory area that is included in the dump – which means we are looking at the stack. There are several things here that might be return addresses – and the addresses immediately preceding them are saved-EBP candidates. To eliminate candidates, we can peek at the memory location they point to – if it’s on the stack, the candidate is viable.
0:000> dd 002af0fc L1
0:000> dd 002af210 L1
The first attempt failed, but the second attempt succeeded – we might have a saved EBP on our hands. We can now proceed with manual reconstruction – the saved EBP points to another EBP, and immediately following it we should find another return address. Repeat several times to see if it makes sense:
0:000> dds 002af210 L2
002af214 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
0:000> dds 002af248 L2
002af24c 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
0:000> dds 002af280 L2
002af284 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
0:000> dds 002af2b8 L2
002af2bc 013719be BatteryMeter!RecurseDeep+0x4e [...\batterymeterdlg.cpp @ 135]
0:000> dds 002af2f0 L2
002af2f4 013719f7 BatteryMeter!CBatteryMeterDlg::OnCPUSelectorChanged+0x27 [...\batterymeterdlg.cpp @ 142]
0:000> dds 002af304 L2
002af308 77d92c8c mfc100u!_AfxDispatchCmdMsg+0x58 [...\cmdtarg.cpp @ 112]
We could keep doing this for a while – reconstructing the stack (as long as we don’t run into an FPO frame) until we hit the bottom. So far we have the RecurseDeep function calling itself at least four times before we hit the stack corruption.
There is also a WinDbg command that can perform this reconstruction for us – we only need to give it a guess for EBP, ESP, and EIP – and it constructs a plausible call stack. Our EBP guess can be the first saved EBP we found on the stack, our EIP guess can be the return address immediately following it, and our ESP guess can be the same as EBP, producing the following output:
0:000> k = 002af210 002af210 013719be
002af210 013719be BatteryMeter!RecurseDeep+0x4e
002af248 013719be BatteryMeter!RecurseDeep+0x4e
002af280 013719be BatteryMeter!RecurseDeep+0x4e
002af2b8 013719be BatteryMeter!RecurseDeep+0x4e
002af2f0 013719f7 BatteryMeter!RecurseDeep+0x4e
002af304 77d92c8c BatteryMeter!CBatteryMeterDlg::OnCPUSelectorChanged+0x27 002af318
77d92e51 mfc100u!_AfxDispatchCmdMsg+0x58 002af334
77dc6d36 mfc100u!CCmdTarget::OnCmdMsg+0x124 002af358
002af388 77e1bc7f mfc100u!CWnd::OnNotify+0x7b
002af454 002af478 mfc100u!CWnd::OnWndMsg+0x9e
... source information and the rest of the stack snipped for brevity
We have turned an impossible problem with very little information into a pretty decent call stack which gives us the likely culprit for the stack corruption. Inspecting the sources for BatteryMeter!RecurseDeep drives the point home – the function corrupts the stack, but does so in a sneaky fashion – instead of corrupting its own frame, it goes back several frames earlier on the stack and overwrites a small memory region with zeroes.
Synopsis: CLR Stack Explorer obtains reliable call stacks of managed processes, supports any combination of 32-bit/64-bit and CLR2/CLR4.
UPDATE [2011/7/20]: If you downloaded CLR Stack Explorer from the above link and are using a recent Windows version, you need to Unblock all the .exe files (right-click, Properties, Unblock) for the tool to run correctly. I have just updated the download location with a self-extracting executable which should solve this problem.
I’m happy to announce that CLR Stack Explorer, a tool I’ve been working on during the last few days, is now ready for preview. Frankly, I have Managed Stack Explorer to thank for this little project – its lack of support for CLR 4 has encouraged me to embark on this journey.
You can find the bits here – but please remember this is a very early preview with probably lots of bugs. Your process may crash as a result of using this tool to view its call stacks.
This is the basic experience when using CLR Stack Explorer:
The tool consists of several components:
- C++/CLI assembly which contains the necessary interactions with the unmanaged CLR Debugging APIs. This assembly exports an unmanaged class, too – so it can be used without the rest of the code.
- C# WinForms GUI that displays a list of relevant managed processes and obtains call stacks from…
- …a C# WCF service that listens over a named pipe for requests to obtain process call stacks. The reason for this service is that the CLR Debugging APIs don’t support cross-bitness debugging, i.e. a 32-bit debugger process can’t obtain call stacks of 64-bit processes and vice versa. Instead of having two separate GUI versions for 32-bit and 64-bit, the GUI talks to two separate services (32-bit and 64-bit) which provide the information.
As I was going at it I found that it would be pretty easy to add source browsing support, so if you have sources at the right location you can double-click a stack frame and go to the line of code:
Also, CLR 4 features a nice addition to the Debugging API which lets a debugger easily see which thread is holding a Monitor (lock) and which threads are waiting for a Monitor using Monitor.Wait. This information is made available through the GUI when right-clicking a thread:
If you find this tool useful, please let me know. It’s been a pleasure writing it so far, and I’m planning to go on – some of the things I have on my TODO list:
- Use GetActiveInternalFrames to retrieve internal frames and display them on the stack (e.g. managed-unmanaged transition)
- Consider moving to ICorDebugStackWalk
- Consider support for unmanaged frames
- Perform automatic deadlock detection
- Cache module metadata and symbol information
- Symbol path support (currently it's just the current directory)
- Source path support (e.g. prompt the user for source location)
How can you map the memory usage of your .NET application? We'll start with VMMap, a free Sysinternals tool that visualizes your process' virtual address space. Below is VMMap's output for an example process:
The type statistics give you a detailed overview of how memory usage is distributed – there are 240MB of DLLs, 50MB of managed heaps (of which only 10MB are committed), etc. In the bottom details view you can see each individual address range on the heap, including its type, size, committed size, and other details (such as DLL names for the "Image" type and file names for the "Mapped File" type).
Within the GC heap, VMMap features the ability to identify the various generations (and the Large Object Heap) within the GC segments:
In the VMMap output above, there are 1,916KB of "Unusable" memory—what's wrong with it? Even though the VMMap documentation doesn't say so clearly, "Unusable" memory regions are address ranges that cannot be allocated because of VirtualAlloc's 64KB start address granularity guarantee (see the history of this decision).
Finally, for 32-bit processes, VMMap features a “Fragmentation View” in which you can see a visual representation of your memory space and zoom in on fragmented regions:
Note that virtual memory fragmentation is a real PITA for 32-bit managed processes. We’re not talking about internal fragmentation due to pinning – we’re talking about external fragmentation between GC segments; Maoni’s blog post What’s New in CLR 2.0 GC explains both phenomena.
How does this happen? The GC allocates virtual memory in large chunks, called segments, which are at least 16MB in size. As time elapses, segments are allocated and freed, and smaller chunks of memory fragment the virtual address space. At the extreme, you may get an out-of-memory exception when there is still lots of virtual memory available – the problem is that there isn’t enough consecutive memory to satisfy the GC’s allocation request for another segment. (VM hoarding is one alleviation option; another way for diagnosing these problems, even from a dump file, is !address –summary or !AnalyzeOOM.)
Another tool to get a visual view of virtual memory is an experimental solution by Microsoft's John Allen posted on Tess Fernandez' blog:
In the third part we'll see how to further zoom-in on the managed heaps and see what's going on inside.
I hope y’all are using Sysinternals Process Explorer on a daily basis as your Task Manager replacement. It’s a really awesome tool with lots of functionality; among my favorites are:
- Seeing all the handles and DLLs opened by the process in the bottom pane
- Monitoring important .NET performance counters through the .NET Performance tab
- Viewing a list of the process’ threads and their respective call stacks
This last feature, however, has a minor drawback: it doesn’t display managed call stacks properly. The reason is the same as with many other debugging tools – managed symbol resolution requires different APIs than native symbol resolution (through Dbghelp.dll).
Enter Managed Stack Explorer, an open source tool that displays managed call stacks for arbitrary processes. It even has some tracing and logging support.
Unfortunately, the tool currently supports only CLR 2.0 processes, and looks somewhat abandoned (last release from 2006). I can’t see any recent updates to the “official” managed debugger API wrappers, so it’s not trivial to see how Managed Stack Explorer can be revived. I’m going to keep an eye on it.
To evaluate whether your application can scale with larger data sets and more concurrent users, you have to understand how it uses the memory available to it. Modern .NET applications (and especially mixed managed and native applications) have thousands of memory chunks across their address space—keeping track of these chunks manually in your head is an impossible endeavor.
First, a quick reminder about how virtual memory, managed heap allocations, and native heap allocations relate to each other (for more details, consider reading the Memory Manager chapter in Windows Internals):
- A Windows process has a virtual address space, which is a range of addresses your application can use. On 32-bit Windows, the address space available to your code is usually 2GB (roughly, the addresses 0x00000000 – 0x7FFFFFFF); on 64-bit systems, the address space available to your 64-bit process is 8TB.
- There is no requirement that enough physical memory (RAM) be available to support the virtual memory requirements of all applications. For example, you can install 2GB of physical memory on your PC and still run dozens of processes, each with a 2GB address space. Windows maps virtual addresses to physical addresses as demands dictate, and transfers data from physical memory to the disk (page file) to free physical memory space.
- Virtual memory can be allocated in two steps, using the VirtualAlloc Win32 API.
- You can tell Windows to reserve a chunk of virtual memory addresses—this reservation is very cheap, and doesn't incur any physical memory cost. Windows guarantees that no other component in your process will use the same address range.
- After reserving a chunk of memory, you can commit it—this is when you can start reading and writing to the allocated address range.
- The VirtualAlloc API is not sufficiently granular for high-level language use (64KB start address granularity, 4KB allocation granularity), and incurs a system call for each allocation request.
- Most applications don't allocate virtual memory directly.
- Managed applications use the .NET allocator (new operator), which reserves from Windows large chunks of memory called GC segments and manages allocations within these segments, committing parts of them as necessary.
- Native applications use the Windows Heap Manager (HeapAlloc) or the CRT allocator (malloc, C++ new operator), which operates similarly to the .NET allocator.
What kinds of things can you expect to see in your process' memory space?
- Loaded DLLs – code your application uses has to be mapped to your process' virtual memory. In large applications, a significant portion of the address space can be taken by DLLs.
- Native heaps – allocations performed by HeapAlloc, malloc, or the C++ new operator.
- Managed heaps – allocations performed by the .NET allocator.
- JIT heaps – areas in which just-in-time compiled code is stored (recall that .NET assemblies may contain platform-independent IL, which is compiled at runtime to executable machine code).
- Mapped files – files that your applications maps to virtual memory for easier access without using stream-based APIs.
- Thread stacks – each Windows thread starts with a 1MB default stack reservation, of which 4KB (a single page) is initially committed.
In the second part we'll start mapping the memory usage of .NET applications and visualizing the various allocation types.
I don't typically rant about security or "The Cloud", but as an avid Dropbox and Instapaper user I've had some comments building up inside for the past few weeks.
Dropbox is a simple private file sharing service which gives you access to your files from a variety of devices (I use it on my Windows laptop, Windows desktop, MacBook Air, iPhone, and iPad). Instapaper is a tool for saving web pages for later viewing – when I don't have time to read a long blog post or interesting article, I click a bookmark in my browser and the text gets saved to my Instapaper archive (I use it on all my PCs, iPhone, iPad, and Kindle).
Recently both services have hit the headlines with unfortunate security-related stories. A brief recap of what I'm referring to:
- Dropbox rolled out an update that enabled you to log in without the correct password. This update was live for over four hours until it was detected and fixed. (The obvious question of "how on earth does this happen" is left as an exercise for the reader.)
- Instapaper's database server was captured by the FBI in a raid on Instapaper's Web hosting provider. It was later discovered that the FBI did not target the specific server, and did not capture the hard disk which was stored in a separate enclosure. The server was subsequently returned.
These two seemingly-unrelated stories finally made me understand that I trust service providers with my data, having not much more than anecdotal information about how the data is stored, how it is secured, and what happens to it along the way. In fact, I have no idea where in the world my Dropbox files and Instapaper bookmarks are stored, how employee access to them is regulated, which governments can capture them given a court order, and what backups are in place in case the whole datacenter goes up in flames.
Am I supposed to perform this investigation every time I entrust my data to a service provider? What do you do?
During the last few months, SELA’s IT group has been evaluating new PC hardware for our classrooms. If you’ve ever visited our headquarters in Ramat-Gan, you know that we have nearly 20 classrooms of various sizes equipped with 10-25 PCs. Replacing them all at once is a rather expensive endeavor.
Before this replacement, our classrooms PCs enjoyed a mixed variety of hardware, including:
- High-end Intel Core i5 workstations with 4GB RAM
- Somewhat outdated Intel Core 2 Duo workstations
- Somewhat more outdated Intel Core workstations, and even an occasional Pentium IV…
After the first evaluation phase (which was somewhat marred by the Intel Sandy Bridge motherboard recall of early 2011), we have ordered 100 new PCs to replace some of the existing inventory. The specification is as follows:
- Intel Core i7-2600 CPU with 8MB cache @ 3.4 GHz
- 8GB Kingston DDR3/1333 RAM
- Gigabyte GA-H67A-USB3-B3 motherboard (featuring the obvious Gigabit Ethernet, HDMI, SATA 6Gb/s connectors, and USB 3.0)
- Western Digital Caviar Blue 500GB 16MB cache @ 7200rpm
We installed Windows 7 Enterprise (64-bit) on these workstations, and they are blazing fast. I’ve already had the pleasure of teaching two courses (Windows Internals @ the Dev Days and .NET Debugging) on these PCs, and it’s a real joy.
To get a general idea of what kind of workloads we are running on this hardware, you need to consider a sampling of the courses we deliver in these classrooms:
- Windows Internals – requiring a virtual machine running Windows XP or Windows Server 2003 and a host with a kernel debugger
- Parallel Programming – requiring at least 4 physical cores to demonstrate the benefits of parallelism
- .NET Performance – requiring lots of physical memory and CPU horsepower to sustain micro-benchmarks and to run the Visual Studio Concurrency Profiler and other heavy-duty profiling tools
- …and of course the entire suite of Computer Graphics courses, in which the photo- and video-editing tools require CPU horsepower, lots of physical memory, and a powerful GPU
You don’t buy 100 brand-new PCs every day: this is a significant milestone and a huge improvement to the training experience we’re providing. I hope you enjoy these new classrooms at your next visit to SELA!
The first remotely useful thing we are going to do with our newly acquired knowledge about device driver development is to register a callback for whenever a process is created, and output the information on the parent and child processes. (Frankly, this can be accomplished quite as easily using the WMI Win32_ProcessStartTrace event class, but bear with me here.)
The PsSetCreateProcessNotifyRoutine function is a service provided by the process manager in the executive, which allows us to register a callback for when processes are created. This can be useful in the context of a security product, auditing software, or malicious code. (Note that this is a prime example of an API that is not accessible from user mode.)
So here’s what we’re going to do—we will add two new IOCTLs (IOCTL_HOOK and IOCTL_UNHOOK) to our driver’s interface. The “hook” IOCTL will register the callback and the “unhook” IOCTL will unregister it so that the driver can be unloaded safely. Consider the unfortunate case when a driver is unloaded but the process manager attempts to call a function in it!
Without further ado, here’s the code for the new IOCTLs and the DriverProcessNotifyRoutine:
#define IOCTL_HOOK (ULONG) CTL_CODE( FILE_DEVICE_HELLOWORLD, 0x01, METHOD_BUFFERED, FILE_ANY_ACCESS )
#define IOCTL_UNHOOK (ULONG) CTL_CODE( FILE_DEVICE_HELLOWORLD, 0x02, METHOD_BUFFERED, FILE_ANY_ACCESS )
IN HANDLE ParentId,
IN HANDLE ProcessId,
IN BOOLEAN Create)
DbgPrint("Process %d created process %d\n",
DbgPrint("Process %d has ended\n",
/* in DriverDispatch */
As always, compile, deploy and register the driver on the target system, and then create and kill a few processes while watching the DebugView output.
In the next episode: using Direct Kernel Object Manipulation (DKOM) to hide a process from tools like Task Manager.