Tales from High Memory Scenarios: Part 1
A few days ago, I was visiting a customer with high-memory scenarios in a 64-bit ASP.NET application. I’m saying “scenarios” because we still don’t know for sure what the problem was, and because we’re pretty confident that there’s more than one underlying cause.
Even though there are no conclusive results yet, I wanted to share with you some of the things we did because they are interesting on their own behalf.
What we were facing was a set of 4 w3wp.exe processes that were consistently increasing in memory consumption over a period of several hours, starting from approximately 300MB and climbing to over 1GB of memory. Because this was a 64-bit system, I was not concerned with memory fragmentation, but overall system physical memory was starting to run out. To save themselves from the dangers of paging, the company had an alert system in place to automatically notify (and restart) when physical memory utilization was around 95%.
Armed with this knowledge, we started looking at memory dumps, which of course contained millions of apparently interesting objects, all rooted at or related to a distributed cache component that the company was using. The architecture of this distributed cache allows for two tiers of caching on each machine: A service that keeps the object data in serialized form, receiving notifications and objects from other nodes in the distributed cluster, and a first-tier managed cache inside the ASP.NET application which keeps the object data non-serialized, after having passed it to the service in serialized form.
Looking at the cache configuration, I noticed that it was set to contain a maximum of 100MB of objects, and a timeout value to ensure that objects were not retained indefinitely. Nonetheless, the application’s memory utilization suggested that many more objects were actually stored in the cache than reported.
Fortunately, the first-tier cache is written in .NET, so I was able to take a look at the source code with Reflector. After inspecting the code for a while, I wondered how the cache was even able to tell the size of the managed object (to keep the cache from exceeding the 100MB hard limit). Apparently, what happened at each add-to-cache operation was approximately the following:
- Serialize the object using BinaryFormatter
- Send the serialized representation to the service, including the timeout parameter
- If the current cache size + the size of the serialized object exceeds the cache maximum, remove objects from the cache until the limits are satisfied
- Store a local reference to the object
There are two fundamental things wrong here:
- There is no notion of timeout on the ASP.NET first-tier side – objects are retained indefinitely
- The serialized size of an object is not necessarily the same as its real size in memory!
This second problem is a very likely cause of the non-deterministic memory consumption. If the cache receives an object that takes 4K of memory but due to sheer luck can be serialized to 1K of memory, the cache retains the 4K of memory but reports only 1K as being in use.
We tried several types of objects and there was a particular category of objects that was being serialized to a much smaller form than the original – Hashtables. Since the cache was primarily used to store session objects, which are hardly anything more than a Hashtable instance, we had a fairly good grip on this one.
To verify, I wanted to change the cache configuration to store at most 1K of objects and see if the memory usage will go down. However, the leak took several hours to accumulate, and we wanted results right away. To do so, I attached WinDbg to a live process, used !dumpheap and !do to find the cache configuration object (which, fortunately, was queried on every add-to-cache operation), and used the w command to write the 1K value to memory (instead of the 100MB that was there).
After a few minutes, nothing happened yet – memory was still high and wouldn’t go down. After inspecting the GC performance counters, I realized that no Gen2 garbage collections were occurring, so there was no opportunity for memory to be reclaimed. Because this was a 64-bit system, and because lots of memory was still available, the GC didn’t feel that a collection was necessary, and rightfully so.
So now we were left with the question – how can we trigger a garbage collection even though there was no need for garbage collection? After exploring interesting ideas like injecting code into the process (with e.g. CreateRemoteThread) to trigger a GC, I figured that we might as well consume all available physical memory. It is known that the CLR listens for the low physical memory status notification, and triggers a GC when this notification occurs.
I wrote a simple console app that allocated lots and lots of physical memory, let it run for a few seconds – and voila, a garbage collection in our process occurred. Most of the memory has been reclaimed, showing that the first-tier cache was the most likely culprit.