July 2009 - Posts
About a week ago, Dima and I were consulting for two different teams at the same customer site, without even knowing about each other’s presence. Around the middle of the day he gave me a call, and we discovered that we’re only a couple of stories apart.
What Dima’s customer was experiencing is a non-deterministic delay in a managed application performing processing of buffers received from a physical device before they were being sent to another physical device. While it would typically take less than 0.5ms to handle a single buffer, some buffers would take as long as 500ms to be processed. At arrival rates approaching 10GBps this was not an acceptable delay, as the underlying buffers would overflow while waiting for the application to drain them.
Seeing that there was no significant work done by the application other than the memory allocation to store the buffer, we immediately suspected the garbage collector. However, after several enquiries and measurements it became evident that most collections were gen 0 collections, which are supposed to be extremely cheap. After doing some more work, however, Dima concluded that it were the gen 0 garbage collections that were sometimes taking nearly a second to complete.
The one clue that we did get was that the delay times were always in multiples of 250ms. Most delays were 250ms, some 500ms, a chosen few 750ms and a very rare one would be 1000ms or 1250ms. Following the “there are no coincidences” rule, I wanted to understand whether there was a component in the system (or possibly in the garbage collector) that was performing a 250ms wait, possibly multiple times in a row before giving up.
To this end, we fired up WinDbg and set a breakpoint in ntdll!NtDelayExecution (which is called by the various flavors of Sleep). Not a single breakpoint was hit. Moving on, we set a breakpoint in kernel32!WaitForSingleObject, WaitForSingleObjectEx, WaitForMultipleObject and WaitForMultipleObjectEx and specified that we want a timeout value of exactly 250ms. (Setting a conditional breakpoint in WinDbg is a piece of cake, especially if you know the method signature.)
Sure enough, there was a breakpoint hit in the middle of a gen 0 garbage collection, with the following call stack (edited and snipped for clarity):
Breakpoint 0 hit
Args to Child
0000008c 000000fa 00000000 KERNEL32!WaitForSingleObjectEx
000000fa 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
000000fa 00000000 00000000 mscorwks!CLREvent::Wait+0x17
00000000 c1ca7196 00000000 mscorwks!Thread::SysSuspendForGC+0x697
00000001 007fc318 00000000 mscorwks!WKS::GCHeap::SuspendEE+0x16c
00000000 00000000 0036f054 mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x195
00000000 00000000 007fc318 mscorwks!WKS::GCHeap::GarbageCollectTry+0x33
00000000 00000000 00000000 mscorwks!WKS::GCHeap::GarbageCollect+0x67
632fe67e 02455fc4 63760272 mscorwks!GCInterface::CollectGeneration+0xaa
00210132 02455f00 02452d74 mscorlib_ni+0x1a9944
00be27e2 00000003 0036f200 mscorlib_ni+0x6b8127
I was able to tell that it was a gen 0 collection by observing the first parameter to the WKS::GCHeap::GarbageCollect method on the call stack. The 000000fa parameter to WaitForSingleObjectEx indicates that we’re seeing a timeout value of 250ms.
The wait object itself is a non-signaled manual reset event:
0:000> !handle 0000008c ff
Object Specific Information
Event Type Manual Reset
Event is Waiting
So we have the location of the wait, and we can even see that the wait returns with a timeout by stepping out and inspecting the return value (usually in EAX) of WaitForSingleObjectEx:
eax=00000102 ebx=007fc318 ecx=7545e9a7 edx=00000000 esi=754b1126 edi=00000000
eip=655bc659 esp=0036ee20 ebp=0036ee50 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
0:000> !error 102
Error code: (Win32) 0x102 (258) - The wait operation timed out.
Right, so we have a situation here where the GC is waiting for some event that is not getting signaled for 250ms while it should instead be very busy preparing itself for a quick gen 0 garbage collection.
By stepping back a few frames on the stack, it’s possible to see what’s happening inside the Thread::SysSuspendForGC method. If the wait returns 0x102 (timeout) then an entry is logged to the CLR stress log (which I hope to discuss in a future post). It’s also possible to trace the entry text to the following two messages:
Timed out waiting for rendezvous event %d threads remaining.
Failed to CheckForAndDoRedirectForGC(). Retry suspension 2 for thread %p.
Finally, when all attempts to gracefully wait for the event have failed, Thread::SysSuspendForGC calls Thread::SuspendThread which maps to kernel32!SuspendThread.
From all of the above, you might be already seeing the light at the end of the tunnel. The GC thread is trying to checkpoint with one of the application threads by waiting for an event that will allow it to continue the GC. The application thread doesn’t signal the event, which causes the GC thread to give up and eventually suspend the application thread brutally using the Win32 API.
Two questions remain.
Why would the GC thread checkpoint with an application thread?
This one’s relatively easy. A garbage collection is fairly sensitive process, in that it walks through data structures and pointers that were used by the application threads just a moment ago. Although during the collection itself all other threads are suspended (in our case we’re working with the workstation concurrent GC, so during the sweep phase threads are suspended), it’s still not advisable to brutally suspend a thread while it’s in the middle of some possibly sensitive operation.
This is why the JIT compiler inserts so-called “safe for GC” markers into the generated code, so that when a thread hits one of these markers it would suspend itself gracefully if a GC is currently in progress. (There are numerous ways to implement this, one might involve temporarily unmapping a page that is accessed in the safepoint code and suspending the thread in the vectored exception handler of the resulting access violation exception.)
What are the application threads doing to prevent them from checkpointing with the GC when necessary?
This one is a little bit harder, but the obvious things to look for are threads performing lots of CPU operations that don’t call into the CLR. Any form of busy loop, number crunching or similar things that are constrained to a single method are suspicious in this regard. In our specific case, we found an internal loop that was busy-waiting its way around a physical device buffer. Apparently, this busy-wait didn’t contain a GC safepoint, making the GC thread wait for at least 250ms until it gave up and brutally suspended the thread.
To overcome this problem, it was necessary to either break the inner loop so that it once in a while calls a CLR API (Dima discovered that GC.KeepAlive did the job, there are other alternatives as well), or redesign it so that it is based on true waiting instead of busy-waiting. Dima chose the latter alternative and the problem disappeared, yielding consistent processing times of less than 0.5ms, as desired.
Once we had this conclusion, it was easy to reproduce the problem in a sterile environment. It takes less than 30 lines of code to demonstrate this issue:
static volatile int i;
static void Foo()
for (i = 0; ; ++i) ;
static void Main(string args)
Thread churner1 = new Thread(() =>
Stopwatch sw = Stopwatch.StartNew();
Here’s a typical output from my machine:
^CPress any key to continue . . .
(Notice the peculiar accuracy of the delays, always in 250ms multiples or near.)
Some remarks before we conclude this post:
- The problem does not manifest when the 64-bit CLR is loaded into the process. I.e., if you want to reproduce the problem, run the above code on a 32-bit OS or make sure that you’re forcing it to run as x86 (if you’re on a 64-bit OS). This can be done from Visual Studio or using the corflags utility.
- The workstation concurrent GC flavor was used in this example, but the other GC flavors are not immune from this problem. I was able to reproduce it with the server GC just as easily.
- This is not the first time someone out there encounters this problem. Even more surprisingly, this is not the first time I have heard about this problem—over a year ago, in February 2008, I got an email from William Levine (a reader of this blog) who was experiencing strange 500ms delays, and suspected that there was some sort of CLR activity that was capped at 500ms. After some analysis, Bill was able to put together a tester (slightly more complicated than the one above) and even discuss it with a Microsoft Escalation Engineer. Unfortunately, the results were the same—this is an inherent limitation of the CLR GC, and there’s not much to do about it other than sprinkle cooperative-multitasking-style API calls into the code.
It’s hard to argue with the 127K subscribers to Jeff Atwood’s blog, so when I first read his post titled “Windows 7: The Best Vista Service Pack Ever” I wasn’t immediately sure how to respond. Nonetheless, I would like to try and offer my opinion on some of Jeff’s points, in the hope of giving a different perspective on what Windows 7 really is.
Now that Windows 7 has reached its "release to manufacturing" milestone, I had the opportunity to install it for myself and see.
Jeff, I’m sorry, but where have you been for the past 8 months? Microsoft released the M3 build at the PDC in October ‘08, the Beta has been out for several months, the RC has been out since May… If it’s really only now that you first installed the OS, how can you possibly write something as definitive as… (brace yourself…)
Windows 7 is the best Vista Service pack ever.
This one has really got me. Just a quick reminder to my dear readers: There is a whole lot of shiny new features in Windows 7, just to name a few off the top of my head:
- New taskbar
- Native multi-touch support
- Support for up to 256 logical processors
- Native Ribbon UI
- Sensor and Location platform
- Windows Web Services API
- Trigger-start services
- Shell libraries
- …and so many more!
Saying that a major release of the OS, which contains dozens of major new features, is a service pack is not just a slight inaccuracy: It’s borderline irresponsible to the 127K blog readers and the thousands of others who are going to perpetuate this claim. Sure enough, Windows 7 offers significant performance and reliability improvements compared to Vista—but it doesn’t make it a service pack. For this matter, Vista SP2 is almost to par with the performance and reliability of Windows 7. That’s the real service pack.
The core of the operating system isn’t that different.
That’s simply not true, regardless of what you define as “the core of the operating system”. The kernel has undergone significant changes, including major scalability improvements; the user interface has been revamped and the initial desktop experience is completely difference; the fundamental Windows applications have a different look and feel; there’s no area of Windows 7 that you can say “isn’t that different”.
I think Windows 7 works well as a de-facto Vista service pack. I guess that's not surprising if you compare the version numbers. […] 6.0.6002 […] 6.1.7600
This is another example of poor fact-checking. In October ‘08 Mike Nash wrote on the Windows Blog that the 6.1 version number reflects only the deepest care for application compatibility. There are too many applications out there that break at the first sight of a OS major version change, and keeping the major version number will likely have significantly reduced the number of applications that refuse to run on Windows 7 for no apparent reason.
If you’re still unconvinced that Windows 7 is an incredibly large, major release, take a few minutes to skim through the archives of the Engineering Windows 7 Blog and see the massive amount of effort that has gone into the making of this OS.
Daniel Moth recently blogged about the sad fact that there are no managed APIs for working with more than 64 logical processors, even though as you probably know, Windows 7 (and of course Windows Server 2008 R2) features support for up to 256 logical processors.
Frankly, these APIs aren’t designed to be very friendly, but sure enough they give you the ability to create threads (within a single process) that will run on different processor groups. As each processor group is limited to 64 processors, by creating 4 threads you can utilize up to 256 processors, which is currently the cap.
I’ve decided to write a C++/CLI wrapper titled Multi-Processor Extensions for .NET (published on CodePlex today) that wraps these native APIs. Among them you’ll find:
- Logical processor information (GetLogicalProcessorInformationEx), including cache relationships, processors and processor groups, NUMA nodes and everything else
- NUMA information, including memory available on each node, NUMA node per process allocation etc.
- Facilities for modifying the processor group affinity of a thread, retrieving the processor group affinity of a process and creating a new process inheriting the parent’s processor group affinity (which is not the default)
Some example use:
foreach (ProcessorGroup group in Processor.ProcessorGroups)
Console.WriteLine("Processors in group: " + group.Processors);
foreach (NumaNodeRelationship numa in Processor.GetNumaNodeRelationships())
Console.WriteLine("NUMA node number: " + numa.NodeNumber);
foreach (CacheRelationship cache in Processor.GetCacheRelationships())
Console.WriteLine("Cache: " + cache.Type + " " + cache.CacheSize);
Console.WriteLine("Memory on NUMA node 0: " + Numa.GetAvailableMemory(0));
Admittedly I haven’t wrapped the API for creating a new thread and assigning it to a processor group that is different from the processor group of its creating process, but I did wrap the API for modifying the processor group of a thread, so you can still do it in two steps. If there’s much demand, I will add it later, or you can do it yourself (you need to use the new API for modifying extended process & thread attributes).
If you’re the lucky owner of dedicated hardware with more than 64 logical processors, please give these wrappers a try and see if they work ;-)
In the previous installment we’ve seen how tinkering with the prefetch settings on Windows XP improved the cold startup performance of a client application. Unfortunately, the improvement wasn’t significant enough and we were called in for another round of thinking.
After looking at the virtual memory address space of the process we found out that there were many relocated DLLs. We could easily determine this with Process Explorer, which shows you the image base address and actual base address for each DLL loaded into the process. Lots of relocations cost you some startup time, and seeing that there were ~300 DLLs involved, there was room for improvement.
So we rebased every single DLL that the application was using (with the rebase.exe utility from the SDK) and did some measurements again. Unfortunately, the results were inconclusive. Additionally, it was hard to tell whether the observed performance improvements were due to the rebase operation itself, or due to the fact that after rebasing the binaries were no longer signed (rebasing changes the file contents).
Another thing we tried with more success was an open-source executable compressor (packer) called UPX. It’s fairly trivial to use, and in this case it managed to cut the size of the application’s binaries by approximately half. UPX doesn’t support managed assemblies yet, but it doesn’t fail when encountering them.
As a result of using a packer, we managed to transform some of the cold startup I/O work to CPU work. Seeing that the CPU was not fully utilized during startup, this has shaved off another couple of seconds in a fairly consistent fashion, and with hardly any effort.
A few months ago, Alon, Dima and I have visited a customer with a client application experiencing very slow cold startup times. In this post (and possibly another) I will look into some of the things we found out in the process. Some of the details have been slightly changed to protect the innocent, but the general findings might be useful nonetheless.
Elsewhere there is some good reading on cold startup performance in general. First there’s Vance Morrison’s theoretical model. Claudio Caldato’s MSDN Magazine article is more relevant for managed apps, but still quite useful.
When transitioning to the latest release of the application (lots of code added, of course, incorporating managed code into a previously purely unmanaged application), the startup times went ballistic: From approximately 5-15 seconds with the old version to 30+ seconds with the new one. Now, a delay of 10 seconds is not very pleasant but users kind of learn to live with that, especially when launching the app immediately after logging on. 30 seconds, on the other hand (which could be as bad as over a minute on slow machines) is intolerable.
We looked at the obvious things first, trying to understand what the application was doing during these dreaded 30 seconds. We installed the application on multiple hardware combinations, including natively on a Vista laptop and inside an XP virtual machine on another box.
Sidenote: Troubleshooting cold startup performance sucks because to reproduce the scenario you have to reboot the box, and preferably the first thing to do after log on is to launch the app, so you get little time to set up tools for diagnosis.
First we looked at the application in Process Explorer, and saw that it was loading over 300 different DLLs. Its total memory consumption after completing the startup sequence was around 250MB, most of them code. Seeing that most of these DLLs were signed, we immediately suspected that a large portion of the startup time was spent loading these DLLs from disk (which explains why the cold startup times are much worse than warm) and verifying their digital signatures. Most of these DLLs were load-bound, with only some managed assemblies being loaded lazily.
To test this hypothesis, we wrote a small loader application that attempted to load the same DLLs in the same order (to obtain the load order we used Dependency Walker with its profiling option turned on). It took approximately 10 seconds to load them on a machine where the startup time was around 25 seconds, so we were fairly pleased with the result. Next, we tried launching the application after letting the loader app do its magic, and witnessed inconsistent behavior. In the VM, startup times went down proportionally to the time it took the loader app to load DLLs from disk; in the native installation, there was only an insignificant decrease in startup time.
Seeing that pre-fetching the application’s DLLs might have a positive effect on cold startup, we tried experimenting with the Windows prefetcher settings. As you might already know, Windows XP has a built-in prefetching mechanism which observes the first few seconds of your application’s startup, records whatever I/O requests you are making, and in subsequent invocations of your application—tries to prefetch these I/O requests while your app is not utilizing the I/O path. (On Vista, by the way, this mechanism was replaced by a newer one called SuperFetch, which is smarter because it tries to prefetch I/O requests even if there is no application that needs them at this time.)
Dima found the registry settings for the Windows XP prefetcher and modified them so that the prefetcher records more than just the first few seconds of the application’s startup. This improved cold startup performance in a fairly consistent fashion, shaving off several seconds of the startup time even on relatively fast boxes.
The following are the old and new versions of the prefetcher settings in the registry. As always when tinkering with registry settings, bear in mind that “It works on my machine” and that you’re doing it at your own risk. (And also bear in mind that these are the XP settings—on Vista or Windows 7 this could be completely different.)
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters]
There are many well-documented ways for launching an elevated (administrative) process from within a process that is running using standard user credentials. The cleanest way is by giving that process a manifest; if this can’t be done, the ShellExecute “runas” verb is the way to go. This is all yesterday’s news, really.
However, there’s a more interesting question: how do you launch a process as the standard user, if the current process is already elevated? As you probably know, a process can’t change its elevation status – an elevated process can’t downgrade itself to a standard user, and a non-elevated process can’t elevate itself. Additionally, there is no special API that obtains the token of the standard user – the token that the elevated process is using is a high-privileged token.
A naive (and very error-prone) approach would be to strip some privileges and rights away from the privileged token. First of all, it’s not very robust because a future release of the OS might add new privileges that you didn’t strip away. Second, this “dumbed-down” token is still considered an elevated token by the OS.
But there’s a more serious problem, and that’s the problem of OTS – over-the-shoulder elevation. If the elevation was a result of an administrator typing in his credentials over the user’s shoulder, then the process is running under the credentials of a different user – that’s not the currently logged on user, so stripping this token will not magically make it the current user’s token.
Aaron Margosis tells this story and offers a C++ solution to launch a process as standard user. The general steps are to obtain the user token of the shell process (explorer.exe), make a primary token off of that and then launch the new process with that token.
I’ve taken the liberty of adapting the sample so that it can be used from managed code, and added it to the UAC Helpers project on CodePlex, release v0.1 (last time it was updated was over a year and a half ago, so it’s about time that we stir it up a little bit). The new functionality is available in the UserAccountControl.CreateProcessAsStandardUser method, and it’s very easy to use (if the process is not elevated, it just falls back to using System.Diagnostics.Process.Start).
Every time I see Pavel working on his laptop, a question pops into my head: Why don’t you use the Aero theme of Windows Vista and Windows 7, and instead use something that closely resembles Windows Server 2003 – especially considering that your laptop is such a beast of a machine?
Today I got my answer, well, kind of. Pavel writes:
But if I don’t use Aero I don’t get thumbnail previews and other goodies. Why is that? I asked around and didn’t get a satisfying answer.
I’m going to call a Raymond Chen and Eric Lippert on this one. Features aren’t magically conceived and born. The feature isn’t there, because, to quote Eric, “no one ever designed, specified, implemented, tested, documented and shipped that feature”.
Noam chimed in and commented that there’s a way to create a new Windows theme that will have the “Classic” appearance but with Aero on, thumbnails and all. This isn’t out of the box because… well, it’s not a very useful theme. Most users turn off Aero because their machine grinds to a halt with Aero on, or because their video card does not support it (I, for one, am forced to live without it on my Asus Eee netbook). So that’s what Windows is trying to accomplish – it gives you a way to turn off Aero completely along with switching to the Basic theme. Or, you could customize the theme to your preference – but that’s something few users would ever think of doing, anyway.
A couple of years ago I’ve had this really disturbing thought about a framework I’ve been building which I have since seen repeated on the Windows 7 Engineering Blog: among the millions of people using Windows, there will never be consensus on any feature. (I’ve come to the same conclusion with a user base consisting of fairly smart developers, and only about a 100 of them. Rinse, repeat for billions of users on both ends of any conceivable spectrum.)