The Point of No Return
Sometimes there's this point in debugging an issue when you can't take it anymore. You've tried to diagnose the *** from every possible perspective, try out various configurations, use brute-force binary elimination, whatever. It just doesn't help. And as it often gets with these particularly nasty bugs, it only gets worse over time. Was it easy to reproduce it when you started? It's no longer easy. Could you work offline on your own machine and fix the issue for the customer? Now it works smoothly on your laptop; this hardly makes the customer happy if it doesn't work on his production server.
This is the case with a bug I'm hunting right now. Figured I might just as well post it here for the minute chance someone might have a valuable contribution. (Note that I am unable to disclose the customer's identification information, or any debugging-related information gathered from the customer's machines.)
Basically, the customer's issues started with unexplained application crashes (ExecutionEngineException, if I may add) with a bizarre call stack indicating what appears to be an access violation while JITting a piece of code. Granted a giant code-base of over 160 Visual Studio projects, it took a fair amount of time to obtain a simple repro which can be separately dissected. It was even easier to obtain that repro since we established that the same assembly which caused the exception during JIT-time, was also causing an exception during NGEN time (which doesn't require any setup to perform). In NGEN, the exception was slightly different (some mumble about RPC call failing), but the call stack up to that point was basically the same. Another noticeable detail was that ILDasm had no trouble whatsoever opening our assembly; on the other hand, PEVerify cried like a little girl and instead of saying something about invalid IL, it simply crashed with a similar call stack!
Having obtained an accurate repro of the problem, we tried to reproduce it on our own machines. We tried .NET 2.0 out of the box, we tried .NET 2.0 with VS2005 SP1, we tried all kinds of hotfixes issued by Microsoft over the years. We even tried compiling with the "updated" compiler and NGEN from the .NET 3.5 Beta 2 release. What we didn't succeed at was reproducing the problem.
Frustrated as hell, we approached the customer again, with some fresh ideas. It seemed that when compiling the assembly from Visual Studio, the problem would occur; however, if compiled using the msbuild utility from the command line, the problem would not occur and everything was happy ever after.
Again we started a quest to determine what's wrong with that Visual Studio installation, to no avail. I was desperate enough to fire up Process Monitor and analyze the differences between what the C# compiler (csc.exe) is trying to do. At this point I was hit by a surprise.
Visual Studio did not even invoke the C# compiler; it did try looking for it in the usual (C:\Windows\Microsoft.NET\Framework\v2.0.50727) directory, but it didn't use it for compilation. However, it still managed to produce an assembly, albeit invalid; appearingly without the aid of a compiler.
After a little bit of web investigation, it appeared to me that Visual Studio 2005 has a curious performance optimization: unless otherwise specified, it doesn't actually invoke the C# compiler (csc.exe) or the VB.NET compiler (vbc.exe); instead, it uses an in-process hosted version of the compiler. This behavior can be controlled by inserting the <UseHostCompilerIfAvailable>false</UseHostCompilerIfAvailable> tag into your MSBuild project (you can do this for the Microsoft.Common.targets file, or for the .csproj file in question).
Using this tag had the amazing effect of actually "fixing" the problem; the out-of-process standalone compiler now produced correct, verifiable code.
At this point I was happy enough to let the problem and myself go home. However, the customer reminded me that the original issue was brought up by QA engineers, who were not using a version of the application compiled in Visual Studio. They were using a version from the Team Foundation build server. Which uses msbuild. Which directly invokes csc.exe, with no "hosted compiler" babble.
Exhausted as I was, we tried to reproduce the problem on the build server. To make a long story short, it didn't reproduce. Everything worked with msbuild. Everything worked as part of the TF build. Everything worked from Visual Studio, with and without my "host compiler" patch.
Magic. Back to square one.