May 2012 - Posts
Yesterday we hosted the second meeting of the Jerusalem .NET/C++ User Group. We have a temporary website now, which I encourage you to bookmark to stay up to date with the group’s meetings. (I will also post any news to my blog, of course.)
The two presentations from the event are below. They were deep technical talks, and I strongly recommend that you peruse the links sprinkled throughout the slides for a deeper understanding of the topics.
Thanks everyone for coming, and see you at the next meeting!
This is a short excerpt (with slight modifications) from Chapter 10 of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.
Theoretically, .NET developers should never be concerned with optimizations tailored to a specific processor or instruction set. After all, the purpose of IL and JIT compilation is to allow managed applications to run on any hardware that has the .NET Framework installed, and to remain indifferent to operating system bitness, processor features, and instruction sets. However, squeezing the last bits of performance from managed applications may require reasoning at the assembly language level. At other times, understanding processor-specific features is a first step for even more significant performance gains.
Data-level parallelism, also known as Single Instruction Multiple Data (SIMD), is a feature of modern processors that enables the execution of a single instruction on a large set of data (larger than the machine word). The de-facto standard for SIMD instruction sets is SSE (Streaming SIMD Extensions), used by Intel processors since Pentium III. This instruction set adds new 128-bit registers (with the XMM prefix) as well as instructions that can operate on them. Recent Intel processors introduced Advanced Vector Extensions (AVX), which is an extension of SSE that offers 256-bit registers and even more SIMD instructions. Some examples of SSE instructions include:
- Integer and floating-point arithmetic
- Comparisons, shuffling, data type conversion (integer to floating-point)
- Bitwise operations
- Minimum, maximum, conditional copies, CRC32, population count (introduced in SSE4 and later)
You might be wondering whether instructions operating on these “new” registers are slower than their standard counterparts. If that were the case, any performance gains would be deceiving. Fortunately, that is not the case. On Intel i7 processors, a floating-point addition (FADD) instruction on 32-bit registers has throughput of one instruction per cycle and latency of 3 cycles. The equivalent ADDPS instruction on 128-bit registers also has throughput of one instruction per cycle and latency of 3 cycles.
Using these instructions in high-performance loops can provide up to 8-fold performance gains compared to naïve sequential programs that operate on a single floating-point or integer value at a time. For example, consider the following (admittedly trivial) code:
//Assume that A, B, C are equal-size float arrays
for (int i = 0; i < A.length; ++i) {
C[i] = A[i] + B[i];
}
The standard code emitted by the JIT in this scenario is the following:
; ESI has A, EDI has B, ECX has C, EDX has i
xor edx,edx
cmp dword ptr [esi+4],0
jle END_LOOP
NEXT_ITERATION:
fld dword ptr [esi+edx*4+8] ; load A[i], no range check
cmp edx,dword ptr [edi+4] ; range check accessing B[i]
jae OUT_OF_RANGE
fadd dword ptr [edi+edx*4+8]; add B[i]
cmp edx,dword ptr [ecx+4] ; range check accessing C[i]
jae OUT_OF_RANGE
fstp dword ptr [ecx+edx*4+8]; store into C[i]
inc edx
cmp dword ptr [esi+4],edx ; are we done yet?
jg NEXT_ITERATION
END_LOOP:
Each loop iteration performs a single FADD instruction that adds two 32-bit floating-point numbers. However, by using 128-bit SSE instructions, four iterations of the loop can be issued at a time, as follows (the code below performs no range checks and assumes that the number of iterations is equally divisible by 4):
xor edx, edx
NEXT_ITERATION:
; copy 16 bytes from B to xmm1
movups xmm1, xmmword ptr [edi+edx*4+8]
; copy 16 bytes from A to xmm0
movups xmm0, xmmword ptr [esi+edx*4+8]
; add xmm0 to xmm1 and store the result in xmm1
addps xmm1, xmm0
; copy 16 bytes from xmm1 to C
movups xmmword ptr [ecx+edx*4+8], xmm1
add edx, 4 ; increase loop index by 4
cmp edx, dword ptr [esi+4]
jg NEXT_ITERATION
On an AVX processor, we could move even more data around in each iteration (with the 256-bit YMM* registers), for an even bigger performance improvement:
xor edx, edx
NEXT_ITERATION:
; copy 32 bytes from B to ymm1
vmovups ymm1, ymmword ptr [edi+edx*4+8]
; copy 32 bytes from A to ymm0
vmovups ymm0, ymmword ptr [esi+edx*4+8]
; add ymm0 to ymm1 and store the result in ymm1
vaddps ymm1, ymm1, ymm0
; copy 32 bytes from ymm1 to C
vmovups ymmword ptr [ecx+edx*4+8], ymm1
add edx, 8 ; increase loop index by 8
cmp edx, dword ptr [esi+4]
jg NEXT_ITERATION
The SIMD instructions used in these examples are only the tip of the iceberg. Modern applications and games use SIMD instructions to perform complex operations, including scalar product, shuffling data around in registers and memory, checksum calculation, and many others. Intel’s AVX portal is a good way to learn thoroughly what AVX can offer.
The JIT compiler uses only a small number of SSE instructions, even though they are available on practically every processor manufactured in the last 10 years. Specifically, the JIT compiler uses the SSE MOVQ instruction to copy medium-sized structures through the XMM* registers (for large structures, REP MOVS is used instead), uses SSE2 instructions for floating point to integer conversion, and other corner cases. The JIT compiler does not auto-vectorize loops by unifying iterations, as we did manually in the preceding code.
Unfortunately, C# doesn’t offer any keywords for embedding inline assembly code into your managed programs. Although you could factor out performance-sensitive parts to a C++ module and use .NET interoperability to access it, this is often clumsy and introduces a small performance penalty as the interoperability boundaries are crossed. There are two other approaches for embedding SIMD code without resorting to interoperability.
A brute-force way to run arbitrary machine code from a managed application (albeit with a light interoperability layer) is to dynamically emit the machine code and then call it. The Marshal.GetDelegateForFunctionPointer method is key, as it returns a managed delegate pointing to an unmanaged memory location, which may contain arbitrary code. The following code allocates virtual memory with the EXECUTE_READWRITE page protection, which enables us to copy code bytes into memory and then execute them. The result, on my Intel i7-860 CPU, is a more than 2-fold improvement in execution time!
[UnmanagedFunctionPointer(CallingConvention.StdCall)]
delegate void VectorAddDelegate(
float[] C, float[] B, float[] A, int length);
[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr VirtualAlloc(
IntPtr lpAddress, UIntPtr dwSize,
IntPtr flAllocationType, IntPtr flProtect);
//This array of bytes has been produced from
//the SSE assembly version – it is a complete
//function that accepts four parameters (three
//vectors and length) and adds the vectors
byte[] sseAssemblyBytes = {
0x8b, 0x5c, 0x24, 0x10, 0x8b, 0x74, 0x24, 0x0c, 0x8b,
0x7c, 0x24, 0x08, 0x8b, 0x4c, 0x24, 0x04, 0x31, 0xd2,
0x0f, 0x10, 0x0c, 0x97, 0x0f, 0x10, 0x04, 0x96, 0x0f,
0x58, 0xc8, 0x0f, 0x11, 0x0c, 0x91, 0x83, 0xc2, 0x04,
0x39, 0xda, 0x7f, 0xea, 0xc2, 0x10, 0x00 };
IntPtr codeBuffer = VirtualAlloc(
IntPtr.Zero, new UIntPtr((uint)sseAssemblyBytes.Length),
0x1000 | 0x2000, //MEM_COMMIT | MEM_RESERVE
0x40 //EXECUTE_READWRITE
);
Marshal.Copy(sseAssemblyBytes, 0,
codeBuffer, sseAssemblyBytes.Length);
VectorAddDelegate addVectors = (VectorAddDelegate)
Marshal.GetDelegateForFunctionPointer(
codeBuffer, typeof(VectorAddDelegate));
//We can now use ‘addVectors’ to add vectors!
A completely different approach, which unfortunately isn’t available on the Microsoft CLR, is extending the JIT compiler to emit SIMD instructions. This is the approach taken by Mono.Simd. Managed code developers who use the Mono .NET runtime can reference the Mono.Simd assembly and use JIT compiler support that converts operations on types such as Vector16b or Vector4f to the appropriate SSE instructions.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
Note: This blog post assumes that you can capture and analyze managed dumps in WinDbg using SOS, and have encountered a bizarre technical problem when using dumps from a production environment. If this assumption is incorrect, feel free to peruse my .NET Debugging Resources link post.
When debugging managed dumps in WinDbg, you will need to load the SOS version that is compatible with the CLR version in the dump. SOS, in turn, requires the CLR “data access DLL” (mscordacwks.dll), which is a debugging helper shipping with the .NET Framework. If SOS and/or mscordacwks.dll are missing or have the wrong version, you will receive an error similar to the following when trying to run most SOS commands:
Failed to load data access DLL, 0x80004005
Verify that 1) you have a recent build of the debugger (6.2.14 or newer)
2) the file mscordacwks.dll that matches your version of mscorwks.dll is in the version directory
3) or, if you are debugging a dump file, verify that the file mscordacwks_<arch>_<arch>_<version>.dll is on your symbol path.
4) you are debugging on the same architecture as the dump file. For example, an IA64 dump file must be debugged on an IA64 machine.
You can also run the debugger command .cordll to control the debugger's load of mscordacwks.dll. .cordll -ve -u -l will do a verbose reload. If that succeeds, the SOS command should work on retry.
If you are debugging a minidump, you need to make sure that your executable path is pointing to mscorwks.dll as well.
Some CLR versions have been indexed by the Microsoft public symbol server, so you can instruct the debugger to download everything automatically by setting your symbol path and image path to the Microsoft symbol server.
However, some CLR versions have not been indexed and cannot be retrieved automatically—and this is where you need to crawl the Web and find the binaries manually. Today I had the chance to encounter a CLR version, 2.0.50727.3607, which WinDbg couldn’t retrieve from the Microsoft symbol server:
SYMSRV: http://msdl.microsoft.com/download/symbols/
mscordacwks_x86_x86_2.0.50727.3607.dll/4ADD5446590000/
mscordacwks_x86_x86_2.0.50727.3607.dll not found
Doug Stewart has been collecting updates to CLR 2.0 for several years now, and has an entry on his post for the 3607 CLR build, associated with KB article 976569.
If you go to the KB article, you’ll be able to download the update, an executable file. Instead of launching it, open it in an application that understands self-executing archives, such as 7-Zip:

Locate the .msp file and open it again:

One of the .cab files contains the files we are looking for. If you got the wrong .cab file, try the other one :-)

Now all that’s left is to extract them to a temporary directory, rename them to mscorwdacwks.dll and mscorwks.dll, and issue a .cordll -u -ve -lp <path> command in WinDbg:
0:014> .cordll -u -ve -lp D:\temp\3607
CLRDLL: Loaded DLL D:\temp\3607\mscordacwks.dll
CLR DLL status: Loaded DLL D:\temp\3607\mscordacwks.dll
Voila—you can run any SOS commands now.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
This is a short excerpt from Chapter 1 of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out. We have an Amazon page and a cover image now!
Where do you fit performance in the software development lifecycle? This innocent question carries the mind baggage of having to retrofit performance into an existing process. Although it is possible, a healthier approach is to consider every step of the development lifecycle an opportunity to understand the application’s performance better—first, the performance goals and important metrics; next, whether the application meets or exceeds its goals; and finally, whether maintenance, user loads, and requirement changes introduce any regressions.
During the requirements gathering phase, you should start thinking about the performance goals you would like to set.
During the architecture phase, you should refine the performance metrics important for your application and define concrete performance goals.
During the development phase, you should frequently perform exploratory performance testing on prototype code or partially complete features to verify that you are well within the system’s performance goals.
During the testing phase, you should perform significant load testing and performance testing to validate completely your system’s performance goals.
During subsequent development and maintenance, you should perform additional load testing and performance testing with every release (and preferably on a daily or weekly basis) to quickly identify any performance regressions introduced into the system.
Taking the time to develop a suite of automatic load tests and performance tests, to set up an isolated lab environment in which to run them, and to analyze their results carefully to make sure no regressions are introduced is a very time-consuming process. Nevertheless, the performance benefits gained from systematically measuring and improving performance and making sure regressions don’t creep slowly into the system is worth the initial investment in having a robust performance development process.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
CLR Profiler is a free Microsoft tool for diagnosing memory-related performance problems in managed applications. In this post, I’m using CLR Profiler v4.0, which you can download here.
I talked about CLR Profiler here as a post-mortem diagnostic tool that can open log files generated by SOS.dll’s !TraverseHeap command and present a reference graph of all live objects. This in itself is a little-known feature of CLR Profiler; it is even less known that CLR Profiler can generate these reference graphs live, and compare them automatically to show you where a memory leak is coming from.
All you need to do is run your application under CLR Profiler, and click the “Show heap now” button periodically. This is similar to the “Take snapshot” functionality in ANTS Memory Profiler and other tools. When the application terminates, you click the “Heap Graph” button in the Summary view.

This produces a reference graph in which you can see the differences between snapshots. This is the end of the graph, which makes it evident that almost all the retained objects are strings, held by string arrays and FileInformation objects:

And this is the beginning of the graph, which makes it clear (with some experience deciphering root reference chains) that the majority of objects are retained by a static EventHandler:

If you zoom into the snapshots, you’ll see three colors, indicating the amount of memory allocated and retained between the snapshots. For example, the darkest pink objects were created between the second and the third snapshots.


There are several advanced options available for further exploration. For example, you can view only new objects and then see who allocated them (which call stack in your program is responsible for creating them). You could even look at a GC timeline and see which objects were alive at every point in time, as well as who allocated them:


To summarize, CLR Profiler still has plenty of hidden gems and you should consider using it—especially in simple scenarios. After all, it’s hard to beat its price :-)
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
If you’re doing production debugging in a closed environment—closed-down servers with no Internet access, or if you work at an institution that doesn’t have unrestricted Internet access from developer machines, this post will help you set up an offline production debugging environment. Incidentally, this post will also help if you’re going to host SELA’s .NET Debugging or C++ Debugging courses, and want to make sure your workstations are ready for the numerous hands-on labs.
First and foremost, you are going to need all the tools you plan using for production debugging. At the very least, this includes:
- Debugging Tools for Windows
Note there are 32-bit and 64-bit versions, you need whatever your application is using. For several releases now, the installation package is only available through the WDK or the Windows SDK, but you can follow the instructions here to configure the SDK web installer to download only the Debugging Tools for Windows installation package. - CLR Profiler
The v4.0 version works for CLR 2.0 applications, so that’s the one you need. - Sysinternals Suite
If you’re really low on disk space, the least you need are Process Explorer, Process Monitor, and VMMap. - .NET disassembler – any of the following will do:
- Application Verifier
Note there are 32-bit and 64-bit versions. - Debugging extensions
Next, you’re going to need debugging symbols for your environment. This includes debugging symbols for your own code, but more importantly, symbols for Windows, the CLR, and the .NET Framework versions your application is using.
In an online environment, obtaining symbols is trivial through the Microsoft symbol server. In an offline environment, you need to jump through a few hoops to make sure you have all the symbols set up, so here goes.
Windows symbols are available for download as symbol packages for various OS versions. As always, you need to make sure you’re downloading symbols for the precise operating system version on which your application runs. If there are several platforms you’re using, you’ll need symbols for all of them.
If you installed any hotfixes or updates from Windows Update, you’ll need to download symbols for them manually. Use the symchk.exe utility that ships with the Debugging Tools for Windows to obtain manually symbols on an Internet-connected machine. Follow these instructions to generate a text file on the disconnected machine and use it on a connected machine to download symbols.
.NET Framework symbols are available online on the Reference Source Code Center. These packages contain source code (!) and symbol files for the .NET Framework assemblies: mscorlib.dll, System.dll, and others.
CLR symbols are not available online as part of the .NET Framework symbol packages. You won’t usually need them unless you’re exploring CLR internals or dealing with very specific problems that require inspecting unmanaged call stacks of managed threads. To obtain CLR symbols, you’ll have to use the same technique mentioned above with symchk.exe.
Key CLR binaries and SOS.DLL are required for properly debugging dump files on a different machine. If you have access to the original machine, you can simply copy mscorwks.dll/clr.dll, SOS.dll and mscordacwks.dll from the %windir%\Microsoft.NET\Framework* directory corresponding to your .NET version. If you don’t have access to the original machine, you’ll have to track down the installation package for the appropriate CLR version and retrieve these binaries from it—I’ll cover this in a future post.
After following these instructions, you should have a fully functional offline production debugging environment. Additions and comments are welcome!
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn