DCSIMG
February 2007 - Posts - All Your Base Are Belong To Us

All Your Base Are Belong To Us

Mostly .NET internals and other kinds of gory details

February 2007 - Posts

JIT Optimizations, Inlining, and Interface Method Dispatching (Part 1 of N)

The .NET Just-In-Time Compiler (JIT) is considered by many to be one of the primary performance advantages of the CLR in comparison to the JVM and other managed environments that use just-in-time-compiled byte-code.  It is this advantage that makes the JIT internals so secret and the JIT optimizations so mysterious.  It is also the reason why the SSCLI (Rotor) contains only a naive fast-JIT implementation, which performs only the minimal optimizations while translating each IL instruction to its corresponding machine language instruction sequence.

 

During my research for the .NET Performance course I wrote for Sela, I've found a couple of sources regarding the known JIT optimizations.  The following is a non-exhaustive list of those optimizations, followed by a detailed discussion of interface method dispatching and the JIT inlining techniques employed thereof.

 

Before we start, there are a couple of points worth mentioning.  First of all, it's pretty obvious that in order to see JIT optimizations, you must look at the assembly code emitted by the JIT in run-time for the release build of your application.  However, there is a minor caveat: if you try to look at the assembly code from within a Visual Studio debug session, you will not see optimized code.  This is due to the fact JIT optimizations are disabled by default when the process is launched under the debugger (for convenience reasons).  Therefore, in order to see JIT-optimized code, you must attach Visual Studio to a running process, or run the process from within CorDbg with the JitOptimizations flag on (by issuing the "mode JitOptimizations 1" command from within the cordbg> command prompt).  Finally, in the caveats section, bear in mind that I do not work for Microsoft and do not have access to the actual sources of the JIT compiler.  Therefore, nothing in the following analysis should be taken for granted.

 

Another important point to mention is the fact this article is based on the x86 JIT that comes with the .NET 2.0 CLR.  It is quite likely that results will be different under the x64 JIT and under the x64 NGEN.  This, however, is the material for an additional, future, post.  I also hope to explore (in this fictional future post) the behavior of the CLR 1.1 JIT.

 

Range-check elimination

When accessing an array in a loop, and the loop's termination condition relies upon the length of the array, the array bounds access check can be eliminated.  Consider the following code.  Do you see anything wrong with it?

 

        private static int[] _array = new int[10];

        private static volatile int _i;

 

        static void Main(string[] args)

        {

            for (int i = 0; i < _array.Length; ++i)

                _i += _array[i];

        }

 

00BE00A5  xor         edx,edx                          ; edx = 0 (i)

00BE00A7  mov         eax,dword ptr ds:[02491EC4h]     ; eax = numbers

00BE00AC  cmp         edx,dword ptr [eax+4]            ; edx >= Length?

00BE00AF  jae         00BE00C2                         ; exception!

00BE00B1  mov         eax,dword ptr [eax+edx*4+8]      ; eax = numbers[i]

00BE00B5  add         dword ptr ds:[0A22FC8h],eax      ; _i += eax

00BE00BB  inc         edx                              ; ++edx (i)

00BE00BC  cmp         edx,0Ah                          ; edx < 100?

00BE00BF  jl          00BE00A7

 

The yellow line is the preamble, the green lines are the loop body, and the blue lines are the code to increment the counter and test for the loop termination condition.

 

The third line in the above listing (00BE00AC) performs a range check – it ensures that the EDX register, used to index inside the array, is not greater than or equal to the array length at [EAX + 4] (EAX holds the array address, which is known to hold the array length at a 4-byte offset from its start).  In addition, there is a loop termination check performed at the second-to-last line of the listing (00BE00BC).

 

Where is the range check elimination, then?  The reason for this behavior is the simple fact that the array reference itself is static in this scenario.  Working with a static reference causes the above code to be generated (also note how in 00BE00A7 we actually fetch the array reference into the register with each iteration of the loop).  This behavior can be eliminated by making a very simple modification to the above program:

 

        private static int[] _array = new int[10];

        private static volatile int _i;

 

        static void Main(string[] args)

        {

            int[] localRef = _array;

            for (int i = 0; i < localRef.Length; ++i)

            {

                _i += localRef[i];

            }

        }

 

009D00D2  mov         ecx,dword ptr ds:[1C21EC4h]      ; ecx = numbers

009D00D8  xor         edx,edx                          ; edx = 0 (i)

009D00DA  mov         esi,dword ptr [ecx+4]            ; esi = numbers.Length

009D00DD  test        esi,esi                          ; esi == 0?

009D00DF  jle         009D00F0

009D00E1  mov         eax,dword ptr [ecx+edx*4+8]      ; loop proper

009D00E5  add         dword ptr ds:[922FC8h],eax       ; _i += numbers[i]

009D00EB  inc         edx                              ; ++edx (i)

009D00EC  cmp         esi,edx                          ; edx > numbers.Length?

009D00EE  jg          009D00E1

 

Using the local reference causes the above code to be generated: note how the range check has been eliminated, and now the loop termination condition is the only test performed in this code.  Note: this discussion mirrors Greg Young's post on the subject, here: http://codebetter.com/blogs/gregyoung/archive/2006/07/08/147230.aspx

 

It's also worth noting that it is very easy to break this optimization by using something other than Array.Length as the loop termination condition.  For example, the following code (which is functionally identical) will generate array bounds check and ruin the optimization for us:

 

            for (int i = 0; i < localRef.Length * 2; ++i)

            {

                _i += localRef[i / 2];

            }

 

Examples of standard compiler optimizations

  1. Compile-time constant folding.
  2. Constant and copy propagation.
  3. Loop unrolling. 

Method inlining

A short and simple method that is frequently used can be inlined into the calling code.  Currently, the JIT is documented (perhaps "blogged about" would be more precise here) to inline methods that are less than 32 bytes in length, do not contain any complex branching logic, and do not contain any exception handling-related mechanisms.  See David Notario's blog for some additional information on this topic (note that is not entirely relevant for CLR 2.0) –  http://blogs.msdn.com/davidnotario/archive/2004/11/01/250398.aspx.

 

Given the following code fragment, let's examine what happens when the JIT doesn't optimize and inline the method call, and what happens when it does.

 

    public class Util

    {

        public static int And(int first, int second)

        {

            return first & second;

        }

 

        [DllImport("kernel32")]

        public static extern void DebugBreak();

    }

 

    class Program

    {

        private static volatile int _i;

 

        static void Main(string[] args)

        {

            Util.DebugBreak();

 

            _i = Util.And(5, 4);

        }

    }

 

Note the use of the kernel32.dll DebugBreak function (which internally issues a software interrupt, int 3).  I use it here so that Windows will offer me the opportunity to debug the process when this method is called, so that I don't have to attach to it manually from within Visual Studio or any other debugger of choice.  Finally, note that I made the _i field volatile so that the assignment to it is not optimized away.

 

When stepping into the disassembly with JIT optimizations disabled (for example, when the process is started from within a Visual Studio debugging session), the following code is emitted for the method call site:

 

00000013  mov         edx,4

00000018  mov         ecx,5

0000001d  call        dword ptr ds:[00A2967Ch]         ; this is Util.And

00000023  mov         esi,eax

00000025  mov         dword ptr ds:[00A28A6Ch],esi     ; this is _i

 

If we keep stepping into the code at 00A2967C, we find the method itself:

 

00000000  push        edi 

00000001  push        esi 

00000002  mov         edi,ecx

00000004  mov         esi,edx

00000006  cmp         dword ptr ds:[00A28864h],0

0000000d  je          00000014

0000000f  call        794F1116

00000014  mov         eax,edi

00000016  and         eax,esi

00000018  pop         esi 

00000019  pop         edi 

0000001a  ret             

 

Note that there is no optimization or inlining here: the parameters are passed to the method in the EDX and ECX registers (fastcall calling convention), and the AND instruction at offset 0x16 performs the method's actual purpose.

 

Now let's look at the inlined call, generated when I attach to the process after the debug breakpoint is issued inside it.  This is what the method call site looks like, this time:

 

00B90075  mov         dword ptr ds:[0A22FD0h],4    ; _i = 4

00B9007F  ret             

 

The result of AND(5, 4) is 4, and this is the value that is directly written into the volatile member field.  Note that the optimization is so aggressive that the AND operation doesn't even occur – it can be computed directly (folded) during compile-time.

 

However, it is seemingly impossible to inline a virtual method call.  This, of course, is due to the fact that the actual method to be called is unknown at compile-time and can change between method invocations.  For example, consider the following code:

 

    class A

    {

        public virtual void Foo() { }

    }

 

    class B : A

    {

        public override void Foo() { }

    }

 

    class Program

    {

        static void Method(A a)

        {

            a.Foo();

        }

 

        static void Main(string[] args)

        {

            for (int i = 0; i < 10; ++i)

            {

                A a = (i % 2 == 0) ? new A() : new B();

                Method(a);

            }

        }

    }

 

When the JIT compiles the Method method, it has no means of knowing which of the two implementations should be called – A.Foo or B.Foo.  Therefore, it is seemingly impossible to inline the call at the call site.  (Note my repeated use of "seemingly" here – it is theoretically possible to perform a partial optimization that would introduce inlining in selected cases, and I discuss it in depth later, when talking about interface method dispatching.)

 

Therefore, a virtual call must go through the actual object's method table.  (If you need some refreshment on this, have a brief skim through http://msdn.microsoft.com/msdnmag/issues/05/05/JITCompiler/.)  Going through the method table involves two levels of indirection: using the object header to reach into the method table, and using a compile-time known offset into the method table to determine the method to call.

 

In the preceding example, the following code will be emitted at the call site (in the Method method):

 

007E0076  xor         esi,esi                   ; esi = 0 (i)

007E0078  jmp         007E00AA                  ; jump to compare condition

007E007A  mov         eax,esi                   ; i % 2 == 0?

007E007C  and         eax,80000001h

007E0081  jns         007E0088

007E0083  dec         eax 

007E0084  or          eax,0FFFFFFFEh

007E0087  inc         eax 

007E0088  test        eax,eax

007E008A  je          007E0098

007E008C  mov         ecx,2B3180h

007E0091  call        002A201C

007E0096  jmp         007E00A2

007E0098  mov         ecx,2B3100h

007E009D  call        002A201C

007E00A2  mov         ecx,eax                   ; ecx = a

007E00A4  mov         eax,dword ptr [ecx]       ; eax = a's method table

007E00A6  call        dword ptr [eax+38h]       ; call through offset 0x38

007E00A9  inc         esi 

007E00AA  cmp         esi,0Ah

007E00AD  jl          007E007A

007E00AF  pop         esi 

007E00B0  ret             

 

As noted above, the virtual call is dispatched in two steps: first, reaching into the object's method table (007E00A4 mov eax, dword ptr [ecx]), and then calling the method pointer at a compile-time known offset into the method table (007EE00A6 call dword ptr [eax+38h]).

 

Note that theoretically, the sealed C# keyword (and its corresponding IL counterpart, final) is meant to minimize the inefficiency involved with virtual method dispatch by indicating that in spite of a method's being virtual, it cannot be overridden anymore by any derived classes.  For example, the following code does not have to involve the virtual method dispatch we have just seen:

 

    class A

    {

        public virtual void Foo() { }

    }

 

    class B : A

    {

        public override sealed void Foo() { }

    }

 

    class C : B

    {

    }

 

    class Program

    {

        static void Method(B b)

        {

            b.Foo();

        }

 

        static void Main(string[] args)

        {

            for (int i = 0; i < 10; ++i)

            {

                B b = (i % 2 == 0) ? new B() : new C();

                Method(b);

            }

        }

    }

 

It is obvious that the call target b.Foo in the Method method is statically known: it's B.Foo that is going to be called.  However, the JIT does not choose to use this information to prevent the virtual method dispatch – it is emitted nonetheless, as we can see from the following assembly code (this time I've snipped the object setup).

 

005000A2  mov         ecx,eax

005000A4  mov         eax,dword ptr [ecx]

005000A6  call        dword ptr [eax+38h]

 

Note that using the sealed keyword on the class itself has no effect either, even though the call target can be statically known as well.

 

To fully understand what's going on in the background, it's worth your time to learn that IL has two instructions used for dispatching method calls: call and callvirt.  One of the main reasons for using the callvirt instruction for instance method calls is the fact that the JIT-emitted code contains a check that ensures the instance is not null and throws a NullReferenceException otherwise.  This is why the C# compiler emits the callvirt IL instruction even when calling non-virtual instance methods.  If this were not the behavior, the following code could compile and run successfully:

 

    class A

    {

        public void Foo() { }   // Foo doesn't use "this"

    }

 

    class Program

    {

        static void Main(string[] args)

        {

            for (int i = 0; i < 10; ++i)

            {

                A a = (i % 2 == 0) ? new A() : null;

                a.Foo();

            }

        }

    }

 

Here's the IL that is emitted for this scenario (note the callvirt instruction at L_0013):

.method private hidebysig static void Main(string[] args) cil managed
{
      .entrypoint
      .maxstack 2
      .locals init (
            [0] int32 num1,
            [1] CallAndCallVirt.A a1)
      L_0000: ldc.i4.0 
      L_0001: stloc.0 
      L_0002: br.s L_001c
      L_0004: ldloc.0 
      L_0005: ldc.i4.2 
      L_0006: rem 
      L_0007: brfalse.s L_000c
      L_0009: ldnull 
      L_000a: br.s L_0011
      L_000c: newobj instance void CallAndCallVirt.A::.ctor()
      L_0011: stloc.1 
      L_0012: ldloc.1 
      L_0013: callvirt instance void CallAndCallVirt.A::Foo()
      L_0018: ldloc.0 
      L_0019: ldc.i4.1 
      L_001a: add 
      L_001b: stloc.0 
      L_001c: ldloc.0 

      L_001d: ldc.i4.s 10

      L_001f: blt.s L_0004
      L_0021: ret 

}

 

And here's the assembly that is emitted for this scenario:

 

00AD0076  xor         esi,esi

00AD0078  jmp         00AD009F

00AD007A  xor         edx,edx                   ; represents the "a" local variable

00AD007C  mov         eax,esi

00AD007E  and         eax,80000001h

00AD0083  jns         00AD008A

00AD0085  dec         eax 

00AD0086  or          eax,0FFFFFFFEh

00AD0089  inc         eax 

00AD008A  test        eax,eax

00AD008C  jne         00AD009A

00AD008E  mov         ecx,0A230E0h

00AD0093  call        00A1201C                  ; constructor call (in the branch)

00AD0098  mov         edx,eax

00AD009A  mov         eax,edx    

00AD009C  cmp         dword ptr [eax],eax       ; null reference check

00AD009E  inc         esi                       ; move on, the method isn't really called

00AD009F  cmp         esi,0Ah

00AD00A2  jl          00AD007A

00AD00A4  pop         esi 

00AD00A5  ret             

 

So, is the JIT capable of inlining such a method?  Yes!  Note how the JIT completely eliminates the call itself (the method is empty and therefore inlining it means simply optimizing it away), but it can't eliminate the null reference check, because that was the intent of the callvirt keyword.  Therefore, the instruction at 00AD009C performs a trivial null reference check by attempting to dereference EAX, which contains the value of the "a" local variable.  If an access violation occurs, it is caught and a NullReferenceException exception is thrown in its stead.

 

If the IL in L_0003 were to use the call instruction instead of the callvirt instruction, the assembly code that would have been emitted wouldn't have the null reference check embedded in it, and the above code could run successfully.  This is not the behavior adopted by the C# language developers.

 

For the sake of completeness, it is worth noting that sometimes calling a virtual method involves the call instruction and not the callvirt instruction.  This occurs, for example, in the following code snippet:

 

    class Employee

    {

        public override string ToString()

        {

            return base.ToString();

        }

    }

 

The ToString method compiled to IL: 

.method public hidebysig virtual instance string ToString() cil managed
{
      .maxstack 8
      L_0000: ldarg.0 
      L_0001: call instance string object::ToString()
      L_0006: ret 

}

 

If the instruction emitted in this case were the callvirt instruction, Employee.ToString would call itself recursively forever.  It is explicitly our purpose here to ignore the normal virtual method dispatching mechanisms and delegate our implementation to the base class' method.  Therefore, the call to Object.ToString will not be emitted with the callvirt instruction, but with the call instruction.

 

Interface method dispatch is not very different in essence from virtual method dispatch.  It is not very different in the sense that a level of indirection must be added and therefore the actual implementation to call cannot be determined at the call site.  For example, consider the following code:

 

    interface IA

    {

        void Foo();

    }

 

    class A : IA

    {

        public void Foo()

        {

        }

    }

 

    class B : IA

    {

        void IA.Foo()

        {

        }

    }

 

    class Program

    {

        [MethodImpl(MethodImplOptions.NoInlining)]

        static void Method(IA a)

        {

            a.Foo();

        }

 

        static void Main(string[] args)

        {

            for (int i = 0; i < 10; ++i)

            {

                IA a = (i % 2 == 0) ? (IA)new A() : (IA)new B();

                Method(a);

            }

        }

    }

 

The trivial method dispatch technique in this case would be as follows:

  1. Reach into the method table for the actual object passed.
  2. Reach into the interface method table map for the interface used within the method table for the actual object, using the process-wide interface ID.
  3. Reach into the interface method table for the method at the compile-time known offset and execute it.

Therefore, the approximate assembly code that should be generated for the actual interface method call site (inside the Method method) is supposed to look like the following:

 

mov    ecx, edi                   ; ecx holds "this"

mov    eax, dword ptr [ecx]       ; eax holds method table

mov    eax, dword ptr [eax+0Ch]   ; eax holds interface method table

mov    eax, dword ptr [eax+30h]   ; eax holds method pointer

call   dword ptr [eax]

 

This is described in http://msdn.microsoft.com/msdnmag/issues/05/05/JITCompiler/ and other sources, but is not the case in real life.  Even the debug (non-optimized) version of the interface dispatch doesn't look like this.  I will examine what the code does look like, but for now it will suffice to say that the naïve approach probably had dire performance characteristics and was therefore replaced with something different.

 

By the way, note that interface methods are always implicitly marked as virtual (consider the following IL for the A and B classes).

.class private auto ansi beforefieldinit A extends object implements NaiveMethodDispatch.IA
      .method public hidebysig newslot virtual final instance void Foo() cil managed
.class private auto ansi beforefieldinit B extends object implements NaiveMethodDispatch.IA
      .method private hidebysig newslot virtual final instance void NaiveMethodDispatch.IA.Foo() cil managed
            .override NaiveMethodDispatch.IA::Foo

 

If the interface implementation is explicit, or if you didn't specify the virtual keyword when implementing the interface implicitly, the compiler also emits the final keyword (equivalent to the C# sealed keyword) on the method.  This means that interface methods cannot be overridden unless they are explicitly marked as virtual in the base class' code; however, they are still virtual in the sense that the callvirt instruction must be used to call them, and that a method table lookup is required so that they are properly dispatched.

 

However, I would not even consider writing this article if everything were as simple as presented so far.  The trigger for my post has actually been Ayende's post http://www.ayende.com/Blog/archive/2007/02/10/On-the-Framework-Design-Principles-from-Raymond-Lewallen.aspx on a vaguely related topic, where he mentioned that the JIT is capable of inlining interface method calls.  A mild theoretical discussion over e-mail has resulted in some (very) preliminary research which I present below.

 

Before diving in, the conclusion of the following discussion can be summarized as follows: it is theoretically impossible to perfectly inline virtual method calls (interface method calls fall in this category as well); the JIT does not inline interface method calls; instead, the JIT performs an optimization that does not use the naïve interface method dispatch outlined above.

 

Before we follow what the actual CLR 2.0 JIT does at interface method call sites, it is probably wise to consider any optimization that could theoretically be performed on such calls.  The most obvious are the following two:

 

  1. Flow analysis can determine that a particular static type is used for the interface method call and therefore can be directly dispatched through the type instead of using the technique described above.

 

            IA a = new A();

            a.Foo();

 

It is fairly clear that A.Foo is being called at this call site, no matter what happens.  The managed compiler's flow analysis can detect this condition and emit the appropriate byte-code, which will allow the JIT to inline the method if it complies with the other inlining requirements (see above).

 

  1. One of the interface implementations is called considerably more frequently than others, at a given call site.  This can be determined by dynamically profiling and patching the code or by some kind of hint from the programmer.  (See http://www.sable.mcgill.ca/publications/papers/2005-4/sable-paper-2005-4.ps and http://www.research.ibm.com/journal/sj/391/suganuma.html for more information on the topic in general and its JVM implementation in particular.)  In this case, the interface method dispatch for that specific interface can be modified to direct dispatch or even inlined, as follows:

 

if (obj->MethodTable == expectedMethodTable) {

      // Inlined code of "hot" method

}

else {

      // Normal interface method dispatch code

}

 

The JIT is not documented (or "blogged about") to perform either of those optimizations.  However, poking through the JIT-emitted code reveals that the case is not as simple as it appears to be.

 

Let's look at the call site for the following code and attempt to analyze what happens in the method dispatching process.  I've given the interface and the methods meaningful names and content so that we could look into the example more easily.

 

    interface IOperation

    {

        int Do(int a, int b);

    }

 

    class And : IOperation

    {

        public int Do(int a, int b) { return a & b; }

    }

 

    class Or : IOperation

    {

        public int Do(int a, int b) { return a | b; }

    }

 

    class Program

    {

        static volatile int _i;

        static void Main()

        {

            for (int i = 0; i < 10; ++i)

            {

                IOperation op = (i % 2 == 0) ?

(IOperation)new And() : (IOperation)new Or();

                _i = op.Do(5, 3);

            }

        }

    }

 

In this case, we have a single call site for IOperation.Do, and it is impossible to the jitter to statically determine which implementation will be called.  Therefore, direct inlining or direct method dispatch is impossible.  What is, then, the code that gets emitted in this case?

 

Let's first have a look at the call site itself, which is the Main entry point.  It is compiled when the entry point is called, so we should be able to look at the code immediately.  The following is the code only for the actual call to IOperation.Do inside the loop.

 

007E00A4  push        3   

007E00A6  mov         edx,5                            ; setup parameters

007E00AB  call        dword ptr ds:[2C0010h]           ; the actual call

007E00B1  mov         dword ptr ds:[002B2FE8h],eax     ; save return value

 

Note that this is not the interface method dispatch pattern we were supposed to see here.  In its stead, we get an indirect call through the address 002C0010.  You should remember this address because we will mention it in the following discussion.  Stepping further inside, we see:

 

002C6012  push        eax 

002C6013  push        30000h

002C6018  jmp         79EE9E4F

 

The actual implementation has not been compiled yet, and therefore we find ourselves tracing through the code of the JIT-compiler.  Eventually, we are redirected (through a complex series of jumps) to the actual interface method's code.  Issuing the ret instruction from there gets us back to the main loop (where the call instruction has been issued, at 007E00AB).

 

However, after the loop has run three times, the code that 002C0010 points to (recall this is where our original call site is going through) is back-patched to an optimized version.  This optimized version follows:

 

002D7012  cmp         dword ptr [ecx],2C3210h

002D7018  jne         002DA011

002D701E  jmp         007E00D0

 

Recall that a .NET object header starts with the object's method table, and there we are at the trivial profiling optimization: if the method table is the expected one (i.e. the "common" or "hot" implementation), we can perform a direct jump to that code.  (Note that its actual address is embedded within the instruction, because the JIT has back-patched this code with the complete and intimate knowledge of the method's location.  This means no extra memory access is required to dispatch this call.)  Otherwise, we must go through the usual dispatching layer which we will see in a moment.  For completeness, here is the code at 007E00D0 (which is the "expected" result of the CMP instruction):

 

007E00D0  and         edx,dword ptr [esp+4]

007E00D4  mov         eax,edx

007E00D6  ret         4   

 

This is simply the And.Do implementation.  Note that the JIT-compiled code was not inlined into the call site or inlined into the dispatching helper.  However, jumping directly to this code should cause as little overhead as possible.  The remaining question is: what's at 002DA011, or, in other words, what happens if the method table is not the expected one?  This time the code is significantly more complex.

 

002DA011  sub         dword ptr ds:[14D3D0h],1

002DA018  jl          002DA056

002DA01A  push        eax 

002DA01B  mov         eax,dword ptr [ecx]

002DA01D  push        edx 

002DA01E  mov         edx,eax

002DA020  shr         eax,0Ch

002DA023  add         eax,edx

002DA025  xor         eax,3984h

002DA02A  and         eax,3FFCh

002DA02F  mov         eax,dword ptr [eax+151A6Ch]

002DA035  cmp         edx,dword ptr [eax]

002DA037  jne         002DA04B

002DA039  cmp         dword ptr [eax+4],30000h

002DA040  jne         002DA04B

002DA042  mov         eax,dword ptr [eax+8]

002DA045  pop         edx 

002DA046  add         esp,4

002DA049  jmp         eax 

002DA04B  pop         edx 

002DA04C  push        30000h

002DA051  jmp         79EED9A8

002DA056  call        79F02065

002DA05B  jmp         002DA01A

 

I've marked the most significant sections so we can focus on them.  First, 1 is decremented from a global variable.  Its initial value is 0x64 (i.e. 100).  We will later see its purpose in a moment.  If the resulting value is less than 0, we jump to a call to one of the JIT back-patching functions, and continue the flow.  What's in that flow?  The normal interface method dispatching as performed by the JIT.  Note that eventually there's a JMP EAX instruction at 002DA049, which actually gets us to the required code:

 

007E00F0  or          edx,dword ptr [esp+4]

007E00F4  mov         eax,edx

007E00F6  ret         4   

 

This is clearly the Or.Do implementation.  Right, so everything seems peachy.  What is the purpose of this global variable we just saw, then?  Consider the optimization we discussed earlier, with the "hot" path being inlined (or at least directly jumped to) if the method table of the actual object matches the method table for the "hot" implementation.  It might be the case that the frequency of calls through each of the implementations changes dynamically during run-time.  For example, it might be the case that for the first 500 times the user calls And.Do, but for the next 5000 times he will call Or.Do.  This makes our optimization look a bit stupid, as we have in fact optimized for the least common case.  To prevent this scenario, a counter is established for each call site.  It is decremented every time there's a "miss" – i.e., when the method table of the actual object does not match the expected method's method table.  When this counter decreases below 0, the JIT back-patches the code that 002C0010 points to again, to be the following:

 

0046A01A  push        eax 

0046A01B  mov         eax,dword ptr [ecx]

0046A01D  push        edx 

0046A01E  mov         edx,eax

0046A020  shr         eax,0Ch

0046A023  add         eax,edx

0046A025  xor         eax,3984h

0046A02A  and         eax,3FFCh

0046A02F  mov         eax,dword ptr [eax+151A74h]

0046A035  cmp         edx,dword ptr [eax]

0046A037  jne         0046A04B

0046A039  cmp         dword ptr [eax+4],30000h

0046A040  jne         0046A04B

0046A042  mov         eax,dword ptr [eax+8]

0046A045  pop         edx 

0046A046  add         esp,4

0046A049  jmp         eax 

0046A04B  pop         edx 

0046A04C  push        30000h

0046A051  jmp         79EED9A8

0046A056  call        79F02065

0046A05B  jmp         0046A01A

 

Again, I've emphasized the important parts.  Without getting into much detail, the purpose of this code is to check whether the current object's type matches the last object's type (note that unlike the previous snippet, the check here is not against a literal constant address – instead, it is calculated here using the object pointer itself).  If there's a match, the location to jump to is calculated and the JMP EAX executes (at 0046A049).  If there isn't a mtach, the JIT's back-patching code is called again, and the process repeats.

 

Note that this code is less efficient than the state we had before the counter decremented itself below 0.  Back then, we had a direct jump to a literal constant address.  Now, we have a calculation of the jump address based on the object pointer itself.  Also note that this time there is no counter – every time there is a miss, the meaning of this code will change.  Summarized in pseudo-code, this looks somewhat like the following:

 

start: if (obj->Type == expectedType) {

      // Jump to the expected implementation

}

else {

      expectedType = obj->Type;

      goto start;

}

 

This is the final behavior that we get for the entire course of the program.  It means that per call site, we get a one-shot counter (initialized to 100) which counts the times the "hot" implementation was missed.  After the counter decays below 0, JIT back-patching is triggered and the code is replaced with the version we just saw, that alternates the "hot" implementation with each miss.

 

Finally, note that the 00C20010 stub is generated for each call site that attempts to dispatch an interface call.  This means that the counter data and related code are per-call-site optimization, which can be highly valuable in certain scenarios.

 

Testing the above hypothesis is fairly trivial by writing a test program that performs interface method dispatching.  In one mode, the program will call the first implementation in a loop and then call the second implementation in a loop; in another mode, the program will interleave every call to the first implementation with a call to the second implementation.  Granted the behavior of the final dispatching code that we just saw, it is reasonable to expect that the first test case will perform better than the second one.  This was tested using the following program:

 

    interface IOperation

    {

        int Do(int a, int b);

    }

 

    class And : IOperation

    {

        public int Do(int a, int b) { return a & b; }

    }

 

    class Or : IOperation

    {

        public int Do(int a, int b) { return a | b; }

    }

 

    class Program

    {

        [MethodImpl(MethodImplOptions.NoInlining)]

        static void Method(IOperation op)

        {

            _i = op.Do(5, 3);

        }

 

        static readonly int NUM_ITERS = 100000000;

        static readonly int HALF_ITERS = NUM_ITERS / 2;

 

        static volatile int _i;

        static void Main()

        {

            IOperation and = new And();

            IOperation or = new Or();

 

            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < HALF_ITERS; ++i)

            {

                Method(and);

                Method(and);

            }

            for (int i = 0; i < HALF_ITERS; ++i)

            {

                Method(or);

                Method(or);

            }

            Console.WriteLine("Sequential: {0} ms", sw.ElapsedMilliseconds);

 

            sw.Reset();

            sw.Start();

            for (int i = 0; i < HALF_ITERS; ++i)

            {

                Method(and);

                Method(or);

            }

            for (int i = 0; i < HALF_ITERS; ++i)

            {

                Method(and);

                Method(or);

            }

            Console.WriteLine("Interleaved: {0} ms", sw.ElapsedMilliseconds);

        }

    }

 

The test results on my laptop averaged to 2775ms for the sequential case, and 2960ms for the round-robin (interleaved) case.  The difference was consistent, but it's obviously not very noticeable.  Therefore, I conclude for now that the two usage patterns have little (if any) effect on the program's performance, especially if the methods are more "chunky" than just a single x86 instruction.

 

Wrap-up

As I already mentioned, this is only a preliminary version of what I (hopefully) intend to write on this subject.  The following is a brief summary of what's missing:

  • Understanding the considerations behind the choice of two different call dispatching stubs (which are swapped when the counter reaches 0)
  • Designing a realistic scenario that is pertinent to some of the above discussion
  • Ensuring that none of this is high-level compiler-specific by checking similar behaviors for the VB.NET compiler (e.g. the behavior of the "sealed" keyword, emitting call/callvirt for non-virtual instance methods, etc.)
  • Stepping through NGEN-ed code (32-bit) to see if there are any differences
  • Repeating the above process for 64-bit code (the JIT is different)
  • Repeating the above process for NGEN-ed 64-bit code
  • Repeating the above process for CLR 1.1 JIT code

Attached to this post you will find an archive of several Visual Studio projects I've used to demonstrate various aspects of the above discussion.  These aren't real-world examples, so don't rush with the download.  J

Log Remoteability Through log4net

For those of you not aware of it, the log4net logging framework provides an open source complete solution to all kinds of logging facilities an application might require.  The framework contains implementation that allows you to log to rolling files, databases (via ADO), e-mail, "net send" and many more.  In addition, the framework is very extensible in the sense that customization is possible almost everywhere (especially given that it's open source).

It's also necessary to mention that during the last couple of years, Microsoft has been working on and releasing the Logging Application Block, which is part of the Patterns and Practices Enterprise Library initiative.  By the way, the artchitecture offered by both frameworks is quite similar.

What I wanted to talk about this time is log remoteability.  I.e., I want the log writing to occur on one machine (A), but the actual log4net appenders that perform the logging reside on another machine (B).  Most people wouldn't concern themselves too much with this scenario, and would simply provide a distributed logging facility that resides on machine A and that all other machines can communicate with using standard mechanisms to perform the logging.  However, I believe that the skeleton of my suggestion is a cleaner solution both in terms of complexity and in terms of appropriate design.

What I would like is that machine B would have a full configuration structure simulating the loggers it wants to write to, so that the actual logging code would be exactly the same even though the logging would not be performed locally.  Machine A would then have a shadow of the loggers with the actual appenders performing the logging.  The missing part is the link that connects the two together, which I provide below as a skeleton implementation that uses rolling files being written to a shared directory (obviously, any other communication mechanism could be used here).  The reason I chose this particular implementation is that the original requirements for this "remote logging" facility required it to use textual files only, for security reasons (communication between a highly secured and highly insecured network).

The "Remoted" Machine

So what the developer has to do is write a configuration file on machine B that contains the appropriate loggers that will be shadowed on machine A.  Each of these loggers on machine B has to have a RemotedAppender (a class sketched below) attached.  The remote appender does not perform actual logging - it only serializes the logging events to a rolling file in a specified shared directory.  The below implementation also takes batching into consideration, so that a new file will not be written for each event.  Note that I use the SoapFormatter because in my particular case, the information recorded must be purely textual and validatable using an XML schema (XSD).

using System;

using System.IO;

using System.Runtime.Serialization.Formatters.Soap;

using log4net.Appender;

using log4net.Core;

 

namespace log4net.Appender

{

    public sealed class RemotedAppender : AppenderSkeleton

    {

        protected override void Append(LoggingEvent loggingEvent)

        {

            if (_batchSize < 0 || string.IsNullOrEmpty(_directoryName) || !Directory.Exists(_directoryName))

            {

                throw new ArgumentException(

                    "RemotedAppender parameters are set incorrectly.  Ensure that the batch size " +

                    "is greater than or equal to 0, and that the directory specified exists.");

            }

 

            if (_eventCounter == 0)

                _events = new LoggingEvent[Math.Max(BatchSize, 1)];

 

            loggingEvent.Fix = FixFlags.All;    // Otherwise, volatile data isn't serialized.

            _events[_eventCounter] = loggingEvent;

 

            if (++_eventCounter >= BatchSize)

            {

                LogEvents(_events);

                _eventCounter = 0;

            }

        }

 

        private void LogEvents(LoggingEvent[] loggingEvents)

        {

            string generatedFileName =

                Path.ChangeExtension(Path.Combine(DirectoryName, Guid.NewGuid().ToString()), ".remotelog");

            using (FileStream fs = File.Create(generatedFileName))

            {

                SoapFormatter formatter = new SoapFormatter();

                formatter.Serialize(fs, loggingEvents);

            }

        }

 

        private int _batchSize;

        private string _directoryName;

        private int _eventCounter;

        private LoggingEvent[] _events;

 

        /// <summary>

        /// Tests if this appender requires a <see cref="P:log4net.Appender.AppenderSkeleton.Layout"/> to be set.

        /// </summary>

        /// <value>Always false.</value>

        protected override bool RequiresLayout

        {

            get

            {

                return false;

            }

        }

 

        /// <summary>

        /// Gets or sets the batch size - the number of events that will be buffered

        /// until the file is written and closed.

        /// </summary>

        public int BatchSize

        {

            get { return _batchSize; }

            set { _batchSize = value; }

        }

 

        /// <summary>

        /// Gets or sets the directory name to which the rolling files will be written.

        /// </summary>

        public string DirectoryName

        {

            get { return _directoryName; }

            set { _directoryName = value; }

        }

    }

}

As you can see from the above code, the LoggingEvent class that is passed to the appender is serializable, so it's embedded in an array and serialized to a file.  The batch size controls the buffer of events that must be accumulated until a file is actually written.

The "Remote" Machine

So what does the developer has to do on machine A?  He has to shadow the logging hierarchy from machine B, and use the RemoteLogPlayer class (sketched below) to replay the log files from the shared directory.  I also use a helper LogDirectoryWatcher class that takes into consideration the fact we can only receive notification when the file is created*, and not when it is closed (so we always use the before-last file).

So here's the code for the player itself:

using System;

using System.IO;

using System.Runtime.Serialization.Formatters.Soap;

using System.Text;

using log4net;

using log4net.Core;

 

namespace RemoteAppendersTest

{

    class RemoteLogPlayer

    {

        private string _fileName;

 

        public RemoteLogPlayer(string fileName)

        {

            _fileName = fileName;

        }

 

        public void Play()

        {

            LoggingEvent[] loggingEvents;

            using (FileStream fs = File.Open(_fileName, FileMode.Open))

            {

                SoapFormatter formatter = new SoapFormatter();

                loggingEvents = (LoggingEvent[]) formatter.Deserialize(fs);

            }

            foreach (LoggingEvent loggingEvent in loggingEvents)

            {

                ILog logger = LogManager.GetLogger(loggingEvent.LoggerName);

                logger.Logger.Log(loggingEvent);

            }

        }

    }

}

There isn't much to say about this code - it deserializes the specified file and sends every event to the appropriate logger, obtained using the LogManager.GetLogger method.  This class is invoked by the directory watcher:

using System;

using System.IO;

using System.Threading;

 

namespace RemoteAppendersTest

{

    public class LogDirectoryWatcher

    {

        private FileSystemWatcher _watcher = new FileSystemWatcher();

        private string _lastLogFileName;

        private RemoteLogPlayer _logPlayer;

        private volatile bool _enabled;

 

        public LogDirectoryWatcher(string directoryName)

        {

            _watcher.Path = directoryName;

            _watcher.Filter = "*.*";

            _watcher.Created += new FileSystemEventHandler(OnNewLogFileCreated);

        }

 

        public bool Enabled

        {

            get

            {

                return _enabled;

            }

            set

            {

                _enabled = value;

                _watcher.EnableRaisingEvents = value;

            }

        }

 

        private void OnNewLogFileCreated(object sender, FileSystemEventArgs e)

        {

            if (!Enabled)

                return;

 

            if (String.IsNullOrEmpty(_lastLogFileName))

            {

                _lastLogFileName = e.FullPath;

                return;

            }

 

            _logPlayer = new RemoteLogPlayer(_lastLogFileName);

 

            if (!Enabled)

                return;

 

            _lastLogFileName = e.FullPath;

 

            _logPlayer.Play();

        }

    }

}

* This is a limitation of the Win32 API that is used by the FileSystemWatcher class.  That API is called FindFirstChangeNotification and it eventually boils down to a call to the file system driver.  It simply doesn't support notifications when the file is being closed.  However, it is possible to use the change journal feature of NTFS to receive notifications about a file being closed (however, it requires interop because the change journal API is simply DeviceIoControl with the appropriate codes; and requires the Indexing Service to be enabled).

Wrap-Up

And that's basically all there is to it.  Note that as I said above, the use of a shared directory and rolling files is just one of the options - any communication mechanism could be used.  What I find enlightening about this example is the fact very little cooperation is required on the user's side once everything is set up correctly - he doesn't have to modify the logging code.  This goes along very well with the idea of log4net appenders - the actual delivery of the log event should be isolated and decoupled completely from the code that provides the log message.

Run FxCop from Code

As a very general introduction, FxCop is a code analysis tool developed by Microsoft that generates errors and warnings that should be part of the build process.  For example, existing FxCop rules can warn you about poorly performing constructs, about code that does not abide to naming conventions, about code that is not properly globalized or secure, and many more.  It is also possible to develop custom rules using the FxCop SDK, or download existing rules written by other developers.  Here's a couple of links that have a collection of rules (I bet there are more being added by the minute as people are getting accustomed to the idea of managed code analysis):

http://dotnetjunkies.com/WebLog/tshak/archive/2005/04/12/65389.aspx
http://davidkean.net/archive/2005/02/19/341.aspx

For those of you who are aware of the tool, but have always used FxCop only as part of the Visual Studio Team Suite (i.e. using the Code Analysis tab on the project settings), you should probably be aware of the fact that FxCop offers a command line interface, in the form of the FxCopCmd.exe stand-alone executable.  Note that even if you don't have VSTS, you can still use FxCop - it's a free download.  Finally, the definitive resource to look for answers is the FxCop team blog.

A few weeks ago I've encountered the need to measure the performance of some custom FxCop rules on a project I'm consulting for.  These rules rely heavily on metadata that is obtained using reflection, which is not the standard FxCop way of looking at code.  FxCop has a means of analyzing assemblies and types using a so-called "introspection engine", that looks at the IL directly, without loading the assembly (using Assembly.Load or the similar).  Some of the existing custom rules seemed to provide feasible performance, especially on small projects, and some of the rules caused Visual Studio to hang for minutes in a row while analysis was being performed.

So what I really needed to do is a means to run FxCop from code, so that I can easily measure the time it took to perform analysis for each of the rules.  The only resource I could find on the net was this forum question, which doesn't really help.  Therefore, I started poking around with Reflector, to finally produce the following fragile code.  In spite of its being fragile, I believe this is the only resource on the web that provides a working implementation that does not invoke FxCopCmd.exe to analyze an assembly.  However, use it at your own risk - this is an undocumented scenario and all the APIs used below are subject to change in any future release (this is based on the version of FxCop that comes with VSTS).

using System;
using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Diagnostics;
using System.Diagnostics.CodeAnalysis;
using System.Globalization;
using System.IO;
using System.Text;
using System.Threading;
using System.Xml;
using System.Xml.Xsl;
using Microsoft.FxCop.Common;
using Microsoft.FxCop.Sdk;

namespace RunFxCopFromCode
{
struct RuleResultData
{
public string RuleName;
public long ElapsedMs;
}

class FxCopExecutor
{

public static RuleResultData Execute(string targetLocation, string ruleLocation, string referencesDirectory,
string specificRuleCategory, string specificRuleId)
{
RuleResultData result;
result.RuleName = String.Empty;
result.ElapsedMs = -1;

Stopwatch stopper = null;
try
{
ExceptionCollection exceptionCollection = FxCopOM.Initialize();
if ((exceptionCollection != null) && (exceptionCollection.Count > 0))
{
foreach (Exception exception in exceptionCollection)
{
Console.WriteLine("* " + exception.Message);
Console.WriteLine(exception.StackTrace);
}
}

FxCopOM.Project = new Project();
FxCopOM.Project.Options.SharedProject = false;

FxCopOM.Project.Targets.AddReferenceDirectory(referencesDirectory);
TargetFile file = new TargetFile(targetLocation, FxCopOM.Project.Targets);
FxCopOM.Engines.LoadTargets(file);
FxCopOM.Project.Targets.Add(file);
RuleFile rule = new RuleFile(ruleLocation, FxCopOM.Project.RuleFiles);
rule.CheckAllChildren(true);
FxCopOM.Project.RuleFiles.Add(rule);

FxCopOM.Project.Options.Stylesheet = string.Empty;

if (!String.IsNullOrEmpty(specificRuleCategory) && !String.IsNullOrEmpty(specificRuleId))
{
rule.CheckAllChildren(false);
Rule specificRule = FxCopOM.Project.AllRules.Find(specificRuleCategory, specificRuleId);
specificRule.Enabled = true;
specificRule.Checked = true;
result.RuleName = specificRule.DisplayName;
}

stopper = Stopwatch.StartNew();
FxCopOM.Engines.Analyze(FxCopOM.Project, true);
stopper.Stop();
if (!FxCopOM.Project.AnalysisResults.AnalysisOccurred)
{
Console.WriteLine("Analysis was not performed");
}
if (FxCopOM.Project.AnalysisResults.Exceptions.Count > 0)
{
Console.WriteLine("Exceptions occurred during analysis");
}
if (FxCopOM.Project.AnalysisResults.RuleExceptions.Count > 0)
{
Console.WriteLine("Rule exceptions occurred during analysis");
}

string reportFileName = Path.Combine(@"C:\Temp", "Report" + specificRuleId);
reportFileName = Path.ChangeExtension(reportFileName, ".xml");
FileStream reportStream = new FileStream(reportFileName, FileMode.Create);
FxCopOM.Project.SaveReport(reportStream, string.Empty, false, Encoding.ASCII);
XmlDocument reportXml = new XmlDocument();
reportStream.Position = 0;
reportXml.Load(reportStream);
reportStream.Close();

int numMessages = reportXml.SelectNodes("//Message").Count;
int numExceptions = reportXml.SelectNodes("//Exception").Count;
if (numMessages > 0)
{
Console.WriteLine(numMessages + " messages");
}
if (numExceptions > 0)
{
Console.WriteLine(numExceptions + " exceptions");
}

if ((FxCopOM.Project != null) && FxCopOM.Project.ContainsBuildBreakingMessage)
{
Console.WriteLine("Build breaking message encountered");
}
}
catch (FxCopException fxCopException)
{
Console.Error.WriteLine(fxCopException.Message);
}
catch (Exception exception)
{
Console.Error.WriteLine(exception.GetType().FullName + ": " + exception.Message);
Console.Error.WriteLine(exception.StackTrace);
}
result.ElapsedMs = stopper == null ? -1 : stopper.ElapsedMilliseconds;
return result;
}
}
}

If the above code looks horrible in your browser, you can use the FxCopExecutor.zip attached to this post.  (It is zipped because the blog doesn't allow .cs extensions.  It only contains a single .cs file inside.)  Also bear in mind that this is merely a proof of concept, and not a fully working example.

To wrap us this post, there's another interesting point worth mentioning.  While I was writing this code, Roy Osherove noted that except for measuring the performance characteristics of the various rules, this POC code can be used to develop a testing framework for FxCop rules.  In the project I just mentioned, the "testing" framework for the rules includes a PERL script that executes FxCopCmd.exe over a set of testing assemblies, and compares the XML files generated to ensure they are identical.  This is obviously quite fragile, and a more detailed framework can be developed.  This seems like a relatively unexplored subject, so you are more than welcome to explore it.  :-)

Fun with the .NET GC, Paging, Generations and Write Barriers

There are plenty of resources on the .NET garbage collector, ranging from fairly basic overviews that explain what a Garbage Collector is, to detailed articles delving deep into the GC implementation, the server/workstation GC, the implementation of concurrent collections, the generational model and so forth.  You can find most of them fairly easily - here's a very brief list of references:

  • If the basic concepts don't sound familiar to you, you might want to refer to Jeffrey Richter's excellent article for a good primer (note that there are two parts).
  • Another fantastic GC resource is Maoni Stephens' blog, one of the Microsoft developers working on the GC.  The following presentation from the PDC 2005 lists the new stuff in CLR 2.0, and discusses some relatively obscure corners of the GC.
  • If the theory of the subject interests you, you can look into the Wikipedia article on garbage collection.
  • If you're looking for a definitive summary (without too many words of explanation), look into Vineet Gupta's post.

If so, why have I decided to write YAPOGC ("yet another post on garbage collection")?  To emphasize a point and to pose a question.  You might find the thoughts below a bit random, and if you feel that I should be more elaborate, feel free to state so in the comments.

Point

The point I'd like to emphasize is that the .NET Garbage Collector, in spite of being tuned for workstation and server applications, in spite of having lots of theoretical academic work behind it, in spite of some major concepts being "borrowed" from the JVM, and in spite of being developed for quite a few years now, is not suitable for every existing application.  For example, the GC allocation strategy, that requires pooling and recycling objects that reach generation 2 so that they are not released and so that full collections do not occur is not necessarily appropriate for every server application.  It is also possible that security requirements or latency considerations require that the GC doesn't run in a specified interval of time or code.  Indeed, a question I encounter quite frequently during my lectures is "how do I disable GC for X seconds?" or "how do I receive a notification that GC is about to start?".

Indeed, there is a way to receive a notification that GC is about to start.  This can be achieved by writing a custom CLR host.  The default host does not provide any managed event or other notification that a GC is happening or has just completed.  However, the hosting API allows us to implement the IHostGCManager interface, which will be sent notifications when the EE (execution engine) is starting thread suspension, ending thread suspension, and when a given thread is going to be blocked for a GC to occur.  However, the host cannot decide that it wants to stop the collection from occuring.  One way or another, there is an overwhelming amount of material on CLR hosting out there, including the excellent book by Steven Pratschner, "Customizing the .NET CLR".

A disastrous condition occurs when the managed memory (the managed heap) is starting to be paged out.  This can happen even if you have more than 2GB memory on your machine, because other processes might "misbehave" and occupy memory, so that your working set is decreased and the heap is being paged out.  This seems OK: if only unused parts of the GC heap are being paged out, this is still fairly acceptable.  After all, this is also what happens when you use a native allocator (e.g. the CRT malloc or new).

However, if the GC is forced to perform a full collection, it is required to traverse all the objects that have references from active roots.  If you are indeed using most of the memory, and some of it is paged out, the memory will have to be paged back in so that the GC gets a chance to examine it.  This is required even though it may be for a few hours until your application requires the objects that have been paged out.  However, the GC isn't aware of that fact, and still must traverse all live objects, even if they are in the page file.  This has a disastrous effect on the % of time your application spends doing collections: in fact, it's just sitting there waiting for all the memory to be paged in and out.  Maoni has related to this scenario and noted that it is being worked on.

This kind of situation will probably direct you in one of the following directions: write a custom native allocator that performs memory pooling and defragmentation, or write some kind of managed pool.

Consider the scenario where you have lots of managed objects that are fairly small, but they most contain a buffer of memory of varying size (e.g. TCP packets).  It's a bit difficult to pool these buffers in managed code, because you would have to keep track of the buffer and have a notion of "here it starts, here it ends".  It is possible to wrap these operations in a managed class, and I have encountered at least one implementation.  It is also possible to pool these buffers in unmanaged code, and provide a window into these buffers using a managed wrapper.  I've encountered an implementation of this idea as well, and it might even provide better performance because array bounds checks will be eliminated when using unsafe pointers.  Note that if you are writing a custom native pool that deals with fragmentation by returning buffers to buckets and allocating them from buckets, you might as well use the Low Fragmentation Heap that comes with Windows XP and higher (it is a flag that you pass to HeapSetInformation after you have a heap, stating that you want to use the LFH).

Note that if any kind of pooling is not an option for you, you might still benefit from using a native allocator to allocate your buffers and access them via a managed wrapper.  You will have to handle your own memory allocation and deallocation issues, but you will be freed from the concern that a full collection will trigger a huge number of page faults to bring all memory from the disk.

Question

If you are familiar with the concept of generational GC, you know that the managed heap is divided into 3 generations (gen0, gen1 and gen2).  A garbage collection does not have to traverse all 3 generations: it is possible to perform a collection only in gen0, for example.  The idea of this optimization, of course, is to minimize the amount of memory the GC has to touch during a collection, and thus minimize the time that is spent performing garbage collections.  Conceptually, the LOH also belongs in gen2 because it is collected when a full collection (a gen2 collection) occurs.

It might have occurred to you that if the GC is only looking at objects that are in gen0 while performing a gen0 collection, it might fail to note that objects in gen2 might also references those younger objects.  If the GC were forced to traverse all objects in gen2 just to make sure that they do not have an outstanding reference to a young object in gen0, the entire optimization of the generational model would be lost.

Therefore, this is not the case.  The JIT establishes write barriers that are triggered every time there is a write into a reference field in an object.  If the value of the reference (which is basically a pointer, after all) is in the ephemeral segment (which is where gen0 and gen1 reside), the write barrier knows that it must update a global data structure, called a card table, with a bit that indicates that we might have a condition where a young object is referenced by an old object.

If every old object would have a bit in the card table, the card table would have been too big.  Consider that the minimum object size is 12 bytes, and assuming that we can access about 1.6GB of memory for gen2, we can have up to 143 million objects, which would require almost 18MB of memory just for the card table to be properly synched.  Therefore, the card table contains a bit for every 128-byte range, which requires a single DWORD for every page (4KB) on x86.

When a GC occurs, at a later point, it consults the card table to see whether there were any "dirty" writes to objects in an older generation, thus not allowing objects that are still referenced from an older generation to be incorrectly sweeped.

When I first heard of this implementation (which, by the way, is quite similar in the JVM, and is a subject of theoretical research and practical analysis), the first obvious question I had was: doesn't this mean that every write to a reference field triggers some kind of write barrier code that must perform the tracking before the actual assignment is made?!

Well, yes-it-does.  Every time the JIT sees that an update to a reference field is about to be emitted, it emits a call to the write barrier thunk.  It can be examined using the SSCLI in the jithelp.asm file (for the x86 implementation).  The code ensures that the address being written to (the destination) is in the GC heap, and then checks whether the address being written (the source) is within the ephemeral segment.

        cmp     rg, g_ephemeral_low
        jb      WriteBarrier_NotInEphemeral_&rg
        cmp     rg, g_ephemeral_high
        jae     WriteBarrier_NotInEphemeral_&rg
        shr     edx, 10
        add     edx, [g_card_table]

Note that whenever the ephemeral segment is moved, the JIT thunk must be updated to contain the correct segment low and high addresses.  There is actual assembly-patching code there, that can be found in jitinterfacex86.cpp, in the StompWriteBarrierResize function.

To make a long (and getting longer) story short, this seems very inefficient.  Every reference write must be patched with this thunk, and while it's not exceptionally expensive, it's much costier than the simple MOV instruction you might have expected.  But before criticizing this solution, is there an alternative at all?

It seems that there is.  The Windows VirtualAlloc function allows us to specify that the allocated region should be "write watched" (using the MEM_WRITE_WATCH flag) so that further writes to it will be recorded and can then be retrieved using the GetWriteWatch API.  The granularity of the watch, naturally, is in pages, so it requires more work on the garbage collector's side after determining that the area has been written to, but it seems that the performance gains (from not using the JIT-generated write barrier thunk) should be more significant.

The question I was talking about, then, is why doesn't the .NET GC use this built-in memory watch mechanism, supplied by the Win32 API?  The following blog entry notes this possibility (in section 3.2.4), but does not elaborate regarding the reasons behind the particular choice made in .NET.  I have several speculations (which are just that - speculations) and am still pursuing a more definitive answer:

  • The aforementioned API is not supported on Windows 95 (which is, perhaps, not so surprising), but it is not supported on Windows 2000 as well.  This would limit the .NET framework's compatibility with these platforms (although in those particular cases the JIT-thunk approach could be adopted).
  • The aforementioned API is Windows-specific and does not provide any compatibility with other platforms.  The JIT-generated write barrier is generic and theoretically can work on any platform.
  • The performance penalty of using the MEM_WRITE_WATCH flag for writing to a region of memory is bigger than the thunk generated by the JIT.  Note that a very primitive measurement I've performed indicates an 8% performance penalty when writing to memory protected by a write watch as opposed to writing to memory that is not protected by a write watch (don't quote me on this :-)).

First Post (Introduction)

Hi everyone.  My name is Sasha Goldshtein, and I work for the Sela Group as an instructor and consultant.  My primary interests are in the .NET area, although my history as a developer includes gory things like C and digging into Win32, COM and ATL, MFC, STL and stuff of that ilk.  :-)  In the lecturing part of my work, I have recently been focusing on .NET performance and internals (and this is what my (real) first post will be about...).

I really hope to keep this blog updated with interesting things that I come across.  If there will be a topic you'd like me to write about, feel free to contact me - it might be already under the umbrella of my research interests, or might be a new thing that I'd like to work on.

One more thing I forgot to add: the name of the blog is a geeky tribute to the all-too-famous quote from the Zero Wing game.  You can find the full transcript on the Wikipedia article.  Other honorable mentions include "You have no chance to survive make your time" and "Somebody set up us the bomb".