DCSIMG
June 2012 - Posts - All Your Base Are Belong To Us

All Your Base Are Belong To Us

Mostly .NET internals and other kinds of gory details

June 2012 - Posts

Slides from the Third Jerusalem .NET/C++ User Group Meeting

On Sunday we hosted the third meeting of the Jerusalem .NET/C++ User Group. My presentation on generics in Java, .NET and C++ is below, and Noam’s presentation on the Windows Azure new features can be found in the Windows Azure Training Kit (June 2012 refresh), which I strongly recommend.

Thanks for coming, and see you at the next meeting!

Micro-Benchmarking Done Wrong, And For The Wrong Reasons

This is a short excerpt from Chapter 2 (Performance Measurement) of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.

The purpose is to determine “which is faster”—using the is keyword and then casting to the desired type, or using the as keyword and relying on the result. Here’s a test class for this comparison I’ve seen lately:

//Test class
class Employee {
  public void Work() {}
}

//Fragment 1 – casting safely and then checking for null
static void Fragment1(object obj) {
  Employee emp = obj as Employee;
  if (emp != null) {
    emp.Work();
  }
}

//Fragment 2 – first checking the type and then casting
static void Fragment2(object obj) {
  if (obj is Employee) {
    Employee emp = obj as Employee;
    emp.Work();
  }
}

A rudimentary benchmarking framework might go along the following lines:

static void Main() {
  object obj = new Employee();
  Stopwatch sw = Stopwatch.StartNew();
  for (int i = 0; i < 500; i++) {
    Fragment1(obj);
  }
  Console.WriteLine(sw.ElapsedTicks);
  sw = Stopwatch.StartNew();
  for (int i = 0; i < 500; i++) {
    Fragment2(obj);
  }
  Console.WriteLine(sw.ElapsedTicks);
}

This is not a convincing microbenchmark, although the results are fairly reproducible. More often than not, the output is 4 ticks for the first loop and 200-400 ticks for the second loop. This might lead to the conclusion that the first fragment is 50-100 times faster. However, there are significant errors in this measurement and the conclusion that stems from it:

  1. The loop runs only once and 500 iterations are not enough to reach any meaningful conclusions—it takes a negligible amount of time to run the whole benchmark, so it can be affected by many environmental factors.
  2. No effort was made to prevent optimization, so the JIT compiler may have inlined and discarded both measurement loops completely.
  3. The Fragment1 and Fragment2 methods measure not only the cost of the is and as keywords, but also the cost of a method invocation (to the FragmentN method itself!). It may be the cast that invoking the method is significantly more expensive than the rest of the work.

Improving upon these problems, the following microbenchmark more closely depicts the actual costs of both operations:

class Employee {
  //Prevent inlining this method
  [MethodImpl(MethodImplOptions.NoInlining)]
  public void Work() {}
}

static void Measure(object obj) {
  const int OUTER_ITERATIONS = 10;
  const int INNER_ITERATIONS = 100000000;
  for (int i = 0; i < OUTER_ITERATIONS; ++i) {
    Stopwatch sw = Stopwatch.StartNew();
    for (int j = 0; j < INNER_ITERATIONS; ++j) {
      Employee emp = obj as Employee;
      if (emp != null)
        emp.Work();
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
  }
  for (int i = 0; i < OUTER_ITERATIONS; ++i) {
    Stopwatch sw = Stopwatch.StartNew();
    for (int j = 0; j < INNER_ITERATIONS; ++j) {
      if (obj is Employee) {
        Employee emp = obj as Employee;
        emp.Work();
      }
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
  }
}

The results on desktop (after discarding the first iteration) were around 410ms for the first loop and 440ms for the second loop, a reliable and reproducible performance difference, which might have you convinced that indeed, it’s more efficient to use just the as keyword for casts and checks.

However, the riddles aren’t over yet. If we add the virtual modifier to the Work method, the performance difference disappears completely, even if we increase the number of iterations. This cannot be explained by the virtues or maladies of our micro-benchmarking framework—it is a result from the problem domain. There is no way to understand this behavior without going to the assembly language level and inspecting the loop generated by the JIT compiler in both cases. First, before the virtual modifier:

; Disassembled loop body – the first loop
mov edx,ebx
mov ecx,163780h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,eax
call dword ptr ds:[163774h] ;Employee.Work()
WRONG_TYPE:

; Disassembled loop body – the second loop
mov edx,ebx
mov ecx,163780h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,ebx
cmp dword ptr [ecx],ecx
call dword ptr ds:[163774h] ;Employee.Work()
WRONG_TYPE:

The instruction sequence emitted by the JIT compiler to call a non-virtual method and to call a virtual method was discussed in one of my posts more than five years ago. When calling a non-virtual method, the JIT compiler must emit an instruction that makes sure we are not making a method call on a null reference. The CMP instruction in the second loop serves that task. In the first loop, the JIT compiler is smart enough to optimize this check away, because immediately prior to the call, there is a null reference check on the result of the cast (if (emp != null) ...). In the second loop, the JIT compiler’s optimization heuristics are not sufficient to optimize the check away (although it would have been just as safe), and this extra instruction is responsible for the extra 7-8% of performance overhead.

After adding the virtual modifier, however, the JIT compiler generates exactly the same code in both loop bodies:

; Disassembled loop body – both cases
mov edx,ebx
mov ecx,1A3794h ;MT of Employee
call clr!JIT_IsInstanceOfClass
test eax,eax
je WRONG_TYPE
mov ecx,eax
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+28h]
call dword ptr [eax+10h]
WRONG_TYPE:

The reason is that when invoking a virtual method, there’s no need to perform a null reference check explicitly—it is inherent in the method dispatch sequence. When the loop bodies are identical, so are the timing results.

This small example should not dissuade you from ever writing your own micro-benchmarks. However, the worst performance optimization is one that is based on incorrect measurements; unfortunately, manual benchmarking often leads into this trap.


I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn

Sessions From SELA C++ Conference: “The Style of C++ 11” and “The Future of C++”

Thanks for coming to my talks today at SELA’s C++ conference! It’s been a great audience and there was lots of interest in the new standard and what’s still coming ahead. There are still two more conference days ahead of us, and it’s great C++ born again!

You can view my two talks below.


I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn

Visual Studio 2012 C++ Auto-Parallelizer

As you might have gathered from some scarce reports on the Web and the initial list of new features in Visual Studio 2012, the new C++ compiler is now capable of automatically vectorizing loop bodies—a feature I’ve already covered here, and also automatically parallelizing them using multiple threads.

Here’s an example. Consider the classic prime number calculation loop, designed to count the number of primes in a given range:

__declspec(noinline) bool is_prime(int n) {
    for (int x = 2; x < n; ++x) {
        if (n % x == 0 && n != x) return false;
    }
    return true;
}

LONG count = 0;
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    ++count;
  }
}
printf(“Count = %d"\n”, count);

This is a classic, ripe candidate for parallelization—although we need to be a little careful with the shared count variable. With N=100000 the loop completes in ~1600ms on my desktop; perhaps the compiler can make it faster automatically.

We go ahead and enable the /Qpar switch in the project properties. This allows the C++ compiler to perform automatic parallelization, but it still sometimes requires an explicit hint regarding the loops that might benefit from parallelization.

image

This hint is given in the form of a #pragma, indicating also how many threads you recommend that the runtime should use:

#pragma loop(hint_parallel(4))
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    ++count;
  }
}

This still takes ~1600ms on my machine, and no parallelization is visible. What’s wrong? The shared variable, of course. The compiler notices that it would be unsafe to parallelize the loop body and refrains from doing it. Changing the loop to…

#pragma loop(hint_parallel(4))
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    InterlockedIncrement(&count);
  }
}

…suddenly works, and brings down the time to ~450ms. Here are the four threads and a representative call stack, showing that the underlying engine is the same as in OpenMP (with its #pragma omp directives introduced in Visual Studio 2005!):

>Debug.ListCallStack
Index  Function
--------------------------------------------------------------------------------
*1      ParallelizingCompilerCpp.exe!wmain$par$1()
2      vcomp110.dll!_vcomp::C2VectParallelRegion::serialCallback(_vcomp::C2VectParallelRegion * c2pr, int)
3      vcomp110.dll!_vcomp::C2VectParallelRegion::parallelCallback_Guided(_vcomp::C2VectParallelRegion * c2pr=0x002af840)
4      vcomp110.dll!_vcomp::fork_helper_wrapper(void (...) *)
5      vcomp110.dll!_vcomp::ParallelRegion::HandlerThreadFunc(void * context=0x002af7dc, unsigned long index=0x00000000)
6      vcomp110.dll!InvokeThreadTeam(_THREAD_TEAM * ptm=0x002dddd8, void (void *, unsigned long) * pvContext=0x002af7dc, void *)
7      vcomp110.dll!_vcomp_fork(int if_test=0x00000001, int arg_count=0x00000001, void (...) * funclet=0x0f941d54, ...)
8      vcomp110.dll!_vcomp::C2VectParallelRegion::Execute()
9      vcomp110.dll!C2VectParallel(int start=0x00000003, int end=0x000186a0, int stride=0x00000001, int inclusive=0x00000000, unsigned int numChunks=0x00000004, int schedule=0x00000003, void (int, int, ...) * func=0x012018d0, int argcnt, ...)
10     Demo.exe!wmain(int argc=0x00000001, wchar_t * * argv=0x002dc178)
11     Demo.exe!__tmainCRTStartup()
12     kernel32.dll!@BaseThreadInitThunk@12()
13     ntdll.dll!___RtlUserThreadStart@8()
14     ntdll.dll!__RtlUserThreadStart@8()

>Debug.ListThreads
Index Id     Name                           Location
--------------------------------------------------------------------------------
*1     2340   Main Thread                    wmain$par$1
2     8084   vcomp110.dll!_vcomp::PersistentThreadFunc _RtlUserThreadStart@8
3     2444   vcomp110.dll!_vcomp::PersistentThreadFunc @RtlpAllocateHeap@24
4     6764   vcomp110.dll!_vcomp::PersistentThreadFunc _RtlUserThreadStart@8
>

The documentation now is much better than it was in the Beta, and you can find online more details about the /Qpar compiler switch and the parallelization #pragmas.


I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn

Third Meeting of the Jerusalem .NET/C++ User Group

The third meeting of the Jerusalem .NET/C++ User Group is going to be on June 24, and we’ll be shifting our focus slightly to the .NET side. We’re having Noam Sheffer talk about all the cool new features in Windows Azure with a bunch of hands-on demos, and then yours truly will talk about the implementation of generics in .NET, C++ and Java. If you’re a developer in the Jerusalem area, you really should come.

The full agenda is the following:

18:00-18:15 - Networking and Refreshments
18:15-19:15 - Hands-On With The New Windows Azure (Noam Sheffer)
19:15-19:30 - Networking and Refreshments
19:30-20:15 - Generics in .NET, C++ and Java (Sasha Goldshtein)

Hands-On With The New Windows Azure
Windows Azure, Microsoft's cloud offering, has just undergone a significant facelift. Azure can now be used as an IaaS offering - you can host your own Windows or Linux virtual machine in the cloud. Azure supports PHP, Python, Node.js and many other programming languages and environments in its cloud services. Finally, Azure Websites, Microsoft's web hosting solution, supports seamless updating of your cloud web app by using push from TFS or git. In this hands-on session we'll demo as many of these features as possible.

Generics in .NET, C++ and Java
Generics are a first-class programming language construct these days, augmenting "standard" object-oriented practices. Different languages implement generics in very different ways, which affect the performance of generic code and its expressiveness. In this session we'll take a look at the implementation of generics in .NET, C++ and Java, understand the performance implications of "true" templates versus run-time generics, and even take a quick detour into the world of template metaprogramming.

Choosing a Collection Is a Matter of Cache, Too

This is a short excerpt from Chapter 5 (Collections and Generics) of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.

Choosing the right collection is not only about its performance considerations. The way data are laid out in memory is often more critical to CPU-bound applications than any other criterion, and collections affect this layout greatly. The main factor behind carefully examining data layout in memory is the CPU cache.

Modern systems ship with large main memories. Eight GB of memory is fairly standard on a mid-range workstation or a gaming laptop. Fast DDR3 SDRAM memory units are capable of ∼15ns memory access latencies, and theoretical transfer rates of ∼15GB/s. Fast processors, on the other hand, can issue billions of instructions per second; theoretically speaking, stalling for 15ns while waiting for memory access can prevent the execution of dozens (and sometimes hundreds) of CPU instructions. Stalling for memory access is the phenomenon known as hitting the memory wall.

To distance applications from this wall, modern processors are equipped with several levels of cache memory, which has different internal characteristics than main memory and tends to be very expensive and relatively small. For example, my desktop’s Intel i7-860 processor ships with three cache levels:

  • Level 1 cache for program instructions, 32KB, one for each core (total of 4 caches).
  • Level 1 cache for data, 32KB, one for each core (total of 4 caches).
  • Level 2 cache for data, 256KB, one for each core (total of 4 caches).
  • Level 3 cache for data, 8MB, shared (total of 1 cache).

image

When the processor attempts to access a memory address, it begins by checking whether the data is already in its L1 cache. If it is, the memory access is satisfied from the cache, which takes ∼5 CPU cycles (this is called a cache hit). If it isn’t, the L2 cache is checked; satisfying the access from the L2 cache takes ∼10 cycles. Similarly, satisfying the access from L3 cache takes ∼40 cycles. Finally, if the data isn’t in any of the cache levels, the processor will stall for the system’s main memory (this is called a cache miss). When the processor accesses main memory, it reads from it not a single byte or word, but a cache line, which on modern systems consists of 32 or 64 bytes. Accessing any word on the same cache line will not involve another cache miss until the line is evicted from the cache.

To calculate cache access latencies on actual hardware, you can consult hardware specs or use a live benchmarking utility like CPUID’s cache latency computation tool. Here is an example of its output on my desktop PC:

Cache latency computation, ver 1.0
www.cpuid.com

Computing ...
<diagnostic output skipped>

3 cache levels detected
Level 1         size = 32Kb     latency = 3 cycles
Level 2         size = 256Kb    latency = 10 cycles
Level 3         size = 8192Kb   latency = 39 cycles

Although this description does not do justice to the true hardware complexities of how SRAM caches and DRAM memories operate, it provides enough food for thought and discussion of how our high-level software algorithms can be affected by data layout in memory. We now consider a simple example that involves a single core’s cache. (There are interesting cases to consider where multiple caches from different cores are involved—Chapter 6 in the book covers these in more detail.)

Suppose the task at hand is traversing a large collection of integers and performing some aggregation on them, such as finding their sum or average. Below are two alternatives; one accesses the data from a LinkedList<int> and the other from an array of integers (int[]), two of the built-in .NET collections.

LinkedList<int> numbers = new LinkedList<int>(
    Enumerable.Range(0, 20000000));
int sum = 0;
for (LinkedListNode<int> curr = numbers.First;
     curr != null; curr = curr.Next)
{
  sum += curr.Value;
}

int[] numbers = Enumerable.Range(0, 20000000).ToArray();
int sum = 0;
for (int curr = 0; curr < numbers.Length; ++curr)
{
  sum += numbers[curr];
}

The second version of the code runs 2× faster than the first on my desktop. This is a non-negligible difference, and if you only consider the number of CPU instructions issued, you might not be convinced that there should be a difference at all. After all, traversing a linked list involves moving from one node to the next, and traversing an array involves incrementing an index into the array. (In fact, without JIT optimizations, accessing an array element would require also a range check to make sure that the index is within the array’s bounds.)

  ; possible x86 assembly for the first loop
  ; assume ‘sum’ is in EAX and ‘numbers’ is in ECX
  xor eax, eax
  mov ecx, dword ptr [ecx+4] ; curr = numbers.First
  test ecx, ecx
  jz LOOP_END
LOOP_BEGIN:
  add eax, dword ptr [ecx+10] ; sum += curr.Value
  mov ecx, dword ptr [ecx+8] ; curr = curr.Next
  test ecx, ecx
  jnz LOOP_BEGIN ; total of 4 instructions per iteration
LOOP_END:
  ...

  ; possible x86 assembly for the second loop
  ; assume ‘sum’ is in EAX and ‘numbers’ is in ECX
  mov edi, dword ptr [ecx+4] ; numbers.Length
  test edi, edi
  jz LOOP_END
  xor edx, edx ; loop index
LOOP_BEGIN:
  add eax, dword ptr [ecx+edx*4+8] ; sum += numbers[i]
  inc edx
  cmp esi, edx
  jg LOOP_BEGIN ; total of 4 instructions per iteration
LOOP_END:
  ...

Given this code generation for both loops (and barring optimizations such as using SIMD instructions to traverse the array, which is contiguous in memory), it is hard to explain the significant performance difference by inspecting only the instructions executed. Indeed, we must analyze the memory access patterns of this code to reach any acceptable conclusions.

In both loops, each integer is accessed only once, and it would seem that cache considerations are not as critical because there is no reusable data to benefit from cache hits. Still, the way the data is laid out in memory affects greatly the performance of this program — not because the data is reused, but because of the way it is brought into memory. When accessing the array elements, a cache miss at the beginning of a cache line brings into the cache 16 consecutive integers (cache line = 64 bytes = 16 integers). Because array access is sequential, the next 15 integers are now in the cache and can be accessed without a cache miss. This is an almost-ideal scenario, with a 1:16 cache miss ratio. On the other hand, when accessing linked list elements, a cache miss at the beginning of a cache line can bring into cache at most 3 consecutive linked list nodes, a 1:4 cache miss ratio! (A node consists of a back pointer, forward pointer, and integer datum, which occupy 12 bytes on a 32-bit system; the reference type header brings the tally to 20 bytes per node.)

The much higher cache miss ratio is the reason for most of the performance difference between the two pieces of code we tested above. Furthermore, we are assuming the ideal scenario in which all linked list nodes are positioned sequentially in memory, which would be the case only if they were allocated simultaneously with no other allocations taking place, and is fairly unlikely. Had the linked list nodes been distributed less ideally in memory, the cache miss ratio would have been even higher, and the performance even poorer.

More practical examples—and everything you will ever want to know about CPU caches and performance considerations of system memory—are covered in Ulrich Drepper’s excellent essay, “What Every Programmer Should Know About Memory”. I recommend re-reading this paper at least once a year :-)


I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn

SELA C++ Conference Is Near!

I just realized that I completely forgot to tell you about the C++ conference we’ve having in the middle of June. Basically, we realized that there’s a lot of latent interest in C++ – what with the “C++ renaissance” – that we are not tapping into with our SDP conferences. Even though the last SDP had sessions on C++11 and debugging C++ applications, we felt there was room for more.

image

At the C++ conference on June 18-20, we have two parts, as usual: a day of breakout sessions at the Crowne Plaza hotel in Tel-Aviv, and two days packed with full-day tutorials. Yours truly is delivering two breakout sessions—a keynote on C++11 and a session on the future of C++; and a tutorial day on managed-native interoperability. Other nuggets include a C++11 deep dive, a session for performance tuning C++ apps (which has sold out quite a while ago!), and others.

I’ve been working on my sessions most of last week, and can promise you they’re going to be a lot of fun! In the breakout sessions I’m planning to cram as many examples of modern and future C++ as I can, and in the interop workshop there will be four labs and plenty of cool demos, including my recently posted SIMD example.


I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn