SIMD-Optimized C++ Code in Visual Studio 11

October 17, 2011

6 comments

The C++ compiler in Visual Studio 11 has another neat optimization feature up its sleeve. Unlike intrusive features, such as running code on the GPU using the AMP extensions, this one requires no additional compilation switches and no changes – even the slightest – to the code.

The new compiler will use SIMD (Single Instruction Multiple Data) instructions from the SSE/SSE2 and AVX family to "parallelize" loops. This is not the standard, thread-level parallelism, which runs certain iterations of the loop in parallel. This is the processor’s inherent ability to execute operations on individual parts of large data elements in parallel.

The following trivial example illustrates the benefits of this optimization. Suppose you want to sum two vectors of floating-point numbers, element-by-element. The following C/C++ loop performs this task:

    for (int i = 0; i < N; ++i)
        C[i] = A[i] + B[i];

The current VC++ compiler compiles this loop to the following 32-bit code with optimizations:

013E105A  xor         eax,eax 
013E105C  lea         esp,[esp] 
013E1060  fld         dword ptr B[eax] 
013E1067  add         eax,28h 
013E106A  fadd        dword ptr [ebp+eax-0FCCh] 
013E1071  fstp        dword ptr [ebp+eax-2F0Ch] 
013E1078  fld         dword ptr [ebp+eax-0FC8h] 
013E107F  fadd        dword ptr [ebp+eax-1F68h] 
013E1086  fstp        dword ptr [ebp+eax-2F08h] 
013E108D  fld         dword ptr [ebp+eax-0FC4h] 
013E1094  fadd        dword ptr [ebp+eax-1F64h] 
013E109B  fstp        dword ptr [ebp+eax-2F04h] 
013E10A2  fld         dword ptr [ebp+eax-0FC0h] 
013E10A9  fadd        dword ptr [ebp+eax-1F60h] 
013E10B0  fstp        dword ptr [ebp+eax-2F00h] 
013E10B7  fld         dword ptr [ebp+eax-0FBCh] 
013E10BE  fadd        dword ptr [ebp+eax-1F5Ch] 
013E10C5  fstp        dword ptr [ebp+eax-2EFCh] 
013E10CC  fld         dword ptr [ebp+eax-0FB8h] 
013E10D3  fadd        dword ptr [ebp+eax-1F58h] 
013E10DA  fstp        dword ptr [ebp+eax-2EF8h] 
013E10E1  fld         dword ptr [ebp+eax-0FB4h] 
013E10E8  fadd        dword ptr [ebp+eax-1F54h] 
013E10EF  fstp        dword ptr [ebp+eax-2EF4h] 
013E10F6  fld         dword ptr [ebp+eax-0FB0h] 
013E10FD  fadd        dword ptr [ebp+eax-1F50h] 
013E1104  fstp        dword ptr [ebp+eax-2EF0h] 
013E110B  fld         dword ptr [ebp+eax-0FACh] 
013E1112  fadd        dword ptr [ebp+eax-1F4Ch] 
013E1119  fstp        dword ptr [ebp+eax-2EECh] 
013E1120  fld         dword ptr [ebp+eax-0FA8h] 
013E1127  fadd        dword ptr [ebp+eax-1F48h] 
013E112E  fstp        dword ptr i[eax] 
013E1135  cmp         eax,0FA0h 
013E113A  jb          wmain+60h (013E1060h) 

Note the aggressive loop unrolling employed by the compiler – each iteration of this loop will perform 10 operations.

The new VC++ compiler compiles the loop to the following 32-bit code with optimizations:

00381041  xor         eax,eax 
00381043  jmp         wmain+50h (0381050h) 
00381045  lea         esp,[esp] 
0038104C  lea         esp,[esp] 
00381050  movups      xmm1,xmmword ptr B[eax] 
00381058  movups      xmm0,xmmword ptr A[eax] 
00381060  add         eax,10h 
00381063  addps       xmm1,xmm0 
00381066  movups      xmmword ptr [ebp+eax-2EF4h],xmm1 
0038106E  cmp         eax,0FA0h 
00381073  jb          wmain+50h (0381050h) 

This time, each iteration of the loop performs 4 operations, by using the SIMD instructions MOVUPS and ADDPS. The first, MOVUPS, copies four floating-point values from memory to registers and the other way around. The second, ADDPS, adds four floating-point values that are packed next to each other in two registers.

What’s the performance difference? On my Intel i7-860 processor, there is exactly a 2x difference between the two compiler toolsets.*

The loop above is a silly example, but it shows the potential of automatic optimization. Using SIMD instructions from C++ programs – up to now – relied on dropping to low-level intrinsics such as _mm_add_ps, and low-level types such as __m128. I’m willing to bet that most C++ developers have never considered using these intrinsics in their programs. That’s why this is an important feature, and just a tiny step in the right direction.


* It’s worth mentioning that the VC++11 compiler can produce AVX instructions (operating on 256 bit YMMx registers), which should be even faster, but this is not the default. My first-generation i7 processor doesn’t support them – feel free to check them out on a Sandy Bridge processor and let me know if it helps.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

6 comments

  1. CatOctober 17, 2011 ב 8:10 PM

    So the VC2010 compiler, when told to use the SSE2 instruction set, does not auto-vectorize? What does SSE/2 instructions does it automatically employ?

    Reply
  2. JoshOctober 19, 2011 ב 2:49 AM

    Will apps compiled with the SIMD-Optimized code still run on non-SIMD capable CPU’s? (Intel’s compiler allows for this via separate copies of the functions AFAIK).

    Writing desktop apps, it can be hard to control what spec CPU’s software gets run on – eg. VIA C3 netbook/etc.

    Reply
  3. Sasha GoldshteinOctober 19, 2011 ב 12:22 PM

    @Cat: I checked, and it uses the SSE/SSE2 instructions to move data around (similar to loop unrolling), but it *doesn’t* use the parallel instructions (e.g. uses ADDSS instead of ADDPS).

    @Josh: No, you will get an illegal instruction exception. There are some functions in the CRT that detect the instruction set and act accordingly, but VS doesn’t generate anything like that for user code. However, I think it’s pretty safe to assume SSE and SSE2 nowadays – they’ve been supported since Pentium 4.

    Reply
  4. MichaelNovember 30, 2011 ב 4:51 AM

    Will it ever use aligned move instructions (eg MOVAPS) if data has been declared aligned?

    Reply
  5. GabestDecember 5, 2011 ב 10:51 PM

    Cat: SSE-mode also replaces FPU instructions, basically the second code snipped with scalar SSE code.

    “most C++ developers have never considered using these intrinsics in their programs.”

    I’d argue with that. When performance really matters developers will try everything they can to improve their code. Until x64 became popular it was either assembly or inline assembly. Those who wanted to support 32/64-bit at the same time, without much pain, quickly switched to SSE intrinsics. Wrap them into some class, overload operators, and it becomes fully transparent.

    Intel’s compiler supported auto-vectorization ages ago, though I never liked their style, too many temporary variables. MSVC was always smarter keeping things in registers.

    Reply
  6. MichaelDecember 8, 2011 ב 6:33 PM

    Great to hear that this feature finally will be adressed, given the fact that Intel and GCC support such things for quite a couple of years now. However, (optionally) generating multiple code-paths would be really handy, given the fact that otherwise it will not be possible to make use of AVX for quite a long time to come (since non-AVX computers are likely to be around for 5-10 years from now on, I guess)…

    Reply