C# Vectorization with Microsoft.Bcl.Simd

April 22, 2014

6 comments

tl;dr

A couple of weeks ago at Build, the .NET/CLR team announced a preview release of a library, Microsoft.Bcl.Simd, that exposes a set of JIT intrinsics on top of CPU vector instructions (a.k.a. SIMD). This library relies on RyuJIT, another preview technology that is aimed to replace the existing JIT compiler. When using Microsoft.Bcl.Simd, you program against a vector abstraction that is then translated at runtime to the appropriate SIMD instructions that your processor supports.

In this post, I’d like to take a look at what exactly this SIMD support is about, and show you some examples of what kind of speedups and optimizations can be expected from using it.

Introduction to SIMD

Let’s start with the basics. What is SIMD and why should I care? It turns out that modern processors can perform vector operations, which affect more than a single word of data at the same time. For example, SSE-compliant processors have vector instructions for adding two vectors of four floating-point numbers in the same time (latency and throughput) as adding two single floating-point numbers. Vector operations are pretty incredible and if your compiler can take advantage of them, you can potentially gain 4-8x speedups for certain kinds of operations on modern hardware.

For example, consider the following method that performs pointwise addition of two floating-point arrays a and b, in place:

void AddPointwise(int[] a, int[] b) {
  int len = Math.Min(a.Length, b.Length);
  for (int i = 0; i < len; ++i) {
    a[i] += b[i];
  }
}

The scalar (non-vectorized) version of this method might look as follows in x86 assembly instructions:

AddPointwise:
  xor ebx, ebx                      ; i = 0
  mov eax, dword ptr [ecx+4]        ; a.Length
  cmp eax, dword ptr [edx+4]        ; b.Length
  jge loop
  mov eax, dword ptr [edx+4]
loop:
  cmp ebx, eax                      ; have we reached the end of the array?
  jge end
  mov edi, dword ptr [edx+8+4*ebx]  ; b[i]
  add dword ptr [ecx+8+4*ebx], edi  ; a[i] += b[i]
  inc ebx                           ; ++i
  jmp loop
end:
  ret

Each loop iteration performs a single ADD instruction. This instruction has a latency of 1 cycle and throughput of 3 instructions per cycle on Intel i* processors. The vectorized version of this method might look as follows (I’m assuming for simplicity that the size of the arrays is divisible by 4):

AddPointwise:
  xor ebx, ebx                          ; i = 0
  mov eax, dword ptr [ecx+4]            ; a.Length
  cmp eax, dword ptr [edx+4]            ; b.Length
  jge loop
  mov eax, dword ptr [edx+4]
loop:
  cmp ebx, eax                          ; have we reached the end of the array?
  jge end
  mov edi, ebx
  shl edi, 4
  movups xmm0, xmmword ptr [edx+8+edi]; b[i], ..., b[i+3]
  addps xmmword ptr [ecx+8+edi], xmm0 ; a[i], ..., a[i+3] += b[i], ..., b[i+3]
  add ebx, 4                            ; i += 4
  jmp loop
end:
  ret

In this version, every loop iteration performs a single ADDPS instruction, which also has a latency of 1 cycle and throughput of 3 instructions per cycle on Intel i* processors. The result could be a 4x speedup, because we’re issuing 4x fewer instructions with the same throughput and latency.

Taxonomy of vectorizing compilers/runtimes

Most Windows developers use C/C++’s approach to vectorization. For years, Visual C++ has provided a set of intrinsic data types and functions that are recognized by the compiler and translated to vector instructions. For example, the preceding method could be written as follows using the Microsoft-specific vector intrinsics:

void AddPointwise(int* a, int* b, int len) {
  __m128* va = reinterpret_cast<__m128*>(a);
  __m128* vb = reinterpret_cast<__m128*>(b);
  for (int i = 0; i < len/4; ++i) {
    *va = _mm_add_ps(*va, *vb);
++va; ++vb; } }

This translates directly to the following x64 instructions:

; parameters: RCX = a, RDX = b, R8D = len
  test        r8d,r8d        ; len <= 0?
  jle         bail_out
  dec         r8d            ; --len
  sub         rdx,rcx        ; b = (char*)b - (char*)a
  shr         r8d,2          ; len /= 4
  inc         r8d            ; ++len
loop:
  movaps      xmm0,xmmword ptr [rdx+rcx]  ; xmm0 = b["i"]
  add         rcx,10h                     ; "++i"
  addps       xmm0,xmmword ptr [rcx-10h]  ; xmm0 += a["i-1"]
  movaps      xmmword ptr [rcx-10h],xmm0  ; a["i-1"] = xmm0
  dec         r8                          ; --len
  jne         loop
bail_out:
  ret

(It’s interesting to note how cryptic the compiler-generated code seems to be. It starts by doing the apparently-absurd operation b – a so that it can later increment the first array’s base directly instead of using another register. I suspect the reason is that the memory addressing instructions that take an address of the form [reg0+imm*reg1] can’t multiply reg1 by 16, which is required here to skip appropriately within the array.)

Clearly, manual vectorization leaves much to be desired. It means the developer is responsible for recognizing when vectorization can be beneficial, and for taking a dependency on the platform’s vector register size. For example, the preceding loop would not be able to use AVX instructions, such as VADDPS which operate on eight 32-bit words (a 256-bit value) at a time.

Therefore, some modern C/C++ compilers, Visual C++ included, can perform automatic vectorization of certain operations. Tight loops such as our one are great candidates for automatic vectorization. There are many difficult problems with automatic vectorization, however. The compiler must be able to prove that there are no weird interdependencies between the loop’s iterations, that there is no pointer aliasing, and other concerns. The Microsoft.Bcl.Simd package is not (at present) an automatic vectorization framework. It is instead a set of library types and intrinsics that are recognized by the JIT compiler much like the __m128 type is recognized by the Visual C++ compiler.

Configuring Microsoft.Bcl.Simd

To use Microsoft.Bcl.Simd, you need to install it as a NuGet package. On its own, however, this package only contains a bunch of library types and methods, such as Vector4f that represents four floating-point numbers packed in a vector register. To get the runtime performance benefits, you need a JIT compiler that can recognize these types and emit the appropriate CPU instructions at runtime — and that’s RyuJIT.

After you install RyuJIT, you have both JIT compilers on your system, and the standard one (clrjit.dll) will be used unless you specify that you want the RyuJIT by setting an environment variable:

SET COMPLUS_AltJit=*

To enable the SIMD-specific extensions in RyuJIT, you need to set another environment variable as well:

SET COMPLUS_FeatureSIMD=1

Now we can actually explore some code that uses the types in the Microsoft.Bcl.Simd package.

SIMD pointwise vector addition

To vectorize the loop we had earlier that adds two vectors pointwise, we can use the various .NET SIMD vector types. The fixed-size Vector2f, Vector3f, and Vector4f, are designed for situations where you already have a vector-like representation of your data, such as points in 2- or 3-dimensional space. There is also a generic Vector<T> type that works with various integral and floating-point types and will be sized accordingly based on your hardware capabilities. Although the current preview only supports 128-bit vector operations, a future release will hopefully also provide support for 256-bit vector operations (AVX) if the hardware provides it. That’s where using Vector<T> can be useful — you don’t have to commit to the vector size that really depends on what your hardware can offer.

So let’s use Vector<T> to vectorize our pointwise vector addition, assuming again that the array sizes are divisible by the “SIMD length” — the size of a vector register:

void AddPointwiseStandard(float[] a, float[] b) {
  for (int i = 0; i < a.Length; ++i) {
    a[i] += b[i];
  }
}

void AddPointwiseSimd(float[] a, float[] b) {
  int simdLength = Vector<float>.Length;
  for (int i = 0; i < a.Length; i += simdLength) {
    Vector<float> va = new Vector<float>(a, i);
    Vector<float> vb = new Vector<float>(b, i);
    va += vb;
    va.CopyTo(a, i);
  }
}

In the preceding code, initializing a Vector<float>, adding two of them together, and then copying the results back to the source array — are all JIT intrinsics that should use vector registers and not incur any method calls. In the current preview, unfortunately, the CopyTo method is not recognized by the JIT as an intrinsic, and incurs a pretty sizeable overhead.

So, how about we take this vectorized version for a spin? Unfortunately, we’re up for a disappointment. The vectorized version, even after getting rid of CopyTo, is much slower than the scalar one. For example, on my Core i7-3720QM processor, the results are:

Add Standard: 0.262ms
Add SIMD: 0.300ms

Let’s take a look at the actual instructions generated and try to understand why that’s the case. Here’s what the compiler produced in the scalar case (I have annotated, reordered, and removed some instructions for brevity and clarity):

; parameters: RCX = a, RDX = b
  xor     eax,eax                ; EAX = i
  mov     r8d,dword ptr [rcx+8]  ; R8D = a.Length
  test    r8d,r8d
  jle     bail_out               ; R8D <= 0?
  test    rdx,rdx
  setne   r9b
  movzx   r9d,r9b
  test    r9d,1
  je      00007ff8`72f90aae
  cmp     dword ptr [rdx+8],r8d  ; b.Length < a.Length?
  jl      loop_with_range_checks_for_b

loop_with_no_range_checks:
  lea     r9,[rcx+r9*4+10h]
  movss   xmm0,dword ptr [r9]             ; low 32 bits of XMM0 = a[i]
  movsxd  r10,eax                         ; sign-extend EAX into R10
  addss   xmm0,dword ptr [rdx+r10*4+10h]  ; low 32 bits of XMM0 += b[i]
  movss   dword ptr [r9],xmm0             ; a[i] = low 32 bits of XMM0

  inc     eax
  cmp     r8d,eax    ; a.Length <= i?
  jg      loop_with_no_range_checks
  jmp     bail_out

; this is pretty much the same code, but we are verifying
; that i is within range for accessing the b array
loop_with_range_checks_for_b:
  movsxd  r9,eax
  lea     r9,[rcx+r9*4+10h]
  movss   xmm0,dword ptr [rcx+r9*4+10h]
  mov     r10d,dword ptr [rdx+8]
  cmp     eax,r10d                ; i >= b.Length?
  jae     range_check_fail
  movsxd  r10,eax
  addss   xmm0,dword ptr [rdx+r10*4+10h]
  movss   dword ptr [r9],xmm0
  inc     eax
  cmp     r8d,eax
  jg      loop_with_range_checks_for_b

bail_out:
  add     rsp,28h
  ret

range_check_fail:
  call    clr!JIT_RngChkFail
  int     3

As you can see, the scalar version is pretty well-optimized, and there is special treatment for the case when a.Length <= b.Length, which implies that no range checks are required when going through the loop. Another thing to note is that even though the compiler uses vector registers (specifically XMM0 is a 128-bit register), it only stores 32-bit values in them, so there is no automatic vectorization magic taking place.

Moving on, here’s what the compiler produced in the vectorized case:

; parameters: RCX = a, RDX = b (as before)
  sub     rsp,38h
  xor     eax,eax
  mov     qword ptr [rsp+28h],rax
  mov     qword ptr [rsp+30h],rax
  xor     eax,eax                 ; EAX = i
  mov     r8d,dword ptr [rcx+8]   ; R8D = a.Length
  test    r8d,r8d
  jle     bail_out                ; R8D <= 0?
  mov     r9d,dword ptr [rdx+8]   ; R9D = b.Length

loop:
  lea     r10d,[rax+3]            ; R10D = i+3
  cmp     r10d,r8d                ; i+3 >= a.Length?
  jae     range_check_fail
  movupd  xmm0,xmmword ptr [rcx+rax*4+10h]  ; XMM0 = (a[i], a[i+1], a[i+2], a[i+3])
  cmp     r10d,r9d                ; i+3 >= b.Length?
  jae     range_check_fail
  movupd  xmm1,xmmword ptr [rdx+rax*4+10h]  ; XMM1 = (b[i], b[i+1], b[i+2], b[i+3])
  addps   xmm0,xmm1               ; XMM0 += XMM1
  xor     r10d,r10d               ; s = 0

copy_back:
  cmp     r10d,4                  ; s >= 4?
  jae     range_check_fail
  movupd  xmmword ptr [rsp+28h],xmm0      ; va = XMM0
  movss   xmm1,dword ptr [rsp+r10*4+28h]  ; lower 32 bits of XMM1 = va[s]
  lea     r11d,[rax+r10]          ; R11D = i + s
  cmp     r11d,r8d                ; i + s >= a.Length?
  jae     range_check_fail
  movsxd  r11,r11d                ; sign-extend R11D into R11
  movss   dword ptr [rcx+r11*4+10h],xmm1  ; a[i + s] = lower 32 bits of XMM1

  inc     r10d                    ; ++s
  cmp     r10d,4                  ; s < 4?
  jl      copy_back

  add     eax,4                   ; i += 4
  cmp     r8d,eax                 ; a.Length > i?
  jg      loop

bail_out:
  add     rsp,38h
  ret

range_check_fail:
  call    clr!JIT_RngChkFail
  int     3

This code is much longer, and hopefully by reviewing it you can see the problematic sections which make it slower than the scalar version. The first problem is that the compiler no longer elides the range checks when accessing arrays. There are numerous range checks here to verify that i is within the bounds of both a and b, to verify that i + s is within the bounds of a, and so on. Additionally, va is treated like a stack location when it comes to copying from it back to the original array a – the copies are executed from the stack using a single MOVSS instruction per each 32-bit component of the 128-bit register. All in all, these factors produce a subpar running time.

(It’s important to understand that this is not an inevitable result. The Visual C++ compiler can take manually vectorized code and produce much faster running times. It’s only a matter of sufficient tuning in the RyuJIT compiler, which is still in preview, so I definitely hope there will be some improvement before it’s released.)

Vectorized min-max

Let’s take a look at a slightly more complex algorithm that can benefit even more from using vector instructions. Suppose we have an array of integers and we need to find the minimum and maximum elements in the array. The following scalar method will do the job:

void MinMax(int[] a, out int minimum, out int maximum) {
  int min = int.MaxValue, max = int.MinValue;
  for (int i = 0; i < a.Length; ++i) {
    min = Math.Min(min, a[i]);
    max = Math.Max(max, a[i]);
  }
  minimum = min;
  maximum = max;
}

This loop can also be vectorized if we make the following simple observation: if we find the minimum and maximum of all the even elements, and the minimum and maximum of all the odd elements, then the global minimum and maximum can be derived from these local ones. The same idea applies if we partition the array into more than two sets, and that’s where vectorization comes in. Here’s what we can do (again, under the simplifying assumption that the input array’s length is divisible by the SIMD vector length):

void MinMax(int[] a, out int minimum, out int maximum) {
  int simdLength = Vector<int>.Length;
  Vector<int> vmin = new Vector<int>(int.MaxValue);
  Vector<int> vmax = new Vector<int>(int.MinValue);
  for (int i = 0; i < a.Length; i += simdLength) {
      Vector<int> va = new Vector<int>(a, i);
      Vector<int> vLessThan = Vector.LessThan(va, vmin);
      vmin = Vector.ConditionalSelect(vLessThan, va, vmin);
      Vector<int> vGreaterThan = Vector.GreaterThan(va, vmax);
      vmax = Vector.ConditionalSelect(vGreaterThan, va, vmax);
  }
  int min = int.MaxValue, max = int.MinValue;
  for (int i = 0; i < simdLength; ++i) {
      min = Math.Min(min, vmin[i]);
      max = Math.Max(max, vmax[i]);
  }
  minimum = min;
  maximum = max;
}

The crux of the matter are the Vector.LessThan and Vector.ConditionalSelect functions. Vector.LessThan performs a pointwise comparison of two vector registers and puts the Boolean results in another vector register. Specifically, vLessThan will have zeroes where va was not less than vmin, and will have ones where it was. Vector.ConditionalSelect can use the results of the comparison to conditionally place values from va or vmin back into vmin, based on which ones were smaller. The same thing applies to Vector.GreaterThan and the rest of the code.

The code is slightly less readable than a simple if statement or a Math.Min call, because we had to translate these conditional operations that only operate on a single scalar value to vector operations that can be translated to CPU instructions that operate on vector registers. The speedup, though, is considerable. On my box, I get a clean 4x speedup from using the vector version:

MinMax Standard: 1.136ms
MinMax SIMD: 0.274ms

More examples

If you are looking for more complex and realistic example of using vector operations to improve performance, you should check out the SIMD Sample on MSDN Code Gallery, which shows examples of vectorizing a Mandelbrot fractal generator and a ray tracer.

Summary

As you can see, modern processors are equipped with powerful vector units that can significantly improve the performance of certain algorithms without resorting to explicit parallelism in the shape of multiple threads. With Microsoft.Bcl.Simd and RyuJIT, .NET applications can also take advantage of vectorization, something that was only available through interop with native code until now.


I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

6 comments

  1. Pingback: Quick info about a great SIMD writeup - JIT, NGen, and other Managed Code Generation Stuff - Site Home - MSDN Blogs

  2. Kevin FreiApril 25, 2014 ב 1:50 AM

    Great write-up. I’ve put a brief addendum to it on the .NET codegen blog

    Reply
    1. Sasha Goldshtein
      Sasha GoldshteinApril 28, 2014 ב 12:20 PM

      Thanks for your attention Kevin!

      Reply
  3. Pingback: .NET Native Performance and Internals

  4. Pingback: Materials From My SDP 2014 Sessions and Workshops