DCSIMG
June 2009 - Posts - Pavel's Blog
Sign in | Join | Help

Pavel's Blog

Pavel is a software guy that is interested in almost everything
software related... way too much for too little time

June 2009 - Posts

Cache Locality For Performance

Published at Jun 22 2009, 06:14 PM by pavely

It is a well known fact that accessing the CPU cache is faster than accessing physical RAM, but how do you ensure, or at least target the CPU cache? Does it actually gain a performance boost that is significant?

Here’s a relatively simple example to test this idea: matrix multiplication. To multiply two matrices together (let’s assume for simplicity sake that these are square matrices), you need to do some summations and multiplications to get the result of a single cell in the output matrix. Here’s a simple native implementation:

static double[,] NativeMultiply(double[,] a, double[,] b) {

   var c = new double[size, size];

   for(int i = 0; i < size; i++)

      for(int j = 0; j < size; j++)

         for(int k = 0; k < size; k++)

            c[i, j] += a[i, k] * b[k, j];

   return c;

}

Assume size is the dimension of the matrix, something like a.GetLength(0). Here’s the result of running this algorithm on my machine with two random 1000x1000 matrices and measuring execution time:

image

That’s about 10.7 seconds on my machine.

To get a better cache locality we can divide the problem to use smaller “squares” while doing a sum for a particular cell. For example, we can divide the 1000x1000 matrix to smaller matrices, say 64x64 cells and then do the summation and multiplications in parts, using these squares on a per cell basis:

static double[,] BlockMultiply(double[,] a, double[,] b, int blockSize) {

   var c = new double[size, size];

   int begin = 0, end;

   while(begin < size) {

      end = begin + blockSize;

      end = Math.Min(end, size);

 

      for(int i = 0; i < size; i++)

         for(int j = 0; j < size; j++)

            for(int k = begin; k < end; k++)

               c[i, j] += a[i, k] * b[k, j];

 

      begin = end// next block starts where we left off

   }

 

   return c;

}

blockSize is the size small matrix dimension (e.g. 64). Running this algorithm produces this output:

image

That’s 3.8 seconds! That’s definitely a major improvement! And that’s before parallelization by the number of cores and perhaps unsafe code to remove the array bounds checking.

By the way, this works in the native world just as effectively (and even more so). I used .NET just as an example.

So, the moral here is, to write high performing code, multithreaded / multicore considerations may not be enough, or at least, keep your eyes open for opportunities with cache locality!

The attached file has the test program with a #define for the naive vs. block implementations. Happy caching!

XNA Game Studio 3.1 Released

Published at Jun 15 2009, 08:01 PM by pavely

For all those of you programming games with XNA game studio (and those that don’t but would like to), a new version has been released and can be downloaded here.

New features / enhancements include (partial list):

  • Ability to play video files (including full screen and as a texture)
  • New avatar support
  • Audio improvements: new XACT 3 tool; SoundEffect.Play now returns true/false (and not SoundEffectInstance that needs to be maintained). The SoundEffect object will be GCed automatically after play is done.
  • More… read the docs

Check it out!