Cache Locality For Performance

June 22, 2009

3 comments

It is a well known fact that accessing the CPU cache is faster than accessing physical RAM, but how do you ensure, or at least target the CPU cache? Does it actually gain a performance boost that is significant?

Here’s a relatively simple example to test this idea: matrix multiplication. To multiply two matrices together (let’s assume for simplicity sake that these are square matrices), you need to do some summations and multiplications to get the result of a single cell in the output matrix. Here’s a simple native implementation:

static double[,] NativeMultiply(double[,] a, double[,] b) {

   var c = new double[size, size];

   for(int i = 0; i < size; i++)

      for(int j = 0; j < size; j++)

         for(int k = 0; k < size; k++)

            c[i, j] += a[i, k] * b[k, j];

   return c;

}

Assume size is the dimension of the matrix, something like a.GetLength(0). Here’s the result of running this algorithm on my machine with two random 1000×1000 matrices and measuring execution time:

image

That’s about 10.7 seconds on my machine.

To get a better cache locality we can divide the problem to use smaller “squares” while doing a sum for a particular cell. For example, we can divide the 1000×1000 matrix to smaller matrices, say 64×64 cells and then do the summation and multiplications in parts, using these squares on a per cell basis:

static double[,] BlockMultiply(double[,] a, double[,] b, int blockSize) {

   var c = new double[size, size];

   int begin = 0, end;

   while(begin < size) {

      end = begin + blockSize;

      end = Math.Min(end, size);

 

      for(int i = 0; i < size; i++)

         for(int j = 0; j < size; j++)

            for(int k = begin; k < end; k++)

               c[i, j] += a[i, k] * b[k, j];

 

      begin = end// next block starts where we left off

   }

 

   return c;

}

blockSize is the size small matrix dimension (e.g. 64). Running this algorithm produces this output:

image

That’s 3.8 seconds! That’s definitely a major improvement! And that’s before parallelization by the number of cores and perhaps unsafe code to remove the array bounds checking.

By the way, this works in the native world just as effectively (and even more so). I used .NET just as an example.

So, the moral here is, to write high performing code, multithreaded / multicore considerations may not be enough, or at least, keep your eyes open for opportunities with cache locality!

The attached file has the test program with a #define for the naive vs. block implementations. Happy caching!

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

3 comments

  1. CyrusDecember 15, 2011 ב 02:55

    This is insightful! Cheers Pavel!

    Reply
  2. Red BaronDecember 20, 2012 ב 12:13

    Really great. I was looking for something like this. The bigger the matrices we multiply the greater the advantage in performance compared to the naive method.

    Reply