DCSIMG
Cache Locality For Performance - Pavel's Blog
Sign in | Join | Help

Pavel's Blog

Pavel is a software guy that is interested in almost everything
software related... way too much for too little time

Cache Locality For Performance

It is a well known fact that accessing the CPU cache is faster than accessing physical RAM, but how do you ensure, or at least target the CPU cache? Does it actually gain a performance boost that is significant?

Here’s a relatively simple example to test this idea: matrix multiplication. To multiply two matrices together (let’s assume for simplicity sake that these are square matrices), you need to do some summations and multiplications to get the result of a single cell in the output matrix. Here’s a simple native implementation:

static double[,] NativeMultiply(double[,] a, double[,] b) {

   var c = new double[size, size];

   for(int i = 0; i < size; i++)

      for(int j = 0; j < size; j++)

         for(int k = 0; k < size; k++)

            c[i, j] += a[i, k] * b[k, j];

   return c;

}

Assume size is the dimension of the matrix, something like a.GetLength(0). Here’s the result of running this algorithm on my machine with two random 1000x1000 matrices and measuring execution time:

image

That’s about 10.7 seconds on my machine.

To get a better cache locality we can divide the problem to use smaller “squares” while doing a sum for a particular cell. For example, we can divide the 1000x1000 matrix to smaller matrices, say 64x64 cells and then do the summation and multiplications in parts, using these squares on a per cell basis:

static double[,] BlockMultiply(double[,] a, double[,] b, int blockSize) {

   var c = new double[size, size];

   int begin = 0, end;

   while(begin < size) {

      end = begin + blockSize;

      end = Math.Min(end, size);

 

      for(int i = 0; i < size; i++)

         for(int j = 0; j < size; j++)

            for(int k = begin; k < end; k++)

               c[i, j] += a[i, k] * b[k, j];

 

      begin = end// next block starts where we left off

   }

 

   return c;

}

blockSize is the size small matrix dimension (e.g. 64). Running this algorithm produces this output:

image

That’s 3.8 seconds! That’s definitely a major improvement! And that’s before parallelization by the number of cores and perhaps unsafe code to remove the array bounds checking.

By the way, this works in the native world just as effectively (and even more so). I used .NET just as an example.

So, the moral here is, to write high performing code, multithreaded / multicore considerations may not be enough, or at least, keep your eyes open for opportunities with cache locality!

The attached file has the test program with a #define for the naive vs. block implementations. Happy caching!

Comments List

# re: Cache Locality For Performance

Published at Thursday, December 15, 2011 2:55 AM by Cyrus  

This is insightful! Cheers Pavel!

# re: Cache Locality For Performance

Published at Wednesday, March 07, 2012 3:50 AM by ThE uSeFuL  

Leave a Comment

(required) 
(
required
)
 
(optional)
(required) 

Enter the numbers above: