It is a well known fact that accessing the CPU cache is faster than accessing physical RAM, but how do you ensure, or at least target the CPU cache? Does it actually gain a performance boost that is significant?

Here’s a relatively simple example to test this idea: matrix multiplication. To multiply two matrices together (let’s assume for simplicity sake that these are square matrices), you need to do some summations and multiplications to get the result of a single cell in the output matrix. Here’s a simple native implementation:

static double[,] NativeMultiply(double[,] a, double[,] b) {

var c = new double[size, size];

for(int i = 0; i < size; i++)

for(int j = 0; j < size; j++)

for(int k = 0; k < size; k++)

c[i, j] += a[i, k] * b[k, j];

return c;

}

Assume **size **is the dimension of the matrix, something like **a.GetLength(0)**. Here’s the result of running this algorithm on my machine with two random 1000×1000 matrices and measuring execution time:

That’s about 10.7 seconds on my machine.

To get a better cache locality we can divide the problem to use smaller “squares” while doing a sum for a particular cell. For example, we can divide the 1000×1000 matrix to smaller matrices, say 64×64 cells and then do the summation and multiplications in parts, using these squares on a per cell basis:

static double[,] BlockMultiply(double[,] a, double[,] b, int blockSize) {

var c = new double[size, size];

int begin = 0, end;

while(begin < size) {

end = begin + blockSize;

end = Math.Min(end, size);

for(int i = 0; i < size; i++)

for(int j = 0; j < size; j++)

for(int k = begin; k < end; k++)

c[i, j] += a[i, k] * b[k, j];

begin = end; // next block starts where we left off

}

return c;

}

**blockSize **is the size small matrix dimension (e.g. 64). Running this algorithm produces this output:

That’s 3.8 seconds! That’s definitely a major improvement! And that’s before parallelization by the number of cores and perhaps unsafe code to remove the array bounds checking.

By the way, this works in the native world just as effectively (and even more so). I used .NET just as an example.

So, the moral here is, to write high performing code, multithreaded / multicore considerations may not be enough, or at least, keep your eyes open for opportunities with cache locality!

The attached file has the test program with a #define for the naive vs. block implementations. Happy caching!

This is insightful! Cheers Pavel!

For further reading,

http://en.wikipedia.org/wiki/Locality_of_reference

Really great. I was looking for something like this. The bigger the matrices we multiply the greater the advantage in performance compared to the naive method.