It is a well known fact that accessing the CPU cache is faster than accessing physical RAM, but how do you ensure, or at least target the CPU cache? Does it actually gain a performance boost that is significant?
Here’s a relatively simple example to test this idea: matrix multiplication. To multiply two matrices together (let’s assume for simplicity sake that these are square matrices), you need to do some summations and multiplications to get the result of a single cell in the output matrix. Here’s a simple native implementation:
static double[,] NativeMultiply(double[,] a, double[,] b) {
var c = new double[size, size];
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
for(int k = 0; k < size; k++)
c[i, j] += a[i, k] * b[k, j];
return c;
}
Assume size is the dimension of the matrix, something like a.GetLength(0). Here’s the result of running this algorithm on my machine with two random 1000x1000 matrices and measuring execution time:
That’s about 10.7 seconds on my machine.
To get a better cache locality we can divide the problem to use smaller “squares” while doing a sum for a particular cell. For example, we can divide the 1000x1000 matrix to smaller matrices, say 64x64 cells and then do the summation and multiplications in parts, using these squares on a per cell basis:
static double[,] BlockMultiply(double[,] a, double[,] b, int blockSize) {
var c = new double[size, size];
int begin = 0, end;
while(begin < size) {
end = begin + blockSize;
end = Math.Min(end, size);
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
for(int k = begin; k < end; k++)
c[i, j] += a[i, k] * b[k, j];
begin = end; // next block starts where we left off
}
return c;
}
blockSize is the size small matrix dimension (e.g. 64). Running this algorithm produces this output:
That’s 3.8 seconds! That’s definitely a major improvement! And that’s before parallelization by the number of cores and perhaps unsafe code to remove the array bounds checking.
By the way, this works in the native world just as effectively (and even more so). I used .NET just as an example.
So, the moral here is, to write high performing code, multithreaded / multicore considerations may not be enough, or at least, keep your eyes open for opportunities with cache locality!
The attached file has the test program with a #define for the naive vs. block implementations. Happy caching!