DCSIMG
Micro-Benchmarking Considered Harmful - All Your Base Are Belong To Us

All Your Base Are Belong To Us

Mostly .NET internals and other kinds of gory details

Micro-Benchmarking Considered Harmful

Another title I’ve considered for this post was:

When measuring something, make sure you’re really measuring it.

Micro-benchmarking is the art of measuring tiny operations, and as always when measuring something tiny – there’s the problem of making sure that you’re actually measuring it.  Let’s take a couple of “trivial” examples, because usually trivial examples end up being more convoluted than the difficult ones.

Example 1 – Measuring the time of the getpid() system call

Assume that we want to measure the time it takes to trap into the OS kernel, and do so by measuring the time it takes to call the getpid() function lots of times in a loop.  Depending on the system you try this on (I tried it on Ubuntu running in VMWare Server) you might find that getpid() takes no more than 10-20 clock cycles on average.

Does this mean that a system call costs 10-20 clock cycles?  This does not align with our prior knowledge of what a system call is comprised of, but on the other hand – how can you argue with empiric measurements?  (Read on to find out how.)

Example 2 – Measuring the time to copy memory using memcpy()

Assume that we want to measure memory bandwidth and latency by using the memcpy() function from the CRT library.  We measure the time it takes to copy various block sizes and try to deduce an expression that describes memory bandwidth as a function of block size.  If you do this naively, and again – depending on the system you try this on – you might find that the results are a scatter of {block size, time} pairs with no apparent order or meaning.  Or you might even find that memory latency and bandwidth are exponential in block size.

Again, this does not align with our understanding of memory bandwidth and latency.  The time to access a memory location (say, a system word) should not be dependent on the size of the block accessed.  How do we argue with this measurement?

Hmm?

Well, apparently there’s an easy way out of the argument in both cases.  In the latter case, it’s pretty obvious that what we’re measuring it cache access times and cache bandwidth and not memory bandwidth.  And if you think about it, it’s actually not that easy to avoid the cache effects.  And even if you avoid them somehow, there’s also TLB effects to take into consideration.  But one way or another, writing a loop to call memcpy() on the same source and destination locations a million times is not a good way to measure memory bandwidth.  There’s a difference between measuring the time it takes to call the memcpy() function – which is an accurate measurement – and measuring the underlying mechanism, which is not necessarily reflected by the time it takes to call memcpy().

In the former case, it’s apparent that the getpid() system call either (1) does not really occur (i.e. is optimized somehow) or (2) does not involve a system call.  Indeed, in some libc implementations the process id is cached after the first call to getpid(), so it doesn’t actually involve a system call except for the first time it’s called.  This again means that we’re not measuring the cost of a system call – we’re measuring the cost of calling getpid() in a loop.

Armed with the consequences of this discussion, what do you say about the following results (slightly altered to protect the innocent):

I wanted to measure the time it takes to perform a cast from a string to an object.  So I wrote a loop and executed it 1,000,000,000 times.  In the loop body I assigned a string variable to an object reference.

It took 550ms to execute the loop.  I conclude that it takes 0.55ns (nanoseconds) to perform a single cast of a string to an object.  (In the Debug build, it takes over 5 seconds.  Bizarre.)

The likely flaw with this measurement is that the result is obviously way too low.  On a 2GHz processor, 0.55ns is just short of 2 clock cycles.  The overhead of actually running the measurement loop for each iteration is almost certainly more than 2 clock cycles, so it’s impossible for the overall measurement to be so fast unless…

  1. There’s no measurement taking place.  It’s possible that the entire measurement loop is optimized away because it does, well, nothing?  (Assigning the same string variable to the same object reference a billion times is considered nothing, indeed.) – or –
  2. There’s nothing inside the loop and the only thing measured was the loop overhead.  (Now this is a tricky one, because due to compiler imperfections it might actually be possible that if you remove the loop body altogether, then the entire loop will be optimized away; but if the loop body stays, then the loop body is optimized away and the loop itself is not optimized away.  This is why micro-benchmarking is hard.)

I will not do any finger-pointing in this post, but if you’re reading this and are considering to perform a micro-benchmark, bear in mind that the first most important thing to do is to make sure that you’re actually measuring something.  The next step is to make sure that this something you’re measuring is actually the same thing that you want to measure.  The analysis of the results – well, this is the subject of an entirely different post.

Comments

Dew Drop - Weekend Edition - May 9-10, 2009 | Alvin Ashcraft's Morning Dew said:

Pingback from  Dew Drop - Weekend Edition - May 9-10, 2009 | Alvin Ashcraft's Morning Dew

# May 10, 2009 4:07 AM
Leave a Comment

(required) 

(required) 

(optional)

(required) 


Enter the numbers above: