On Measuring Performance
To rephrase this post (or rant) in a nutshell: Measuring performance is not as simple as people think it is. I have seen all kinds of information in books, the web, in hallway conversations etc. that is based on some kind of simple performance measurement; these performance measurements almost never reflect the actual state of affairs. To measure performance, it isn’t enough to wrap a block of code with two Environment.TickCount samples; it isn’t enough to use a Stopwatch, either, even though even a Stopwatch is not often to be seen. Measuring performance correctly, and more importantly – deciding what to measure before you slap up the actual implementation – is not nearly as trivial as that.
First and foremost, performance is not always about execution time. This is something you learn in college or in the university – algorithms have a run-time complexity as well as a space complexity (and there are some interesting attempts to bring the formal concepts of complexity analysis, which mostly works for theoretical algorithms and data structures written on paper, to the world of real class libraries). But it is not always about memory usage either; measuring performance as the amount of time it takes to perform some action completely disregards other aspects such as latency vs. throughput, performance under load, the scalability of a particular solution, the effect a solution has on overall system performance, and many other aspects of that ilk. An algorithm that processes 10 requests per second (and therefore takes 100ms on average per request) might not have an average latency of 200ms or even more; the same algorithm might be consuming twice as much memory, which will go unnoticeable on a powerful development machine but will cause page thrashing and cache thrashing on the actual production system; the same algorithm might be perfect for the not-so-mighty dual core system it is being written on, but will grind to a screeching halt on a massively parallelized, 16-way system due to high contention, locking that’s too granular or too coarse, lock convoys, and a dozen of other issues.
This is a topic of such great importance that I find it necessary to restate the most obvious and apparent truths of what you might be interested in measuring. (This is not a comprehensive drill-down into the subject; I’ll reserve the right to do that in a separate post.)
Run-time is important. Do measure the CPU time that your algorithm is taking to complete. But while you’re at it, don’t just measure CPU time. You also care about context switches. You also care about the actual CPU cycles you consumed (before Vista, it is not necessarily a function of CPU time). You also care whether there were any interrupts that you were accounted for, but in fact were serviced by an entirely different part of the system. It doesn’t help if you’re saying 100ms without having any additional data. And by the way, this is one of the reasons that I often don’t care about the time your algorithm takes; instead, I care how good it is compared to other algorithms. After all, if I should choose your algorithm over someone else’s, I am not choosing between 100ms and 200ms, which is going to be different on the production machine anyway; I’m choosing between something that’s potentially good and something that’s potentially twice as good.
Memory utilization is important. And memory utilization is a subject wide enough to deserve a blog post (if not a book) on its own. So you’re not just looking at the number of bytes your application occupies in system memory (BTW, are you looking at physical memory or virtual memory?). You need to care about the working set size. You need to care about private (non-shared) memory as opposed to shared memory – this has an impact on overall system performance. You need to focus on page faults and see if they’re soft page faults or hard page faults – the different is vast and inexcusable. In a managed environment, you need to understand if your memory is a single consecutive chunk or if it’s really a million of different objects, because it has a grave effect on garbage collection timing. Oh, and you're using garbage collection? That's cool, what kind of garbage collector are you using? Does concurrent GC make a difference for your application? In .NET, would your algorithm run faster under the multiple-heaps, multiple GC-threads server GC or the single-heap, single GC thread workstation GC? You need to make sure that your memory usage patterns are consistent, and that you are not fragmenting system memory or your language-of-choice memory allocator’s internal data structures; besides fragmentation, there are dozens of best practices and strategies for allocating memory to be friendly to the environment you’re using – learn them, and use them. Obvious as it sounds, you must make sure that your memory usage doesn’t change over time, and suddenly unveils a memory leak. And that’s just memory.
I/O utilization is important. Oftentimes, page faults are ignored from this perspective – but a page fault is just another kind of I/O. Do you have lots of them? Can your file system accesses be satisfied from the system cache? Are you using every hint there is to let the system optimize, pre-fetch and cache your I/O patterns? Are you using asynchronous (a.k.a. overlapped) I/O, completion ports, I/O priorities, bandwidth reservations, QoS, handle priority hints, every single mechanism there is to ensure that this is really the best the system can do for you?
Degree of concurrency is important. You are not running on a single-core machine anymore. You are not. This is a nice dream that has come to an abrupt end with the frightening popularity of multi-core machines in every kitchen. How do you scale to 4 processors? How do you scale to more? How many parts of your application can run in parallel, and how many are inherently synchronized? How much contention do you have per each lock you hold, per each thread you create, per request, per client, per process? And what about lock convoys, starvation, priority inversion, NUMA systems, cache collision – is there a chance you’ll be seeing those in the near future? Have you considered lock-free algorithms and data structures? Have you considered functional languages? Should you?
Performance under load is important. There’s a nice graph you can try sketching for your current system, whatever it’s supposed to do – the X axis would be the average throughput per request, and the Y axis would be the number of requests being added to the system every second. You might see some surprising results – have you considered the fact that the more threads you spin off for servicing requests, the more locking and waiting you might induce into the system? While you have load in mind, do consider whether latency or throughput or both are important to you; do consider where you want to invest the majority of your optimization time; do invest in making the decision and writing the specification as to how many concurrent requests you are required to support. Don’t let that specification and that consideration stem from the system as you implemented it.
Finally, when you decide what to measure, spend some time on thinking what parts of your system you are willing to optimize. If you don’t know your hot path (and even if you think you know your hot path), give the profiler a chance. It isn’t biased to thinking that lookup in a Dictionary<,> is an O(1) operation which can be discarded. It isn’t biased to thinking that there’s nothing to optimize if you’re just moving a piece of memory from one place to the next. You can yell at it later when it gives you the unexpected results; and there will always be unexpected results, no matter the size of your application. (And this applies to any kind of tool, not just profilers; the authors of these tools have often already made the mistakes you are going to make for the first time – isn’t it best to learn from their experience? Is there really no alternative to writing a console application that calls a function in a poorly written loop, guarded by a sloppy Stopwatch?)
Another common pitfall is incorrectly choosing the input size and the number of iterations for testing the code in question. Assume you’re measuring your own home-bred algorithm for searching a substring in a sequence of characters. Measuring your algorithm with a constant input size is just as bad as measuring it with a constant number of iterations. The most common case I’ve seen is measuring just one iteration and then happily concluding something.
Nothing can bring you farther from the truth than the fallacy of the single iteration. If you’re writing managed code, your method must be JIT-compiled when it’s first invoked; the compilation cost is often much higher than the cost of the algorithm itself. For any kind of code, the first time it is executed, the processor doesn’t have it in its instruction cache yet; bringing the instruction sequence to the instruction cache can lend significant delays to the overall result. Finally, your code is probably accessing data (few algorithms run without data); this data must be brought in from secondary storage, must be paged into memory, must be brought into the processor’s data cache. If you think that’s cheap, or if you think your language-of-choice is giving you a good enough abstraction, think again. We are all running on the (roughly) same kind of architecture; we are all enjoying the same leaky abstraction. Programming in Ruby, Boo, Python or JScript gives you no advantage over the plain-old-assembler; in fact, it probably gives you a disadvantage because assembly programmers who don’t know their way around the CPU cache are rare to find.
The opposite of the single iteration fallacy is choosing poor input sizes; searching a substring in a sequence of 5 characters is not something that lends itself to optimization easily. Does your home-bred search algorithm consider small input sizes? Does your home-bred sort algorithm use a fall-back insertion sort when the inputs are small? How is your home-bred garbage collector dealing with page faults, optimizing memory access, making correct use of the cache? Oh, it isn’t?
Surprising as it may seem, it’s necessary to thoroughly understand your execution environment before you can measure anything. There aren’t many guides to creating a clean system that can be used for performance tests; but there are too many systems used for these tests which are entirely unsuitable for the task. For example, I wouldn’t trust any performance measurement from the average bloatware-infested laptop in a large enterprise; I wouldn’t trust any performance measurement from my parents’ spyware-infested desktop at home; I wouldn’t trust any performance measurement from a single-core system, especially today. The point I’m trying to make is that you really need to understand the kind of environment you need; even if all you care about is CPU time, don’t discard the fact that you have a 5,400RPM hard drive – the first page fault you are going to take when the drive has spun off to a state of sleep will cost you so much more than on an ultra-fast 10,000RPM beast with embedded NVRAM. After spending quite some time writing performance tests and measuring performance of simple and complex algorithms alike, I can certainly conclude that a clean environment is not easy to find and not easy to construct even if you have all the sufficient resources. But if you can’t make it perfectly clean, at least do strive to come close; and it is easy to see that the environment is affecting your measurements, if, for example, two successive samples give you wildly different results. Oh yes, there are non-deterministic algorithms that genetically evolve and so there’s no way of predicting their run-time. Yours probably isn’t one of them.
I don’t believe I have to write this, but there’s another trick that will save you lots of embarrassment should you decide to make your performance measurements public. Please, oh please, make sure that you understand the domain in question before you run around trying to measure it. Do you understand degrees of transactional isolation? If you don’t, then you shouldn’t be comparing using ReadCommitted isolation to Serializable isolation and posting a bunch of useless code on the web. Do you understand the premises of concurrent execution? If you don’t, why trying to coerce an existing framework to parallelizing a piece of entirely non-parallelizable code? Are you familiar with the scenarios that motivated the development of various collection classes? If you aren’t, how is it useful if you compare the cost of looking up an element in an unsorted bag as opposed to a hash-indexed dictionary? If you don’t understand how a virtual method is different from a non-virtual method (sans the fact it can be overridden in a derived class), don’t try to come up with a clever way of optimizing virtual method calls. I am only writing this because your time is important, and my time is important.
Last but not least, performance testing is not a one-time activity. It is becoming clear today, with the advent of unit testing frameworks, evangelists, methodologies – the entire Agile development industry – that testing a piece of software for correctness is not something you leave to the QA phase of development. But that’s testing correctness; testing performance is an entirely different beast. What’s most bewildering is the fact that some of these unit-testing principles are so involved with proving code correctness that they sacrifice performance for the very sake of being able to test correctness (e.g. arguing that all methods be virtual by default because it makes it so much easier to mock methods on objects and provide stubs for these objects). It makes no sense for anyone, myself included, to argue that performance should be preferred over correctness. But why are we forced to choose? Why are people trying to convince us that correctness is something that comes at the expense of performance, citing the ever-abused “Optimization is the root of all evil” quote with frothing mouths?
Back to topic, performance testing is not just running a profiler over your code a couple of times to make sure you know what’s going on. Performance testing is about setting up performance goals and performance specifications; it’s about executing performance tests on a daily basis and comparing their results to ensure that no regression has occurred; it’s about rigorously making sure that there isn’t a single performance datum that you can define as perplexing; it’s all about understanding that your code is not “just correct” – that there is an entirely different set of criteria, an evolving set of criteria, that is relevant if performance is important to you (and trust me, it is). We have to cope with an enormous gap in methodology, tools and processes for performance testing; but I think we can give the correctness people a decent fight.
I know this post is beriddled with questions; I also know that I’m not giving you any easy answers. Performance is important; it’s crucial; it’s often overlooked when Correctness raises its ugly head. It’s crucial when designing a system; it’s crucial when developing it; it’s crucial when testing it and when deploying it to production. I think there hasn’t been a subject of such importance so unexplored in computer software; so neglected and forgotten and washed away by the Big Methodology Wave; so inexcusably coerced and distorted by people, tools and processes alike. We’re just making our baby steps in the quest of exploring it today, and there is still so much that lies further ahead.