Long-time readers of this blog know that I really don’t like rehashing someone else’s thoughts and linking to material that isn’t my own. However, the ACM article The Future of Microprocessors (S. Borkar, A. Chien) warrants an exception to this rule.
If you can afford the time (approx. 2 hours), I strongly recommend that you read the article instead of my somewhat incoherent ramblings below. If you’re looking for an executive summary highlighting some of the biggest challenges and likely solutions and are willing to sacrifice accuracy or presentation, read on 🙂
Moore’s Law and related observations have clouded somewhat the way we look at application performance. For years, it has been common to assume that CPU-bound software doubles its performance every year or two, making it easy to process bigger sets of data, support higher display resolutions, or handle faster and wide network streams. With the end of the free lunch, parallelism rises its ugly head and forces us to think how processing can be broken into multiple parts and executed simultaneously on multiple cores. For some workloads this is easy, and other workloads warrant highly inventive parallel algorithms that often deviate significantly from their sequential counterparts. The cost of synchronization, inter-core communication, and cache coherency drill a large hole in the high-level language abstractions to which we have grown used.
Which of these trends is going to dominate the next 20 years of microprocessors? Are we in for a 1000x increase in processor speeds or numbers of processor cores? Or is there something completely different that we will have to embrace, revolutionizing again the way we reason about software performance?
The Future of Microprocessors addresses some of these questions in a very accessible way. First, it outlines the way some of the greatest advances in processor performance have been achieved:
- Increasing individual transistor speeds by scaling them down to unthinkably tiny size (from 10 micrometers 40 years ago to 30 nanometers today)
- Microarchitecture tricks including multi-cycle execution, pipelining, branch prediction
- Multiple layers of cache memory, bringing down stalls associated with fetching data directly from main memory
Unfortunately, some of these “automatic” trends cannot proceed further without significant changes. The primary challenge is the capacity (power- and area-wise) that is reasonable for a desktop CPU. The limited energy budget limits the advances attainable by reasonable scaling from today’s technology far below the 1000x performance increase expected by 2030. Some of the ways of addressing this challenge are outlined below.
Multiple cores don’t have to deliver linear speedup. From a power efficiency perspective, it might make sense to have a set of small cores with lower single-thread performance but a reasonable energy signature. A hybrid approach, where large cores are used for certain workloads and smaller cores are used for highly parallel execution, is also a feasible alternative.
Data movement across the interconnect and within the processor caches must be rethought to accommodate for the given energy budget. With only a 30x increase in processor performance and assuming an average of 1mm propagation distance for instruction operands, 90% of the processor’s energy budget would be consumed by memory movement alone. This restriction leads to the need for (unconventionally) larger register files to allow more data to be stored within 0.1mm of the relevant execution units.
The interconnect network between CPU components and multiple CPU cores is in need of another radical redesign. This calls for multiple types of buses, combining commonplace packet-switching networks with circuit-switched networks.
Finally, some of the work might befall upon us developers—an inevitable conclusion may lead in the direction of software being the solution to the scaling problems. Some of the conveniences afforded by modern hardware, such as a flat address space and coherent caches might have to break down and be replaced by explicit software alternatives, that could go as high up as the programming languages themselves. (By the way, some of these trends are becoming visible already with NUMA systems on modern servers and desktops, and the challenges associated with scaling even OS kernels to 256 processor cores or more.)
I wonder how long it takes until another “The Free Lunch Is Over” article will be in place. It might be 5 years, or 10, or 20, but it’s hard to see now how our current software is going to adapt to the processor scaling trends.