Uneven Work Distribution and Oversubscription

October 23, 2013

no comments

A few days ago I was teaching our Win32 Concurrent Programming course and showed students an experiment with the std::thread class introduced in C++ 11. The experiment is designed to demonstrate how to partition work across multiple threads and coordinate their execution, and the work to partition is simply counting the number of primes in a certain interval.

You can find the whole benchmark here. The heart of the code is the parallelize_count function, below:

void parallelize_count(unsigned nthreads, unsigned begin, unsigned end) {
    std::vector<std::thread> threads;
    unsigned chunk_size = (end – begin) / nthreads;
    for (unsigned i = 0; i < nthreads; ++i) {
        unsigned chunk_begin = chunk_size*i;
        unsigned chunk_end = chunk_begin + chunk_size;
        threads.emplace_back([=]() {
            count_primes(chunk_begin, chunk_end);
    std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));

Many people expect this to scale linearly with the number threads, up to the hardware limit – the number of processors actually installed on the system. The reality is slightly more complex. Below are the results from one of the trial runs on my system, with an i7-3720QM processor (4 physical cores, 8 logical cores):


The orange and blue lines are results from gcc 4.8.2 on OS X 10.9 and from Visual C++ 2013 on Windows 8. VC’s spike around 50 threads is not reproducible, and can be attribute to measurement perturbations. The interesting results, which are pretty similar across both compilers, are the following:

  1. The speedup is not quite linear. For example, going from 1 to 2 threads is only 27% faster (as opposed to 50% linear speedup).
  2. Even though I only have 4 physical cores on this machine, there is still a considerable 37% speed bump from 4 to 8 threads.
  3. Even after the hardware limits are exceeded, we can still see speedups from introducing more threads to the mix. For example, going from 8 to 16 threads produces a 27% speedup.

The explanation for this is the uneven workload distribution between the threads. Counting primes in an interval of larger numbers takes much longer than in an interval of smaller numbers. The Visual Studio Concurrency Visualizer can show you this graphically – and even embed markers from your code into the generated report. For example, here is a sample showing the three threads working on the range [0, 100000):


And here are the eight threads working on the same range:


This uneven distribution means that adding more threads – i.e., forcefully oversubscribing the scheduler with more threads than processors – is actually beneficial here, because of the uneven workload distribution.

The practical lesson here is that distributing works across threads is not easy. One good way to get rid of these (and other) problems is to use smaller chunks of work and queue them to the thread pool. There are of course even higher-level abstractions, such as parallel_for, designed to handle the work item partitioning and distribution for you. But that’s a story for a different post.

I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>