C++ Renaissance: Getting Back The Free Lunch

April 16, 2011

At the beginning of 2005 Herb Sutter had an article stating that the developer’s free lunch is over. We had an assumption that more transistors in the CPU imply better application execution speed. The CPU executes the code in a sequential manner hence the performance of the CPU-bound code is directly related to CPU frequency. This used to be our “Free Lunch”: an old program runs faster on a new CPU. Using this assumption with modern low power consumption multi-core CPUs is wrong, we might even find that an old program runs slower on a new CPU. Since performance is no longer tied to CPU frequency we need to leverage parallelism. To some extent this is not a new concept, multiprocessing systems exist for years. For example Windows NT was designed back in 1989 as a Symmetric Multiprocessing Operating System. Hyper threading CPUs exist at least half a decade. But do we get better performance by moving from one-way machine to 16-ways? We need a better abstraction!

Our software design can be based on multi-threading execution. We can align the number of threads to the number of cores, but does this mean that if we double the cores, we double the performance? Probably not! There are many obstacles that we need to overcome. It is not easy to split a task to many threads. Creating threads is a costly operation. Kernel based context switch implies more overhead. We need to make sure that the number of running thread concurrently will not be more than the number of cores, however if a thread gets blocked we need to free or create another thread to run to utilized the free core. Add to that the fact that systems with many cores have NUMA architecture, which means that allocating and affinity of memory to threads and core can gain more performance.

Parallelism is hard, instead of inventing the underlying abstraction; we want to use existing well-known ones. In this post we will see how the C++ Concurrency Runtime (ConcRT) and Parallel Pattern Library (PPL) provide the needed abstraction and bring us back the free lunch. Using these abstractions we will gain performance from running the same application on many cores. We will also compare the new ConcRT libraries to the .NET Task Parallel Library (TPL) and see that the concept are quiet similar, however C++ has a better runtime (Resource Manager and Task Scheduler) that taking NUMA into consideration and can result better performance on such architectures.

 

To show that throwing more cores result better performance we will calculate prime numbers in a loop. We want to find the highest prime number that we can in 10 seconds. We will start with one core and do the same algorithm throwing more cores. Since I am using a Core i7 Hyper-threaded enabled machine, we will see some degradation in scalability when we will reach a high number of cores (actually virtual CPUs since it is hyper-threaded).

Bear in mind that finding the highest prime number is non-linear problem. I.E, finding a 10 digits number will take much more time than finding a 5 digits prime number. So it will be hard to deduce the scalability factor. What we will see is that adding more cores brings back our free lunch!

To keep contention to the minimum we will use the combinable<T> class. This class provides a local copy for each thread. In our case each thread will hold its maximum finding and then using the combine() method, we will pick the maximum number from all the threads. To handle the time limit, we will use the task task_group.cancel() method to finish the current parallel loop execution. To limit the range of numbers, we will use groups of 8192 numbers in parallel at a time.

The C++ Code:

 

 

// ----------------------------------------------------------------------
// <copyright file="GettingBackTheFreeLunch.cpp" company="CodeValue">
//     Copyright (c) 2011 by CodeValue Ltd. All rights reserved
// </copyright>
//
// http://codevalue.com
// Licensed under the Educational Community License version 1.0 (http://www.opensource.org/licenses/ecl1)
// This example was written as a demonstration of principles only, 
// as part of the Windows Concurrent programming course for C++ developers // // ------------------------------------------------------------------------
#include
"stdafx.h" #include <Windows.h> #include <ppl.h> #include <concurrent_vector.h> #include <iostream> #include <functional> #undef max using namespace std; using namespace Concurrency; ULONGLONG FindLargestPrimeNumberInTime(DWORD concurrencyLevel, DWORD timeInterval); int _tmain(int argc, _TCHAR* argv[]) { SYSTEM_INFO systemInformation; ::GetSystemInfo(&systemInformation); wcout << L"Number of logical processor: " << systemInformation.dwNumberOfProcessors << endl; for (DWORD concurrencyLevel = 1; concurrencyLevel <= systemInformation.dwNumberOfProcessors; ++concurrencyLevel) { DWORD timeInterval = 1000 * 10; ULONGLONG largetPrimeNumber = FindLargestPrimeNumberInTime(concurrencyLevel, timeInterval); wcout << L"Result: " << largetPrimeNumber << L" in " << timeInterval/1000 << L" seconds using " << concurrencyLevel << L" Logical Processors" << endl; } return 0; } bool IsPrime(ULONGLONG number, DWORD tickCountLimit) { if ((number & 1) == 0) return false; ULONGLONG limit = static_cast<ULONGLONG>(sqrt(static_cast<double>(number))); for (ULONGLONG n = 3; n <= limit; n += 2) { if ( (number % n == 0) || (::GetTickCount() > tickCountLimit)) return false; } return true; } ULONGLONG FindLargestPrimeNumberInTime(DWORD concurrencyLevel, DWORD timeInterval) { ULONGLONG result = 3; SchedulerPolicy policy; policy.SetConcurrencyLimits(concurrencyLevel, concurrencyLevel); CurrentScheduler::Create(policy); task_group taskGroup; taskGroup.run_and_wait([&] { DWORD dueTime = ::GetTickCount() + timeInterval; combinable<ULONGLONG> maxPrime([]()->ULONGLONG { return 3; }); ULONGLONG slice = 8192; for (ULONGLONG range = 3; range < ~static_cast<ULONGLONG>(0); range += slice) { parallel_for<ULONGLONG>(range,range + slice, 1, [&](ULONGLONG n) { if (IsPrime(n, dueTime)) maxPrime.local() = max(maxPrime.local(), n); if (::GetTickCount() > dueTime) taskGroup.cancel(); }); result = maxPrime.combine(max<ULONGLONG>); } }); return result; }

The result is:

 

Number of logical processor: 8

Result: 16252919 in 10 seconds using 1 Logical Processors

Result: 19365887 in 10 seconds using 2 Logical Processors

Result: 23945203 in 10 seconds using 3 Logical Processors

Result: 26845127 in 10 seconds using 4 Logical Processors

Result: 29343719 in 10 seconds using 5 Logical Processors

Result: 31547357 in 10 seconds using 6 Logical Processors

Result: 33079289 in 10 seconds using 7 Logical Processors

Result: 26673149 in 10 seconds using 8 Logical Processors

Press any key to continue . . .

 

 

To show the C# free lunch and to compare the results, I implement the same program using .NET TPL. Instead of combinable, the .NET parallel programing sample pack provides the ReductionVariable class. To stop the loop I use the loopState loop controlling mechanism. To control parallelism I use the ParallelOptions.MaxDegreeOfParallelism  property: 

 

using System;
using System.Threading;
using System.Threading.Tasks;

namespace GettingBackTheFreeLunch
{
    class Program
    {
        static void Main()
        {
            Console.WriteLine("Number of logical processor: {0}", Environment.ProcessorCount);

            for (int concurrencyLevel = 1; concurrencyLevel <= Environment.ProcessorCount; ++concurrencyLevel)
            {
                const int timeInterval = 1000 * 10;
                long largetPrimeNumber = FindLargestPrimeNumberInTime(concurrencyLevel, timeInterval);
                Console.WriteLine("Result: {0} in {1} seconds using {2} logical processors",
                                  largetPrimeNumber, timeInterval / 1000, concurrencyLevel);
            }
        }

        static bool IsPrime(long number, long tickCountLimit)
        {
            if ((number & 1) == 0)
                return false;

            var limit = Math.Sqrt(number);

            for (long n = 3; n <= limit; n += 2)
            {
                if ((number % n == 0) || (Environment.TickCount > tickCountLimit))
                    return false;
            }
            return true;
        }


        private static long FindLargestPrimeNumberInTime(int concurrencyLevel, long timeInterval)
        {
            var parallelOption = new ParallelOptions
                                     {
                                         MaxDegreeOfParallelism = concurrencyLevel
                                     };
            
            long dueTime = Environment.TickCount + timeInterval;

            var maxPrime = new ReductionVariable<long>(() => 3);

            const long slice = 8192;

            for (long range = 3; range < long.MaxValue; range += slice)
            {
                Parallel.For(range, range + slice,
                             parallelOption, (n, loopState) =>
                                                 {
                                                     if (loopState.IsStopped)
                                                         return;

                                                     if (IsPrime(n, dueTime))
                                                         maxPrime.Value =
                                                             Math.Max(maxPrime.Value, n);

                                                     if (Environment.TickCount > dueTime)
                                                         loopState.Stop();
                                                 });


                if (Environment.TickCount > dueTime)
                    break;
            }
            long result = maxPrime.Reduce(Math.Max);
            return result;
        }
    }
}

 

The result of the C# code is:

Number of logical processor: 8

Result: 7946903 in 10 seconds using 1 logical processors

Result: 12932119 in 10 seconds using 2 logical processors

Result: 16153307 in 10 seconds using 3 logical processors

Result: 19601821 in 10 seconds using 4 logical processors

Result: 21543757 in 10 seconds using 5 logical processors

Result: 23133361 in 10 seconds using 6 logical processors

Result: 24362383 in 10 seconds using 7 logical processors

Result: 25337269 in 10 seconds using 8 logical processors

Press any key to continue . . .

The C++ code is a little bit faster, it is hard to tell how much faster due to the non-linearity of the problem. We also need to check it on a real multi-core NUMA based server to have a better comparison between the C++ and the C# code. The nice thing about the C# and the C++ implementation is that we have frameworks that can bring us back the free lunch.

 

image

Add comment
facebook linkedin twitter email

Leave a Reply to Thoai Cancel Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

7 comments

  1. ripper234April 16, 2011 ב 12:38

    I find this post confusing. For starters, even for one thread, your C++ code out-performs C# by a factor of 24, so this is clearly not a good comparison of _parallelism_. I don’t know TPL API very well, but are you sure the parallel loop actually stops when it encounters a divisor? I would verify that to make sure “all is kosher”.

    Also, comparing the results is a bit problematic. E.g. for C++, the score of 21946369 with 8 cores over 14131147 with 1 cores looks like it’s only a 50% increase (as opposed to the theoretical 700%), but we have to remember the scale is not linear – the amount of work done with 8 cores is much more than 50% more than 1 core, because the algorithm is not linear, but pseudo-quadrupedal (pseudo because it depends on the prime number distribution).

    Finally, if we ignore the scale issue and just look at the parallelism, it looks that C# can obtain a X3.2 improvement, while C++ only got a X1.5 improvement…

    I would love to see a followup post that addresses these issues.

    Reply
  2. Alon FliessApril 16, 2011 ב 17:10

    Thanks for the comment. The purpose of the blog post was to show that we get back our free lunch, I.E getting better result with more cores. The comparison of the C# and C++ is more an anecdote. But to be on the safe side, I will check again the C# code. There are two factors that make the comparison very difficult. The first is the non-linearity of the problem. Maybe I need to sample the code for longer time than 10 seconds, or find a linear problem. The second is the Hyper-Threading effect. If C++ utilized the CPU better, Hyper-threading will hurts the scalability factor, since on virtual CPU is interfering with the other on the same core.

    Reply
  3. Alon FliessApril 16, 2011 ב 18:48

    After more checks and thanks to Eran Stiller who found that I use the costly DateTime.Now, the C# example works much faster. So thanks again ripper to make me double-check the post. I have also updated the content to clarify my intensions.

    Reply
  4. Eran StillerApril 16, 2011 ב 19:21

    It was my pleasure to help. 🙂

    Reply
  5. icon librarySeptember 18, 2012 ב 10:56

    I consider, that you are not right. Let’s discuss. Write to me in PM.

    P.S. Please review Android Menu Icons from martinking33

    Reply
  6. yqlbosehtm@gmail.comDecember 23, 2012 ב 06:24

    this is a demo post, thank you very much.

    Reply
  7. ThoaiJanuary 29, 2013 ב 22:53

    Hey everybody tdayos home computer games does not use much more than 4 cores, hmmm i really wonder why most of tdayos processors don’t have much more than 4 cores per CPU strange huh. Seriously this is overkill for now. Also remember Doom 4 that was shown with a 8 core Mac G5 tower, and that was shown using one Core according to John Carmack, lead designer of Id Software,so we dont need something this overkill, at least for now. I understand that this is what is potential

    Reply