DCSIMG
Microsoft HPC Server 2008 and Parallel Techniques - Pavel's Blog
Sign in | Join | Help

Pavel's Blog

Pavel is a software guy that is interested in almost everything
software related... way too much for too little time

Microsoft HPC Server 2008 and Parallel Techniques

The multi-core mini-revolution signalled the rise of multithreading techniques being discussed more vigorously, as the exploitation of multiple cores is no longer the reach of a few individuals, but is as common as the computer itself.

Another way to get performance scalability is simply to use multiple machines, running some application or parts of it at the same time, hopefully distributing the workload appropriately and communicating if and as required to get the final result. Combining the two techniques (multiple cores and multiple machines) can be even more powerful.

I have been recently introduced to the Microsoft HPC Server 2008 (High Performance Computing), which is a cluster of Windows 2008 64 bit machines, managed by the HPC Server software. This package allows executing jobs (consisting of individual tasks, which are simply executables) that may be configured to run on multiple cores on multiple machines managed by the cluster.

The HPC Server 2008 is the second version of Microsoft’s computing cluster, the first being the Microsoft Compute Cluster Server 2003 (MCCS), built on top of Windows Server 2003 and released in 2006 (followed by a SP1 in 2007).

HPC centres around cluster management, with some client tools to manage a cluster of computers: add nodes (computers), remove, add jobs, and otherwise manage and monitor the cluster. The only requirement is that each and every machine must be a Windows 2008 64 bit server. A client, running any version of Windows, can issue jobs programmatically, or by running the client management tools (supplied with the HPC 2008 Pack).

A typical configuration of HPC Server is as follows:

image

The Head node functions as the “controller” of the cluster, effectively dispatching the jobs to the compute nodes. There is nothing special about the Head node – in fact, the Head node can also be a compute node.

The broker nodes are a special case used when WCF is used to “invoke” a job in a typical WCF way, by calling a service. Internally, the HPC infrastructure translates the call to a job, as HPC is a batch oriented system that simply pushes jobs to the compute nodes.

image

Programming the cluster

The HPC does not deal directly with how to parallelize jobs or tasks. It’s simply a manager and a job dispatcher with decent tools for configuring and monitoring operations. The actual programming is another matter altogether.

What follows is a non-exhaustive list of ways to parallelize stuff on a Windows platform. Some are based on a single machine and some on cluster based techniques.

Libraries and APIs

  • Hard core multithreading – using native Windows threads, either working with the Windows API (CreateThread) or the .NET (roughly equivalent) Thread class. This is definitely the most difficult option, requiring careful planning and doing correct synchronization, considering the available cores, etc. There are plenty of sources of information on how, why and why not to do hard core threading.
  • OpenMP – OpenMP is a standard library to do parallel stuff more easily. It’s implemented as a set of #pragmas that add threading support to a C/C++ applications, without the need to create actual OS threads. Microsoft has implemented the OpenMP 2.0 standard (the current standard being 3.0), and is integrated into Visual Studio 2008. Here’s a simple program that does matrix multiplication in a single threaded manner and then with OpenMP:

#include <omp.h>

 

const int size = 1000;

 

int m1[size][size], m2[size][size];

int m3[size][size];   // result

 

int _tmain(int argc, _TCHAR* argv[]) {

   ::srand(GetTickCount());

 

   for(int i = 0; i < size; i++)

      for(int j = 0; j < size; j++) {

         m1[i][j] = ::rand() % 100;

         m2[i][j] = ::rand() % 100;

      }

   printf("Starting single threaded...\n");

   DWORD start = ::GetTickCount();

 

   for (int i = 0; i < size; i++)

      for (int j = 0; j < size; j++) {

         m3[i][j] = 0;

         for (int k = 0; k < size; k++)

            m3[i][j] += m1[i][k] * m2[k][j];

      }

 

   DWORD stop = ::GetTickCount();

   printf("Elapsed: %d msec\n", stop - start);

   printf("\nStarting with OpenMP parallel for...\n");

   start = ::GetTickCount();

 

#pragma omp parallel for

   for (int i = 0; i < size; i++)

      for (int j = 0; j < size; j++) {

         m3[i][j] = 0;

         for (int k = 0; k < size; k++)

            m3[i][j] += m1[i][k] * m2[k][j];

      }

 

      stop = ::GetTickCount();

      printf("Elapsed: %d msec\n", stop - start);

 

   return 0;

}

The #pragma omp parallel for is the key to parallelizing the outer for loop. OpenMP creates the best number of threads, which is the number of cores in this case. Here’s the  output on my machine (x64 release):

image

We get a nice 84% speed increase, as OpenMP created 2 threads on my dual-core machine. It will create 4 threads on a quad core machine and distribute the work evenly.

  • Task Parallel Library (TPL) a.k.a. Parallel Extensions – in the upcoming .NET 4.0 framework, the high level Task class along with some even-higher level types, such as Parallel will make some things easier in the .NET world. The Parallel.For method, for instance, is the rough equivalent of the above OpenMP #pragma omp parallel for:

Parallel.For(0, 100000, x => DoSomethingWithX(x));

This will automatically parallelize the loop with respect to the number of existing cores, and works great assuming there is no dependency between iterations.

  • MPI – MPI (Message Passing Interface) is a C based library that takes a different approach to parallelism. Instead of focusing on threads, it focuses on processes, so that data does not need to be synchronized (different address space), but to share information a message passing framework is used, which is the heart of MPI. A typical scenario is a master process handing work to worker processes (which may run on different machines), and collects the results when the work is done. The master and workers are actually written with the same code. An MPI based process starts up using a dispatcher tool, mpiexec.exe, which launches all processes with different IDs (called “ranks” in MPI terminology). Each process examines its rank (passed indirectly by a command line argument) and decides what to do (be a master or a worker). This idea scales well on clusters and is supported on HPC Server 2008.
  • MPI.NET – a .NET object oriented wrapper around the MPI library (Codeplex – no release at this time, but available from the university of Indiana). It’s still in a kind of alpha, being supported but not developed by Microsoft, and does not wrap the entire MPI API, but most of it. It may be a reasonable alternative for managed developers to harness the cluster for parallel computing.
  • MPI with OpenMP – there is nothing to stop combining OpenMP multithreading support in a single process with MPI capabilities to stretch across processes and machines. Not usually done, because things may get hairy pretty quickly, but definitely a worth while alternative.
  • Intel tools – Intel has some tools (such as the Intel Parallel Studio recently released) for doing multithreaded stuff (all targeting native developers only), as well as tools to analyze multithreading issues, such as deadlocks and race conditions. They all cost a fair amount of money as opposed to others such as OpenMP and MPI. Intel tools are a topic onto itself which I will not tackle here.

There’s definitely a lot going on in the multi-core, multithreaded, and cluster worlds right now and it won’t be going away any time soon (quite the opposite). Now is a good time to jump on the parallel wagon and see where it takes you (avoiding deadlocks of course…)

Comments List

No Comments

Leave a Comment

(required) 
(
required
)
 
(optional)
(required) 

Enter the numbers above: