PDC 2009 Day 3: DirectCompute: Capturing the Teraflop

November 19, 2009

one comment

Chas Boyd’s session on DirectX11 DirectCompute is going to focus on bringing the power of the GPU for general-purpose computing (and not necessarily graphics applications).

A modern CPU would have 4 cores, run at 3GHz, 4 float-wide SIMDs, peak theoretical performance of 48-96GFlops, 2x hyperthreaded capability, 64KB L1 cache, a memory interface of about 20GB/s, and take about 200W out of the wall at a cost of about $200.

A GPU is usually constructed from 32 cores, each 32-float wide, at 1GHz, giving us about 1Teraflop (with high-end ones giving much more), 32x hyperthreading, and a very high 150GB/s memory bandwidth to the GPU RAM, also taking about 200W at about $200. The GPU has 16K 32-bit registers per core! The compiler allocates the registers available to each thread from this pool.

How can applications exploit the performance of the GPU architecture? There are some advantages aside from the Teraflop – it’s also about high bandwidth memory, higher GFlops per watt, etc. One of the problems is that the bandwidth between the CPU and the GPU (the PCI bus) gives us about 1-2GB/s practical bandwidth only…

The GPU has thousands of ALUs, meaning we need hundreds of thousands of threads (considering 32x hyperthreading) to hit peak, and only data elements come in such numbers. E.g., performing an operation on a 1MP image makes the GPU very happy if you create 1M threads. (Another important notion is that the CPU is optimized for random memory access, but the GPU likes sequential memory accesses.)

Some scenarios: Image processing (highly parallelizable), video processing, audio, linear algebra, simulation and modeling, and many others. A huge amount of small units of work (tens of bytes each) is ideal.

Oftentimes the algorithm has to be replaced to be data-parallel. For example, for sorting, quicksort can be replaced with bitonic sort.

A sample app that performs an N-body simulation of particles shows almost 750GFlops on the GPU and hardly 25GFlops on the CPU. Of course, when the GPU is working, the CPU is free for other tasks.

DirectCompute is essentially a low-level computer language for programming the GPU. All DirectX11 chips will support DirectCompute, and some DirectX10 chips already support it.

To use DirectCompute: Initialize, create some GPU code in an .hlsl file, compile it using DirextX compiler, load the code onto the GPU, set up a GPU buffer and set up a view into it, make that data view current, execute the code on the CPU, copy the data back to CPU memory. (It’s recommended, of course, to execute lots of code on the GPU and keep buffers there instead of copying it around every time to the CPU. Chunky interfaces.)

The HLSL language syntax is rather similar to C/C++, there are preprocessor defines, basic types, operators, variables, functions, and also a lot of intrinsics, NO POINTERS. The language is compiled to an intermediate language which is then consumed by the hardware driver.

DirectCompute is part of DirectX11 which ships with Windows 7, and also available as part of a Platform Update for Windows Vista. (The DirectX SDK can be installed on either OS.)

[Here’s a link to a tutorial on the DirectCompute framework that you might find useful.]

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


one comment

  1. eyalNovember 26, 2009 ב 6:14 PM

    Any case microsoft israel are looking for a GPU/CUDA programmer for a GPU project?