Next Generation Production Debugging: Demo 6

April 10, 2008

This is the last in a series of posts summarizing my TechEd 2008 presentation titled “Next Generation Production Debugging”.  Previous posts in the series:

After spending some quality time with the debugger, analyzing an invalid handle situation, I approached the final demo.  In this particular case, the application is requested to perform some heavy processing operation on a set of images (I called that operation “Batch Get Average Color”).

Customers were highly unpleased with the performance of that operation, particularly since they were running the service side of the application on powerful multi-processor servers and not seeing all the cores being utilized for processing purposes.  Therefore, we have implemented a parallelized version of the same operation.  By the way, implementing it in parallel only required a tiny change to the code.  These are the two versions side by side:

public List<PictureColor> BatchGetAverageColor(List<string> pictureFileNames)


    List<PictureColor> avgColorList = new List<PictureColor>();


        (list, picFn) => { list.Add(GetAverageColor(picFn)); return list; });

    return avgColorList;



public List<PictureColor> ParallelBatchGetAverageColor(List<string> pictureFileNames)


    List<PictureColor> avgColorList = new List<PictureColor>();


    //Could be rewritten as an aggregate with PLINQ:


    Parallel.ForEach(pictureFileNames, picFn =>


        PictureColor c = GetAverageColor(picFn);

        lock (avgColorList) avgColorList.Add(c);



    return avgColorList;


Parallel.ForEach comes from the Parallel Extensions for .NET CTP, and makes parallelizing a heavy processing operation across multiple CPUs a trivial task.

However, when using the next version of the application, performance didn’t only go up, but it even seemed to be degraded slighly in the multiprocessor scenario.  How is this possible?  I’ve mentioned multiple reasons for parallelism blockers before (such as cache coherency and contention), but in this particular case the operation seems to be 100% CPU-bound.

The easiest way of getting a bird’s view picture of what’s going on in the system is using the Windows Performance Toolkit I’ve blogged about before the conference.  I enabled it and run the application’s batch processing again and got the following results on the CPU and I/O activity graphs for the service process:


We can see that the operation is hardly using the CPU; the disk, on the other hand, is working at full power.  Subsequently, I asked xperf to show me a detailed graph of I/O activity and what I saw was a huge number of disk accesses to temporary files at random disk offsets during the time of the batch processing operation!


Since the operation is clearly I/O bound, increasing the number of processors on which is runs is not going to improve performance whatsoever (if the disk bandwidth is completely saturated).

So thanks again for coming to my presentation or for reading this brief summary of the demos.  When the session recording hits the web, I’ll be sure to let you know.  For the meantime, if you’re looking for more material, you can tune in to my DevAcademy session recording, or read my post on debugging and investigation tools.

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>