Introduction
In this post I would like to take a look at the For method of the Parallel class.
To try this out for yourselves you need to be running .Net 3.5 or later and to have installed the Parallel Extensions. You can download the June 2008 CTP of Parallel Extensions from here.
The API Reference Help (the “documentation”) that comes with the installation provides an excellent overview of the technology.
As described in the Overview section of the documentation, there are 3 major components in the Parallel Extensions.
- PLINQ - which enables you implement Declarative Data Parallelism.
- The Parallel static class – for Imperative Data Parallelism
- Tasks and Futures for Imperative Task Parallelism
The For method of the Parallel class is an implementation of structured parallelism that sits under the second item.
This post is structured as follows:
- Optimizing Matrix Multiplication.
In this section I will present a fairly simple optimization for matrix multiplication that can be performed even before applying parallelism. Then I apply Parallel.For. - Test Results
In this section I will compare the execution time of the three implementations: The trivial sequential implementation, the optimized sequential implementation and the parallelized optimized implementation. - Comparison to Unmanaged Code
In this section I tested three similar implementations in C++. Of course, I had to write a lot of code to mimic Parallel.For. You can’t beat C# for coding simplicity but when it comes to performance… See the conclusions - Consolidated Test Results
Here is a comparison of the 3 implementations in managed and in unmanaged code. - Conclusions
As usual, I provide the full source code for my tests (download from here)
Optimizing Matrix Multiplication
The documentation offers matrix multiplication as an example for the Parallel.For method.
I implemented my own Matrix class to test the example and used the following sequential implementation, which is a little different but functionally equivalent to the one provided in the documentation.
public static Matrix Multiply(Matrix lhs, Matrix rhs)
{
Matrix result = new Matrix(lhs.Rows, rhs.Cols);
double[,] left = lhs.m_Array;
double[,] right = rhs.m_Array;
double[,] dest = result.m_Array;
for (int i = 0; i < lhs.Rows; i++)
{
for (int j = 0; j < rhs.Cols; j++)
{
for (int k = 0; k < lhs.Cols; k++)
{
dest[i, j] += left[i, k] * right[k, j];
}
}
}
return result;
}
The documentation goes ahead to optimize this method with Parallel.For.
But before we apply parallelism, lets perform a simple optimization for the sequential code:
public static Matrix MultiplyReorderLoops(Matrix lhs, Matrix rhs)
{
Matrix result = new Matrix(lhs.Rows, rhs.Cols);
double[,] left = lhs.m_Array;
double[,] right = rhs.m_Array;
double[,] dest = result.m_Array;
for (int i = 0; i < lhs.Rows; i++)
{
for (int k = 0; k < lhs.Cols; k++)
{
double value = left[i, k];
for (int j = 0; j < rhs.Cols; j++)
{
dest[i, j] += value * right[k, j];
}
}
}
return result;
}
All I did was to reorder the two inner loops. As the results will show this cuts approximately 40% in execution time! Does anyone have any ideas why?
This is my explanation:
- In the inner loop we now perform only two indexing operations - instead of three.
The third indexing operation occurs outside the inner loop. - Both indexers in the inner loop iterate over rows.
Previously one indexing operation iterated over columns.
Assuming the array is laid out in memory in raster order, this will reduce the number of data cache misses when accessing data from the array.
OK. The next step is to apply parallelism.
The implementation is really simple – and this is the beauty of Parallel.For
public static Matrix MultiplyReorderLoopsParallel(Matrix lhs, Matrix rhs)
{
Matrix result = new Matrix(lhs.Rows, rhs.Cols);
double[,] left = lhs.m_Array;
double[,] right = rhs.m_Array;
double[,] dest = result.m_Array;
Parallel.For(0, lhs.Rows, i =>
{
for (int k = 0; k < lhs.Cols; k++)
{
double value = left[i, k];
for (int j = 0; j < rhs.Cols; j++)
{
dest[i, j] += value * right[k, j];
}
}
});
return result;
}
As you can see, I replaced the for statement with Parallel.For and converted the loop itself to a lambda expression. On my dual Pentium laptop, this modification cut another 40% of the execution time relative to MultiplyReorderLoops.
The documentation says, that though it is also possible to apply Parallel.For to the inner loops too, the benefit is likely to be small, if any, due to the overhead of scheduling so many small parallel tasks. Indeed, I found this to be the case.
Test Results
I created two 500 x 500 matrix of double values and multiplied them. I edited the multiplication operator in Matrix to select each of the three implementations and measure the execution time for each.
I ran this on my Del D530 with Dual Core Pentium @ 2 GHz. Typical results were as follows:
| Implementation | Execution Time [msec] | Reduction |
| Multiply | 5421 | |
| MultiplyReorderLoops | 3398 | 37% |
| MultiplyReorderLoopsParallel | 2067 | 39% |
Comparison to Unmanaged Code
I created a very similar project in C++ with three implementations corresponding to those described above.
A few words of explanation are warranted.
- The Matrix class is templated by the type of elements in the matrix. (I used double for testing). This is actually a convenience I couldn’t mimic in C# because there is no syntax for a generic constraint that testifies that a type supports operators (such as + or * that I use in the class).
- I implemented a copy constructor and an assignment constructor. These are not needed in the C# because all matrices in the main program are passed around by reference and not copied or assigned from.
- The Matrix * operator is implemented by invoking a computation constructor of Matrix.
This trick encourages the compiler to apply Return Value Optimization (RVO), reducing the number of temporaries created in and around the multiplication operator. - I implemented a nested class inside Matrix called Calculator. This is a command or functor object that performs the calculations required for multiplication in portions.
- Calculator accepts all the information it requires for the calculation including a range of columns and range of rows that it is responsible to calculate.
- The trivial and optimized sequential implementations of the multiplication operator are provided in the Calculator class under the names Multiply and MultiplyReorderLoops respectively.
- The implementation of parallelization in C++ took quite a lot of effort. First I wrote a Thread class (see Thread.h) from which I derived the Calculator helper class. Then I had to portion up the calculations into smaller chunks of work each of which would be assigned to a Calculator object and run in parallel. Finally, I ran the Calculator objects in parallel using the Win32 API WaitForMultipleObjects to make sure that I would return from the multiplication operation - only after the entire computation is complete.
Consolidated Test Results
| Implementation | C# Execution Time [msec] | C++ Execution Time [msec] |
| Multiply | 5421 | 1950 |
| MultiplyReorderLoops | 3398 (37% reduction) | 850 (56% reduction) |
| MultiplyReorderLoopsParallel | 2067 (39% reduction) | 680 (20% reduction) |
Conclusions
- Interestingly enough, the loop reordering technique provides the most significant performance improvement both in C# and in C++.
- Parallelizing in C++ was tough and only offered a 20% improvement.
- In C#, “Parallel.For” was really easy to use and offered 39% which is definitely significant.
- It is possible that in the C++ code I didn’t choose the right parallelization strategy. Had I done so, I might have observed a more marked performance improvement. But this would require me to write “domain specific” code that could not be applied to other parallelization problems.
- There can be no doubt therefore, that Parallel.For allows us to handle parallelism in a simple, reusable way, saving a lot of coding. On the other hand, if you must optimize your code to get the best performance possible, C++ would seem to be the best candidate for that. (more than 3 times more efficient than the C# implementation)
Download source code from here.
Today I completed exam 70-549 and earned my MCPD Enterprise Applications Developer.
It’s really not too difficult and I think its worth while taking the time to get certified. I recommend you take a look at this post for more information on how.
My next step on the program is to upgrade my certifications to .Net 3.5. (I need to take two upgrade exams for that).
Meanwhile, I have been taking a close look at Parallel Extensions which will be part of .Net 4.0. In my next post I will be sharing some of my findings with you.