PDC 2009 Day 1: Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework
Dryad is a distributed execution environment and framework (by Microsoft Research) that strives to optimize data flow across nodes in a computation network and to optimize the computation itself across nodes. A Dryad job is a directed acyclic graph (DAG) of inputs, processing vertices and outputs. This is very similar to Unix pipes, only in 2D (scale-out).

It’s very natural to attempt to run LINQ queries and operations on a distributed cluster of machines. PLINQ is the first, intra-machine step for parallelizing declarative queries automatically; DryadLINQ is the second, inter-machine step, with full Visual Studio integration, type safety of queries, automatic serialization of data and code, etc.
Bringing LINQ queries to the cluster with DryadLINQ requires some query planning and optimizations – e.g. optimize data locality on a node, combine multiple vertices of the graph that could run on different nodes into one vertex, etc.. This can be done by the job manager that sees the entire query. The managers hands off the job graph to the Dryad execution engine, which in turn uses the HPC job scheduler to execute the query on the cluster.
The only difference between a standard LINQ query over objects and a DryadLINQ query is that instead of IEnumerable<T> you use a PartitionedTable<T>, which is initialized with information specifying where the data set is stored across the cluster’s data storage nodes (the underlying storage can be a partitioned file, an SQL table, a cluster FS or anything else). There’s even an extension method for IEnumerable<T> – ToPartitionedTable() – that converts it to a DryadLINQ query. [Obviously, if this were to become a real product, there would be need to specify additional optimizations manually, such as PLINQ’s WithDegreeOfParallelism(), scheduler extensions and other mechanisms.]
Almost all LINQ operators are supported by DryadLINQ, but there are some new operators as well: HashPartition, RangePartition, Merge, Fork and Dryad Apply. Using the standard LINQ query operators, writing the famous MapReduce framework boils down to:
PartitionedTable<T> source = …;
Func<T, M> mapper = …;
Func<M, K> keySelector = …;
Func<IGrouping<K, M>, R> reducer = …;
source.
SelectMany(mapper).
GroupBy(keySelector).
SelectMany(reducer);
One of the examples for using DryadLINQ is a k-means clustering algorithm (finding the k centers of a set of points using a specific distance metric). When writing this kind of algorithm, there are hints you can specify to DryadLINQ to achieve further optimizations. For example, if you have an operator+ on a Vector class, you might want to specify that it is associative using a special [Associative] attribute so that the query evaluator can generate a more efficient execution plan.
Another example shown during the session is analysis performed on a Major League Baseball dataset with pitch data from 2007. There are tens of thousands of tiny XML files that contain pitch information for each batter, and yet after performing the initial partitioning across the cluster, analysis is as easy as running a LINQ query.
There’s a lot of ongoing work – query optimizations with annotation, runtime sampling and dynamic tuning of query execution on the go, dynamic allocation of resources between multiple DryadLINQ applications, support for Azure, and other areas.
There’s now an academic release of DryadLINQ that you can publicly obtain from Microsoft Connect, including sources! I hope to have time after the PDC to evaluate DryadLINQ myself. If you happen to try out DryadLINQ, you don’t have to set up an HPC cluster – it’s possible to debug the execution locally, on your development machine.
Finally, Microsoft Research released also a very interesting academic article on DryadLINQ with examples of queries that can be executed using the Dryad execution engine. Go ahead and read it for more information – I’m hoping DryadLINQ will become a mainstream technology, emerging from Microsoft Research.