DryadLINQ Tutorial: Part 1 – Distributed Storage Catalog (DSC) Basics

14 ביוני 2011

One of the most exciting additions to Windows HPC Server 2008 R2 SP2 (currently in beta) is the support for DryadLINQ. DryadLINQ is an API that allows the creation and execution of large scale, data-parallel compute tasks. One of the core capabilities of Dryad logo_hpc_466x165_014B1D74(the underlying framework used by DryaLINQ) is the ability to distribute the data over the cluster and maintain data locality by executing the code on the node storing the data. In order to do both, Dryad utilizes a mechanism called The Distributed Storage Catalog (DSC) which I will cover in this post.

Overview

DryadLINQ provides a powerful mechanism for distributing LINQ queries over a cluster.  In order to do so, Dryad needs to distribute both the code to be executed and the data on which it needs to operate. Before we can execute DryadLINQ queries over any type of data we must save it to DSC. This is what allows DryadLINQ to execute parts of the query (called vertices) in a distributed manner:  each vertex will then emit a result that is either retuned to the caller or saved to DSC for further querying.

The HPC Dsc Service

When you install a head node using SP2, a new Windows service – HPC Dsc – is added  (Note: for the current beta version of SP2, DryadLINQ requires a new installation, no upgrades are supported). HPC Dsc exposes a service API (via net.tcp binding) with two primary functions:

  • Provide logical management for DSC entities such as as DSC nodes, file sets, files, etc. (we will discuss those shortly) in a database called HPCDsc.
  • Physically manage the participating nodes’ files by accessing the file system via a file share.

image

Figure 1: Overview of the HPC Dsc service

In order to allow DSC on any node simply run the following command:

dsc node add [compute node name] /tempath:[local path for HpcTemp share] /datapath:[local path for HpcData share] /service:[head node name]

This will create the shared directories on the on the target node, and register them with the HPC Dsc service. Once this command is successfully executed, the corresponding compute node can be used for DSC (thereby enabling Dryad) operations.

Working Directly With DSC Files

DSC allows the storage of either text files or serialized objects. The most straightforward way to save either is to write directly to DSC files as the following code demonstrates (you can download all code samples for this post from here):

// creating a fileset and two files manually
if (context.DscService.FileSetExists("TextFileSet"))
    context.DscService.DeleteFileSet("TextFileSet");

DscFileSet fileSet = context.DscService.CreateFileSet("TextFileSet");

DscFile fileA = fileSet.AddNewFile(40000);
DscFile fileB = fileSet.AddNewFile(40000);

// copying the content of text files in to the DSC files
File.Copy(Path.Combine(Directory.GetCurrentDirectory(), @"TextFiles\TextFile1.txt"),
          fileA.WritePath);
File.Copy(Path.Combine(Directory.GetCurrentDirectory(), @"TextFiles\TextFile2.txt"),
          fileB.WritePath);

fileSet.Seal();

var lines = context.FromDsc<LineRecord>("TextFileSet");
Console.WriteLine("The number of lines is {0}", lines.Count());

in this snippet I used a couple of the DSC entities (or better said, their .NET APIs which reside in the Microsoft.Hpc.Dsc namespace) and I would like to review them before we continue:

  • DscFileSet:
    The DscFileSet provides a mechanism to access, manage and most importantly query a group of files distributed over the cluster.   
  • DscFile:
    These represent actual files in the FileSet and provide a few useful properties including the WritePath I used in the code above to simply write to the files.

In the above snippet we use the DscFileSet to create two DscFiles. The DscFileSet creates each file in the DscData shared directory on whichever DSC node it sees fit to host the file.  We can influence this decision by passing a preferred node name to the AddNewFile method. Once created, we can then manually write to the DscFile.WritePath.  When we are done writing to the files we call the Seal method. The Seal method informs the HPC Dsc service that we are done writing to the files.  This means that the HPC Dsc service can now update the databse with the file sizes of each file, and change the state of the DscFileSet to sealed.

Another API I used in this snippet is the (declared outside of the presented code) HpcLinqContext (from the Microsoft.Hpc.Linq namespace) which I will cover in more details in later parts of this tutorial.  For now, however, I’ll just explain its role in the above code: Once configured the HpcLinqContext  allows us to communicate with the DSC in couple of ways. The first one is by exposing the HPC Dsc service via the DscService property:  we use the Dsc service to create the new file set. The second way is by allowing us to execute a DryadLINQ query against the FileSet using the FromDsc<T> method.

Saving Serialized Objects To Files

As I mentioned above, DSC supports serialized objects. While this ability exposes the real strength of Dryad, it is unfortunately a mess in its current state (SP2 beta).  Fortunately, the DryadLINQ samples that are a part of the SP2 installation contain a project named HpcLinqExtras which simplify much of the oddities of DSC in its current state. I will review these oddities and their solutions in the HpcLinqExtras.

So lets begin by defining a class named Person and mark it as [Serializable]. There are some restrictions for objects that can be serialized to DSC that can be found in the DryadLINQ and DSC Programmer’s Guide, but for our sample we will use the following class definition, and you’ll just have to take my word that it is a complaint with regards to the DSC restrictions:

[Serializable]
public class Person
{
    public int Id { get; set; }
    public string Name { get; set; }
}

One would believe this should be enough for saving instances of Person to DSC, but with DSC nothing is as simple as it seems. The biggest DSC serialization crankiness is the need for a type implementing custom HPC serialization. Note that not all types in the graph saved need to implement custom HPC serialization – only the root type that will be used for saving manually the data. The HpcLinqExtras project contains the ObjectRecord class that can be used as a sort of generic root type for any .NET object. To implement custom HPC serialization, a type must be decorated with the CustomHpcSerializerAttribute and implement the IHpcSerializer<T> interface, this will allow it to read and write the type passes as the T type parameter from and to DSC. The ObjectRecord for example class is defined as follows:

[CustomHpcSerializer(typeof(ObjectRecord))]
public class ObjectRecord : IHpcSerializer<ObjectRecord>

The next step is to implement a serialization mechanism for writing the actual data to by writing the two method declared in the IHpcSerializer interface: Read and Write. This can be done by simply using runtime serialization into and out of memory streams and passing the latest to the HpcBinaryWriter and HpcBinaryReader. Since the same behavior needs to be invoked manually (without using the HPC binary reader/writer) the object record exposes overloads that use a BinaryWriter and a BinaryReader for the Read and Write methods.

Now we can create new instances of Person and write our own objects one by one, wrapped in ObjectRecords:

// creating a fileset and two files manually
if (context.DscService.FileSetExists("PersonFileSet"))
    context.DscService.DeleteFileSet("PersonFileSet");

DscFileSet fileSet = context.DscService.CreateFileSet("PersonFileSet");

DscFile fileA = fileSet.AddNewFile(100);
DscFile fileB = fileSet.AddNewFile(100);

var andy = new Person { Id = 1, Name = "Andy" };
var kelly = new Person { Id = 2, Name = "Kelly" };

// writing each person to a different file
using (BinaryWriter bw = new BinaryWriter(File.OpenWrite(fileA.WritePath)))
{
    new ObjectRecord(andy).Write(bw);
    bw.Close();
}

using (BinaryWriter bw = new BinaryWriter(File.OpenWrite(fileB.WritePath)))
{
    new ObjectRecord(kelly).Write(bw);
    bw.Close();
}

fileSet.Seal();

Once saved we can query the DscFileSet using the FromDsc<ObjectRecord> to get them back, and use a simple Select to return IQueryable<Person>:

var champs = context.FromDsc<ObjectRecord>("PersonFileSet")
                    .Select<ObjectRecord, Person>(or => or.Value as Person);

Saving Serialized Objects To Files – Take 2

The above code might seem reasonable when working with two instances and a small FileSet. But what if we have a large collection, and we want to distribute it over the cluster? Well our good friend HpcLinqExtras provides a couple of nifty extension methods that make all this pain go away. For starters we can use the FromEnumerable<T> to write our data structure to a temporary DSC file set and return the good old IQueryable<T> that we can save using the ToDsc<T>:

var andy = new Person { Id = 1, Name = "Andy" };
var kelly = new Person { Id = 2, Name = "Kelly" };
var source = new List<Person>();
source.Add(andy);
source.Add(kelly);

context.FromEnumerable<Person>(source)
       .ToDsc<Person>("PersonFileSet")
       .SubmitAndWait(context);

So, let’s just see what we have, shall we?

  • First we call the FromEnumerable<Person> method that works in the following manner:
    • It creates a temporary fileset with one file
    • Then, the FromEnumerable method loops over the IEnumerable it receives as a parameter and wraps each member in an ObjectRecord for writing to the file.
    • Finally the FromEnumerable<Person> calls FromDsc<ObjectRecord>, and uses a simple Select to return IQueryable<Person>.
  • Next, we use the ToDsc<Person> call to create a file set named “PersonFileSet”, the result of the IQueryable<Person> we just got as a result from the FromEnumerable<Person> method. You might have noticed that ToDsc<T> has that magical ability to write runtime serializable and not just custom HPC serializable objects, I’ve warned you, the API for working with serialized objects is a mess.
  • Finally, we call the SubmitAndWait method which obviously submits a job to the head node, and then, wait for it… it waits for it to complete.

You might get the feeling we are no better off then before.  Granted, we wrote a lot less code then in the previous version but not only that FromEnumerable<T> does exactly what we’ve done in our first sample, we then query the temporary file set and write it again to the final file set. Astute readers might even notice that both times the file sets contain a single file: For the temporary file set a single file was created manually, and if you take a closer look at the sample you will notice that the “PersonFileSet”  was created implicitly using the ToDsc<Person> method. So we are not even distributing our data. There is a way to distribute data using this technique, called partitioning which I’ll cover in the next part of this tutorial.

Summary

The fact that DSC, and the Dryad stack are still in early stages doesn't undermine the fact that we are looking at a technology that can change the face of parallel application. And while it is likely some (if not all) of the extensions available today in the HpcLinqExtras will find their way in to the final release of DryadLINQ, it is important to understand how it works and what are the implications that come with it.

Shout it kick it on DotNetKicks.com

Add comment
facebook linkedin twitter email

כתיבת תגובה

האימייל לא יוצג באתר. (*) שדות חובה מסומנים

תגי HTML מותרים: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

one comment