<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.microsoft.co.il/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>I&amp;#39;m on a mission from God object : HPC</title><link>http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HPC/default.aspx</link><description>Tags: HPC</description><dc:language>en</dc:language><generator>CommunityServer 2007.1 (Build: 20917.1142)</generator><item><title>LINQ to HPC (Formerly known as DryadLINQ) Tutorial: Part 2–Data Partitioning (DSC)</title><link>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/08/09/linq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx</link><pubDate>Tue, 09 Aug 2011 18:39:00 GMT</pubDate><guid isPermaLink="false">b5c4f5bc-c09b-4439-a595-91a98c1847df:881685</guid><dc:creator>Yaniv Rodenski</dc:creator><slash:comments>3</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.microsoft.co.il/blogs/roadan/rsscomments.aspx?PostID=881685</wfw:commentRss><comments>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/08/09/linq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx#comments</comments><description>&lt;div class="wlWriterHeaderFooter" style="float:none;margin:0px;padding:4px 0px 4px 0px;"&gt;&lt;iframe src="http://www.facebook.com/widgets/like.php?href=http://blogs.microsoft.co.il/blogs/roadan/archive/2011/08/02/linq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx" scrolling="no" frameborder="0" style="border:none;width:450px;height:80px;"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;p&gt;A new &lt;a href="http://blogs.technet.com/b/windowshpc/archive/2011/07/07/announcing-linq-to-hpc-beta-2.aspx"&gt;beta&lt;/a&gt; has been released since I wrote &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx"&gt;part 1&lt;/a&gt; of this tutorial. While very little was changed in the product, we have a new name. Another thing held me back personally from publishing this part was the fact that LINQ to HPC is not a part of Windows HPC R2 SP2. So without farther ado I am proud to present the second part of my tutorial about LINQ to HPC.&lt;/p&gt;  &lt;p&gt;In &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx"&gt;part 1&lt;/a&gt; of this tutorial we discussed the fundamentals of DSC: how to manually write data to DSC files and how to use the FromEnumerable&amp;lt;T&amp;gt; extension method (from the &lt;b&gt;HpcLinqExtras&lt;/b&gt; project) to implicitly save object data to a temporary file set (in order to use it inline in a subsequent query). We also saw a caveat in this method, namely that because FromEnumerable&amp;lt;T&amp;gt; saves the data to a single file in the temporary file set, &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_014B1D74_4909AB78.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="Windows HPC Server 2008 R2 " border="0" alt="Windows HPC Server 2008 R2 DSC DryadLINQ Dryad LINQ to HPC" align="right" src="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_014B1D74_thumb_7D6A3226.png" width="198" height="107" /&gt;&lt;/a&gt;the subsequent query cannot be parallelized. This is due to the fact that LINQ to HPC runs any query logic locally on the DSC node containing the data to which it refers. &lt;/p&gt;  &lt;p&gt;The task at hand is quite straight forward: we would like to partitions our data into logical pieces that can be distributed across the cluster. Before we start discussing how we can physically partition data in LINQ to HPC, I would like to consider the logic we will use for dividing the data into groups. in order to do so we will take a look at vertices, which are the basic tasks that execute the query on the cluster. I will describe vertices in detail in a later part of this tutorial but for now there are few facts I would like you to consider:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;A vertex can only use data from a single DSC file, located on the node it is executing on. This is, of course, in order to preserve data locality. The main implication of this little fun fact is that we should make sure that pieces of data that are dependent on each other will reside continuously in the DSC file set. A good example for this is the use of GroupBy in a query. Lets create a Student class defined as follows:      &lt;br /&gt;      &lt;br /&gt;      &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:dcde419d-d821-46a4-9c15-7796cc71c91a" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;[&lt;span style="color:#2b91af;"&gt;Serializable&lt;/span&gt;]&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;class&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;int&lt;/span&gt; Id { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;string&lt;/span&gt; Name { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;string&lt;/span&gt; Nationality { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;double&lt;/span&gt; AvgGrade { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt; }&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;      &lt;br /&gt;      &lt;br /&gt;Now let’s say we are grouping our Persons by nationality, so our data should be ordered like this:       &lt;br /&gt;      &lt;br /&gt;&lt;a href="http://blogs.microsoft.co.il/blogs/roadan/image_3670E95C.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="A file set containing students" border="0" alt="Windows HPC Server 2008 R2 DSC DryadLINQ Dryad LINQ to HPC" src="http://blogs.microsoft.co.il/blogs/roadan/image_thumb_072ED7F8.png" width="524" height="359" /&gt;&lt;/a&gt;&amp;#160; &lt;br /&gt;&amp;#160; &lt;br /&gt;Dryad can execute local queries in each vertex and then union all the groups. If the same data needs to be reordered by the query (let’s say items were ordered by Id in the query), the first thing LINQ to HPC would need to do is to reorganize the data into intermediate files, and only then execute the necessary logic.       &lt;br /&gt;&lt;b&gt;Note:&lt;/b&gt; grouping operators are a bit more complex when it comes to LINQ to HPC and will be discussed in a later part of this tutorial.       &lt;br /&gt;&lt;/li&gt;    &lt;li&gt;A vertex will process all the data in the DSC file it is accessing. This means that if we would like to break down the processing of local queries in to smaller pieces we need to break the data in to smaller files. This is possible since DSC file set support creating more files than the number of nodes. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;We can control the order in which our objects are written to file when using custom HPC serialization (as I have shown in &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx"&gt;part 1&lt;/a&gt; of the tutorial). However this can become tedious, especially if we need to use the same data in different queries that can benefit from different partitioning and ordering. &lt;/p&gt;  &lt;h5&gt;Repartitioning Operators&lt;/h5&gt;  &lt;p&gt;Repartitioning operators are LINQ to HPC operators that result in intermediate DSC files partitioned in a way that is not dependent on the partitioning of the input files. There are two Repartitioning operators in LINQ to HPC: Hash and Range Partitioning.&lt;/p&gt;  &lt;p&gt;&lt;b&gt;Hash Partitioning &lt;/b&gt;&lt;/p&gt;  &lt;p&gt;Hash partitioning provides a mechanism for partitioning data that is not sorted; Returning to our students sample, nationality is a prime candidate for hash partitioning. To use hash partitioning you need to call the HashPartition operation, which provides an overload that accepts the number of partitions to be created, once called you can use the ToDsc operator to create a new DSC file set and call SubmitAndWait to commit the operation (I have reviewed this steps in &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx"&gt;part 1&lt;/a&gt; of this tutorial):&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:26694206-f26f-4a48-b249-839d7a66ce09" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#008000;"&gt;// getting the list of students&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;List&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt; students = GetStudentsList();&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#008000;"&gt;// saving the students range partitioned to the file set with 5 partitions &lt;/span&gt;&lt;br /&gt; context.FromEnumerable&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(students)&lt;br /&gt;        .HashPartition(std =&amp;gt; std.AvgGrade, 5)&lt;br /&gt;        .ToDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;StudentsFileSet&amp;quot;&lt;/span&gt;)&lt;br /&gt;        .SubmitAndWait(context);&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;The Why hash partitioning selects the partition for a specific entity is by performing a mod operation between the hash code of the key selector and the number of partitions, the following code mimics the behavior of hash partitioning regarding the partition selection:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:2580fa4b-1400-46d1-95ab-cea821ecc784" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; students = GetStudentsList();&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#0000ff;"&gt;foreach&lt;/span&gt; (&lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; student &lt;span style="color:#0000ff;"&gt;in&lt;/span&gt; students)&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;int&lt;/span&gt; portNum = student.GetHashCode() % 5;&lt;br /&gt; &lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; str = &lt;span style="color:#a31515;"&gt;&amp;quot;the student {0} with nationality {1} will be written into partition no: {2}&amp;quot;&lt;/span&gt;;&lt;br /&gt;     &lt;span style="color:#2b91af;"&gt;Console&lt;/span&gt;.WriteLine(str, &lt;br /&gt;                       student.Name, &lt;br /&gt;                       student.Nationality,&lt;br /&gt;                       portNum);&lt;br /&gt; }&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;This method is disappointingly crude. If you run this code (supplied with my &lt;a href="https://skydrive.live.com/?cid=8de1cdea3626e8c0&amp;amp;sc=documents&amp;amp;id=8DE1CDEA3626E8C0%21172#"&gt;samples&lt;/a&gt;) you will see that although we have instructed the HashPartition operator to create 5 partitions, the result of the mod operation results in only four different values. This is of course due to the nature of the values in our key selector (none of them divides evenly by 5). This result is somewhat arbitrary, and we could have had the result distributed in many ways (even and un even) dependent on the result of the key selector GetHashCode. To overcome this pitfall, HashPartition has another overload that accepts an IEqualityComparer that can be used to override the implementation of GetHashCode of the key selector.&lt;/p&gt;  &lt;h5&gt;Range Partitioning&lt;/h5&gt;  &lt;h5&gt;&lt;font style="font-weight:normal;"&gt;Range partitioning allows the ordered partitioning of sorted keys. Returning once more to our student’s sample, the average grade can be used as such a key. This is useful if our query uses this key selector ordering in its logic. The way range partitioning works is by assigning a range of keys for every file: any object whose key belongs in that range will be placed in the DSC file. By using this method files can be created un-evenly, but we can ensure that objects within a specific range will reside in the same file. Range separators are used to define ranges: these are values that mark the border points between one range and another. Let’s say we now would like to partition our students into files that are partitioned by grades. We will use two range separators to split the data in to three files&lt;/font&gt;&lt;font style="font-weight:normal;"&gt;: &lt;/font&gt;    &lt;br /&gt;    &lt;br /&gt;&lt;a href="http://blogs.microsoft.co.il/blogs/roadan/image_53A2A334.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="A file set containing students" border="0" alt="Windows HPC Server 2008 R2 DSC DryadLINQ Dryad LINQ to HPC" src="http://blogs.microsoft.co.il/blogs/roadan/image_thumb_3E6477BF.png" width="550" height="388" /&gt;&lt;/a&gt;     &lt;br /&gt;    &lt;br /&gt;&lt;font style="font-weight:normal;"&gt;In this case our range separators are 3 and 6. One thing that is very easy to overlook is the fact that if our student’s grade equals the value of a range separator, it can belong, range-wise to the two files on both sides of the separator. Range separators can be assigned in two ways:&lt;/font&gt;&lt;/h5&gt;  &lt;ul&gt;   &lt;li&gt;Statically assigned by user:      &lt;br /&gt;In some cases we would like to explicitly force the range structure. This is useful when we know our data and queries structure and believe we can benefit from it. Let’s say we know our queries mostly filter students with grades of 6 and above, we can reflect this knowledge into our file structure even dough it results in an uneven distribution.       &lt;br /&gt;We can pass an array of range separators like this:&amp;#160; &lt;br /&gt;      &lt;br /&gt;      &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:d4990d36-d333-4568-b125-092e1a6b4d6f" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:500px;overflow:auto;padding:2px 5px;"&gt;&lt;span style="color:#008000;"&gt;// getting the list of students&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;List&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt; students = GetStudentsList();&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#008000;"&gt;// saving the students range partitioned to the file set&lt;/span&gt;&lt;br /&gt; context.FromEnumerable&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(students)&lt;br /&gt;        .RangePartition(std =&amp;gt; std.AvgGrade, &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt;[] { 3d, 6d })&lt;br /&gt;        .ToDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;StudentsFileSet&amp;quot;&lt;/span&gt;)&lt;br /&gt;        .SubmitAndWait(context);&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;      &lt;br /&gt;      &lt;br /&gt;All we need to provide here is a key selector delegate, to select the value on which we partition and the rangeKeys parameter which holds the array of range separators of the same type as the return type of the key selector. &lt;/li&gt;    &lt;li&gt;Dynamically sampled:&amp;#160; &lt;br /&gt;Another, perhaps simpler approach is to use a different overload that allows LINQ to HPC to generate partition separators for us. When we allow RangePartition to select the range separators for us, it will try to create DSC files of approximately equal size, but on the other hand we do lose much of the control we had creating the range separators ourselves. There are few overloads of RanePartition; the simplest looks like this:       &lt;br /&gt;      &lt;br /&gt;      &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:008f8b7f-6ca6-43b0-a34b-8f58648deddd" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#008000;"&gt;// getting the list of students&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;List&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt; students = GetStudentsList();&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#008000;"&gt;// saving the students range partitioned to the file set with 5 partitions &lt;/span&gt;&lt;br /&gt; context.FromEnumerable&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(students)&lt;br /&gt;        .RangePartition(std =&amp;gt; std.AvgGrade, 5)&lt;br /&gt;        .ToDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;Student&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;StudentsFileSet&amp;quot;&lt;/span&gt;)&lt;br /&gt;        .SubmitAndWait(context);&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;      &lt;br /&gt;      &lt;br /&gt;Other than losing control with dynamic range partitioning there are few key points you should bear in mind:       &lt;ul&gt;       &lt;li&gt;Currently dynamic sampling will take place for every 1,000 records - not really useful for small datasets. &lt;/li&gt;        &lt;li&gt;Dynamic range partitioning is using range separators even if you did not set them yourself. If the key selector will return non-proportional ranges, the files will have to differ in size. &lt;/li&gt;     &lt;/ul&gt;   &lt;/li&gt; &lt;/ul&gt;  &lt;h5&gt;Summary&lt;/h5&gt;  &lt;p&gt;Data partitioning allows us to implicitly distribute our data over the cluster, thus adding more control to how (and where) our queries will execute. Now that we got all our data just where we want it, we can start creating distributed kick-ass queries. But this calls for a completely different post. &lt;/p&gt;  &lt;p&gt;Source code for all the samples can be found &lt;a href="https://skydrive.live.com/?cid=8de1cdea3626e8c0&amp;amp;sc=documents&amp;amp;id=8DE1CDEA3626E8C0%21172#"&gt;here&lt;/a&gt;.&lt;/p&gt; &lt;a href="http://dotnetshoutout.com/LINQ-to-HPC-Formerly-known-as-DryadLINQ-Tutorial-Part-2Data-Partitioning-DSC"&gt;&lt;img style="border-right-width:0px;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;" alt="Shout it" src="http://dotnetshoutout.com/image.axd?url=http%3A%2F%2Fblogs.microsoft.co.il%2Fblogs%2Froadan%2Farchive%2F2011%2F08%2F02%2Flinq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx" /&gt;&lt;/a&gt; &lt;a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fblogs.microsoft.co.il%2fblogs%2froadan%2farchive%2f2011%2f08%2f02%2flinq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx"&gt;&lt;img border="0" alt="kick it on DotNetKicks.com" src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%253a%252f%252fblogs.microsoft.co.il%252fblogs%252froadan%252farchive%252f2011%252f08%252f02%252flinq-to-hpc-formerly-known-as-dryadlinq-tutorial-part-2-data-partitioning-dsc.aspx" /&gt;&lt;/a&gt;&lt;img src="http://blogs.microsoft.co.il/aggbug.aspx?PostID=881685" width="1" height="1"&gt;</description><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DEV/default.aspx">DEV</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HPC/default.aspx">HPC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Parallel+Programing/default.aspx">Parallel Programing</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DryadLINQ/default.aspx">DryadLINQ</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Dryad/default.aspx">Dryad</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DSC/default.aspx">DSC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HashPartition/default.aspx">HashPartition</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/RanePartition/default.aspx">RanePartition</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Windows+HPC+Server+2008+R2+SP2/default.aspx">Windows HPC Server 2008 R2 SP2</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/LINQ+to+HPC/default.aspx">LINQ to HPC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HpcLinqExtras/default.aspx">HpcLinqExtras</category></item><item><title>DryadLINQ Tutorial: Part 1 – Distributed Storage Catalog (DSC) Basics</title><link>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx</link><pubDate>Tue, 14 Jun 2011 22:29:59 GMT</pubDate><guid isPermaLink="false">b5c4f5bc-c09b-4439-a595-91a98c1847df:842867</guid><dc:creator>Yaniv Rodenski</dc:creator><slash:comments>3</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.microsoft.co.il/blogs/roadan/rsscomments.aspx?PostID=842867</wfw:commentRss><comments>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx#comments</comments><description>&lt;div class="wlWriterHeaderFooter" style="float:none;margin:0px;padding:4px 0px 4px 0px;"&gt;&lt;iframe src="http://www.facebook.com/widgets/like.php?href=http://blogs.microsoft.co.il/blogs/roadan/archive/2011/06/14/dryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx" scrolling="no" frameborder="0" style="border:none;width:450px;height:80px;"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;p&gt;One of the most exciting additions to &lt;a href="http://connect.microsoft.com/HPC"&gt;Windows HPC Server 2008 R2 SP2&lt;/a&gt; (currently in beta) is the support for &lt;a href="http://research.microsoft.com/en-us/projects/dryadlinq/"&gt;DryadLINQ&lt;/a&gt;. DryadLINQ is an API that allows the creation and execution of large scale, data-parallel compute tasks. One of the core capabilities of Dryad &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_014B1D74_5CC19687.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="logo_hpc_466x165_014B1D74" border="0" alt="logo_hpc_466x165_014B1D74" align="right" src="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_014B1D74_thumb_3E472FD1.png" width="198" height="107" /&gt;&lt;/a&gt;(the underlying framework used by DryaLINQ) is the ability to distribute the data over the cluster and maintain data locality by executing the code on the node storing the data. In order to do both, Dryad utilizes a mechanism called The Distributed Storage Catalog (DSC) which I will cover in this post.&lt;/p&gt;  &lt;h5&gt;&lt;/h5&gt;  &lt;h5&gt;Overview&lt;/h5&gt;  &lt;p&gt;DryadLINQ provides a powerful mechanism for distributing LINQ queries over a cluster.&amp;#160; In order to do so, Dryad needs to distribute both the code to be executed and the data on which it needs to operate. Before we can execute DryadLINQ queries over any type of data we must save it to DSC. This is what allows DryadLINQ to execute parts of the query (called vertices) in a distributed manner:&amp;#160; each vertex will then emit a result that is either retuned to the caller or saved to DSC for further querying.&lt;/p&gt;  &lt;h5&gt;&lt;/h5&gt;  &lt;h5&gt;&lt;/h5&gt;  &lt;h5&gt;The HPC Dsc Service&lt;/h5&gt;  &lt;p&gt;When you install a head node using SP2, a new Windows service - HPC Dsc – is added&amp;#160; (Note: for the current beta version of SP2, DryadLINQ requires a new installation, no upgrades are supported). HPC Dsc exposes a service API (via net.tcp binding) with two primary functions:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;Provide logical management for DSC entities such as as DSC nodes, file sets, files, etc. (we will discuss those shortly) in a database called HPCDsc. &lt;/li&gt;    &lt;li&gt;Physically manage the participating nodes’ files by accessing the file system via a file share. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;&lt;a href="http://blogs.microsoft.co.il/blogs/roadan/image_65BDFC3B.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="image" border="0" alt="image" src="http://blogs.microsoft.co.il/blogs/roadan/image_thumb_24FB6432.png" width="492" height="361" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;font size="1"&gt;Figure 1: Overview of the HPC Dsc service&lt;/font&gt;&lt;/p&gt;  &lt;p&gt;In order to allow DSC on any node simply run the following command:&lt;/p&gt;  &lt;p&gt;&lt;strong&gt;dsc node add [compute node name] /tempath:[local path for HpcTemp share] /datapath:[local path for HpcData share] /service:[head node name]&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;This will create the shared directories on the on the target node, and register them with the HPC Dsc service. Once this command is successfully executed, the corresponding compute node can be used for DSC (thereby enabling Dryad) operations. &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Working Directly With DSC Files&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;DSC allows the storage of either text files or serialized objects. The most straightforward way to save either is to write directly to DSC files as the following code demonstrates (you can download all code samples for this post from &lt;a href="http://cid-8de1cdea3626e8c0.office.live.com/self.aspx/.Public/DryadLINQ.Samples.rar"&gt;here&lt;/a&gt;):&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:75f1b14b-8aad-457a-8054-05bf8e1ae3a4" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:500px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#008000;"&gt;// creating a fileset and two files manually&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;if&lt;/span&gt; (context.DscService.FileSetExists(&lt;span style="color:#a31515;"&gt;&amp;quot;TextFileSet&amp;quot;&lt;/span&gt;))&lt;br /&gt;     context.DscService.DeleteFileSet(&lt;span style="color:#a31515;"&gt;&amp;quot;TextFileSet&amp;quot;&lt;/span&gt;);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFileSet&lt;/span&gt; fileSet = context.DscService.CreateFileSet(&lt;span style="color:#a31515;"&gt;&amp;quot;TextFileSet&amp;quot;&lt;/span&gt;);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFile&lt;/span&gt; fileA = fileSet.AddNewFile(40000);&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFile&lt;/span&gt; fileB = fileSet.AddNewFile(40000);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#008000;"&gt;// copying the content of text files in to the DSC files&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;File&lt;/span&gt;.Copy(&lt;span style="color:#2b91af;"&gt;Path&lt;/span&gt;.Combine(&lt;span style="color:#2b91af;"&gt;Directory&lt;/span&gt;.GetCurrentDirectory(), &lt;span style="color:#a31515;"&gt;@&amp;quot;TextFiles&amp;#92;TextFile1.txt&amp;quot;&lt;/span&gt;), &lt;br /&gt;           fileA.WritePath);&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;File&lt;/span&gt;.Copy(&lt;span style="color:#2b91af;"&gt;Path&lt;/span&gt;.Combine(&lt;span style="color:#2b91af;"&gt;Directory&lt;/span&gt;.GetCurrentDirectory(), &lt;span style="color:#a31515;"&gt;@&amp;quot;TextFiles&amp;#92;TextFile2.txt&amp;quot;&lt;/span&gt;), &lt;br /&gt;           fileB.WritePath);&lt;br /&gt; &lt;br /&gt; fileSet.Seal();&lt;br /&gt; &lt;br /&gt; &lt;br /&gt; &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; lines = context.FromDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;LineRecord&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;TextFileSet&amp;quot;&lt;/span&gt;);&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;Console&lt;/span&gt;.WriteLine(&lt;span style="color:#a31515;"&gt;&amp;quot;The number of lines is {0}&amp;quot;&lt;/span&gt;, lines.Count());&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;in this snippet I used a couple of the DSC entities (or better said, their .NET APIs which reside in the Microsoft.Hpc.Dsc namespace) and I would like to review them before we continue:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;strong&gt;DscFileSet&lt;/strong&gt;:       &lt;br /&gt;The DscFileSet provides a mechanism to access, manage and most importantly query a group of files distributed over the cluster.&amp;#160;&amp;#160;&amp;#160; &lt;/li&gt;    &lt;li&gt;&lt;strong&gt;DscFile:        &lt;br /&gt;&lt;/strong&gt;These represent actual files in the FileSet and provide a few useful properties including the WritePath I used in the code above to simply write to the files. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;In the above snippet we use the DscFileSet to create two DscFiles. The DscFileSet creates each file in the DscData shared directory on whichever DSC node it sees fit to host the file.&amp;#160; We can influence this decision by passing a preferred node name to the AddNewFile method. Once created, we can then manually write to the DscFile.WritePath.&amp;#160; When we are done writing to the files we call the Seal method. The Seal method informs the HPC Dsc service that we are done writing to the files.&amp;#160; This means that the HPC Dsc service can now update the databse with the file sizes of each file, and change the state of the DscFileSet to sealed.&lt;/p&gt;  &lt;p&gt;Another API I used in this snippet is the (declared outside of the presented code) &lt;strong&gt;HpcLinqContext &lt;/strong&gt;(from the Microsoft.Hpc.Linq namespace) which I will cover in more details in later parts of this tutorial.&amp;#160; For now, however, I’ll just explain its role in the above code: Once configured the HpcLinqContext&amp;#160; allows us to communicate with the DSC in couple of ways. The first one is by exposing the HPC Dsc service via the DscService property:&amp;#160; we use the Dsc service to create the new file set. The second way is by allowing us to execute a DryadLINQ query against the FileSet using the FromDsc&amp;lt;T&amp;gt; method. &lt;/p&gt;  &lt;p&gt;&lt;strong&gt;Saving Serialized Objects To Files&lt;/strong&gt;&lt;/p&gt;  &lt;p&gt;As I mentioned above, DSC supports serialized objects. While this ability exposes the real strength of Dryad, it is unfortunately a mess in its current state (SP2 beta).&amp;#160; Fortunately, the DryadLINQ samples that are a part of the SP2 installation contain a project named &lt;strong&gt;HpcLinqExtras&lt;/strong&gt; which simplify much of the oddities of DSC in its current state. I will review these oddities and their solutions in the HpcLinqExtras.&lt;/p&gt;  &lt;p&gt;So lets begin by defining a class named Person and mark it as [Serializable]. There are some restrictions for objects that can be serialized to DSC that can be found in the DryadLINQ and DSC Programmer’s Guide, but for our sample we will use the following class definition, and you’ll just have to take my word that it is a complaint with regards to the DSC restrictions:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:ccc06f25-2efd-4590-a9f7-3208500f82a2" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;[&lt;span style="color:#2b91af;"&gt;Serializable&lt;/span&gt;]&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;class&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;int&lt;/span&gt; Id { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;string&lt;/span&gt; Name { &lt;span style="color:#0000ff;"&gt;get&lt;/span&gt;; &lt;span style="color:#0000ff;"&gt;set&lt;/span&gt;; }&lt;br /&gt; }&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;One would believe this should be enough for saving instances of Person to DSC, but with DSC nothing is as simple as it seems. The biggest DSC serialization crankiness is the need for a type implementing custom HPC serialization. Note that not all types in the graph saved need to implement custom HPC serialization - only the root type that will be used for saving manually the data. The HpcLinqExtras project contains the ObjectRecord class that can be used as a sort of generic root type for any .NET object. To implement custom HPC serialization, a type must be decorated with the CustomHpcSerializerAttribute and implement the IHpcSerializer&amp;lt;T&amp;gt; interface, this will allow it to read and write the type passes as the T type parameter from and to DSC. The ObjectRecord for example class is defined as follows:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:483c5860-4754-4aca-abc9-497c1cf950c6" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;[&lt;span style="color:#2b91af;"&gt;CustomHpcSerializer&lt;/span&gt;(&lt;span style="color:#0000ff;"&gt;typeof&lt;/span&gt;(&lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;))]&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;public&lt;/span&gt; &lt;span style="color:#0000ff;"&gt;class&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt; : &lt;span style="color:#2b91af;"&gt;IHpcSerializer&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;&amp;gt;&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;The next step is to implement a serialization mechanism for writing the actual data to by writing the two method declared in the IHpcSerializer interface: Read and Write. This can be done by simply using runtime serialization into and out of memory streams and passing the latest to the &lt;strong&gt;HpcBinaryWriter &lt;/strong&gt;and &lt;strong&gt;HpcBinaryReader&lt;/strong&gt;. Since the same behavior needs to be invoked manually (without using the HPC binary reader/writer) the object record exposes overloads that use a &lt;strong&gt;BinaryWriter&lt;/strong&gt; and a &lt;strong&gt;BinaryReader &lt;/strong&gt;for the Read and Write methods.&lt;/p&gt;  &lt;p&gt;Now we can create new instances of Person and write our own objects one by one, wrapped in ObjectRecords:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:e32bd811-c0b7-4ea0-af92-6eb1fe4bd188" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:500px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#008000;"&gt;// creating a fileset and two files manually&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;if&lt;/span&gt; (context.DscService.FileSetExists(&lt;span style="color:#a31515;"&gt;&amp;quot;PersonFileSet&amp;quot;&lt;/span&gt;))&lt;br /&gt;     context.DscService.DeleteFileSet(&lt;span style="color:#a31515;"&gt;&amp;quot;PersonFileSet&amp;quot;&lt;/span&gt;);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFileSet&lt;/span&gt; fileSet = context.DscService.CreateFileSet(&lt;span style="color:#a31515;"&gt;&amp;quot;PersonFileSet&amp;quot;&lt;/span&gt;);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFile&lt;/span&gt; fileA = fileSet.AddNewFile(100);&lt;br /&gt; &lt;span style="color:#2b91af;"&gt;DscFile&lt;/span&gt; fileB = fileSet.AddNewFile(100);&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; andy = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt; { Id = 1, Name = &lt;span style="color:#a31515;"&gt;&amp;quot;Andy&amp;quot;&lt;/span&gt; };&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; kelly = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt; { Id = 2, Name = &lt;span style="color:#a31515;"&gt;&amp;quot;Kelly&amp;quot;&lt;/span&gt; };&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#008000;"&gt;// writing each person to a different file&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;using&lt;/span&gt; (&lt;span style="color:#2b91af;"&gt;BinaryWriter&lt;/span&gt; bw = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;BinaryWriter&lt;/span&gt;(&lt;span style="color:#2b91af;"&gt;File&lt;/span&gt;.OpenWrite(fileA.WritePath)))&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;(andy).Write(bw);&lt;br /&gt;     bw.Close();&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; &lt;span style="color:#0000ff;"&gt;using&lt;/span&gt; (&lt;span style="color:#2b91af;"&gt;BinaryWriter&lt;/span&gt; bw = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;BinaryWriter&lt;/span&gt;(&lt;span style="color:#2b91af;"&gt;File&lt;/span&gt;.OpenWrite(fileB.WritePath)))&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;(kelly).Write(bw);&lt;br /&gt;     bw.Close();&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; fileSet.Seal();&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;Once saved we can query the DscFileSet using the FromDsc&amp;lt;ObjectRecord&amp;gt; to get them back, and use a simple Select to return IQueryable&amp;lt;Person&amp;gt;:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:c082fbb3-0f45-4453-85b1-42f0fd27b676" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; champs = context.FromDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;PersonFileSet&amp;quot;&lt;/span&gt;)&lt;br /&gt;                     .Select&amp;lt;&lt;span style="color:#2b91af;"&gt;ObjectRecord&lt;/span&gt;, &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;&amp;gt;(or =&amp;gt; or.Value &lt;span style="color:#0000ff;"&gt;as&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;);&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;h5&gt;Saving Serialized Objects To Files – Take 2&lt;/h5&gt;  &lt;p&gt;The above code might seem reasonable when working with two instances and a small FileSet. But what if we have a large collection, and we want to distribute it over the cluster? Well our good friend HpcLinqExtras provides a couple of nifty extension methods that make all this pain go away. For starters we can use the FromEnumerable&amp;lt;T&amp;gt; to write our data structure to a temporary DSC file set and return the good old IQueryable&amp;lt;T&amp;gt; that we can save using the ToDsc&amp;lt;T&amp;gt;:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:ade32917-83bf-46b5-819b-6a30e845dbd9" class="wlWriterSmartContent"&gt; &lt;div class="le-pavsc-container"&gt; &lt;div style="background-color:#ffffff;max-height:300px;overflow:auto;padding:2px 5px;white-space:nowrap;"&gt;&lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; andy = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt; { Id = 1, Name = &lt;span style="color:#a31515;"&gt;&amp;quot;Andy&amp;quot;&lt;/span&gt; };&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; kelly = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt; { Id = 2, Name = &lt;span style="color:#a31515;"&gt;&amp;quot;Kelly&amp;quot;&lt;/span&gt; };&lt;br /&gt; &lt;span style="color:#0000ff;"&gt;var&lt;/span&gt; source = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;List&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;&amp;gt;();&lt;br /&gt; source.Add(andy);&lt;br /&gt; source.Add(kelly);&lt;br /&gt; &lt;br /&gt; context.FromEnumerable&amp;lt;&lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;&amp;gt;(source)&lt;br /&gt;        .ToDsc&amp;lt;&lt;span style="color:#2b91af;"&gt;Person&lt;/span&gt;&amp;gt;(&lt;span style="color:#a31515;"&gt;&amp;quot;PersonFileSet&amp;quot;&lt;/span&gt;)&lt;br /&gt;        .SubmitAndWait(context);&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;p&gt;So, let’s just see what we have, shall we?&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;First we call the FromEnumerable&amp;lt;Person&amp;gt; method that works in the following manner:      &lt;ul&gt;       &lt;li&gt;It creates a temporary fileset with one file &lt;/li&gt;        &lt;li&gt;Then, the FromEnumerable method loops over the IEnumerable it receives as a parameter and wraps each member in an ObjectRecord for writing to the file. &lt;/li&gt;        &lt;li&gt;Finally the FromEnumerable&amp;lt;Person&amp;gt; calls FromDsc&amp;lt;ObjectRecord&amp;gt;, and uses a simple Select to return IQueryable&amp;lt;Person&amp;gt;. &lt;/li&gt;     &lt;/ul&gt;   &lt;/li&gt;    &lt;li&gt;Next, we use the ToDsc&amp;lt;Person&amp;gt; call to create a file set named “PersonFileSet”, the result of the IQueryable&amp;lt;Person&amp;gt; we just got as a result from the FromEnumerable&amp;lt;Person&amp;gt; method. You might have noticed that ToDsc&amp;lt;T&amp;gt; has that magical ability to write runtime serializable and not just custom HPC serializable objects, I’ve warned you, the API for working with serialized objects is a mess. &lt;/li&gt;    &lt;li&gt;Finally, we call the SubmitAndWait method which obviously submits a job to the head node, and then, wait for it… it waits for it to complete. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;You might get the feeling we are no better off then before.&amp;#160; Granted, we wrote a lot less code then in the previous version but not only that FromEnumerable&amp;lt;T&amp;gt; does exactly what we’ve done in our first sample, we then query the temporary file set and write it again to the final file set. Astute readers might even notice that both times the file sets contain a single file: For the temporary file set a single file was created manually, and if you take a closer look at the sample you will notice that the “PersonFileSet”&amp;#160; was created implicitly using the ToDsc&amp;lt;Person&amp;gt; method. So we are not even distributing our data. There is a way to distribute data using this technique, called partitioning which I’ll cover in the next part of this tutorial.&lt;/p&gt;  &lt;h5&gt;Summary&lt;/h5&gt;  &lt;p&gt;The fact that DSC, and the Dryad stack are still in early stages doesn&amp;#39;t undermine the fact that we are looking at a technology that can change the face of parallel application. And while it is likely some (if not all) of the extensions available today in the HpcLinqExtras will find their way in to the final release of DryadLINQ, it is important to understand how it works and what are the implications that come with it. &lt;/p&gt; &lt;a href="http://dotnetshoutout.com/DryadLINQ-Tutorial-Part-1-Distributed-Storage-Catalog-DSC-Basics"&gt;&lt;img style="border-right-width:0px;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;" alt="Shout it" src="http://dotnetshoutout.com/image.axd?url=http%3A%2F%2Fblogs.microsoft.co.il%2Fblogs%2Froadan%2Farchive%2F2011%2F06%2F14%2Fdryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx" /&gt;&lt;/a&gt; &lt;a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fblogs.microsoft.co.il%2fblogs%2froadan%2farchive%2f2011%2f06%2f14%2fdryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx"&gt;&lt;img border="0" alt="kick it on DotNetKicks.com" src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%3a%2f%2fblogs.microsoft.co.il%2fblogs%2froadan%2farchive%2f2011%2f06%2f14%2fdryadlinq-tutorial-part-1-distributed-storage-catalog-dsc-basics.aspx" /&gt;&lt;/a&gt;&lt;img src="http://blogs.microsoft.co.il/aggbug.aspx?PostID=842867" width="1" height="1"&gt;</description><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DEV/default.aspx">DEV</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HPC/default.aspx">HPC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Windows+HPC+Server+2008+R2+SP1/default.aspx">Windows HPC Server 2008 R2 SP1</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Parallel+Programing/default.aspx">Parallel Programing</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DryadLINQ/default.aspx">DryadLINQ</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Dryad/default.aspx">Dryad</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DSC/default.aspx">DSC</category></item><item><title>There's nothing like a job well done</title><link>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/04/14/there-s-nothing-like-a-job-well-done.aspx</link><pubDate>Thu, 14 Apr 2011 13:21:00 GMT</pubDate><guid isPermaLink="false">b5c4f5bc-c09b-4439-a595-91a98c1847df:818413</guid><dc:creator>Yaniv Rodenski</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.microsoft.co.il/blogs/roadan/rsscomments.aspx?PostID=818413</wfw:commentRss><comments>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/04/14/there-s-nothing-like-a-job-well-done.aspx#comments</comments><description>&lt;p&gt;Today @ MIX &lt;a href="http://blogs.microsoft.co.il/blogs/idof/archive/2011/04/14/published-article-windows-hpc-with-burst-to-windows-azure-application-models-and-data-considerations.aspx"&gt;Ido&lt;/a&gt; and I got some very good news. The first part of our work for the HPC Azure Burst training kit was released today for &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_1A58897E.png"&gt;&lt;img style="background-image:none;border-bottom:0px;border-left:0px;padding-left:0px;padding-right:0px;display:inline;float:right;border-top:0px;border-right:0px;padding-top:0px;" title="logo_hpc _466x165" border="0" alt="logo_hpc _466x165" align="right" src="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_thumb_2117A4CF.png" width="198" height="107" /&gt;&lt;/a&gt;&lt;a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=acde41c6-153a-4181-912e-78024fcc86da"&gt;download&lt;/a&gt; by Microsoft. Oh and we got a Kinect.&lt;/p&gt;  &lt;p&gt;The document released provides on overview of the architectural considerations for using Azure compute nodes as a part of your cluster.&lt;/p&gt;  &lt;p&gt;There is a lot of work still ahead of us and we will kick out some very cool content on MSDN (and cooler stuff on our blogs &lt;img style="border-bottom-style:none;border-left-style:none;border-top-style:none;border-right-style:none;" class="wlEmoticon wlEmoticon-smile" alt="Smile" src="http://blogs.microsoft.co.il/blogs/roadan/wlEmoticon-smile_079F7BC8.png" /&gt;) so stay tuned.&lt;/p&gt;  &lt;p&gt; Yaniv&lt;/p&gt;&lt;img src="http://blogs.microsoft.co.il/aggbug.aspx?PostID=818413" width="1" height="1"&gt;</description><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DEV/default.aspx">DEV</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HPC/default.aspx">HPC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Azure/default.aspx">Azure</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/MIX/default.aspx">MIX</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/Windows+HPC+Server+2008+R2+SP1/default.aspx">Windows HPC Server 2008 R2 SP1</category></item><item><title>Understanding HPC SOA application</title><link>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/04/10/understanding-hpc-soa-application.aspx</link><pubDate>Sun, 10 Apr 2011 21:00:00 GMT</pubDate><guid isPermaLink="false">b5c4f5bc-c09b-4439-a595-91a98c1847df:817491</guid><dc:creator>Yaniv Rodenski</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://blogs.microsoft.co.il/blogs/roadan/rsscomments.aspx?PostID=817491</wfw:commentRss><comments>http://blogs.microsoft.co.il/blogs/roadan/archive/2011/04/10/understanding-hpc-soa-application.aspx#comments</comments><description>&lt;p&gt;As server developers, we are used to a certain level of interactivity. Our services get a request and often return some kind of response. Lately I found myself justifying &lt;font style="background-color:#ffff00;"&gt;&lt;/font&gt;having HPC batch jobs. It’s sometimes hard to grasp that classic HPC programs, like the Human Genome Project, rendering a full feature 3D animation movie, &lt;a href="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_014B1D74.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="logo_hpc _466x165" border="0" alt="logo_hpc _466x165" align="right" src="http://blogs.microsoft.co.il/blogs/roadan/logo_hpc_466x165_thumb_2FE12656.png" width="198" height="107" /&gt;&lt;/a&gt;or simply operating a “civilian purposes” nuclear reactor, take a long time. And when you execute such programs for days, weeks, months, etc., interactivity is not even a consideration. &lt;/p&gt;  &lt;p&gt;Batch jobs are the very essence of classic HPC applications. Still, Microsoft is trying to expose Windows HPC Server 2008 to new verticals, some of which would like to use it for more interactive tasks. For such purposes Windows HPC Server 2008 supports an SOA model exposed via WCF. Microsoft also released a &lt;a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=B6208B6F-6E22-40F2-955B-EA82556656BF&amp;amp;amp;displaylang=en"&gt;cluster debugger&lt;/a&gt; that has some cool features such as:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;Cluster debugging &lt;/li&gt;    &lt;li&gt;Local Debugging &lt;/li&gt;    &lt;li&gt;Running service code locally in a simulated Windows Azure environment &lt;/li&gt;    &lt;li&gt;Two project templates for creating both Interactive and Durable Session clients &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;There is also a decent MSDN &lt;a href="http://msdn.microsoft.com/en-us/library/ff686939.aspx"&gt;walkthrough&lt;/a&gt; that shows how to create and debug an HPC SOA application. I recommend running through it, to get a feeling for how HPC SOA applications are built. In this post, I would like to take a deeper look at both session types and some of the mechanisms and techniques they utilize, both on the client and in the cluster.&lt;/p&gt;  &lt;h5&gt;Sessions&lt;/h5&gt;  &lt;p&gt;Sessions are an essential part of HPC SOA clients. There are two types of sessions: Interactive and Durable sessions. Neither HPC client session type provides the same semantics that WCF sessions provide, where every call during the session uses the same instance of the service class on the server side. In fact, one of the roles of sessions, when used correctly, is to ensure that calls during the session will be load-balanced between different compute nodes in the cluster. Conversely, durable sessions provide another key functionality which I will describe below. To understand this process we need to take a look at two more components: The Job Scheduler and Broker Node.&lt;/p&gt;  &lt;h5&gt;The Job Scheduler&lt;/h5&gt;  &lt;p&gt;The job scheduler is the main component that runs on the head node. It handles units of work called jobs. The main concern of the job scheduler is to allocate the necessary resources for the job and start sub-units of the job tasks on the allocated compute nodes. In an SOA application, the tasks that the job scheduler creates are called service tasks and they host the services defined in the job.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.microsoft.co.il/blogs/roadan/image_339ECB26.png"&gt;&lt;img style="background-image:none;border-right-width:0px;padding-left:0px;padding-right:0px;display:inline;border-top-width:0px;border-bottom-width:0px;border-left-width:0px;padding-top:0px;" title="A session starting service tasks using the job scheduler" border="0" alt="HPC SOA job scheduler" src="http://blogs.microsoft.co.il/blogs/roadan/image_thumb_204DEEBA.png" width="500" height="245" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;font size="1"&gt;Figure 1: A session starting service tasks using the job scheduler&lt;/font&gt;&lt;/p&gt;  &lt;p&gt;In SOA applications the job scheduler has one more task, start a broker node that will be used to load balance all service calls between the service tasks.&lt;/p&gt;  &lt;h5&gt;&lt;/h5&gt;  &lt;h5&gt;The Broker Node&lt;/h5&gt;  &lt;p&gt;The broker node provides a few key capabilities for SOA applications, the first is exposing an endpoint for every service call targeting the specific service job. Every call that is sent to the broker node through that endpoint, will be load balanced between the service tasks in that job. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.microsoft.co.il/blogs/roadan/image_70CF5A20.png"&gt;&lt;img style="background-image:none;border-bottom:0px;border-left:0px;padding-left:0px;padding-right:0px;display:inline;border-top:0px;border-right:0px;padding-top:0px;" title="A message being routed through the broker node" border="0" alt="broker node SOA HPC" src="http://blogs.microsoft.co.il/blogs/roadan/image_thumb_0524F9DF.png" width="505" height="289" /&gt;&lt;/a&gt;&lt;/p&gt;  &lt;p&gt;&lt;font size="1"&gt;Figure 2: A Windows HPC Service Message Lifecycle&lt;/font&gt;&lt;/p&gt;  &lt;p&gt;&lt;font size="1"&gt;And now that we understand the basic stuff we can look into two of the more powerful (and cool) scenarios HPC provides for SOA applications.&lt;/font&gt;&lt;/p&gt;  &lt;h5&gt;Durable Sessions&lt;/h5&gt;  &lt;p&gt;By now we discussed mechanisms that work the same way in both Interactive an Durable sessions. The main difference between both sessions is (not surprisingly) durability. Durable sessions simply save the response message in an MSMQ queue on the broker node where they can be retrieved by clients, either the initiating client or any other client with the permissions to attach itself to the session. Another difference is that in order to use durable sessions, one must use the BrokerClient class in order to send and receive messages defined as MessageContracts as shown in the following snippet:&lt;/p&gt;  &lt;div style="padding-bottom:0px;margin:0px;padding-left:0px;padding-right:0px;display:inline;float:none;padding-top:0px;" id="scid:9ce6104f-a9aa-4a17-a79f-3a39532ebf7c:bb5050d2-ffe3-444c-883b-d6375fadd40a" class="wlWriterEditableSmartContent"&gt; &lt;div style="border:#000080 1px solid;color:#000;font-family:&amp;#39;Courier New&amp;#39;, Courier, Monospace;font-size:10pt;"&gt; &lt;div style="background-color:#ffffff;max-height:500px;overflow:auto;padding:2px 5px;"&gt;&lt;span style="color:#0000ff;"&gt;using&lt;/span&gt; (&lt;span style="color:#2b91af;"&gt;DurableSession&lt;/span&gt; session = &lt;br /&gt;        &lt;span style="color:#2b91af;"&gt;DurableSession&lt;/span&gt;.CreateSession(info))&lt;br /&gt; {&lt;br /&gt;     &lt;span style="color:#2b91af;"&gt;Binding&lt;/span&gt; binding = &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;BasicHttpBinding&lt;/span&gt;();&lt;br /&gt; &lt;br /&gt;     &lt;span style="color:#0000ff;"&gt;using&lt;/span&gt; (&lt;span style="color:#2b91af;"&gt;BrokerClient&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;ISquareService&lt;/span&gt;&amp;gt; client =&lt;br /&gt;         &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;BrokerClient&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;ISquareService&lt;/span&gt;&amp;gt;(session, binding))&lt;br /&gt;     {&lt;br /&gt;         &lt;span style="color:#008000;"&gt;// Set the response handler&lt;/span&gt;&lt;br /&gt;         client.SetResponseHandler&amp;lt;&lt;span style="color:#2b91af;"&gt;SquareResponse&lt;/span&gt;&amp;gt;((response) =&amp;gt;&lt;br /&gt;         {&lt;br /&gt; &lt;br /&gt;             &lt;span style="color:#0000ff;"&gt;int&lt;/span&gt; reply = response.Result.SquareResult;&lt;br /&gt;             &lt;span style="color:#2b91af;"&gt;Console&lt;/span&gt;.WriteLine(&lt;span style="color:#a31515;"&gt;&amp;quot;Received response for request {0}: {1}&amp;quot;&lt;/span&gt;,&lt;br /&gt;                   response.GetUserData&amp;lt;&lt;span style="color:#0000ff;"&gt;int&lt;/span&gt;&amp;gt;(), reply);&lt;br /&gt;         });&lt;br /&gt; &lt;br /&gt;         client.SendRequest&amp;lt;&lt;span style="color:#2b91af;"&gt;SquareRequest&lt;/span&gt;&amp;gt;(&lt;br /&gt;             &lt;span style="color:#0000ff;"&gt;new&lt;/span&gt; &lt;span style="color:#2b91af;"&gt;SquareRequest&lt;/span&gt;(1000 + i), i);&lt;br /&gt;     }&lt;br /&gt; }&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;  &lt;h5&gt;Grow/Shrink&lt;/h5&gt;  &lt;p&gt;Grow/Shrink is basically a scheduling policy, and while the job scheduler provides few of these, Grow/Shrink is the more relevant for SOA applications. As its name suggests, Grow/Shrink allows administrators to add or remove resources for a job over time. This helps dealing with peeks and can even be done using Windows Azure Worker Roles as additional compute nodes. The cool thing is that once you add the resources to the job, the job scheduler notifies the broker node about the new resources and the broker node can now take them into account while load balancing.&lt;/p&gt;  &lt;h5&gt;Conclusion&lt;/h5&gt;  &lt;p&gt;SOA is still the the ugly duckling in the world of HPC. In fact the whole concept of using sessions to start a service on the cluster comes from the classic HPC paradigm of a single job that needs to be distributed on&amp;#160; a cluster. In fact, you can start a service job directly on the server using The HPC 2008 Cluster Manager or use some very simple patterns to start a session that will be shared among all clients transforming Windows HPC Server 2008 R2 in to a lean, mean SOA machine.&lt;/p&gt;  &lt;p&gt;There are many capabilities just waiting to be used in commercial SOA applications and the hardware does not need to be a monstrosity (but that can definitely make things cooler). The constant improvement in Windows Azure support makes this type of solutions ideal for dealing with uncommon / unpredicted peeks, since there is no need to buy the hardware to support them.&lt;/p&gt;  &lt;p&gt;So until next time, remember: real developers use big computers.&lt;/p&gt;  &lt;p&gt;Yaniv&lt;/p&gt;&lt;img src="http://blogs.microsoft.co.il/aggbug.aspx?PostID=817491" width="1" height="1"&gt;</description><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/DEV/default.aspx">DEV</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/HPC/default.aspx">HPC</category><category domain="http://blogs.microsoft.co.il/blogs/roadan/archive/tags/SOA/default.aspx">SOA</category></item></channel></rss>