I had a very good time today talking at the Israeli Web Developers Community meeting held at Microsoft Raanana. I Want to thank everyone for participating for a great discussion about REST, Hypermedia and ASP.NET Web API.
I would like to take this chance to share my demo code (available here) and slide deck:
It seems that every time I have a talk about Web API’s something big happens, last time Microsoft moved it from WCF to ASP.NET and this week Microsoft announced about the release of ASP.NET web stack under an open source license.
This announcement comes a the beginning of a couple of very exciting months for me covered with talks about few things that ware the focus of my professional life for the past year:
REST, Web API
On Monday, April 2, I will be talking @ the Israeli web developers community (WDCIL) about REST via ASP.NET web API. In this session we will discuss REST\Hypermedia services and how ASP.NET Web API can help us develop them.
You can register here for the event, hosted at Microsoft office in Ra'anana
Apache Hadoop-based services for Windows Azure
For the lost 6 months I have been a part of a wonderful team at Sela that created training content focusing on Apache Hadoop-based services for Windows Azure. In June, I will have a couple of exciting sessions sharing the experiences we had developing and using Hadoop on Azure:
On the beginning of June, I’ll be traveling to Oslo, with my friend and colleague Ido Flatow where we both will be speaking @ the Norwegian Developers Conference (NDC). If you are there I must recommend Ido’s Fiddler session (I have seen myself a couple of times).
Towards the end of June, I will be traveling to New-York to speak at the first ever QCon New-York. I am honored to be joining real all-star panel of speakers and can’t wait to be a part of this exciting conference.
If you are planning on attending QCon New-York, you are also invited to use my discount code: RODE100 and get a 100$ discount for all prices.
There are some more talks coming on soon, but for now I have a lot of work ahead of me. I hope to see you in one of my talks, in a conference/develpers community near you.
For those of you who do follow me on twitter (shame on you), last Thursday’s announcement about the reincarnation of the WCF Web API as the ASP.NET Web API could not come in worst timing. As I was scheduled to deliver a 3 hours session in Microsoft for an audience of WCF developers entitled REST via the WCF Web API. Since yours truly is not one of those speakers that is willing to talk about last weeks technology, I have spent most of the weekend rewriting my demos and rebuilding the session (the demos can be found here).
Now the the storm have passed, I had some time to go back and revisit my new found love. Before we begin I have one confession to make: last time I actually wrote ASP code this is what it was called, ASP, and .NET was starting its first beta. I guess what I am going thru right now is going to be a common pain for WCF developers who developed HTTP services in the past few year.
So for my session I built a fantasy football site called my footy. my first and most basic functionality was to create my services controller (formally known as service) using the ASP.NET MVC 4 template for Web API. the basic template creates a project built more or less like most MVC projects (as far as a WCF guy like me can tell) with minor differences. My first stop was to create the players controller that looked more or less like this:
As you can see I have different methods (that’s C# methods) for different HTTP verbs. the verbs are mapped to the methods by convention (apparently this is called convention over configuration). This convention will map the verb to any method who's name starts with the verb so I can have even more meaningful method names. As you can see my Post and Delete methods accept a name parameter that in the old days (last week) we used to map using a Uri template in our WebGet or WebInvoke attributes. Now in the days of ASP.NET web API this is mapped as part of the routing registration in the Global.asax like this:
routes.MapHttpRoute( name: "DefaultApi", routeTemplate: "{controller}/{name}", defaults: new { name = RouteParameter.Optional } );
This tells ASP.NET to use the above template when accessing the application, first parameter is the name of the controller second one is a parameter called name. Since I am using IIS to run my application, I needed to allow Delete which is blocked by default by WebDAV (and thanks to Sebastian Pederiva for his IIS support) so I added the following section to my default web site’s Web.Config:
Works like a charm. The next functionality I had for my players service was to add a transfer functionality, so my users could trade players. I have added the Transfer method that looks like this:
So now I had to tweak my routing to support this, so I’ve added a second HTTP route mapping like this:
routes.MapHttpRoute( name: "TransferApi", routeTemplate: "{controller}/{action}/{teamName}", defaults: new { name = RouteParameter.Optional } );
Now I could have used HTTP Post to access my method using the Uri template: “players/transfer/{teamName}”. But wait, this is not all. I could now access my method using HTTP Get. This is apparently a usual behavior for ASP.NET MVC and we can solve this by adding the HttpPost attribute:
That’s better, but we are not nearly done. One of the side effects of my second route mapping is that now my Get, Post and Delete methods are also mapped as action (beside being mapped to HTTP verbs) so I can now get the representation of a player both by using http://localhost/players/neymar and http://localhost/players/get/neymar as an address. While I can live with two addresses exposing the same resource the issue is a bit trickier with my Delete method that is now accessible either by using the verb Delete using the http://localhost/players/neymar Uri and using the verb Get and the http://localhost/players/delete/neymar Uri. Using Get to delete resources is just bad HTTP, if for no other reason than due to fact HTTP assumes get is both safe and idempotent (you can read more about safe and idempotent verbs here) so I added the HttpDelete attribute to make sure my service is consumed the way I planned it to be consumed:
I guess if you are coming from ASP.NET MVC background most of what I showed here might seem trivial but for WCF developer there are a lot of things to notice here. With that said and done I do hope ASP.NET web API will encourage developers do write more HTTP services no matter what your background is.
Dependency injection is a great technique to reduce coupling between components and improve testability. There are few techniques we can create dependency injections, you can use a framework like MEF or spring to Automate dependency injection but I personally favor manually injected dependencies. call me old fashion, but I like creating object via simple constructor calls (most of the time).
This is really straight forward most of the time but when dealing with WCF services there is a slight complexity to take in to consideration. In most scenarios WCF is in charge of instantiating the service class (the only exception here is with single instance context mode, where we can supply ServiceHost with a ready made instance of our service class).
Lately I have come across a really cool (and simple) option in WCF Web API. The WCF Web API supply an HttpConfiguration API that exposes a CreateInstance delegate we can use to manually create a new instance of our service class:
var factory = newHttpServiceHostFactory() { Configuration = config };
While this API is cool, it can only be used for http based services (using the WCF Web API). I really felt like using something like that in a SOAP based project I am currently working on so I figured what the hack, I can create the similar solution (source code can be found here) for any WCF service host out there.
The first stop was creating an IExtension<ServiceHostBase> that can transport the delegate down the WCF pipeline:
To hook the the ManualInstanceProvider we need to create a service behavior and implement the ApplyDispatchBehavior method like this:
publicvoid ApplyDispatchBehavior(ServiceDescription serviceDescription, ServiceHostBase serviceHostBase) { foreach (ChannelDispatcherBase cdb in serviceHostBase.ChannelDispatchers) { ChannelDispatcher cd = cdb asChannelDispatcher; if (cd != null) { foreach (EndpointDispatcher ed in cd.Endpoints) { ed.DispatchRuntime.InstanceProvider = newManualInstanceProvider(); } } } }
The last thing we need to do is create an extension method for ServiceHostBase that will allow setting the delegate as the factory function of our host:
publicstaticvoid ConfigureInstanceFactory(thisServiceHostBase host, Func<object> instanceInitializer) { // adding a behavior that hocks up the ManualInstanceProvider host.Description.Behaviors.Add(newInstanceCreationBehavior());
// passing the instance initialize down the rabbit hole host.Extensions.Add(newInstanceInitializerExtension { InstanceInitializer = instanceInitializer }); }
Now we are ready to create our service host:
var host = newServiceHost(typeof(PlayersService)); host.ConfigureInstanceFactory(() => newPlayersService(newPlayersDal()));
host.Open();
Summary
Using dependency injection can some (if not most) of the time a simple task. Using this extension we can utilize whichever technique we choose.
I am very happy to announce the opening of a new Windows Azure and SQL Azure forum in Hebrew. The new forum is managed by Shay Friedman and myself.
If you are new to Windows Azure, I would like to use this opportunity to invite you to experience cloud development with our support and guidance. If you are already experienced with Windows Azure development I would like to assure you that you can find in our forum help from highly experienced professionals.
With your participation I am sure we can create an awesome and vibrant community.
The Asynchronous Queuing Pattern describes a classic way to improve service throughput in distributed applications. Over the years I have seen quite a few implementations of this pattern, from the use of MSMQ to ReactiveQueue, each with its own strengths and weaknesses. Windows Azure queue storage is designed for passing messages between applications in a persisted, scalable and controlled manner. With the above attributes, queue storage is a natural choice for enabling the Asynchronous Queuing Pattern, as described in detail in this MSDN magazine article.
A recent implementation I ran across at a client challenged the performance of the Azure queue storage, especially when dealing with a large queue. Their initial implementation was too slow due to a design issue we identified easily, but now they were stuck with a queue containing millions of records and they could not retrieve the messages fast enough. I decided to measure the length of the different queue operations they were using.
The code I used to measure the performance is very simple and can be found here so you can reproduce the tests for yourself. Keep these considerations in mind:
We are using a public storage infrastructure that is prone to preemption by other applications.
The Windows Azure storage infrastructure and API implementations are subject to change.
The following totals reflect 1000 iterations (minus the first 2 to remove the additional cost of the JIT compiler and other potential initialization overhead) of a standard consumer/producer use of Windows Azure queue storage:
Total ticks
% Execution
new CloudQueueMessage
3013727
0.000513214
AddMessage
444328027497
75.66556694
GetMessage
79718072883
13.57536056
DeleteMessage
63164926400
10.75648996
AsString
12151612
0.002069324
The first thing we notice is that we can easily improve is the message retrieval code. In the above code we used the GetMessage method to retrieve the messages one by one. However the Windows Azure Queue API also exposes an API that allows the retrieval of up to 32 messages at a time using the GetMessages method. As you can see in the results from the following run, messages retrieval was over 6 times faster.
Note: since I omitted the first two iterations of GetMessages, I also omitted the first 64 iterations of every other queue operation, so at the end of the day we are looking at 936 messages rather than 998, but still the improvement is clearly noticeable.
Total ticks
% Execution
new CloudQueueMessage
2907419
0.000599733
AddMessage
428481361062
88.3858044
GetMessages(32)
12041770036
2.483938924
DeleteMessage
44255399085
9.128866277
AsString
3833020
0.000790663
The next stop on our quest for throughput improvement is the deletion of messages from the queue after we retrieve them. The consumer has to perform this operation in order to clear the message from the queue and ensure reliability. The call to DeleteMessage can also be easily improved. If you take a closer look at the code, you can see that we are using the DeleteMessage method, which is a synchronous call to the Azure Queue service. However there is no real need to wait for this call, so we can use its async implementation by calling BeginDeleteMessage. The results of this run (again for 1000 iterations minus 64) are shown here:
Total ticks
% Execution
new CloudQueueMessage
4904719
0.001183763
AddMessage
401853371789
96.98804476
GetMessages(32)
12041770036
2.906303177
BeginDeleteMessage
429024316
0.103545802
AsString
3822202
0.000922495
In our sample code, we do not handle exceptions for BeginDeleteMessage (as well as for DeleteMessage) but we can easily do so by passing a callback function to BeginDeleteMessage, which calls the EndDeleteMessage method inside a try/catch block.
Until this point, we have dramatically improved the consumer code for our queue, which I must admit the easy part. For the producer part it is going to be a bit more problematic. Windows Azure Queue Storage exposes an APM based API for adding messages to the queue (using the BeginAddMessage/EndAddMessage methods). If you are adding to the queue from a client application you can use this API to release the calling thread and using the network card to perform the majority of the heavy lifting.
If you are adding to the queue from a WCF service this will not be enough, you should consider using an asynchronous service contract. More information about implementing asynchronous services (and asynchronous calls in WCF in general) can be found in this blog post by Wenlong dong.
Summary
Windows Azure Queue Storage was created with the SOA Asynchronous Queuing Pattern in mind. Using it’s async APIs (based on WCFs awesome async capabilities) and calling the GetMessages batch method we ware able to improve it’s throughput and lower the need for more compute instances.
A new beta has been released since I wrote part 1 of this tutorial. While very little was changed in the product, we have a new name. Another thing held me back personally from publishing this part was the fact that LINQ to HPC is not a part of Windows HPC R2 SP2. So without farther ado I am proud to present the second part of my tutorial about LINQ to HPC.
In part 1 of this tutorial we discussed the fundamentals of DSC: how to manually write data to DSC files and how to use the FromEnumerable<T> extension method (from the HpcLinqExtras project) to implicitly save object data to a temporary file set (in order to use it inline in a subsequent query). We also saw a caveat in this method, namely that because FromEnumerable<T> saves the data to a single file in the temporary file set, the subsequent query cannot be parallelized. This is due to the fact that LINQ to HPC runs any query logic locally on the DSC node containing the data to which it refers.
The task at hand is quite straight forward: we would like to partitions our data into logical pieces that can be distributed across the cluster. Before we start discussing how we can physically partition data in LINQ to HPC, I would like to consider the logic we will use for dividing the data into groups. in order to do so we will take a look at vertices, which are the basic tasks that execute the query on the cluster. I will describe vertices in detail in a later part of this tutorial but for now there are few facts I would like you to consider:
A vertex can only use data from a single DSC file, located on the node it is executing on. This is, of course, in order to preserve data locality. The main implication of this little fun fact is that we should make sure that pieces of data that are dependent on each other will reside continuously in the DSC file set. A good example for this is the use of GroupBy in a query. Lets create a Student class defined as follows:
Now let’s say we are grouping our Persons by nationality, so our data should be ordered like this:
Dryad can execute local queries in each vertex and then union all the groups. If the same data needs to be reordered by the query (let’s say items were ordered by Id in the query), the first thing LINQ to HPC would need to do is to reorganize the data into intermediate files, and only then execute the necessary logic. Note: grouping operators are a bit more complex when it comes to LINQ to HPC and will be discussed in a later part of this tutorial.
A vertex will process all the data in the DSC file it is accessing. This means that if we would like to break down the processing of local queries in to smaller pieces we need to break the data in to smaller files. This is possible since DSC file set support creating more files than the number of nodes.
We can control the order in which our objects are written to file when using custom HPC serialization (as I have shown in part 1 of the tutorial). However this can become tedious, especially if we need to use the same data in different queries that can benefit from different partitioning and ordering.
Repartitioning Operators
Repartitioning operators are LINQ to HPC operators that result in intermediate DSC files partitioned in a way that is not dependent on the partitioning of the input files. There are two Repartitioning operators in LINQ to HPC: Hash and Range Partitioning.
Hash Partitioning
Hash partitioning provides a mechanism for partitioning data that is not sorted; Returning to our students sample, nationality is a prime candidate for hash partitioning. To use hash partitioning you need to call the HashPartition operation, which provides an overload that accepts the number of partitions to be created, once called you can use the ToDsc operator to create a new DSC file set and call SubmitAndWait to commit the operation (I have reviewed this steps in part 1 of this tutorial):
// getting the list of students List<Student> students = GetStudentsList();
// saving the students range partitioned to the file set with 5 partitions context.FromEnumerable<Student>(students) .HashPartition(std => std.AvgGrade, 5) .ToDsc<Student>("StudentsFileSet") .SubmitAndWait(context);
The Why hash partitioning selects the partition for a specific entity is by performing a mod operation between the hash code of the key selector and the number of partitions, the following code mimics the behavior of hash partitioning regarding the partition selection:
var students = GetStudentsList();
foreach (var student in students) { int portNum = student.GetHashCode() % 5;
var str = "the student {0} with nationality {1} will be written into partition no: {2}"; Console.WriteLine(str, student.Name, student.Nationality, portNum); }
This method is disappointingly crude. If you run this code (supplied with my samples) you will see that although we have instructed the HashPartition operator to create 5 partitions, the result of the mod operation results in only four different values. This is of course due to the nature of the values in our key selector (none of them divides evenly by 5). This result is somewhat arbitrary, and we could have had the result distributed in many ways (even and un even) dependent on the result of the key selector GetHashCode. To overcome this pitfall, HashPartition has another overload that accepts an IEqualityComparer that can be used to override the implementation of GetHashCode of the key selector.
Range Partitioning
Range partitioning allows the ordered partitioning of sorted keys. Returning once more to our student’s sample, the average grade can be used as such a key. This is useful if our query uses this key selector ordering in its logic. The way range partitioning works is by assigning a range of keys for every file: any object whose key belongs in that range will be placed in the DSC file. By using this method files can be created un-evenly, but we can ensure that objects within a specific range will reside in the same file. Range separators are used to define ranges: these are values that mark the border points between one range and another. Let’s say we now would like to partition our students into files that are partitioned by grades. We will use two range separators to split the data in to three files:
In this case our range separators are 3 and 6. One thing that is very easy to overlook is the fact that if our student’s grade equals the value of a range separator, it can belong, range-wise to the two files on both sides of the separator. Range separators can be assigned in two ways:
Statically assigned by user: In some cases we would like to explicitly force the range structure. This is useful when we know our data and queries structure and believe we can benefit from it. Let’s say we know our queries mostly filter students with grades of 6 and above, we can reflect this knowledge into our file structure even dough it results in an uneven distribution. We can pass an array of range separators like this:
// getting the list of students List<Student> students = GetStudentsList();
// saving the students range partitioned to the file set context.FromEnumerable<Student>(students) .RangePartition(std => std.AvgGrade, new[] { 3d, 6d }) .ToDsc<Student>("StudentsFileSet") .SubmitAndWait(context);
All we need to provide here is a key selector delegate, to select the value on which we partition and the rangeKeys parameter which holds the array of range separators of the same type as the return type of the key selector.
Dynamically sampled: Another, perhaps simpler approach is to use a different overload that allows LINQ to HPC to generate partition separators for us. When we allow RangePartition to select the range separators for us, it will try to create DSC files of approximately equal size, but on the other hand we do lose much of the control we had creating the range separators ourselves. There are few overloads of RanePartition; the simplest looks like this:
// getting the list of students List<Student> students = GetStudentsList();
// saving the students range partitioned to the file set with 5 partitions context.FromEnumerable<Student>(students) .RangePartition(std => std.AvgGrade, 5) .ToDsc<Student>("StudentsFileSet") .SubmitAndWait(context);
Other than losing control with dynamic range partitioning there are few key points you should bear in mind:
Currently dynamic sampling will take place for every 1,000 records - not really useful for small datasets.
Dynamic range partitioning is using range separators even if you did not set them yourself. If the key selector will return non-proportional ranges, the files will have to differ in size.
Summary
Data partitioning allows us to implicitly distribute our data over the cluster, thus adding more control to how (and where) our queries will execute. Now that we got all our data just where we want it, we can start creating distributed kick-ass queries. But this calls for a completely different post.
Source code for all the samples can be found here.
One of the most exciting additions to Windows HPC Server 2008 R2 SP2 (currently in beta) is the support for DryadLINQ. DryadLINQ is an API that allows the creation and execution of large scale, data-parallel compute tasks. One of the core capabilities of Dryad (the underlying framework used by DryaLINQ) is the ability to distribute the data over the cluster and maintain data locality by executing the code on the node storing the data. In order to do both, Dryad utilizes a mechanism called The Distributed Storage Catalog (DSC) which I will cover in this post.
Overview
DryadLINQ provides a powerful mechanism for distributing LINQ queries over a cluster. In order to do so, Dryad needs to distribute both the code to be executed and the data on which it needs to operate. Before we can execute DryadLINQ queries over any type of data we must save it to DSC. This is what allows DryadLINQ to execute parts of the query (called vertices) in a distributed manner: each vertex will then emit a result that is either retuned to the caller or saved to DSC for further querying.
The HPC Dsc Service
When you install a head node using SP2, a new Windows service - HPC Dsc – is added (Note: for the current beta version of SP2, DryadLINQ requires a new installation, no upgrades are supported). HPC Dsc exposes a service API (via net.tcp binding) with two primary functions:
Provide logical management for DSC entities such as as DSC nodes, file sets, files, etc. (we will discuss those shortly) in a database called HPCDsc.
Physically manage the participating nodes’ files by accessing the file system via a file share.
Figure 1: Overview of the HPC Dsc service
In order to allow DSC on any node simply run the following command:
dsc node add [compute node name] /tempath:[local path for HpcTemp share] /datapath:[local path for HpcData share] /service:[head node name]
This will create the shared directories on the on the target node, and register them with the HPC Dsc service. Once this command is successfully executed, the corresponding compute node can be used for DSC (thereby enabling Dryad) operations.
Working Directly With DSC Files
DSC allows the storage of either text files or serialized objects. The most straightforward way to save either is to write directly to DSC files as the following code demonstrates (you can download all code samples for this post from here):
// creating a fileset and two files manually if (context.DscService.FileSetExists("TextFileSet")) context.DscService.DeleteFileSet("TextFileSet");
// copying the content of text files in to the DSC files File.Copy(Path.Combine(Directory.GetCurrentDirectory(), @"TextFiles\TextFile1.txt"), fileA.WritePath); File.Copy(Path.Combine(Directory.GetCurrentDirectory(), @"TextFiles\TextFile2.txt"), fileB.WritePath);
fileSet.Seal();
var lines = context.FromDsc<LineRecord>("TextFileSet"); Console.WriteLine("The number of lines is {0}", lines.Count());
in this snippet I used a couple of the DSC entities (or better said, their .NET APIs which reside in the Microsoft.Hpc.Dsc namespace) and I would like to review them before we continue:
DscFileSet: The DscFileSet provides a mechanism to access, manage and most importantly query a group of files distributed over the cluster.
DscFile: These represent actual files in the FileSet and provide a few useful properties including the WritePath I used in the code above to simply write to the files.
In the above snippet we use the DscFileSet to create two DscFiles. The DscFileSet creates each file in the DscData shared directory on whichever DSC node it sees fit to host the file. We can influence this decision by passing a preferred node name to the AddNewFile method. Once created, we can then manually write to the DscFile.WritePath. When we are done writing to the files we call the Seal method. The Seal method informs the HPC Dsc service that we are done writing to the files. This means that the HPC Dsc service can now update the databse with the file sizes of each file, and change the state of the DscFileSet to sealed.
Another API I used in this snippet is the (declared outside of the presented code) HpcLinqContext (from the Microsoft.Hpc.Linq namespace) which I will cover in more details in later parts of this tutorial. For now, however, I’ll just explain its role in the above code: Once configured the HpcLinqContext allows us to communicate with the DSC in couple of ways. The first one is by exposing the HPC Dsc service via the DscService property: we use the Dsc service to create the new file set. The second way is by allowing us to execute a DryadLINQ query against the FileSet using the FromDsc<T> method.
Saving Serialized Objects To Files
As I mentioned above, DSC supports serialized objects. While this ability exposes the real strength of Dryad, it is unfortunately a mess in its current state (SP2 beta). Fortunately, the DryadLINQ samples that are a part of the SP2 installation contain a project named HpcLinqExtras which simplify much of the oddities of DSC in its current state. I will review these oddities and their solutions in the HpcLinqExtras.
So lets begin by defining a class named Person and mark it as [Serializable]. There are some restrictions for objects that can be serialized to DSC that can be found in the DryadLINQ and DSC Programmer’s Guide, but for our sample we will use the following class definition, and you’ll just have to take my word that it is a complaint with regards to the DSC restrictions:
[Serializable] publicclassPerson { publicint Id { get; set; } publicstring Name { get; set; } }
One would believe this should be enough for saving instances of Person to DSC, but with DSC nothing is as simple as it seems. The biggest DSC serialization crankiness is the need for a type implementing custom HPC serialization. Note that not all types in the graph saved need to implement custom HPC serialization - only the root type that will be used for saving manually the data. The HpcLinqExtras project contains the ObjectRecord class that can be used as a sort of generic root type for any .NET object. To implement custom HPC serialization, a type must be decorated with the CustomHpcSerializerAttribute and implement the IHpcSerializer<T> interface, this will allow it to read and write the type passes as the T type parameter from and to DSC. The ObjectRecord for example class is defined as follows:
The next step is to implement a serialization mechanism for writing the actual data to by writing the two method declared in the IHpcSerializer interface: Read and Write. This can be done by simply using runtime serialization into and out of memory streams and passing the latest to the HpcBinaryWriter and HpcBinaryReader. Since the same behavior needs to be invoked manually (without using the HPC binary reader/writer) the object record exposes overloads that use a BinaryWriter and a BinaryReader for the Read and Write methods.
Now we can create new instances of Person and write our own objects one by one, wrapped in ObjectRecords:
// creating a fileset and two files manually if (context.DscService.FileSetExists("PersonFileSet")) context.DscService.DeleteFileSet("PersonFileSet");
var andy = newPerson { Id = 1, Name = "Andy" }; var kelly = newPerson { Id = 2, Name = "Kelly" };
// writing each person to a different file using (BinaryWriter bw = newBinaryWriter(File.OpenWrite(fileA.WritePath))) { newObjectRecord(andy).Write(bw); bw.Close(); }
using (BinaryWriter bw = newBinaryWriter(File.OpenWrite(fileB.WritePath))) { newObjectRecord(kelly).Write(bw); bw.Close(); }
fileSet.Seal();
Once saved we can query the DscFileSet using the FromDsc<ObjectRecord> to get them back, and use a simple Select to return IQueryable<Person>:
var champs = context.FromDsc<ObjectRecord>("PersonFileSet") .Select<ObjectRecord, Person>(or => or.Value asPerson);
Saving Serialized Objects To Files – Take 2
The above code might seem reasonable when working with two instances and a small FileSet. But what if we have a large collection, and we want to distribute it over the cluster? Well our good friend HpcLinqExtras provides a couple of nifty extension methods that make all this pain go away. For starters we can use the FromEnumerable<T> to write our data structure to a temporary DSC file set and return the good old IQueryable<T> that we can save using the ToDsc<T>:
var andy = newPerson { Id = 1, Name = "Andy" }; var kelly = newPerson { Id = 2, Name = "Kelly" }; var source = newList<Person>(); source.Add(andy); source.Add(kelly);
First we call the FromEnumerable<Person> method that works in the following manner:
It creates a temporary fileset with one file
Then, the FromEnumerable method loops over the IEnumerable it receives as a parameter and wraps each member in an ObjectRecord for writing to the file.
Finally the FromEnumerable<Person> calls FromDsc<ObjectRecord>, and uses a simple Select to return IQueryable<Person>.
Next, we use the ToDsc<Person> call to create a file set named “PersonFileSet”, the result of the IQueryable<Person> we just got as a result from the FromEnumerable<Person> method. You might have noticed that ToDsc<T> has that magical ability to write runtime serializable and not just custom HPC serializable objects, I’ve warned you, the API for working with serialized objects is a mess.
Finally, we call the SubmitAndWait method which obviously submits a job to the head node, and then, wait for it… it waits for it to complete.
You might get the feeling we are no better off then before. Granted, we wrote a lot less code then in the previous version but not only that FromEnumerable<T> does exactly what we’ve done in our first sample, we then query the temporary file set and write it again to the final file set. Astute readers might even notice that both times the file sets contain a single file: For the temporary file set a single file was created manually, and if you take a closer look at the sample you will notice that the “PersonFileSet” was created implicitly using the ToDsc<Person> method. So we are not even distributing our data. There is a way to distribute data using this technique, called partitioning which I’ll cover in the next part of this tutorial.
Summary
The fact that DSC, and the Dryad stack are still in early stages doesn't undermine the fact that we are looking at a technology that can change the face of parallel application. And while it is likely some (if not all) of the extensions available today in the HpcLinqExtras will find their way in to the final release of DryadLINQ, it is important to understand how it works and what are the implications that come with it.
I was woken up tonight by the sounds of Ido and Gil getting ready for their live interview on the Sela College Channel. And let me tell you, it was a slippery slippery slope from there on. I was forced to wait behind the scenes since the main view behind the guys was my bed. The result was quite a funny session with me participating off-screen. If you saw the session, here is how it looked behind the scenes:
And with that piece of cinematic magic, I’m going back to bed.
Today @ MIX Ido and I got some very good news. The first part of our work for the HPC Azure Burst training kit was released today for download by Microsoft. Oh and we got a Kinect.
The document released provides on overview of the architectural considerations for using Azure compute nodes as a part of your cluster.
There is a lot of work still ahead of us and we will kick out some very cool content on MSDN (and cooler stuff on our blogs ) so stay tuned.
As server developers, we are used to a certain level of interactivity. Our services get a request and often return some kind of response. Lately I found myself justifying having HPC batch jobs. It’s sometimes hard to grasp that classic HPC programs, like the Human Genome Project, rendering a full feature 3D animation movie, or simply operating a “civilian purposes” nuclear reactor, take a long time. And when you execute such programs for days, weeks, months, etc., interactivity is not even a consideration.
Batch jobs are the very essence of classic HPC applications. Still, Microsoft is trying to expose Windows HPC Server 2008 to new verticals, some of which would like to use it for more interactive tasks. For such purposes Windows HPC Server 2008 supports an SOA model exposed via WCF. Microsoft also released a cluster debugger that has some cool features such as:
Cluster debugging
Local Debugging
Running service code locally in a simulated Windows Azure environment
Two project templates for creating both Interactive and Durable Session clients
There is also a decent MSDN walkthrough that shows how to create and debug an HPC SOA application. I recommend running through it, to get a feeling for how HPC SOA applications are built. In this post, I would like to take a deeper look at both session types and some of the mechanisms and techniques they utilize, both on the client and in the cluster.
Sessions
Sessions are an essential part of HPC SOA clients. There are two types of sessions: Interactive and Durable sessions. Neither HPC client session type provides the same semantics that WCF sessions provide, where every call during the session uses the same instance of the service class on the server side. In fact, one of the roles of sessions, when used correctly, is to ensure that calls during the session will be load-balanced between different compute nodes in the cluster. Conversely, durable sessions provide another key functionality which I will describe below. To understand this process we need to take a look at two more components: The Job Scheduler and Broker Node.
The Job Scheduler
The job scheduler is the main component that runs on the head node. It handles units of work called jobs. The main concern of the job scheduler is to allocate the necessary resources for the job and start sub-units of the job tasks on the allocated compute nodes. In an SOA application, the tasks that the job scheduler creates are called service tasks and they host the services defined in the job.
Figure 1: A session starting service tasks using the job scheduler
In SOA applications the job scheduler has one more task, start a broker node that will be used to load balance all service calls between the service tasks.
The Broker Node
The broker node provides a few key capabilities for SOA applications, the first is exposing an endpoint for every service call targeting the specific service job. Every call that is sent to the broker node through that endpoint, will be load balanced between the service tasks in that job.
Figure 2: A Windows HPC Service Message Lifecycle
And now that we understand the basic stuff we can look into two of the more powerful (and cool) scenarios HPC provides for SOA applications.
Durable Sessions
By now we discussed mechanisms that work the same way in both Interactive an Durable sessions. The main difference between both sessions is (not surprisingly) durability. Durable sessions simply save the response message in an MSMQ queue on the broker node where they can be retrieved by clients, either the initiating client or any other client with the permissions to attach itself to the session. Another difference is that in order to use durable sessions, one must use the BrokerClient class in order to send and receive messages defined as MessageContracts as shown in the following snippet:
using (DurableSession session = DurableSession.CreateSession(info)) { Binding binding = newBasicHttpBinding();
using (BrokerClient<ISquareService> client = newBrokerClient<ISquareService>(session, binding)) { // Set the response handler client.SetResponseHandler<SquareResponse>((response) => {
int reply = response.Result.SquareResult; Console.WriteLine("Received response for request {0}: {1}", response.GetUserData<int>(), reply); });
Grow/Shrink is basically a scheduling policy, and while the job scheduler provides few of these, Grow/Shrink is the more relevant for SOA applications. As its name suggests, Grow/Shrink allows administrators to add or remove resources for a job over time. This helps dealing with peeks and can even be done using Windows Azure Worker Roles as additional compute nodes. The cool thing is that once you add the resources to the job, the job scheduler notifies the broker node about the new resources and the broker node can now take them into account while load balancing.
Conclusion
SOA is still the the ugly duckling in the world of HPC. In fact the whole concept of using sessions to start a service on the cluster comes from the classic HPC paradigm of a single job that needs to be distributed on a cluster. In fact, you can start a service job directly on the server using The HPC 2008 Cluster Manager or use some very simple patterns to start a session that will be shared among all clients transforming Windows HPC Server 2008 R2 in to a lean, mean SOA machine.
There are many capabilities just waiting to be used in commercial SOA applications and the hardware does not need to be a monstrosity (but that can definitely make things cooler). The constant improvement in Windows Azure support makes this type of solutions ideal for dealing with uncommon / unpredicted peeks, since there is no need to buy the hardware to support them.
So until next time, remember: real developers use big computers.
The SDP conference is over and it’s time to say a big thanks to both my partners: Bnaya for the Parallel programing tutorial day and Ran for our very special ETW session. I would also like thank everybody who attended the session, I really enjoyed seeing you all. You can get the slide deck and demo code from here.
ETW is a powerful and high performance tracing facility. In this post I will describe how to create your own ETW provider and publish events from you application. This post assumes you have a previous understanding of ETW concepts so I won’t cover in detail what ETW is and why to use it. If you’re an ETW newbie, then I invite you to ceck out Ran Wahle and myself at the SDP or to read this MSDN magazine article: Improve Debugging And Performance Tuning With ETW.
Before we start I’d like to go over this one point:
ETW provides two provider models:
Classic provider – This pre-Vista API has the following characteristics:
defines its events only in code.
must implement the registration and deregistration to trace sessions logic.
can be enabled by only one session.
Manifest-based provider – for providers that run on Vista or higher OS the following improvements are available:
allows resources and localization mapping capabilities (This is useful since the mapping is done during the interpretation of the trace to save the write overhead).
can be enabled by up to eight sessions.
Creating a Manifest-based provider is not a complicated task. It’s just a little bit long and a poorly documented process. In this guide I’ll demonstrate how to enable event tracing using the Manifest-based provider.
The Manifest
The first step when creating a Manifest-based ETW provider is to create a manifest file (naturally…). To do so you can either write the xml yourself and a minimalistic documentation about the manifest schema exists in msdn or use the Manifest Generator (ecmangen.exe) provided with the Microsoft Windows SDK as I do:
Select “new provider” and you should see the following form
Most of this form is pretty self explanatory: the human readable name of the provider, a symbol that allows access to the provider from other manifests (I used the providers name here as well) and a Guid that we will use to identify the provider from now on. The decoding files are something we should pay attention for:
These files are manifested assemblies that allow the mapping of localized messages and resources while interpreting the trace. Note however that using such files requires a full path, so no relative paths at this time.
The next step is to create an event in the manifest by selecting the “new event” option in the right hand sidebar. You should see the following form:
This is also pretty self explanatory for most parts: We give the event a symbol, an ID and a version. As for the message this is a localized message in en-US, to add more locals you will need to edit the manifest’s xml.
Next we will define a template. Templates describe the payload of the event which is an int in our case.
We create a template named MyEventData and a parameter named ReturnNumber to output a simple counter. Click “add” and save the template and now we can go back to our event and add the template.
Now we will create a channel where the provider can publish to:
We define a name, symbol and a type (we can choose between admin, operational, admin and debug). and go back to the event to select the channel. Now we can also select the Level, Task, Opcode and Keywords that are all additional filtering options
Save the event, MyEvent now looks like this:
Generating a header file using the Message Compiler
Our next step would be to use the Message Compiler (MC.exe) to generate a header file. using the following command line:
mc.exe [our manifest.man] –h [output path]
You’ll find in the output path a .h file that we will use shortly
Little bit of code
Lets put the manifest aside for a moment and now we can write some code:
The first step is to create an event provider and passing the same Guid we used in the manifest to its ctor:
var provider = newEventProvider(newGuid("{1B22749B-5EE3-49B5-9C1F-83AA56D393D6}"));
The next step is to create an event descriptor which represents the event we have defined in the manifest. We need to initialize the event provider with the Event id, version, channel, opcode, task and keywords we have defined in the manifest. To do so we can open the header file we’ve created with the MC and look for the following section:
Note that The keyword is an unsigned long value and cannot be cast directly to a long so we call the ctor inside an unchecked block.
Now we can write the events:
We will request the user to enter the number of events to be written, then loop that many times, and write the event with the iteration number as default.
Console.WriteLine("Please enter the number of events to be written and press enter"); var returns = int.Parse(Console.ReadLine());
for (int i = 0; i < returns; i++) { // we pass the event descriptor by ref and the counter // as our payload. Please note that the payload is passed // as params object[] so we can enjoy some good ol' boxing provider.WriteEvent(ref descriptor, i); }
Compilation
Compiling a provider takes a few steps of its own:
We will now use the message compiler once again and run it with the following arguments:
mc.exe [our manifest.man] –r [output path]
This will output the two bin files and an rc file
We will now use the Resource Compiler (rc.exe) to generate a resource file out of the rc file we’ve just created:
rc.exe [our recource.rc]
We can now add the resource file to our visual studio project in the project properties application tab like this:
and build our project.
Deployment
The next step will be to place our resource files (which is the entire application in our case at the location we have given them in the manifest.
We now need to register our provider using the wevtutil:
wevtutil im [our manifest.man]
And now our provider is installed as we can see by running xpref
xperf -providers i
and what’s left now is starting a new session and writing them events.
As you can see enabling ETW is not that hard at all and can come in handy when trying to understand the inner workings of your application.
Ever since Microsoft announced Silverlight is going to be the framework for developing applications on the Windows Phone platform. I was very excited, I thought Microsoft have finally came to there senses and are going to take the mobile world by storm. Soon we will have an Android implementation and RIM will sure hope on the wagon and one of the biggest issues in mobile development, lack of cross platform client is about to become history at least for big chunk of the smartphones market and that’s a step in the right direction. This will also attract a lot of developers to develop Windows Phone applications and everybody win, But since then there was no movement in that direction.
Jason Zander with the Sela experts (Photograph by: Ido Flatow)
This afternoon I was present in Jason Zander’s meeting with the Sela experts and I got to ask a question in I had in mind for a long time: Are Microsoft going to invest in Silverlight on other mobile OSs, and the answer was a polite no. He did say that HTML 5 is going to be the cross platform environment for mobile applications as well. I run it by Alex Golesh and explains the financial side of it: while developing a Silverlight support for other OSs will cost Microsoft many millions, developing HTML 5 support is done individually by every platform.
I still think there is a huge market need for a a runtime like Silverlight, a place that is being filled today by Java and that’s a shame. In the near future we will see some Mono implementations on other mobile OSs, but probably, much like their older sibling they will not be able to keep up with the original.
I hope Microsoft will come around since they can change the developer story for for the mobile world and that would be grate for us. Until then, we will always have HTML 5.