All Your Base Are Belong To Us

Mostly .NET internals and other kinds of gory details

July 2008 - Posts

Practical Concurrency Patterns: Lock-Free Operations

In the previous installments we have reviewed multiple strategies for caching or storing calculated key-value data so that accesses to it are optimized to the highest applicable degree.  For simpler storage types, such as a work item queue, there are even cheaper alternatives that what we've seen so far.

Consider the following scenario:image

  • A stock trade application receives trade orders at the alarming rate of thousands of operations per second from multiple sources
  • Orders are placed in a queue for further processing in order to free the front-end component(s)
  • Orders are removed from the queue by multiple processing components which proceed to perform very brief update operations on in-memory data

This scenario can be easily dominated by CPU work, as most of the operations are performed in memory.  The single bottleneck in the above design is the queue.  A read-write strategy will be useless because all operations on the queue are writes (insert and remove operations), so we need something else to minimize contention.

We observe that the only two operations actually requiring a lock are the inserting items to and removing items from the queue.  Creating, filtering or otherwise manipulating work items can be performed without synchronization.

This is the classic use case for a lock-free queue.  If the queue is maintained as a singly linked list, an insert operation can be implemented by performing one atomic write - updating the list tail.  Similarly, a remove operation can be implemented by performing one atomic write - updating the list head.  (Or so it might appear.)

Since reference writes under the CLR memory model are guaranteed to be atomic, it might appear that we don't need synchronization around the queue at all.  This assumption quickly breaks, however, if we consider the following naive implementation:

class NaiveLockFreeQueue<T>

{

    volatile QueueNode<T> head;

    volatile QueueNode<T> tail;

 

    public NaiveLockFreeQueue()

    {

        tail = head = new QueueNode<T>(default(T));

    }

    public void Insert(T data)

    {

        QueueNode<T> node = new QueueNode<T>(data);

        tail.Next = node;

        tail = node;

    }

    public T Remove()

    {

        QueueNode<T> oldHead = head.Next;

        head = oldHead.Next;

        return oldHead.Data;

    }

}

The funny thing around here is that you might test this code a long many times and it will work properly across multiple threads, especially if you're testing it on a single CPU.  The reason is that it is rather unlikely for a race condition to occur and invalidate the above implementation - it would require a context switch between two specific instructions in the Insert or Remove implementation.  But the results would be disastrous.  Consider the following sequence:

Thread 1 Thread 2
Sets tail.Next to X1  
  Sets tail.Next to X2
  Sets tail to X2
Sets tail to X1  

As a result, we have a split list!

image

Different and boring is the way Remove can break:

Thread 1 Thread 2
Sets oldHead to X  
  Sets oldHead to X
  Sets head to oldHead.Next (Y)
Sets head to oldHead.Next (Y)  
Returns X.Data  
  Returns X.Data

As a result, both threads observe the same removed item!

As you see, reasoning about lock-free data structures is already not trivial.  To make the insert and remove logic valid, we must introduce some sort of synchronization to address both problems:

  • While inserting an element to the queue, we must ensure that the tail doesn't lag behind
  • While removing an element from the queue, we must ensure that the head was not updated before we update it

Fortunately, it is possible through the use of Interlocked.CompareExchange and very cautious reasoning about what can possibly break.  An analysis of a full implementation is beyond the scope of this post, but there are multiple sources on the web, including Julian M. Bucknail's excellent series.  To examine a practical implementation, you might want to take a look under the hood of System.Threading.dll in the Parallel Extensions CTP - the ConcurrentQueue<T> class is your friend.

Replacing the stock lock implementation with a lock-free data structure removes the single biggest bottleneck in the system.  Interlocked operations are not free of cost (cache invalidation and memory fences are expensive), but they are orders of magnitude cheaper than a full kernel lock.  By the way, if you think a lock is expensive only because a kernel mechanism is involved, you're not seeing the full picture.  The dire consequence of contending for a frequently taken lock is lock convoys!

The following diagram depicts a happy situation, where threads are making forward progress taking a lock for a very short time but doing so frequently.  Note that there is no stalling at all because everything appears to be arranged and timed so that threads never contend for the lock.

image

On the other hand, a minor aberration in the system can lead to a thread being switched out while holding a lock (in the above diagram, the third and fourth threads are pretty damn close to it happening).  This results in the following:

image

Our threads are stalling more than half of the time now, once the convoy is in effect.  Due to the initial context switch while a lock was held, we now witness the effect of multiple threads stalling for the lock.  With lock-free structures, this would never happen.  QED.

Guaranteed, Finalization Order Is Not

image As part of our ongoing quest to categorize all finalization-related problems, I'd like to present another honorable mention in the series.  This time, it's finalization order (or the lack thereof).

A quick refresher on finalization: When an object with a finalizer is created, a reference to it is placed in the finalization queue.  When the object is no longer referenced by the application, the GC moves it to the f-reachable queue.  This wakes up the finalizer thread, which in turn removes the object from the queue and runs its finalizer.  At the next GC, the object's memory is reclaimed.

With that in mind, consider a graph of objects with finalizers, such as a StreamWriter and a FileStream.  The FileStream is a thin wrapper around a Win32 file handle and the related APIs.  The StreamWriter is a buffering convenience wrapper around a stream.  Both require finalization to work properly - the StreamWriter needs a finalizer to flush its internal buffer and close the underlying stream, and the FileStream needs a finalizer to close the Win32 file handle.  Both should also implement IDisposable or provide other means for deterministic finalization (e.g. a Close method).

So what happens if the user doesn't use deterministic finalization and relies on a finalizer instead?

FileStream fs = new FileStream("1.txt", FileAccess.ReadWrite);

StreamWriter sw = new StreamWriter(fs);

sw.WriteLine("Hello World");

//No sw.Close() here

Scenario #1: If the StreamWriter's finalizer is called first, it will flush the buffer and close the FileStream.  Since the stream was deterministically closed, it will not be finalized.

Scenario #2: If the FileStream's finalizer is called first, it will close the Win32 file handle.  When the StreamWriter's finalizer is called, it will attempt to flush its internal buffer into a closed stream!

Since finalization order is undefined (and there's no way for it to be defined, because it depends on the GC's traversal order which is also undefined), we can never tell if we're going to land in scenario #1 (yay) or scenario #2 (ouch).  Therefore, StreamWriter does not have a finalizer.

Bonus question: What happens if you forget to deterministically close a StreamWriter?  (You can try it at home.  Nothing horrible will happen, but you will lose the buffered data that wasn't yet flushed to the underlying stream.)

This also brings up the subject of resurrection, a little-known "feature" that boils down to making an object accessible from within a finalizer.  For example, if we were to implement object pooling, we would want to ensure that even if the user didn't explicitly return the instance to the pool, it will still be returned to the pool by the finalizer:

class PooledObject

{

    //Implementation omitted for clarity

    //...

 

    ~PooledObject()

    {

        Pool.ReturnToPool(this);

    }

}

The first problem with this approach is that the finalizer for this specific instance won't ever be called again.  We get one shot at the finalization queue - no one is going to add our instance back automatically, so we have to take care of it ourselves:

GC.ReRegisterForFinalize(this);

Another problem we are certainly going to have revolves around other finalizable objects we might be referencing.  Since finalization order is undefined, it is entirely possible that the finalizers for our contained (referenced) objects have already run, rendering their state inconsistent.  Any attempt to use them will yield undefined results - few objects are willing to work properly after their finalizer has already run, or are even aware of the possibility this might happen!

Effectively, the only "safe" way of resurrecting such referenced objects that are outside our control is discarding and reinitializing them from scratch.

Finalizer vs. Application: A Race Condition from Hell

One of my favorite managed debugging demos is analyzing a memory leak caused by a blocking finalizer.  This tension between the finalizer thread and the application threads making allocations should be kept in mind by any programmer using this powerful but dangerous feature.  However, there is another subtle category of bugs that can surface "thanks" to finalization: Race conditions between the finalizer and your application threads.

For example, consider the following scenario in which the File class acts as a wrapper over an implementation-provided Handle type:

class File : IDisposable

{

    Handle h;

 

    public File(string fn)

    {

        h = new Handle(fn);

    }

 

    public void Read()

    {

        Util.InternalRead(h);

    }

 

    ~File()

    {

        h.Close();

    }

 

    public void Dispose()

    {

        h.Close();

        GC.SuppressFinalize(this);

    }

}

 

class Program

{

    static void Main(string[] args)

    {

        File f = new File("1.txt");

        f.Read();

    }

}

The class' code seems perfectly solid even to the seasoned reviewer (although utterly useless :-)).  It provides deterministic finalization through IDisposable, it provides a backup through a non-deterministic finalizer, it is very cautious to close the handle only once...  The client code is a bit flaky as it doesn't call Dispose (nor uses the using statement), but the finalizer should take care of things.  So what am I hinting at?

image

Due to the GC's eager nature, it's entirely possible for the file handle to be closed in the middle of the InternalRead operation.  This is a very disturbing consequence that can be the source of extremely difficult debugging scenarios.

Once you have overcome the shock, let's follow the reasoning:  The object itself is considered in use as long as it is referenced.  Once it is no longer referenced, it becomes eligible for garbage collection and subsequently for finalization as well.  In our case, the initial references are the local variable f in the main method, and the this implicit parameter (which is normally kept in the ECX register) in the Read method.  The f local variable is not used after the f.Read() call, so it does not prevent the object from being collected.  The surprise stems from the fact that the this parameter is not used after the file handle is passed to the InternalRead method, and therefore doesn't prevent the object from being collected either!  Closing the handle in the middle of the InternalRead call is a bug from hell, and it's non-deterministic by definition because we have no idea when exactly the finalizer will be called.

Oh, and have I mentioned it?  The bug will surface only in the Release build, because in the Debug build the GC treats local variables as active roots until the end of their enclosing scope to facilitate debugging.  Did you need more motivation for running your tests against Release build as well?...

Exposing Custom Performance Counters in .NET

One of the first things I do as part of the .NET Performance course is demonstrating some of the ways we have for measuring the performance of individual applications and of the system as a whole.

As part of that quest, we encounter tools that require no ad-hoc participation on our part: Fire them up, give them an executable or a process to do their magic, and analyze the results.  Some other tools need more love and care on our part to produce anything remotely useful.

A profiler can tell you that you're spending 90% of the time in the SpamHouse.GenerateLotsOfEmails method, even if it condemns the idea and even if you haven't actively aided the profiler when you implemented the method.  On the other hand, performance counters are among those tools that can't tell you anything without our participation, because there are no performance counters for your business logic by default.

Sprinkling performance counters into the application code is not an easy task.  You basically attempt to predict what problems you are going to have and where you're going to have them and put performance counters around these sections of code to facilitate instrumentation.  Doing this by hand is tedious, although .NET makes it relatively easy; a more appropriate solution is using some sort of AOP framework to make it happen.

For example, the following attribute-based design could specify that all employees should have a performance counter attached monitoring the number of hours they worked this week:

[PerformanceCounterCategory]

public class Employee : IIdentity

{

    private int _hours;

 

    public string Id { get; set; }

    public string Name { get; set; }

 

    [PerformanceCounter]

    public int HoursWorked

    {

        get { return ++_hours; }

    }

 

    public Employee(string id, string name)

    {

        Id = id;

        Name = name;

    }

}

The IIdentity implementation (which is an interface I made up, not the .NET security interface) is there to enforce an instance name for each performance counter instance that is being created.

To actually monitor these objects, we use a background timer that updates all counters once a second.  We use weak references to the actual instances to ensure that we don't have a memory leak because of performance counters (i.e., if an object is no longer referenced by the application, the performance counters infrastructure we just built is not going to keep it alive).  For example, assume we have the following code:

Employee e1 = Activator.CreateInstance<Employee>("12", "Joe");

Employee e2 = Activator.CreateInstance<Employee>("15", "Kate");

 

GC.Collect();

Thread.Sleep(TimeSpan.FromSeconds(5));

GC.KeepAlive(e1);

 

GC.Collect();

Thread.Sleep(TimeSpan.FromSeconds(5));

GC.KeepAlive(e2);

 

GC.Collect();

Console.ReadLine();

imageIf we look at the results in Performance Monitor, we will see that the first employee object is getting collected and stops working 20 seconds before the other.  The reason is that the garbage collector collects the first employee object after the control in the Main method passes the GC.KeepAlive line (yes, the GC is that eager to collect unused objects!).

The point at which we can intercept the fact an object needs to be monitored is the point of instantiation, so we provide our own activator (this is very similar to the technique we've used in the past when implementing the Design-By-Contract sample):

public static class Activator

{

    public static T CreateInstance<T>(params object[] args)

        where T : class, IIdentity

    {

        EnsureTimerActive();

        EnsureTypeRegistered<T>();

 

        T instance = (T)System.Activator.CreateInstance(typeof(T), args);

        RegisterCountersFor<T>(instance);

 

        return instance;

    }

}

This activator relies on System.Activator to perform the actual instantiation, but it ensures that the timer is active, that the type is registered (i.e. the performance counter category exists), and finally registers the object's properties to be monitored by the background thread.

The sample code used to implement this facility can be downloaded from my SkyDrive.  Please note that it's sample code - it's inefficient, lacks thread-safety and generally suffers from a quick-and-dirty approach.  However, it shows that the infrastructure code for introducing performance counters into your application can be written once, and its fruits can be enjoyed basically in every module you write that might one day require instrumentation.

AppDomains and Remoting Life-Time Service

Sometimes we forget that .NET AppDomains are really separate light-weight processes inside a single .NET process.  We enjoy all the benefits of a light-weight process: fault isolation, assembly "unloading", custom security policies, and many other things.

But we often forget that AppDomains are separate from their host and from each other, and that under the covers they use good old .NET Remoting to communicate.  With that in mind, perhaps you can answer the following question:

I'm creating an object in a separate AppDomain and storing a reference to it in a dictionary.  At some point, when I'm trying to access the object I'm getting an exception as if the object has been collected - even though I have a reference to it.  What's going on?

Well, if you remember, when you create a remote object (including an object in a separate AppDomain), it can either be a marshal-by-value or a marshal-by-ref object.  If it's serializable, it's automatically treated as marshal-by-value, in which case you have a copy of the object in the "host" domain and there are no lifetime issues to consider.  But if the object derives from MarshalByRefObject, then it's deemed marshal-by-ref, meaning that you actually have a remote proxy to the original object, which lives in a separate AppDomain.

As such, the remote object's lifetime is no longer tied to the standard .NET GC rules.  It is tied to the remoting lease and sponsor mechanism.  This means that unless we explicitly opt-in and manage the object's lifetime, it's possible for it to appear collected and disappear even though we still have a reference to its proxy.  If this happens, the exception will resemble the following:

System.Runtime.Remoting.RemotingException: Object '/76e7cd41_2cd2_4e89_9c03_fae752ec4d59

/zb_uualy_cm6kwizjlentfdl_3.rem' has been disconnected or does not exist at the server.

(Note that the object URI is automatically generated when you use methods such as AppDomain.CreateInstanceAndUnwrap to create an object within the other domain.)

If we desire to achieve singleton semantics for the remote object, it's simplest to ensure that it never dies.  This can be done by overriding the InitializeLifetimeService method on your MarshalByRefObject-derived class and returning null instead of a valid ILease implementation.

Sela Technology Center Course: .NET Performance, Internals and Debugging

Yesterday was the second session in our series of courses for our instructors and consultants.  It was my turn to be the lecturer, and I talked about performance measurement on Windows in general and using .NET in particular.

In a pretty standard but significantly faster-paced session taken from the .NET Performance course, I've reviewed the CLR Profiler, the Visual Studio 2008 Profiler, performance counters as well as exposing your own performance counters, querying WMI information (I've also mentioned how to expose WMI information), and demonstrated my own tool that I developed for the course - called "Performance Harness" - which can be used for quick-and-dirty measurements of a little piece of code at a time (such as comparing the performance of for to foreach).  This is the first time I've mentioned the tool in public, and I'm considering to go open source and let the community turn it into a complete product.

image

In the subsequent sessions we are going to talk about the .NET garbage collector from a performance perspective, and get our hands on some production debugging techniques.

This time some "outsiders" were also allowed to attend, and many people have watched the session from the comfort of their homes thanks to our live broadcasting team.  The session was also recorded, and the recording is of excellent quality.

 image image image image

If this has whetted your appetite and you're very very interested in hearing subsequent sessions, feel free to contact me and I'll see if I can arrange some extra credential sets for accessing the live broadcasts.  If not, you can always take the full .NET Performance course at Sela :-)

Deadlock In Traffic

No, I'm not talking about deadlocks resulting from high traffic on an IIS server.  I'm talking about a real-world physical deadlock involving cars that I've been part of today.  I didn't even need Wait Chain Traversal - it was the perfect visualization, quotable for books and articles.

image

It was one of the most amusing things I've ever seen.  Consider the two-way (right-hand driving) street depicted in the diagram.  Car A has just emerged from a parking lot on the right hand side, and is waiting for the left lane of cars to clear.  Car B has simultaneously emerged from a parking lot on the left hand side, and is waiting for the right lane of cars to clear.

Analyzing the situation with standard wait-chain algorithms, we obtain the following sequence: The cars in the left lane can't make any progress until car B moves; car B can't move until the cars in the right lane move; the cars in the right lane can't move until car A moves; car A can't move until the cars in the left lane move - here's our deadlock.

I was in one of the cars in the right lane, and immediately realized this was a deadlock.  It took several minutes to resolve - the typical Israeli driver won't back out of a lane once he has managed to squeeze into it, and it takes a critical mass of yelling and horn-blowing until progress is made.  Note that behind cars A and B there was an additional row of cars trying to get out of the parking lots, so it wasn't really trivial to resolve the deadlock without multiple cars backing out of the way.

Considering that this scenario took place at 6:15PM on Ha-Barzel street in Tel-Aviv, an industrial area with a dense concentration of hi-tech companies, I wonder how many of the other drivers have realized this is a deadlock, and how many of them are recording this thought on their very own blog as we speak.  :-)

Is It a Managed or a Native Memory Leak?

How do you determine whether the memory issue you seem to be experiencing is a .NET memory leak or a native memory leak?  Why is it even important?

It's the first question I as a consultant would ask you if we were to talk over the phone regarding a high-memory problem you've having.  It's extremely important because in most of the cases it narrows the scope of the problem and facilitates bringing the appropriate people, tools and resources for focusing on the task at hand.

The process for diagnosing a .NET memory leak revolves around tools such as the CLR Profiler, WinDbg, SOS.DLL and the useful set of commands such as !VMStat, !DumpHeap, !TraverseHeap, !GCRoot and others.  The process for diagnosing a native memory leak, on the other hand, is usually significantly more complicated and vague, consisting of analyzing the memory for patterns, searching strings in memory, attempting to correlate memory usage to modules, and so on.

So how do you tell?  The first step that you're definitely capable of taking in the right direction is opening Performance Monitor and looking at the values of two important performance counters for your application's process:

  • .NET CLR Memory / # Bytes in All Heaps
  • Process / Private Bytes

The private bytes counter for your process is the amount of memory committed to the process by the Windows memory manager.  It doesn't have to reside in physical memory - it might be paged out or in transition - but it does reflect the memory usage of your application.  Note the "private" in the counter's name - only memory that isn't shared with other processes on the system is considered for this counter's value.  Shared memory sections or loaded assemblies are not considered.

The .NET bytes in all heaps counter is the amount of managed memory allocated in your process (for all AppDomains).  This is garbage-collected memory, and a leak here indicates that we can use the standard arsenal for managed high-memory scenarios.

There can really be three distinct scenarios from this point on:

  1. The private bytes counter is climbing but the .NET bytes in all heaps counter remains constant.  This indicates a native memory leak.
  2. The private bytes counter and the .NET bytes in all heaps counter are climbing at the same rate (the difference between the two remains constant).  This indicates a managed memory leak.
  3. The private bytes counter and the .ET bytes in all heaps counter are climbing at different rates (the difference between the two is not constant).  This indicates a combined managed and native memory leak, e.g. leaking a managed component that holds a reference to a COM object (via an RCW) or creating a CCW/RCW cycle.

Let's take an example from one of my recent debugging consultations.  At the client's site I've set up a performance counters log to monitor the above two counters and determine what kind of leak we're talking about.  Here's what the graph looked like:

image

The black highlighted line is the managed memory counter, the blue line is the private bytes counter.

From this graph, it's evident that the applications is leaking managed memory only.  The difference between the two counters stays constant, meaning we're in leak category #2 above.  This enables a smooth transition to dump analysis for managed leaks.

Interview Questions Considered Harmful: A Sequel

I've come up with two additional categories of interview questions after having incorporated some feedback on my earlier post from colleagues and friends.  Hopefully this list won't grow any longer than it is...

Impossible Technical Questions

With a pen and paper, provide an implementation that would automatically normalize a database representation, generate the appropriate constraints and index tables based on SQL profiling information from runtime.

Yes, it's a great idea and one the interviewer might be considering for a future startup, but it's not practical to expect anything practical in an interview.  You might discuss the design of this kind of feature, especially if you're interviewing a talented database professional, but if you're expecting any code...

Alternatively, the question might sound reasonable at first, but then the interviewer proceeds to add complexity such that it makes no sense to implement at all, not to mention in an interview:

Implement a function that backs up modified files to a network share.  Oh, and make it configurable.  And add a notification feature when something goes wrong.  Ah, and address the condition when there's low disk space on the network share and you're required to balance the storage across multiple shares.  Add an algorithm for redundantly storing and later reconstructing the store from these multiple shares.

(And so on.)

The Illegal Kind

Do you have any kids?  Does your husband have any health issues?  Have you done military service?  Are you planning to get pregnant?

This is somewhat less of an issue in Israel (from my experience), but appears to have more significance elsewhere.  One of my friends told me a story a couple of years ago about her not getting hired at a kids' game construction company because she appeared religious and the interviewer didn't like her telling him that she's planning to get married soon.

Eventually, most of these irrelevant personal questions are downright illegal.  With them being irrelevant, you should constrain your curiosity to a situation less formal than a job interview.

Sharing WCF Contracts as Visual Studio File Links

There are multiple ways of sharing WCF contract types across service boundaries.  The recommended way, which I have no intent of undermining, is sharing contracts via service metadata.  This means that the contract CLR types can be private (internal) to the service, and regenerated on the consumer's side from the service metadata.

image

This provides full decoupling between the service and its consumers, guarantees that there is no dependency on CLR-specific information that can't be conveyed in an interoperable fashion, and ensures that only contractual information is shared.  For example, a method declared on a data contract type is not shared; a custom non-WCF attribute on a service operation contract is not shared, etc.

The primary disadvantage of this approach in a practical development scenario is that it makes deploying and testing services quite complicated, especially if the services are undergoing constant change.  Additionally, it often introduces difficult-to-resolve scenarios, such as those when two services have a dependency on each other or when a service has a dependency on an infrastructure (non-service) component which in turn consumes the service when used for the implementation of other services.

The alternative to sharing contracts via metadata is sharing contracts via their assembly.  This means that the contract CLR types are contained in a shared assembly that is referenced by the service and its consumers.  No generation of contracts from metadata is necessary, but only .NET consumers (aware of WCF) can consume such services.  Additionally, strong coupling is introduced as methods and additional non-contractual implementation leaks across service boundaries.

image

On the other hand, this approach is very easy to use while developing services.  You change the contract in a single place and the change is automatically observed by the service implementation and by its consumers.  If you're part of a team writing multiple services which have to communicate with one another, this is the most time-conserving solution available.

In spite of the above, there is a third solution that I'm using with relative success in my current project (co-invented by Alon Fliess and myself).  This is a pragmatic technique that facilitates sharing the contract code but still allowing for a redefinition or addition of data and behavior on each side of the service boundary.

image

imageWhat we have come up with is a fairly simple approach.  Each service contract and each individual data contract (including fault contracts) is defined in a separate code file in the service's project (assembly).  All data contracts are defined as partial classes.  All contracts are defined as internal.

Service consumers add a link to the code files they find relevant, and may choose to replace a link to a code file with a customized replica of it, or extend one of the data contracts by extending the partial class.

We called this technique WCF contract links.  This trick relies on Visual Studio's ability to have a link to a code file, and features the flexibility and convenience of both approaches:

  • A contract can always be customized on either side of the service boundary.
  • No special work is required to generate contracts on the consumer side - add a link to a code file and you're good to go.
  • You can have a contract link to an assembly and reference that assembly because the contracts are internal.

There are some obvious caveats and limitations with this approach, but with enough discipline they can be overcome:

  • It is now possible to leak implementation details via the shared code file.  This is something to look for in code reviews or using static code analysis rules.
  • If you need to use contracts in high-level application code, their being internal can compromise some constructs.  This can be addressed by making contracts public on a case-by-case basis (this creates the risk of referencing an assembly that you have a contract link to, which creates a compilation ambiguity).
  • This approach is only applicable for WCF consumers, preferably within the same team for access to the linked code file.  This is not the way to go for sharing services across team or enterprise boundaries.
Shutdown Performance: Just Exit!

The best shutdown sequence for an application is the shutdown sequence that doesn't require any work.

If all resources held by your application are process-wide resources that do not affect the rest of the system, then your shutdown sequence doesn't require any work.  All you need to do is let the system close you down.

Many resources fit in this category - memory, for one, is something that you shouldn't be obsessively freeing when asked to shutdown.  Windows will reclaim the memory, don't worry about it.  Handles to kernel objects are another example, although some of them don't like being abandoned (e.g. mutex handles would rather be closed or released than abandoned in mid-acquire).  Database connections, on the other hand, represent a resource that might take some time until it's reclaimed by a timeout.

The primary reason for minimizing the shutdown sequence as much as possible is fairly simple.  If your application hasn't been used for quite a while, then your shutdown sequence might result in lots and lots of paging.  There are myriads of applications that take a long time to shut down, and the disk LED starts flashing repeatedly for several minutes.  During the process, you're paging in memory that is completely useless (strictly speaking, it's needed for your shutdown sequence, but...) and you're causing other memory that is useful to be paged out, effectively incurring yet another series of paging activity when your application has finally been closed.

But why would the user close your application after it hasn't been used?  I hope you're not asking this question, but just in case you are, here are some common reasons:

  • The system has just come out of standby or hibernation, and the user wants to shut down or restart the system.
  • The user has just finished watching three consecutive episodes of his favorite TV show, and while going to bed decided to shut down the computer.
  • The user has just finished reading his 300-message email backlog, with your application in the background, and now wants to shut down the system and go to bed.
  • The user has been prompted with some Windows Updates installation, and during the process was incapable of using the system.  Now that updates have completed, the user wants to restart the system.

Wrapping up this short piece of advice:  Don't start freeing memory and obsessively validating resources when the user has requested a shutdown.  If your application hasn't been used for a while, you're now paging the entire world back in from the disk.  Do the bare minimum and get out!

Practical Concurrency Patterns: Safe/Unsafe Cache

After having examined a classical reader-writer-lock-based cache and a thread-local cache, we have come to terms with the deficiencies of both alternatives.  A classical RWL-cache requires a lock (albeit a cheap one) for every operation, including a read, even though contention doesn't occur until there's a write.  A TLS-cache is fine where the worker threads are countable and controllable, but becomes inapplicable when hundreds of threads from the thread pool are spawned and destroyed.

An alternative to these approaches is not using locks at all, or at least avoid from locking the entire collection.  For example, consider the following scenario:

  • A bank maintains a dictionary mapping a currency to its current rate
  • The data is updated very often as the various currencies are traded, and the updates are received from many different sources
  • The data is read very often as different clients request the values of various currencies

image

Using an RWL-cache or a TLS-cache in this scenario is unrealistic.  The TLS-cache is not feasible because the number of clients is unknown, so we cannot dedicate a worker thread to each client and use the thread pool instead.  The RWL-cache is not feasible because there's a gigantic amount of reads and writes and the contention would dash any hopes of scaling this kind of application.  (And it certainly sounds like it would need scaling...)

However, there is one noticeable thing in this scenario.  The values in the dictionary are changing very frequently; but the dictionary itself is kept unchanged.  While it is theoretically possible that through the course of a day new currencies will enter the bank's trading system, it is a fairly unlikely event that we shouldn't be optimizing for.

This effectively means that in this example, the dictionary can be kept read-only, with the values changing - requiring a lock only on something associated with the value (e.g. the currency string representation), and not the entire dictionary.  Additionally, since changing the value in the dictionary is often an operation that we can perform atomically (e.g. a reference swap or an integer assignment), we can eliminate locking in the general case.  (We'll discuss lock-free operations in a subsequent installment.)

However, we still have to handle the specific scenario where a new currency is added during trade hours.  It has to be addressed because if the general dictionary is used without locking, we can't simply perform an insert.  We need another dictionary that will be used for these exceptional currencies that have been added during trade hours - the rest of the currencies will still be stored in the read-only dictionary.

I've first encountered this pattern in code written by Alon Fliess, so I will use his terminology - he called it a Safe/Unsafe Cache, even though it's effectively an adapter over any collection type.  A sample implementation goes along the following lines:

public interface ICache<K, V>

{

    V Get(K key);

    V GetOrCreate(K key, Func<K, V> creator);

}

 

public class SafeUnsafeAdapter<T, K, V> : ICache<K, V>

    where T : ICache<K, V>, new()

{

    Dictionary<K, V> _safeCache = new Dictionary<K, V>();

    T _unsafeCache = new T();

    volatile bool _isUnsafe;

 

    public void SwitchToUnsafe()

    {

        _isUnsafe = true;

        //Thread-safe from here on,

        //single-threaded until this point

    }

 

    public V Get(K key)

    {

        if (!_isUnsafe)

        {

            return _safeCache[key];

        }

        else

        {

            V val;

            if (!_safeCache.TryGetValue(key, out val))

            {

                return _unsafeCache.Get(key);

            }

            return val;

        }

    }

 

    public V GetOrCreate(K key, Func<K, V> creator)

    {

        V val;

        if (!_safeCache.TryGetValue(key, out val))

        {

            if (!_isUnsafe)

            {

                val = creator(key);

                _safeCache.Add(key, val);

                return val;

            }

            else

            {

                return _unsafeCache.GetOrCreate(key, creator);

            }

        }

        return val;

    }

}

This can be best summarized as follows:

  • Single-threaded initialization work proceeds against the "safe" cache, which cannot be accessed from multiple threads.  This work doesn't require synchronization.
  • At some point, the "safe" cache is locked for modification - further inserts will not modify it.  From this point on, the cache can be accessed from multiple threads but the safe cache doesn't require locking - because it's only accessed for reading.
  • Inserts to the "unsafe" cache go to a separate dictionary.  Whenever a lookup in the "safe" cache fails, it is attempted from the "unsafe" cache - by taking a lock.

image

Going back to our example, most currencies are statically known when the system is initialized.  This means that during initialization, we insert the currency data into the safe cache and switch to unsafe, thereby preventing its further modification and making all work lock-free.  In the exceptional case where a currency is added during the day, we resort to the unsafe cache to perform the insert and the subsequent lookups - but most of the time, our work is lock-free.

Practical Concurrency Patterns: Thread-Local Cache

In the previous post we have looked at the fairly simple implementation of a read-write cache, where elements are created at the moment they are needed.  We've also noticed that for the specific cases where the data is accessed very frequently and calculating the data doesn't cost too much in terms of CPU time and memory, there's a large overhead to this approach because access to the cache must be synchronized.

We can introduce a simple optimization by noticing that locks are unnecessary if each worker thread gets its own copy of the cache.  This proves very efficient if the cache size isn't very large, and the cache hit ratio is very high (e.g. after having created 100 elements, there are 10,000 accesses to these elements).

imageimage 

There are multiple ways of accomplishing this with the Parallel Extensions, one of them is an overload of Parallel.For which gives us the ability to specify a thread-local state:

Parallel.For(1, 11,

    () => new ReadWriteCache<int, int>(),   //Thread-local

    (i, loc) =>

    {

        for (int k = 0; k < 100; ++k)

        {

            Thread.Sleep(k%i);

            Console.WriteLine(

                loc.ThreadLocalState.GetOrCreate(k%i, x =>

            {

                Console.WriteLine("CALCULATING {0}", x);

                return x * x;

            }));

        }

    });

This is a fairly limiting approach, however, because we have to use Parallel.For and it doesn't feel natural to launch our worker threads in this way.

Giving each thread its own copy of the cache in .NET is as easy as using the [ThreadStatic] attribute.  We can devise a static class that will contain the thread-local data for us, and then use it instead of the shared ReadWriteCache instance:

public static class ThreadContext

{

    public static TLSItems Current

    {

        get

        {

            //No lock needed here

            if (_current == null)

                _current = new TLSItems();

            return _current;

        }

    }

 

    [ThreadStatic] private static TLSItems _current;

}

 

public class TLSItems

{

    public ReadWriteCache<int, int> Squares =

        new ReadWriteCache<int, int>();

}

We can add more items to the TLSItems class as necessary.  This is easily usable without any initialization from the worker threads:

Parallel.For(1, 11, (i) =>

{

    for (int k = 0; k < 100; ++k)

    {

        Thread.Sleep(k % i);

        Console.WriteLine(

            ThreadContext.Current.Squares.GetOrCreate(

                k % i, x =>

                {

                    Console.WriteLine("CALCULATING {0}",x);

                    return x * x;

                }));

    }

});

Each thread is going to get its own cache, and perform all the work against that cache.  This is an excellent solution if we have a limited number of controllable worker threads, but what if we're using a thread pool?

This technique is dangerous and inapplicable when working with thread-pool threads.  We have no control over the number of these threads (which means we might have 500 copies of the cache polluting our process' memory), and it's poor etiquette to change the TLS of a thread-pool thread and then return it to the pool because of the possible side-effects (there might be other pieces of code in the process using these threads for their own purposes!).

Therefore, the thread local cache optimization trick doesn't apply to pool threads.  In the next installment we will look at a different optimization technique that will hold for pool threads as well as regular worker threads we create.

Practical Concurrency Patterns: Read-Write Cache

Assume that you have a set of worker threads producing and consuming information.  In the process, there's some generated data that can be cached instead of being calculated every time it is accessed.

A typical solution for this problem is a thread-safe read-write cache (e.g. a .NET Dictionary) that will contain the calculated data.  Due to its changing nature, a lock must be taken when reading from the cache and when writing to the cache.  This is typically a reader-writer lock, unless the data is changing very frequently.

The following is a straightforward sketch of an implementation:

public class ReadWriteCache<TKey, TValue>

{

    private Dictionary<TKey, TValue> _dict = new Dictionary<TKey, TValue>();

    private ReaderWriterLockSlim _rwl = new ReaderWriterLockSlim();

 

    public TValue Get(TKey key)

    {

        //Throw if not found:

        _rwl.EnterReadLock();

        try

        {

            return _dict[key];

        }

        finally

        {

            _rwl.ExitReadLock();

        }

    }

    public TValue GetOrCreate(TKey key, Func<TKey, TValue> creator)

    {

        //Start with a read lock to optimize

        _rwl.EnterReadLock();

        try

        {

            TValue oldVal;

            if (_dict.TryGetValue(key, out oldVal))

                return oldVal;

        }

        finally

        {

            _rwl.ExitReadLock();

        }

 

        //Multiple threads can calculate the value concurrently

        TValue val = creator(key);

        _rwl.EnterUpgradeableReadLock();

        try

        {

            TValue newVal;

            if (_dict.TryGetValue(key, out newVal))

                return newVal;  //Someone else has added

            _rwl.EnterWriteLock();

            try

            {

                _dict.Add(key, val);

                return val;

            }

            finally

            {

                _rwl.ExitWriteLock();

            }

        }

        finally

        {

            _rwl.ExitUpgradeableReadLock();

        }

    }

}

This solution uses a ReaderWriterLockSlim, introduced in .NET 3.5, and attempts to minimize the write locks as much as possible.  Nonetheless, accessing the cache is always associated with a lock, even if the lock is not taken and only readers enter it.

Using this cache is very easy.  The following is a simple Parallel Extensions loop that uses 10 threads to access the cache and retrieve integer squares from it:

ReadWriteCache<int, int> squares = new ReadWriteCache<int, int>();

Parallel.For(0, 10, (i) =>

{

    for (int k = 0; k < 100; ++k)

    {

        Thread.Sleep(Math.Abs(k - i));

        Console.WriteLine(squares.GetOrCreate(k, x =>

        {

            Console.WriteLine("CALCULATING square {0}", x);

            return x * x;

        }));

    }

});

If the data stored under the lock is accessed very frequently, and the cost of calculating it is not very high, this approach might introduce too much contention into the system, diminishing the effects of caching in the first place.  In the next two posts, we will examine two different techniques for removing this overhead in specific scenarios.

Sela Technology Center: Course for Our Professionals

imageAs part of our continuous effort to maintain the highest level of professional expertise among our consultants, Sela Technology Center's elite team of 25 professionals has been invited to a private training course consisting of 24 evenings taught by our best consultants and instructors.

This training course is unique in the level of participation exhibited by the attendees, in the technological breadth and depth of the covered topics, and in the attempt to cover all recently released Microsoft technologies to achieve a uniform level of expertise across our entire consulting team.  It is not a set of TechEd-style one-hour lectures; it is a full spectrum of courses aimed to achieve the highest level of involvement and comprehension.

Every session is recorded from the lecturer's computer (screen capture) and a video camera focusing on the lecturer and the class.  The sessions are streamed live to Sela consultants who couldn't attend the class in person, and can be viewed later from our online course archive.

Among the subjects we have chosen to include in this unique offering are:

  • C# 3.0 and LINQ with an eye toward Data Services and Entity Framework
  • SOA, WCF, WF and Workflow Services with an eye toward OSLO
  • ASP.NET MVC, DDC and AJAX
  • MOSS 2007
  • .NET Performance and Internals
  • Hardcore Production Debugging

The last two topics will be taught by yours truly - I will attempt to cover the material from the .NET Performance and .NET Debugging courses I've authored in a 4-evening series of concise meetings.

The first meeting was yesterday, with Alon Fliess (Sela CTO and VC++ MVP) covering the C# 3.0 language enhancements, motivation and design patterns.  Alon also spoke about our open-source projects that have recently been released, with the Non-Paged CLR Host among them.

More Posts Next page »