Fun with the .NET GC, Paging, Generations and Write Barriers
There are plenty of resources on the .NET garbage collector, ranging from fairly basic overviews that explain what a Garbage Collector is, to detailed articles delving deep into the GC implementation, the server/workstation GC, the implementation of concurrent collections, the generational model and so forth. You can find most of them fairly easily - here's a very brief list of references:
- If the basic concepts don't sound familiar to you, you might want to refer to Jeffrey Richter's excellent article for a good primer (note that there are two parts).
- Another fantastic GC resource is Maoni Stephens' blog, one of the Microsoft developers working on the GC. The following presentation from the PDC 2005 lists the new stuff in CLR 2.0, and discusses some relatively obscure corners of the GC.
- If the theory of the subject interests you, you can look into the Wikipedia article on garbage collection.
- If you're looking for a definitive summary (without too many words of explanation), look into Vineet Gupta's post.
If so, why have I decided to write YAPOGC ("yet another post on garbage collection")? To emphasize a point and to pose a question. You might find the thoughts below a bit random, and if you feel that I should be more elaborate, feel free to state so in the comments.
The point I'd like to emphasize is that the .NET Garbage Collector, in spite of being tuned for workstation and server applications, in spite of having lots of theoretical academic work behind it, in spite of some major concepts being "borrowed" from the JVM, and in spite of being developed for quite a few years now, is not suitable for every existing application. For example, the GC allocation strategy, that requires pooling and recycling objects that reach generation 2 so that they are not released and so that full collections do not occur is not necessarily appropriate for every server application. It is also possible that security requirements or latency considerations require that the GC doesn't run in a specified interval of time or code. Indeed, a question I encounter quite frequently during my lectures is "how do I disable GC for X seconds?" or "how do I receive a notification that GC is about to start?".
Indeed, there is a way to receive a notification that GC is about to start. This can be achieved by writing a custom CLR host. The default host does not provide any managed event or other notification that a GC is happening or has just completed. However, the hosting API allows us to implement the IHostGCManager interface, which will be sent notifications when the EE (execution engine) is starting thread suspension, ending thread suspension, and when a given thread is going to be blocked for a GC to occur. However, the host cannot decide that it wants to stop the collection from occuring. One way or another, there is an overwhelming amount of material on CLR hosting out there, including the excellent book by Steven Pratschner, "Customizing the .NET CLR".
A disastrous condition occurs when the managed memory (the managed heap) is starting to be paged out. This can happen even if you have more than 2GB memory on your machine, because other processes might "misbehave" and occupy memory, so that your working set is decreased and the heap is being paged out. This seems OK: if only unused parts of the GC heap are being paged out, this is still fairly acceptable. After all, this is also what happens when you use a native allocator (e.g. the CRT malloc or new).
However, if the GC is forced to perform a full collection, it is required to traverse all the objects that have references from active roots. If you are indeed using most of the memory, and some of it is paged out, the memory will have to be paged back in so that the GC gets a chance to examine it. This is required even though it may be for a few hours until your application requires the objects that have been paged out. However, the GC isn't aware of that fact, and still must traverse all live objects, even if they are in the page file. This has a disastrous effect on the % of time your application spends doing collections: in fact, it's just sitting there waiting for all the memory to be paged in and out. Maoni has related to this scenario and noted that it is being worked on.
This kind of situation will probably direct you in one of the following directions: write a custom native allocator that performs memory pooling and defragmentation, or write some kind of managed pool.
Consider the scenario where you have lots of managed objects that are fairly small, but they most contain a buffer of memory of varying size (e.g. TCP packets). It's a bit difficult to pool these buffers in managed code, because you would have to keep track of the buffer and have a notion of "here it starts, here it ends". It is possible to wrap these operations in a managed class, and I have encountered at least one implementation. It is also possible to pool these buffers in unmanaged code, and provide a window into these buffers using a managed wrapper. I've encountered an implementation of this idea as well, and it might even provide better performance because array bounds checks will be eliminated when using unsafe pointers. Note that if you are writing a custom native pool that deals with fragmentation by returning buffers to buckets and allocating them from buckets, you might as well use the Low Fragmentation Heap that comes with Windows XP and higher (it is a flag that you pass to HeapSetInformation after you have a heap, stating that you want to use the LFH).
Note that if any kind of pooling is not an option for you, you might still benefit from using a native allocator to allocate your buffers and access them via a managed wrapper. You will have to handle your own memory allocation and deallocation issues, but you will be freed from the concern that a full collection will trigger a huge number of page faults to bring all memory from the disk.
If you are familiar with the concept of generational GC, you know that the managed heap is divided into 3 generations (gen0, gen1 and gen2). A garbage collection does not have to traverse all 3 generations: it is possible to perform a collection only in gen0, for example. The idea of this optimization, of course, is to minimize the amount of memory the GC has to touch during a collection, and thus minimize the time that is spent performing garbage collections. Conceptually, the LOH also belongs in gen2 because it is collected when a full collection (a gen2 collection) occurs.
It might have occurred to you that if the GC is only looking at objects that are in gen0 while performing a gen0 collection, it might fail to note that objects in gen2 might also references those younger objects. If the GC were forced to traverse all objects in gen2 just to make sure that they do not have an outstanding reference to a young object in gen0, the entire optimization of the generational model would be lost.
Therefore, this is not the case. The JIT establishes write barriers that are triggered every time there is a write into a reference field in an object. If the value of the reference (which is basically a pointer, after all) is in the ephemeral segment (which is where gen0 and gen1 reside), the write barrier knows that it must update a global data structure, called a card table, with a bit that indicates that we might have a condition where a young object is referenced by an old object.
If every old object would have a bit in the card table, the card table would have been too big. Consider that the minimum object size is 12 bytes, and assuming that we can access about 1.6GB of memory for gen2, we can have up to 143 million objects, which would require almost 18MB of memory just for the card table to be properly synched. Therefore, the card table contains a bit for every 128-byte range, which requires a single DWORD for every page (4KB) on x86.
When a GC occurs, at a later point, it consults the card table to see whether there were any "dirty" writes to objects in an older generation, thus not allowing objects that are still referenced from an older generation to be incorrectly sweeped.
When I first heard of this implementation (which, by the way, is quite similar in the JVM, and is a subject of theoretical research and practical analysis), the first obvious question I had was: doesn't this mean that every write to a reference field triggers some kind of write barrier code that must perform the tracking before the actual assignment is made?!
Well, yes-it-does. Every time the JIT sees that an update to a reference field is about to be emitted, it emits a call to the write barrier thunk. It can be examined using the SSCLI in the jithelp.asm file (for the x86 implementation). The code ensures that the address being written to (the destination) is in the GC heap, and then checks whether the address being written (the source) is within the ephemeral segment.
cmp rg, g_ephemeral_low
cmp rg, g_ephemeral_high
shr edx, 10
add edx, [g_card_table]
Note that whenever the ephemeral segment is moved, the JIT thunk must be updated to contain the correct segment low and high addresses. There is actual assembly-patching code there, that can be found in jitinterfacex86.cpp, in the StompWriteBarrierResize function.
To make a long (and getting longer) story short, this seems very inefficient. Every reference write must be patched with this thunk, and while it's not exceptionally expensive, it's much costier than the simple MOV instruction you might have expected. But before criticizing this solution, is there an alternative at all?
It seems that there is. The Windows VirtualAlloc function allows us to specify that the allocated region should be "write watched" (using the MEM_WRITE_WATCH flag) so that further writes to it will be recorded and can then be retrieved using the GetWriteWatch API. The granularity of the watch, naturally, is in pages, so it requires more work on the garbage collector's side after determining that the area has been written to, but it seems that the performance gains (from not using the JIT-generated write barrier thunk) should be more significant.
The question I was talking about, then, is why doesn't the .NET GC use this built-in memory watch mechanism, supplied by the Win32 API? The following blog entry notes this possibility (in section 3.2.4), but does not elaborate regarding the reasons behind the particular choice made in .NET. I have several speculations (which are just that - speculations) and am still pursuing a more definitive answer:
- The aforementioned API is not supported on Windows 95 (which is, perhaps, not so surprising), but it is not supported on Windows 2000 as well. This would limit the .NET framework's compatibility with these platforms (although in those particular cases the JIT-thunk approach could be adopted).
- The aforementioned API is Windows-specific and does not provide any compatibility with other platforms. The JIT-generated write barrier is generic and theoretically can work on any platform.
- The performance penalty of using the MEM_WRITE_WATCH flag for writing to a region of memory is bigger than the thunk generated by the JIT. Note that a very primitive measurement I've performed indicates an 8% performance penalty when writing to memory protected by a write watch as opposed to writing to memory that is not protected by a write watch (don't quote me on this :-)).