August 2012 - Posts
SELA conferences are going international: we’re now announcing SELA Dev Academy, a three day technical conference in Toronto, with a great lineup of speakers and sessions immediately after Microsoft’s BUILD.

If you’re in the Toronto area or flying back home from BUILD, this is a great opportunity to get up to speed on the latest Microsoft technologies with sessions on Windows 8, WCF, Parallel Programming, Windows Azure, ASP.NET Web API, and many more.
The location is the Hyatt Regency Toronto, which looks awesome from the pictures although I haven’t visited.
Registration is here. There’s an early bird price, so hurry up! :-)
This is an excerpt from Chapter 5 (Collections and Generics) of Pro .NET Performance, scheduled to appear in less than a month. I might be publishing a few more of these before and after the book is out.
Before we can concern ourselves with the runtime implementation of generics, it’s paramount to ask whether there is a runtime representation of generics—as we will see shortly, C++ templates, a similar mechanism, have no runtime representation to speak of. This question is easily answered if you look at the wonders Reflection can do with generic types at runtime:
Type openList = typeof(List<>);
Type listOfInt = openList.MakeGenericType(typeof(int));
IEnumerable<int> ints = (IEnumerable<int>
Activator.CreateInstance(listOfInt);
Dictionary<string, int> frequencies =
new Dictionary<string, int>();
Type openDictionary =
frequencies.GetType().GetGenericTypeDefinition();
Type dictStringToDouble = openDictionary.MakeGenericType(
typeof(string), typeof(double));
As you see, we can dynamically create generic types from an existing generic type and parameterize an “open” generic type to create an instance of a “closed” generic type. This demonstrates that generics are first-class citizens and have a runtime representation, which we now survey.
The syntactic features of CLR generics are fairly similar to Java generics and even slightly resemble C++ templates. However, it turns out that their internal implementation and the limitations imposed upon programs using them are very different from Java and C++. To understand these differences we should briefly review Java generics and C++ templates.
A generic class in Java can have generic type parameters, and there is even a constraint mechanism quite similar to what .NET has to offer (bounded type parameters and wildcards). For example, here’s a first attempt at a generic list in Java:
public class List<E> {
private E[] items;
private int size;
public List(int initialCapacity) {
items = new E[initialCapacity];
}
public void add(E item) {
if (size < items.Length – 1) {
items[size] = item;
++size;
} else {
//Allocate a larger array, copy the elements,
//then place ‘item’ in it
}
}
public E getAt(int index) {
if (index < 0 || index >= size)
throw IndexOutOfBoundsException(index);
return items[index];
}
//Many more methods omitted for brevity
}
Unfortunately, this code does not compile. Specifically, the expression new E[initialCapacity] is not legal in Java. The reason has to do with the way Java compiles generic code. The Java compiler removes any mentions of the generic type parameter and replaces them with java.lang.Object, a process called type erasure. As a result, there is only one type at runtime — List, the raw type — and any information about the generic type argument provided is lost. (To be fair, by using type erasure Java retains binary compatibility with libraries and applications that were created before generics — something that .NET 2.0 does not offer for .NET 1.1 code.)
Not all is lost, though. By using an Object array instead, we can reconcile the compiler and still have a type-safe generic class that works well at compilation time:
public class List<E> {
private Object[] items;
private int size;
public void List(int initialCapacity) {
items = new Object[initialCapacity];
}
//The rest of the code is unmodified
}
List<Employee> employees = new List<Employee>(7);
employees.add(new Employee(“Kate”));
employees.add(42); //Does not compile!
However, adopting this approach in the CLR voices a concern: what is to become of value types? One of the two reasons for introducing generics was to avoid boxing at any cost. Inserting a value type to an array of objects requires boxing and is not acceptable.
Compared to Java generics, C++ templates may appear enticing. (They are also extremely powerful: you may have heard that the template resolution mechanism in itself is Turing-complete.) The C++ compiler does not perform type erasure — quite the opposite — and there’s no need for constraints, because the compiler is happy to compile whatever you throw in its way. Let’s start with the list example, and then consider what happens with constraints:
template <typename T>
class list {
private:
T* items;
int size;
int capacity;
public:
list(int initialCapacity) :
size(0), capacity(initialCapacity) {
items = new T[initialCapacity];
}
void add(const T& item) {
if (size < capacity) {
items[size] = item;
++size;
} else {
//Allocate a larger array, copy the elements,
//then place ‘item’ in it
}
}
const T& operator[](int index) const {
if (index < 0 || index >= size)
throw exception(“Index out of bounds”);
return items[index];
}
//Many more methods omitted for brevity
};
The list template class is completely type-safe: every instantiation of the template creates a new class that uses the template definition as a… template. Although this happens under the hood, here is an example of what it could look like:
//Original C++ code:
list<int> listOfInts(14);
//Expanded by the compiler to:
class __list__int {
private:
int* items;
int size;
int capacity;
public:
__list__int(int initialCapacity) :
size(0), capacity(initialCapacity) {
items = new int[initialCapacity];
}
};
__list__int listOfInts(14);
Note that the add and operator[] methods were not expanded —the calling code did not use them, and the compiler generates only the parts of the template definition that are used for the particular instantiation. Also note that the compiler does not generate anything from the template definition; it waits for a specific instantiation before it produces any code.
This is why there is no need for constraints in C++ templates. In a binary search example, the following is a perfectly reasonable implementation:
template <typename T>
int BinarySearch(T* array, int size, const T& element) {
//At some point in the algorithm, we need to compare:
if (array[x] < array[y]) {
...
}
}
There’s no need to prove anything to the C++ compiler. After all, the template definition is meaningless; the compiler waits patiently for any instantiations:
int numbers[10];
//Compiles, int has an operator <
BinarySearch(numbers, 10, 42);
class empty {};
empty empties[10];
//Does not compile, empty does not have an operator <
BinarySearch(empties, 10, empty());
Although this design is extremely tempting, C++ templates have unattractive costs and limitations, which are undesirable for CLR generics:
- Because the template expansion occurs at compile-time, there is no way to share template instantiations between different binaries. For example, two DLLs loaded into the same process may have separate compiled versions of list<int>. This consumes a large amount of memory and causes long compilation times, by which C++ is renowned.
- For the same reason, template instantiations in two different binaries are considered incompatible. There is no clean and supported mechanism for exporting template instantiations from DLLs (such as an exported function that returns list<int>).
- There is no way to produce a binary library that contains template definitions: template definitions exist only in source form, as header files that can be #included into a C++ file.
In the next post, we’ll take a look at how CLR generics are implemented at runtime.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
As John Robbins repeatedly likes to say, Apple computers are the best hardware for running Windows. To quote, “a Mac Pro […] it's the best Windows machine money can buy”. After yesterday’s release of Windows 8 to MSDN subscribers, I went along and installed it on my new MacBook Pro.
Everything went fine—the setup was blazingly fast, all drivers were successfully installed, except for the pesky Apple trackpad settings that were way off what a Windows user comes to expect. In case you don’t know, MacBook trackpads look like this:

There’s no left or right button; just a solid sheet of multi-touch goodness. Well, not-so-goodness—the default gesture for right-click on this trackpad is the following: tap the trackpad with two fingers and simultaneously click with a third finger. (Or, you know, use Shift-F10 every time.)
Typically you’d go to the Control Panel, launch the Boot Camp control panel applet and change all these settings. When I did that, the control panel applet required UAC elevation and then refused to start, claiming that I don’t have access to my startup disk. It’s worth noting that I haven’t requested access to my startup disk.

Scouring the web for hints, I found a post indicating that if you use a standard user account to launch the Boot Camp control panel applet, it works just fine and lets you change the right-click trackpad settings. That’s what I did, only to find the changes made under the new user account do not affect my primary (“Microsoft Account”-enabled) user account.
That’s when I realized that I shouldn’t have been messing around with the secondary user account at all. In fact, it would suffice to launch the control panel applet as standard user, bypassing the elevation process, and it would work fine. So the final piece of puzzle is how to launch a process as standard user when it demands UAC elevation.
This takes me back to the first days of Windows Vista (and this blog), when I even wrote a library called UACHelpers that allows various customizations of this sort. One way or another, to determine whether the program should run elevated Windows consults its manifest. In the case the Boot Camp control panel applet, it says “highestAvailable”, which means that if the user is an administrator, he will be prompted for elevation prior to running the application. We need to modify this to “asInvoker”, so that if the current token is non-elevated, it will be used without prompting for elevation.
I made a copy of the Boot Camp control panel applet (C:\Windows\System32\AppleControlPanel.exe), extracted its manifest using the mt.exe tool, modified it to avoid requesting elevation, embedded it back into the executable, and voila!—everything worked just fine.
In other words:
> copy C:\windows\system32\AppleControlPanel.exe .
> mt.exe -inputresource:AppleControlPanel.exe;#1
-out:extracted.manifest
> sed –i '' -e's/highestAvailable/asInvoker'
extracted.manifest
> mt.exe -outputresource:AppleControlPanel.exe;#1
-manifest extracted.manifest
To summarize, the days of writing Application Compatibility training kits paid off again.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
TL;DR – Parallel programming is hard enough with locks and interlocked memory access instructions. Don’t make it even harder by introducing subtle data races unprotected by locks into your programs.
Language compilers are free to introduce reordering of certain memory access operations into their output. For example, do you really care in which order the following two stores are issued to memory?
x = 12;
y = 13;
Indeed, the compiler may reorder these operations freely, and the thread executing them cannot observe any intermediate stages because it does not read x before writing y, and the value of y does not depend on the value of x.
Even if the compiler hasn’t reordered memory accesses in the original program, some processors will reorder these operations anyway. Although x86 and AMD64 processors Windows developers are used to have very strict memory ordering semantics, there exist processors (notably ARMv7) that perform a much wider set of reorderings. For example, on ARM it would be possible for the processor to reorder the two stores in the program listing above.
This is still not a problem as long as your code is running on a single thread, and hence a single processor. These memory ordering changes do not affect the way your single thread is observing memory. However, as soon as multiple threads come into the picture, things get complicated, fast.
Consider the following two threads executing on two separate processors concurrently, where g is a shared global variable initialized to NULL:
Thread 1 Thread 2
Obj* p = new Obj; if (g != NULL) {
p->data = new int[100]; g->data[10] = 42;
g = p; }
It is now possible for thread 2 to crash while accessing the g->data array, because it might be initialized with garbage although the global g pointer has already been set. For this to happen, the language compiler and/or processor must allow store-store reorderings so that the write to p->data occurs after the write to g.
Even on the much stricter x86-AMD64 architecture, which does not allow store-store reorderings, store-load reorderings are still allowed – it is possible for a store to move after a load if the locations are independent. For example, the two instructions in the following program could be reordered:
mov dword ptr [edi], eax
mov ecx, dword ptr [esi]
It’s somewhat harder to come up with an example that breaks in lieu of only store-load reorderings, but fortunately Jeff Preshing has already found one in the Intel architecture specification:
Thread 1 Thread 2
x = 1; y = 1;
printf("%d", y); printf("%d", x);
Although it might be reasonable to expect that at least one of the threads ends up printing 1, it is in fact possible for both threads to print 0, because the store and load are reordered on both processors.
To be sure, all these problems can be addressed at the lowest level possible by introducing so-called fences (memory barriers). These are special instructions that inhibit the processor from performing some or all types of reordering. For example, in the first problematic listing, we could add a memory barrier between the write to p->data and the write to g, preventing the erroneous reordering that affects the other thread:
Obj* p = new Obj;
p->data = new int[100];
MemoryBarrier();
g = p;
However, I would like to urge you to avoid using memory barriers or really any sort of data-race-prone code that you have to reason so hard about to write correctly. Use locks, use interlocked instructions (such as InterlockedCompareExchange), use high-level constructs that ensure synchronization (such as thread-safe collection classes or queues). If you have to think about memory ordering to prove that your program will run correctly, more often than not you are doing something wrong or trying to optimize your code prematurely.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
Special treatment is always a good thing, especially when you find that the kernel part of Windows gives preferential treatment to its user mode counterpart.
Executive summary: It turns out that the Windows memory manager’s access fault handler, which handles the exception raised by the CPU when an invalid memory access occurs, has a special case for an expected access violation that might be raised when popping an item from an interlocked singly linked list (SList). The rationale for this special treatment is probably the prohibitive cost of setting up an exception handling frame in the popping function.
Before we get started, I assume that you know what a lock-free stack is, and how to implement one in terms of CPU instructions such as lock cmpxchg. For a reference implementation in C#, use Reflector to inspect System.Collections.Concurrent.ConcurrentStack<T> from .NET 4.0.
When implementing a lock-free stack in an unmanaged language, you find yourself in a bit of a pickle. It would be natural for the Pop method to free the memory of the node you just popped of the stack – and even if your Pop method does not free it, the user might free it shortly afterwards. This is a problem, however, because other threads concurrently executing the Pop method might access that memory. Suppose this is your lock-free Pop implementation:
bool Pop(T* item) {
Node* oldHead, newHead;
do {
oldHead = head;
if (oldHead == null) return false;
newHead = oldHead-Next;
} while (oldHead != InterlockedCompareExchange(&head, newHead, oldHead));
*item = oldHead->Value;
delete oldHead;
return true;
}
Now, after the blue line executes on one thread, another thread might be executing the red line, with the same object. This opens up several possibilities:
- The memory has been completely returned to the operating system. This means accessing oldHead->Next will cause an access violation, crashing the application.
- The memory has been returned to a free list but has not been returned to the operating system. This means accessing oldHead->Next will simply read some potentially garbage data, which is fine because another iteration will remedy that.
- The memory has been returned to a free list and then used again to satisfy a node allocation in the Push method. This is an extremely dangerous situation known as the ABA problem, where a head node removed from the list returns to it while another thread still thinks it’s the original head. The result is that data added to the list is lost, causing potential state corruptions and incorrect results throughout.
Whereas the ABA problem can be addressed by storing additional information (such as a monotonic counter) in every node and use that for the InterlockedCompareExchange operation, the problem of memory being returned to the OS requires some handling. The obvious option is to protect the accesses to oldHead with a __try…__except block, and ignore the access violation. However, at least on 32-bit systems, using Structured Exception Handling (SEH) is expensive – an exception activation record has to be constructed on the stack to install the exception handler, and has to be subsequently removed to uninstall it.
Windows uses a different approach in its implementation of a lock-free stack, called SList (also known as Interlocked Singly Linked List). It too encounters the problem of accessing potentially decommited memory in its Pop implementation, but relies on the kernel memory manager’s access fault handler to retry the operation automatically without using exception handling. To let the access fault handler know that the faulting instruction is precisely the potentially problematic one, there is a special label in the SList code called RtlInterlockedPopEntrySListFault (in our case, it would be on the line oldHead = head, which effectively retries the operation).
When the processor is unable to resolve a virtual address to a physical address, or encounters some other problem accessing virtual memory that demands the attention of the operating system, it raises a page fault (interrupt 0xE), which is handled by Windows in the KiTrap0E kernel function. After some minor preprocessing, KiTrap0E transfers control to the memory manager’s MmAccessFault function, which does the bulk of the page fault handling.
The following disassembly is from the Windows 7 SP1 32-bit ntoskrnl.exe. When MmAccessFault begins its fault handling process, the first real function it calls is KeInvalidAccessAllowed, bailing out quickly with a negative return value if the latter function returns an appropriate value:
@KeInvalidAccessAllowed@4 proc near
test ecx, ecx
jnz short do_check
not_allowed:
xor al, al
retn
do_check:
mov eax, [ecx+6Ch]
cmp eax, 8
jz short executive_slist
cmp eax, 1Bh
jnz short not_allowed
mov eax, ds:_KeUserPopEntrySListFault
jmp short check_address
executive_slist:
mov eax, offset _ExpInterlockedPopEntrySListFault
check_address:
cmp [ecx+68h], eax
setz al
retn
@KeInvalidAccessAllowed@4 endp
When KiTrap0E sees a negative return value from MmAccessFault, it transitions to the following code, which verifies that the reason for the fault is what we described above:
mov ecx, offset _ExpInterlockedPopEntrySListFault
cmp [ebp+68h], ecx
jz executive_slist_fault
mov ecx, ds:_KeUserPopEntrySListFault
cmp [ebp+68h], ecx
jz user_slist_fault
executive_slist_fault:
mov ecx, offset _ExpInterlockedPopEntrySListResume
jmp short finish_interrupt_successfully
user_slist_fault:
test byte ptr [ebp+6Ch], 1
jz loc_438A98
mov ecx, large fs:124h
cmp edi, [ecx+1F0h]
jnz short loc_438BBC
cmp dword ptr [ecx+1DCh], 400h
inc dword ptr [ecx+1DCh]
jb short loc_438BCC
mov dword ptr [ecx+1DCh], 0
jmp loc_438A98
loc_438BBC:
mov [ecx+1F0h], edi
mov dword ptr [ecx+1DCh], 0
loc_438BCC:
mov ecx, ds:_KeUserPopEntrySListResume
...
finish_interrupt_successfully:
mov [ebp+68h], ecx ; replace EIP with new return point
mov esp, ebp
test ds:_KdpOweBreakpoint, 1
jz Kei386EoiHelper
...
When MmAccessFault detects that the access violation occurred while executing the specific instruction in RtlInterlockedPopEntrySList that is subject to the problem described above, it returns from the exception (effectively handling it) and configures the restored context such that the function resumes and performs another loop iteration.
If you trust the ReactOS code to be a fair representation of what Windows does, at least in these low-level situations, check out their sources here:
if (TrapFrame->Eip == (ULONG_PTR)ExpInterlockedPopEntrySListFault)
{
PSLIST_HEADER SListHeader;
/* Sanity check that the assembly is correct:
This must be mov ebx, [eax]
Followed by cmpxchg8b [ebp] */
ASSERT((((UCHAR*)TrapFrame->Eip)[0] == 0x8B) &&
(((UCHAR*)TrapFrame->Eip)[1] == 0x18) &&
(((UCHAR*)TrapFrame->Eip)[2] == 0x0F) &&
(((UCHAR*)TrapFrame->Eip)[3] == 0xC7) &&
(((UCHAR*)TrapFrame->Eip)[4] == 0x4D) &&
(((UCHAR*)TrapFrame->Eip)[5] == 0x00));
/* Get the pointer to the SLIST_HEADER */
SListHeader = (PSLIST_HEADER)TrapFrame->Ebp;
/* Check if the Next member of the SLIST_HEADER was changed */
if (SListHeader->Next.Next != (PSLIST_ENTRY)TrapFrame->Eax)
{
/* Restart the operation */
TrapFrame->Eip = (ULONG_PTR)ExpInterlockedPopEntrySListResume;
/* Continue execution */
KiEoiHelper(TrapFrame);
}
}
I was quite surprised to discover this fact during the last Windows Internals course I delivered. It felt really wrong – a very low-level component adjusts its behavior specifically for a fairly high-level mechanism. Considering how important SLists are to Windows and to user-mode applications, however, this “hack” is probably worthwhile.
I am posting short updates and links on Twitter as well as on this blog. You can follow me: @goldshtn
Oh, I was so happy when I discovered the ever so slight pun in “post-BUILD event”. Programmer humor, very sad. Moving on.

The November SDP is going to be the biggest one yet. We’ll kick off the conference on Sunday, November 18 with three tracks in parallel, for “Client and Web”, “Server and Cloud” and “ALM”. With keynotes from leading Microsoft speakers and an additional five sessions in each track, this day should bring you up to par with the technological novelties from Microsoft in your area of expertise. Some examples:
And then we have four days of full-day workshops – 16 workshops with our best speakers diving deep into the flurry of technologies delivered by Microsoft during the last year. Here are some of the sessions:
Yours truly will have two sessions at the conference – a workshop titled Improving the Performance of .NET Applications, which I’m giving successfully for the fifth time in a row (I think :-)), and a brand new breakout session called JavaScript, Meet Cloud: Node.js on Azure, which really takes me out of my comfort zone. But from the experience of the last few months, I like Node.js quite a lot, and then the June 2012 refresh features of Windows Azure are incredible. This talk will combine these points of view to a coherent development experience, which I hope you’ll like.
See you at the SDP!