Pedro Teixeira is going to talk about processes and threads in systems with more than 64 logical processors as well as user-mode scheduling.
Surprisingly for some people, NUMA is not an esoteric hardware architecture. Even high-end gaming rigs today are NUMA; Pedro is going to use a loaned machine by HP that has 256 processors with 1TB of physical memory.
Adding support for more than 64 logical processors required a breaking app compat change, because CPU masks were represented in Windows by a bitmask. Therefore, CPUs are now addressed by 64-processor groups and there’s a need for new APIs that interact with these groups. Currently there’s a maximum of four groups, but this is an artificial limitation. (Processor groups are only available on 64-bit platforms.)
Groups are distributed by proximity – all logical processors in a core (hyper-threading), all cores in a socket, closer sockets on the board, NUMA nodes, and closer NUMA nodes. This guarantees that within one CPU group the computational resources are as close as possible.
Windows actually tests the distances between NUMA nodes in spite of what the hardware reports, to determine the true proximity metric for processor groups.
Each thread can belong only to one CPU group. Threads can be assigned to groups using CreateRemoteThreadEx with a special thread initialization attribute, or using SetThreadGroupAffinity. Processes are assigned to groups in round-robin, and the threads inherit the group from the parent thread and from their process.
NUMA and Proximity Information
There are also new APIs that can be used to extract the system topology information with relationships between cores, NUMA nodes, and so forth. As part of this topology, information about memory technology and device location are also exposed – e.g. physical devices and network storage connected to a NUMA node directly. (So it’s not just about memory anymore, and it’s not surprising considering DMA must be performed from these devices. NUEA – non-uniform everything access :-))
Future hardware allows for load-balancing of NUMA nodes within the same box – e.g. by having two NICs with the same IP address, connected to separate NUMA nodes, that can use sophisticated rules for load-balancing of requests to the nodes. More new APIs allow extracting the NUMA node information from a socket or a file handle, and more.
[None of the demos worked during the presentation because the network connection to the monster HP machine was very flaky. Hopefully we’ll be able to get the demos after the talk.]
UMS enables developers to write their own implementation of scheduling in user-mode, without going into the kernel, including synchronization primitives (which ConcRT implements). If one thread happens to use a kernel synchronization primitive, the user-mode scheduler still gets a chance to reschedule another task on that thread.
UMS is nothing like fibers – fibers allow for switching between contexts, but threads are much more than that (e.g. impersonation information), and UMS allows you to switch threads and not just contexts.
UMS is about splitting threads into a user part and a kernel part so that they can be switched separately (lazily switched). In user-mode, it’s possible to switch back and forth the user part of the thread (registers, stack and meta-information) without going into the kernel, running on the “wrong” kernel thread because the kernel information is not required as long as the thread runs in user-mode. When going into the kernel, UMS makes sure that the thread parts match (a thread switch).
In UMS there is a scheduler thread to represent each core, and you would usually create one scheduler thread per core – these decide which thread to run next. When creating a user scheduler, you provide a callback function that is called when necessary (a “message loop”). The scheduler decides what to do next by executing a worker thread – from a kernel perspective, these are full threads, not fibers. The worker threads are pushed by the kernel into a queue called the UMS completion list. (You can have as many UMS completion lists as you’d like – each UMS scheduler can create a completion list or share them, and so forth.)
The first thing the scheduler should do is pop worker threads from the UMS completion list and put them in its own ready list, perform scheduling and executes the worker thread.
A UMS thread can yield the processor, passing control to the scheduler. Yielding is a means for implementing arbitrary synchronization mechanisms in user-mode (e.g. an event wait in a library could mean storing away some contextual wait information and then yielding – the scheduler will be responsible for not scheduling the thread until the wait condition occurs).
When a worker thread performs a kernel call and blocks, the scheduler thread is again called with a notification that the thread was blocked. It can select another UMS thread and run it. (It’s the same yield as in the previous paragraph, but a kernel yield and not a user yield.) When the worker thread unblocks, it’s put in the UMS completion list and the scheduler is now responsible to look at the UMS completion list when it’s next invoked and move the threads from there to its private ready list.
If there’s no work in the ready list: The function for popping from the UMS completion list can block (with a timeout parameter), and you can also obtain an event that will be signaled when there’s something in the UMS completion list. This enables the scheduler to wait for work to be scheduled. However: Never return from the scheduler function!!! (When you return from the scheduler function, you’re back to the normal thread that you converted to the scheduler thread.)
There’s no support for preemption at the moment, so it’s impossible to implement a quantum-style scheduler. It’s possible to implement something like this using a watchdog thread that would call SuspendThread – that blocks and puts the thread in the UMS completion list.
[All in all, UMS is a very interesting mechanism. Alon and I agreed to write a sample together that shows off the basic features of UMS. Hopefully we’ll be able to publish it soon.]