November 2009 - Posts
A few hours ago the Microsoft Professional Developers Conference 2009 was adjourned. What a conference it was! Lots of interesting sessions, our own Ariel Ben Horesh presenting together with Glenn Block, the free Acer multitouch tablet, lots of great food – an amazing week at the session halls, the Sela booth at the Partner Expo, and in Los Angeles in general.
Now that the PDC is over, it’s time to start preparing for the SDP – Sela Developer Practice. We’re going to bring the information back to Israel so that Israeli developers can learn about all the latest advances in Silverlight, Visual Studio 2010, concurrency, garbage collection and other topics.
At the SDP, Eran Stiller and I are going to talk about Parallel Programming with Visual Studio 2010 and about WCF 4.0, WF 4.0 and Workflow Services.
Here’s a list of my blog posts summarizing the sessions I attended (excluding the keynotes) along with brief highlights:
So again, it was a great PDC. If you were there, I’d love to hear what you think, and if you’re going to attend the SDP, come forward and say hi!
The last session at the PDC that I’m attending is about incubation tools for debugging, from Microsoft Research. Debugging is hard and the process of finding the root cause is manual and therefore tedious and long.
The formal debugging process – ask an expert, check the bug database, check the version history, reproduce the bug, trace in a debugger. [Some existing tools that help along the way are Visual Studio Test Impact analysis, Visual Studio Test Elements and Visual Studio Intellitrace (new in Visual Studio 2010).]
Can we automatically debug the code and find the root cause?
- Holmes – a statistical debugging tool that uses large test suites to diagnose failures. (available for download right now and integrates with Visual Studio)
- Darwin – a tool for debugging regressions using a stable version to diagnose failures.
- Debug Advisor – a recommendation system for bugs that mines software repositories for information related to a bug.
Holmes
Can we use test suites to help the cause of the failures and not just the failures themselves? Statistical debugging is about collecting instrumented data (all acyclic path fragments within a method) from a large set of successful and failing test cases. Next, Holmes tries to look for code paths that strongly correlate with failure. (Looks for code paths that when exercised, almost always cause tests to fail, and when not exercised, almost always cause tests to pass.)
Holmes collects path coverage information (could also collect Intellitrace information), and once it’s available Holmes comes up with a set of potential root causes.
After injecting a bug, the presenter ran a suite of unit tests and saw some test failures. The actual test failures don’t contain sufficient information for full diagnostics. Next, you load the Holmes package for Visual Studio, select the failing test run and Holmes provides two possible root causes. Double-clicking on one of the results brings you to the actual code with the highlighted code path.
Holmes also supports the external Visual Studio Test Manager so that the developer can open a test run from a TFS server later as long as the Holmes coverage data was collected during the test run.
Because of path coverage (instead of full coverage) the extra overhead introduced by Holmes is about 10% – 30%. As for the statistical analysis, it completes quite quickly. The analysis involves measuring correlation between paths and test failures, but it’s not just a simple correlation – if a path correlates with failure it doesn’t necessarily imply that it causes failure (e.g. exception handling, error recovery etc. are associated with failures but don’t cause it). The smarter analysis idea used in Holmes is that you’re looking for paths that correlate with failure, but only inside methods/loops/try-catch statements that do not correlate with failure. This is a very effective heuristic in practice.
Holmes relies on many test cases – if you have lots of test cases, Holmes can help with a good correlation; but for a small number of tests, there’s hardly anything that can be done with the output of a test run. The presenter’s recommendation is about 100 tests with approximately 10-20 failing test cases.
Darwin
Darwin is useful when you have a stable, working build of your applications and then a change is made and introduces a regression. Manual debugging would involve comparing the old version with the new version… (This is tedious for large changes and doesn’t work with regressions present in the previous version but unmasked in this one.)
Comparing test cases is another approach – comparing a trace of the failing test case with a similar, passing test case. Armed with this information, you can look for the place where the test cases diverge. This is the cause of the bug. The problem is that coming up with a passing test is not easy, and that’s the problem Darwin tries to solve.
Darwin defines similarity between test cases and is capable of generating similar, passing tests given a failing test case. To do that, Darwin uses the previous (stable) version of the application. Tests are similar if for the passing test case they follow the same code path but for the failing case they follow different paths.
Finding similar tests is a constraint solving problem, and techniques similar to Pex can be used to solve it. The tool is currently in prototype but it has already been used to detect a bug in a web server as well as an image processing application.
Debug Advisor
Debug Advisor is a recommendation system for bugs. When a bug report is received, Debug Advisor helps answering the questions: Has someone else looked at this bug before or fixed beforeit ? What do we know about this category of bugs? Who should I ask for help? Where should I start looking?
What you do with the tool is take all the information you know about the bug (in text form) and put it in a big search box. Debug Advisor shows you similar bugs, as well as lots of related information about the bug – people related to it (e.g. working on the same part of the project or encountered a similar bug before), source files that might be related to this bug, as well as a list relevant binaries.
Existing Windows Embedded offerings: Windows Embedded Compact runs on consumer devices such as GPS, Windows Embedded Standard runs on microscopes, projectors, and Windows Embedded POSReady runs on point-of-sale devices.
Windows Embedded Standard 2011 is a way to build devices with a custom Windows 7-based OS, with only the features you need. It supports standard Windows applications, 64-bit and 32-bit drivers, Windows servicing tools, and additional features.

The Developer Toolkit has a Wizard experience – Image Build Wizard, installs a Standard 2011 image on the device interactively, suitable for prototyping and evaluation. The Advanced Experience (Image Configuration Editor with a Target Analyzer) provides more advanced OS settings. Tools such as WinPE, WDS etc. are also supported.
Building blocks: Embedded core (bootable, command line, networking features), Feature Packages (e.g. IE, .NET, DirectX), Language Packs, Driver Packs, Embedding Enabled Features, and of course any third party software. This all goes through the Image Builder Engine and results in the embedded OS.
The IBW (wizard) operates from a template or an empty starting point. The templates can contain all wizard element described in the previous paragraph. There’s also some smart dependency analysis in the wizard so that you don’t exclude packages or features required for proper operation of the device.
The advanced editor (ICE) allows changing the machine settings (e.g. firewall rules) as well as scripting installations of other applications with answer files to all questions asked by installers (including product license keys), the Windows welcome screen etc. (these are just smart macros ;-)).
[The rest of the session was basically one big demo of ICW and ICE and how they can be used to configure an image for prototyping, deployment, installation, etc.]
About an hour ago, the Silverlight 4 application that Alex wrote for the Sela booth at the PDC was used to draw one lucky winner who took home an HP TouchSmart IQ846 – an all-in-one PC with a 25.5” dual-touch display that can be used as a great media-center. Its tech specs are really neat, and in fact back home I have a slightly older model of the same machine.
Other than the TouchSmart, we also gave away two 8MP Polaroid cameras. Thanks to everyone who came to the booth in aisle 400, and we look forward to seeing you at the next PDC!
[If you’re not at the PDC, don’t forget about the Sela Developer Practice on December 27-30, with great tutorials and sessions based on this PDC’s materials and presented by Sela experts.]
Chas Boyd’s session on DirectX11 DirectCompute is going to focus on bringing the power of the GPU for general-purpose computing (and not necessarily graphics applications).
A modern CPU would have 4 cores, run at 3GHz, 4 float-wide SIMDs, peak theoretical performance of 48-96GFlops, 2x hyperthreaded capability, 64KB L1 cache, a memory interface of about 20GB/s, and take about 200W out of the wall at a cost of about $200.
A GPU is usually constructed from 32 cores, each 32-float wide, at 1GHz, giving us about 1Teraflop (with high-end ones giving much more), 32x hyperthreading, and a very high 150GB/s memory bandwidth to the GPU RAM, also taking about 200W at about $200. The GPU has 16K 32-bit registers per core! The compiler allocates the registers available to each thread from this pool.
How can applications exploit the performance of the GPU architecture? There are some advantages aside from the Teraflop – it’s also about high bandwidth memory, higher GFlops per watt, etc. One of the problems is that the bandwidth between the CPU and the GPU (the PCI bus) gives us about 1-2GB/s practical bandwidth only…
The GPU has thousands of ALUs, meaning we need hundreds of thousands of threads (considering 32x hyperthreading) to hit peak, and only data elements come in such numbers. E.g., performing an operation on a 1MP image makes the GPU very happy if you create 1M threads. (Another important notion is that the CPU is optimized for random memory access, but the GPU likes sequential memory accesses.)
Some scenarios: Image processing (highly parallelizable), video processing, audio, linear algebra, simulation and modeling, and many others. A huge amount of small units of work (tens of bytes each) is ideal.
Oftentimes the algorithm has to be replaced to be data-parallel. For example, for sorting, quicksort can be replaced with bitonic sort.
A sample app that performs an N-body simulation of particles shows almost 750GFlops on the GPU and hardly 25GFlops on the CPU. Of course, when the GPU is working, the CPU is free for other tasks.
DirectCompute is essentially a low-level computer language for programming the GPU. All DirectX11 chips will support DirectCompute, and some DirectX10 chips already support it.
To use DirectCompute: Initialize, create some GPU code in an .hlsl file, compile it using DirextX compiler, load the code onto the GPU, set up a GPU buffer and set up a view into it, make that data view current, execute the code on the CPU, copy the data back to CPU memory. (It’s recommended, of course, to execute lots of code on the GPU and keep buffers there instead of copying it around every time to the CPU. Chunky interfaces.)
The HLSL language syntax is rather similar to C/C++, there are preprocessor defines, basic types, operators, variables, functions, and also a lot of intrinsics, NO POINTERS. The language is compiled to an intermediate language which is then consumed by the hardware driver.
DirectCompute is part of DirectX11 which ships with Windows 7, and also available as part of a Platform Update for Windows Vista. (The DirectX SDK can be installed on either OS.)
[Here’s a link to a tutorial on the DirectCompute framework that you might find useful.]
Pedro Teixeira is going to talk about processes and threads in systems with more than 64 logical processors as well as user-mode scheduling.
Surprisingly for some people, NUMA is not an esoteric hardware architecture. Even high-end gaming rigs today are NUMA; Pedro is going to use a loaned machine by HP that has 256 processors with 1TB of physical memory.

Processor Groups
Adding support for more than 64 logical processors required a breaking app compat change, because CPU masks were represented in Windows by a bitmask. Therefore, CPUs are now addressed by 64-processor groups and there’s a need for new APIs that interact with these groups. Currently there’s a maximum of four groups, but this is an artificial limitation. (Processor groups are only available on 64-bit platforms.)
Groups are distributed by proximity – all logical processors in a core (hyper-threading), all cores in a socket, closer sockets on the board, NUMA nodes, and closer NUMA nodes. This guarantees that within one CPU group the computational resources are as close as possible.
Windows actually tests the distances between NUMA nodes in spite of what the hardware reports, to determine the true proximity metric for processor groups.
Each thread can belong only to one CPU group. Threads can be assigned to groups using CreateRemoteThreadEx with a special thread initialization attribute, or using SetThreadGroupAffinity. Processes are assigned to groups in round-robin, and the threads inherit the group from the parent thread and from their process.
NUMA and Proximity Information
There are also new APIs that can be used to extract the system topology information with relationships between cores, NUMA nodes, and so forth. As part of this topology, information about memory technology and device location are also exposed – e.g. physical devices and network storage connected to a NUMA node directly. (So it’s not just about memory anymore, and it’s not surprising considering DMA must be performed from these devices. NUEA – non-uniform everything access :-))
Future hardware allows for load-balancing of NUMA nodes within the same box – e.g. by having two NICs with the same IP address, connected to separate NUMA nodes, that can use sophisticated rules for load-balancing of requests to the nodes. More new APIs allow extracting the NUMA node information from a socket or a file handle, and more.
[None of the demos worked during the presentation because the network connection to the monster HP machine was very flaky. Hopefully we’ll be able to get the demos after the talk.]
User-Mode Scheduling
UMS enables developers to write their own implementation of scheduling in user-mode, without going into the kernel, including synchronization primitives (which ConcRT implements). If one thread happens to use a kernel synchronization primitive, the user-mode scheduler still gets a chance to reschedule another task on that thread.
UMS is nothing like fibers – fibers allow for switching between contexts, but threads are much more than that (e.g. impersonation information), and UMS allows you to switch threads and not just contexts.
UMS is about splitting threads into a user part and a kernel part so that they can be switched separately (lazily switched). In user-mode, it’s possible to switch back and forth the user part of the thread (registers, stack and meta-information) without going into the kernel, running on the “wrong” kernel thread because the kernel information is not required as long as the thread runs in user-mode. When going into the kernel, UMS makes sure that the thread parts match (a thread switch).

In UMS there is a scheduler thread to represent each core, and you would usually create one scheduler thread per core – these decide which thread to run next. When creating a user scheduler, you provide a callback function that is called when necessary (a “message loop”). The scheduler decides what to do next by executing a worker thread – from a kernel perspective, these are full threads, not fibers. The worker threads are pushed by the kernel into a queue called the UMS completion list. (You can have as many UMS completion lists as you’d like – each UMS scheduler can create a completion list or share them, and so forth.)
The first thing the scheduler should do is pop worker threads from the UMS completion list and put them in its own ready list, perform scheduling and executes the worker thread.
A UMS thread can yield the processor, passing control to the scheduler. Yielding is a means for implementing arbitrary synchronization mechanisms in user-mode (e.g. an event wait in a library could mean storing away some contextual wait information and then yielding – the scheduler will be responsible for not scheduling the thread until the wait condition occurs).
When a worker thread performs a kernel call and blocks, the scheduler thread is again called with a notification that the thread was blocked. It can select another UMS thread and run it. (It’s the same yield as in the previous paragraph, but a kernel yield and not a user yield.) When the worker thread unblocks, it’s put in the UMS completion list and the scheduler is now responsible to look at the UMS completion list when it’s next invoked and move the threads from there to its private ready list.
If there’s no work in the ready list: The function for popping from the UMS completion list can block (with a timeout parameter), and you can also obtain an event that will be signaled when there’s something in the UMS completion list. This enables the scheduler to wait for work to be scheduled. However: Never return from the scheduler function!!! (When you return from the scheduler function, you’re back to the normal thread that you converted to the scheduler thread.)
There’s no support for preemption at the moment, so it’s impossible to implement a quantum-style scheduler. It’s possible to implement something like this using a watchdog thread that would call SuspendThread – that blocks and puts the thread in the UMS completion list.
[All in all, UMS is a very interesting mechanism. Alon and I agreed to write a sample together that shows off the basic features of UMS. Hopefully we’ll be able to publish it soon.]
Dana Groff, Senior Program Manager on the ConcRT team is going to talk about the new Concurrency Runtime – an abstraction on top of the underlying operating system, supported from Windows XP through Windows Server 2008 R2.

The ConcRT Resource Manager is an abstraction over the hardware that allows vendors like Microsoft and Intel (OpenMP, TBB) to program at a higher layer and compose these platforms, as well as coming up with one set of concepts for providing parallel code such as tasks, task groups and so forth.
Dana uses a high-end AMD server with 48 cores (eight six-core processors, with eight NUMA nodes - the HP ProLiant DL785 G6). He uses it to scale a raw image processing application, and it scales almost linearly. When all the processors are working, the machine also generates a lot of noise that can be heard throughout the room :-)

We think serially, and what we have to do is to start decomposing our algorithms into tasks, so that each task separately runs serially. Then we group these tasks – which ones depend on each other or work on the same data. Finally, we have to schedule these task groups together. What the runtime does is to cooperatively handle blocking events, dependencies and other things explicitly stated in the code.
A task can be a lambda, can be a pointer to a function, a function object – it’s a unit of work that has to run serially. Under the covers there are task_handle and lightweight tasks. The latter are used by agents – they are fire and forget; the former can be controlled and managed in a richer way, and they are cancelable.
When tasks are blocked, a notification is given to the runtime and the runtime decides that it can use that core for another task until it completes or blocks. When tasks unblock, the ConcRT scheduler prefers putting tasks on the same core where they ran before blocking.
Standard Threads and UMS
When using standard Windows threads, applications must use synchronization mechanisms provided by ConcRT to take advantage of cooperative scheduling. For example, there is an abstraction of an event that works differently from the Win32 event. When using Win32 synchronization mechanisms, there’s no way for ConcRT to know that its threads can run other tasks; when using ConcRT synchronization mechanisms, the blocking and switching is handled by ConcRT.
Some of the cooperative blocking mechanisms include wait for a task group, an event, critical sections, RWLs, receives and waits in the agent library, and others.
When running on UMS threads (on Windows 7 64-bit or Windows Server 2008 R2). User mode scheduling allows for ConcRT to be aware of kernel blocking, even using Win32 synchronization mechanisms. When a thread enters the kernel performing a wait, ConcRT is notified and can schedule some other user-mode thread on the same thread. UMS threads look exactly like regular threads, which is a great advantage compared to fibers.
The user-mode context of the UMS thread is handled by ConcRT; the kernel facility communicates with ConcRT when there’s need for rescheduling. ConcRT tries as hard as possible to reschedule the same user-mode context on the same core (or at least on the same NUMA node), if possible.
The promise of UMS is that you can sometimes be sloppy about not creating enough threads or oversubscribing to some resources, and also a performance win in some scheduling scenarios by as high as 8%.
Another great advantage of ConcRT is that it seamlessly takes advantage of the 256-processor support in Windows 7 and Windows Server 2008 R2. ConcRT has done the proper testing in that kind of environment (with 128- and 256-core machines) so that you don’t have to worry about that abstraction.
[The next talk, Developing Applications for Scale-Up Servers Running Windows Server 2008 R2, will focus more on the UMS and the low-level APIs. Stay tuned.]
Yochay, a good friend and co-author of “Introducing Windows 7 for Developers” and of the “Windows 7 Taskbar APIs” MSDN Magazine article, is delivering a presentation on the Windows API Code Pack. (Which is a library of managed APIs to interact with Vista and Windows 7 features that are otherwise accessible only from native code through COM and Win32 APIs.)
This library replaces many of the sample managed integration libraries that our team at Sela developed for the Windows 7 Metro Training, such as the Taskbar integration library, the Sensor and Location integration library, and many others. It’s great to see these technologies converging into an almost-supported library, and especially great to see some of them emerge in .NET 4.0 (Location, taskbar, multitouch and other features).
Yochay announced the latest version of the Windows API Code Pack – version 1.0.1, which mainly offers bug fixes and performance improvements, as well as many new demos.
Next, he demonstrated an application called Fishbowl which is the WPF version of the Facebook Silverlight application featured in Scott Guthrie’s keynote earlier today. It’s a full client application, so it can take advantage of the new Windows 7 features. It has a jump list with the latest notifications when the app is open or some launch tasks when the app is closed, and a thumbnail toolbar. Another example of an application taking advantage of Windows 7 is the Amazon Kindle for PC and it has a smart jump list with recent books and some tasks, as well as proper multitouch support.
Then, Yochay shows the XP to Windows 7 reference application called “PhotoView” which we at Sela developed. It’s an application that demos all the features of Vista and Windows 7 and still supports Windows XP with a smart plugin architecture. Yochay showed sensor integration in the app, which responds to ambient light changes by changing the color intensity of the picture displayed.
Throughout the session, Yochay proceeds to show us lots of code from the Windows API Code Pack itself as well as various demos that ship with it or that were built during the past few months.
If you’re not writing Windows 7 applications today, you should start thinking about giving yourself a competitive advantage. If you’re not using the Windows API Code Pack, go ahead and download it today, and use the Windows 7 Training Kit to bring yourself up to par with all its features.
Burton Smith’s session on the state of parallel programming was standing-room only – I’m sitting on the floor with some chairs blocking my view of the presentation :-)
Generally, Burton Smith lays out a theory of parallel programming that I tried to cover in the notes below.
Imperative languages prefer putting values in parameters, and they are prone to data races which are rather hard to detect considering the amount of possible paths.
Pure functional languages avoid variables – they compute new constants from their old values and provide means for efficient reclamation of old constants. Consider Excel, for example, where cells can be calculated from other cells, and so on.
No variables implies no mutable state at all. Unfortunately, mutable state is crucial for efficiency – especially to prevent always copying all the data whenever it’s passed around. Monads provide mutable state (and input/output operations) to pure functional languages, but they are not inherently parallel.
The fundamental problem is maintaining invariants of state – conservation laws of one’s program (e.g. whenever a reference points to something, the something points back to the original reference – like in a linked list; or a graph that must be acyclic, and so forth). These properties, these invariants, are crucial for a program’s correct execution with parallelism. We have to invent something that describes these properties, because we’re not used to declaring them.
Where do invariants come from? Well, we can sometimes generate invariants from code. At times we can also generate code from invariants – here’s the invariant and the termination condition, generate the code for me please :-) We can also write invariants and code separately and have the compiler or runtime check their preservation (as with Code Contracts for .NET and the C++ annotation language used by the secure CRT). Finally, we can add to the languages a capacity to make sure that the transaction covers the invariant’s domain. What this means for debugging is that there’s a need for programs that check the object’s invariants, so that we can at least check when invariants fail.
Updates performed to object state perturb but then restore an invariant, and this is what proofs of correctness of programs are based on. For parallelism, updates must be isolated to make sure they do not interfere with one another. Updates must also be atomic. Both of these properties (atomic and isolated) bring transactional semantics to our vocabulary.
Invariants provide a general commutativity property. If p and q preserve an invariant I, and p and q don’t interfere with each other, then the parallel execution {p || q} also preserves I. The definition of “don’t interfere” with each other is more complex, and it’s possible to prove that if they execute in isolation and atomically (as transactions), then they do not interfere. Even if the operations do not commute with respect to the state of the object, they do commute (can be reordered) with respect to the invariant (this is not absolute commutativity). This leads to a weaker form of determinism called “good non-determinism”, and something like an operating system cannot allow a stronger form of determinism without losing a great deal of performance.
Assume we want to implement a hash table. The invariant is that an item is in the set iff its insertion followed all removals. There are also some invariants on the structure of the actual buckets in the hash table. Parallel insertions and removals need only maintain the conjunction of these invariants, and do not require deterministic state. (E.g. the order of items in a bucket is not specified and we have no interest in it.)
If we isolate some loads and stores so that they are also atomic, we might still have a situation where they cover only a part of an invariant (e.g. copying around more data than the system’s data bus can perform in a single bus transfer). The domain of the invariant is not necessarily covered by the individual transactions. (E.g. a bank account transfer can be protected separately on each account, but that wouldn’t preserve the bigger, more important, invariant.)
SQL transactions achieve consistency via atomicity and isolation, but they are not general-purpose; operating systems use locks for isolation and perform lock ordering to prevent deadlock. There’s a need for a more general-purpose parallel language that could handle both types of applications.
Implementing isolation means one of the following techniques or a nested combination of several of them: Proving that concurrent state updates are isolated; Locking – where deadlock has to be handled somehow; Buffering updates; Partitioning information (e.g. quicksort); Serializing execution and not allowing parallelism (e.g. COM STA).
Some existing languages provide isolation features – MPI, Erlang and others. Some allow only one thread per address space, some allow only serial execution and lock changes, and there are additional approaches.
Atomicity means “all or nothing”, which requires undo in many scenarios. Isolation without atomicity is not very interesting, but atomicity is vital even in serial executions in lieu of exceptions. The obvious implementation techniques are explicit compensation and logging (restore), and both are challenging for distributed computing and for input/output that often can’t be compensated or restored (non-repeatable, non-testable, non-compensatable actions).
Exceptions threaten atomicity because they require undo in the middle or an aborted state update.
Transactional memory means transaction semantics on a language or framework level, and there’s obviously lot of optimization to do because atomicity and isolation require a huge amount of CPU cycles. There are of course traditional ways to achieve transactional semantics (via compensation, implicit undo, and so forth).
Burton Smith’s opinion is that pure functional languages with transactions are the way to go with regard to parallel programming, and this is the direction Microsoft is heading. Implementations of isolation and atomicity are important and must be optimized for efficiency, and hopefully the hardware architecture will support these things in the future.
Additionally, the von Neumann programming model needs to be replaced – putting band-aids on issues such as coherent cache and memory access is definitely not going to bring us forward to the next generation of parallel programming.
As I wrote a few hours ago, every PDC attendee got himself a nice little Acer Aspire 1420P laptop (by the way, kudos to the conference organizers – I went to pick up the laptop during the 30-minute break at 12:30PM, and the queue was very long but also very quick – in less than 10 minutes I was holding the laptop in my hands). This laptop comes preinstalled with Windows 7 Ultimate 64-bit edition, Office 2010 Beta, Windows Live Essentials and Virtual PC (XP Mode) for Windows 7, and the Windows 7 Touch Pack which is a bundle of really neat applications demonstrating multitouch capabilities.
The machine itself sports a rather weak Intel Celeron U2300 CPU, running two cores at 1.2GHz, 2GB of DDR3 1066MHz RAM (upgradeable to 8GB), a 250GB HD, a multitouch tablet display with two touch points, HDMI and VGA output, WWAN support, an accelerometer sensor for automatic display re-orientation, and a few other perks.
It scores only 3.2 on the Windows Experience Index, due to the comparatively weak graphics adapter. The CPU scores a 3.9, memory 4.7, and the hard disk scores a surprising 5.6.
The 11.6” screen looks very crisp and bright with the native resolution of 1366x768. The touch display is not amazing but it’s somewhat more pleasant to use than the N-Trig display I have on my multitouch laptop. The digitizer, on the other hand, is fairly primitive and doesn’t have any buttons (such as an eraser function) that are pretty common on other tablets.
Tomorrow I’m going to try using it exclusively during the day, without resorting to my well-tested friend, the Dell Latitude XT. I’ll let you know if its battery is going to be up to this task. (Apparently it should last 8 hours on the Power Saver plan. Wow.)
If you’re looking for a slightly larger multitouch companion and are at the PDC 2009, go ahead and hurry to the Sela Group booth at the Partner Expo area. There’s a 25.5” HP Touchsmart all-in-one PC with multitouch support that will be drawn tomorrow from a pool of all attendees that come to swipe their badge at the booth. You can also take a look at our book (in print), hear about the XP to Windows 7 reference application, ask around about Silverlight 4 innovation, and let us know what you think.
If you missed the PDC but are going to be in Israel in December, stay tuned for the Sela Developer Practice (SDP) where we are going to repeat many of the PDC talks (often in more depth) for the Israeli developer community.
Patrick Dussud described the variants of garbage collection that exist in the world today – specifically, reference counting and tracing GC.
GCs are measured by the speed of the allocator, the GC time overhead, the pause time (latency), the working set, and multi-core scaling.
Patrick described the general architecture of the GC, and said that he wants to focus on GC policies and mechanisms and not on the actual implementation of root scanning, thread suspension, virtual memory management and similar traits.
Evolution of GC
GC v1 – workstation and server GC, concurrent GC, scalable GC for servers. Assumption is that on workstation GC there are few allocations but latency is critical; on servers there’s uniform concurrent workload and latency is tolerable.
GC v2 – no significant changes to policies or mechanisms, only implementation.
GC v3 – low latency mode API, GC notifications API for full GCs.
GC v4 – on workstation: background GC allows gen0 and gen1 GCs while in the middle of a concurrent gen2 GC (two collections occurring at the same time). The assumption that allocation activity is moderate. The effect is reduced pause time in full GC for all workloads.
Future Trends
There are more cores, more memory (especially considering 64-bit operating systems becoming more prevalent, so address spaces grow), and virtualization.
As far as the speed of the allocation and GC time are concerned, there are no worries with these trends. Pause time for large heaps, however, is very bad; and so is the working set for a given workload.
Background GC needs more feedback; the same approach can be brought to the server GC flavor.
Real-time (fully concurrent, no-stop) GC is not likely because of the effect of read barriers required to implement it. The performance hit is immense. [See my post on write barriers, which was my first post on this blog!]
The problem with working set is that the LOH isn’t compacting (for reasons of refraining from doing large copies) and it seems that some workloads would benefit from LOH compaction.
Background GC isn’t compacting either, and fragmentation occurs. In background GC, if there is a low-memory condition then a full blocking GC occurs. In the future it seems that partial compaction for background GC is likely – pick a badly fragmented area and compact it while doing an ephemeral GC.
Hardware assistance is another option – it was done in the 80s but was too slow (TI Explorer, Symbolic Ivory chips), and now that transistors are cheaper it seems that implementing read barriers (mechanisms to intercept memory dereferences) in hardware would give the most value to implement fully concurrent GC.
Operating system assistance is another direction, and the CLR GC team has cooperated with the Windows team in the past. There is good research on avoiding paged out object scanning, which is a major problem if a full GC occurs and there’s paging. This can be somewhat improved by having more integration with the Windows memory manager so that paging is delayed for GC segments, and enabling better virtualization support, e.g. idle VMs can use less working set. [See my Non Paged CLR Host open source project for more information.]
Some other ideas – because tracing GC is worthwhile when survival is low, but in gen2 it’s usually high. Therefore, it would be possible to maintain a (inaccurate) reference count on gen2 objects, allowing to reclaim some of them without performing tracing GC (there’s research literature on the subject).
[Shameless plug: If some of the discussion of GC was incomprehensible to you, the .NET Performance course at Sela (that I teach) will give you a comprehensive overview of the .NET tracing GC including all latest features incorporated in CLR 4.0, and a vast amount of additional topics.]
An impressive array of speakers is sitting on the podium in front of us, taking live questions from the audience using Twitter. Here are some of the interesting thoughts provoked by these questions:
Worried About
Some of the things that the speakers are “worried about”: Concurrency, dependability (taking dependencies we don’t understand), developing rich applications easily, providing simple programming solutions and still give experts the opportunity to “complicate”, how do we make database access viable.
Concurrency and Parallel Programming
How do we do parallel programming simply? One take on it is that LINQ makes it easy to write parallel programs. Not all problems are easily decomposable in a way that allows for parallelism. The industry needs to do for parallelism what was done for GUI and for objects during the past decades with VB 1.0, MFC 1.0 and so on.
The biggest problem that remains is what to do about state, as opposed to functional style (like SQL). Dealing with state in a parallel program is the biggest problem.
Transactions
What about transactions? Are transactions in software or database the silver bullet that solves the problem with concurrency? Transactions are great when they work – they are like fairy dust. STM is another direction that has to be addressed in somewhat untraditional ways. Even if there are no transactions in the language for you, make sure every method on an object takes the object from a consistent state to another consistent state, even in face of exceptions – which is another hard thing to do.
Software tends to work when it works, and fail when it fails.
Don Box says that there will be support for optimized transactions with the in-memory coordinator and SQL Server, so if the transaction is only between these two RMs, there will be no distributed transaction (no two-phase commit, etc.). This is good news for software transactional memory.
Another observation: Transactions must be small. Big transactions don’t work. Big transactions work too, but require nesting and complexity management among multiple transactions.
Type Safety
For Herb Sutter, type safety would mean that an object can be used if all its invariants are met. (E.g. its constructor has run and completed, and its destructor or finalizer has not started to run yet.) Don’t rely on the type system for everything.
Herb Sutter: You can design a language without null pointers, but oddly you still have a NullPointerException!
For the question of whether you can live without pointers – well, even if you don’t have them, there is great (irreplaceable) value in having the semantics of a thing and a reference to the thing which is not the thing itself. It’s next to impossible to design a useful language that has only value semantics, and no references.
Boundaries
How we should treat distributed problems – make boundaries clear or hide them implicitly? Differences that don’t matter should be hidden; differences that matter should be explicit in the developer’s and user’s face. This applies to all boundary transitions. On the other hand – taking this to the extreme, the path between the cache and the RAM should also be made explicit.
Computation Diversity
How do we leverage the GPU? It’s underway, and what seems to happen is that there will be more diversity in the types of processing units, especially seeing that processing-power-per-watt considerations become more eminent. [See Multikernel about an OS targeting diverse types of processing units.]
With lots of variety in the hardware, languages, programming software – the interesting question becomes how to glue these components together, and glue becomes the most important factor. The answer is not any particular runtime.
Herb Sutter: No one wants multicore! We would like faster chips every year but we can’t get them anymore, and therefore we have to work hard to parallelize applications. [See “The Free Lunch Is Over”.]
Modeling
Is modeling the answer and we can stop modeling? We’ve seen a trend of extracting parts of the program specification to lots of external files (be it XML, DB, .ini files). Representation of the intent (instead of programming, implementing this intent) is meant to be easier. [Is it really?]
Also, if you have the model – you can reason against the model instead of against the program. On the other hand, static analysis (such as compilation – AST construction) are examples of reasoning against a program which is perfectly viable.
Modeling allows for proving theorems about programs. Extracting a model from a program is one thing, but having a declarative basis for the truths of a program (such as invariants) such as they can undergo formal verification is a whole different thing. And when faced with formal verification of programs and theorem proving, the question has to be where the theorem comes from? (The specification can be harder than the program to write. Right now we don’t have declarative means to specify the theorems that we want to prove about the program.)
Miscellaneous
Stanley Lipman: There are many pages on amazon.com that don’t work quite well or serve incorrect information. But it doesn’t matter if they serve 5% incorrect information. The only page that absolutely has to be right is the big orange Order button!
Herb Sutter, while answering a question about garbage collection in C++: “… what was the question again?”, the panel host: "His memory has been garbage collected!”
Graphical programming environments are useful when they are useless (five things on the screen, demos) but useless when they need to be usable (five hundred things on the screen). Don Box: “You will have to peel the text editor out of my cold dead fingers.”
This doesn’t mean we have to store programs as text, but we can use a more structured way. [See Kirill Osenkov’s structured program editor.]
All in All
The conclusive question: What will we talk about in 5-10 years? Some answers: Parallel programming, safety, composability, reliability. Another answer: What happens when Moore’s Law ends? After all, you can’t go exponential forever and then optimization will go very very sexy again.
All in all, there was quite little discussion of the FUTURE, which was the title of the talk. Nonetheless, there was some interesting insight.
[Also see part 1 – keynote by Steven Sinofsky; and part 2 – keynote by Scott Guthrie.]
The final part of the keynote was delivered by Kurt DelBene, Senior Vice President, Microsoft Corp. He talked about Office and SharePoint 2010.
A Technical Preview of Office 2010 was released several months ago (in fact, I’m using it on two of my computers already and very happy with it, including the x64 version).
As of the last two minutes, Office 2010 Beta and SharePoint 2010 Beta are available for download, as well as Windows Mobile 6.5 Office 2010 Beta clients!
Another exciting (for some people :-)) piece of news is that SharePoint 2010 is finally supported on client OSs, so there’s no need to install a server VM to develop for SharePoint. There’s also support for sandboxed solutions, which can be deployed without even being an administrator on the server.
There were also additional features demonstrated, but despite my best efforts I simply can’t get myself to focus on SharePoint and Office development :-)
[Also see the first part – the keynote by Steven Sinofsky.]
The second part of the keynote was delivered by Scott Guthrie, Corporate Vice President, Developer Division, Microsoft Corp.
Scott started talking about Silverlight 3 and its new features that shipped several months. Specifically, Scott discussed the SketchFlow tool for quick UI prototyping. He also mentioned the success of Silverlight – it’s being used on more and more sites around the world, as well as in demanding enterprise environments. Today Silverlight is installed on 45% of the Internet-connected devices in the world (the number was 33% in the summer).
Scott announced Silverlight 4 with three foci: Media, Business Applications, and Out-of-Browser Experiences. The rest of the keynote is going to focus on Silverlight 4 features.
Silverlight 4 allows access to webcams and microphones on the machine, enables multicast streaming, supports output protection as well as offline DRM support for premium content.
For webcams, there is support for some neat pixel shader effects on the video stream. Scott demoed a simple app with these effects as well as barcode scanner using an open source library for barcode reading.
IIS Smooth Streaming works by analyzing the network and CPU conditions and adaptively changing the bitrate by scaling down if necessary. The iPhone does not directly support Smooth Streaming today, but there is an extension for IIS Manager that enables iPhone output with Smooth Streaming. Unfortunately the demo didn’t work :-) It’s available online though at http://iis.net/iphone.
Other new features for application development include printing support, rich text editing, clipboard access, right-click support for context menus, and mouse wheeling for all the standard Silverlight controls (it’s funny that they have to add these things and everyone is excited about it – that’s what happens when the first releases are weak on features…). There’s also implicit styling support, drag and drop support, bidi and RTL support, hosting HTML controls in the app, commanding and MVVM, and of course additional controls and extensions for existing controls.
It’s possible to share assemblies across Silverlight and .NET 4.0 – no need to compile twice, which is great for common business logic to share across the client and the server, or for unit tests. There are some data binding improvements, UDP multicast support, REST (with ADO.NET Data Services) and WCF enhancements and improvements, and finally WCF RIA Services.
There are also enhancements in Visual Studio 2010 support for Silverlight, with a WYSIWYG design surface, XAML intellisense improvements, data binding, layout, styles and other features. Scott Hanselman demos these improvements in VS10. The DataGrid demo that Scott showed – with dragging a data source to the design surface – reminded me of the first days of WinForms 1.0. It’s about time Silverlight becomes on par with true UI development frameworks. :-)
OOB support contains windowing APIs, notification popups, HTML support, and the OOB application can be a drop target even if it’s sandboxed. It’s also possible to build trusted applications that run outside the sandbox (this is called “elevated trust” but it’s not about running as admin, it’s just running outside of IE Protected Mode, which is low IL). For trusted applications there is additional support for custom window chrome, local file system access, cross-site network support, keyboard support in full screen mode, hardware device access, and the ability to access COM objects (through IDispatch only, not custom interfaces). This is done through the dynamic support in the current wave of languages (e.g. C# 4.0).
Finally, there was a significant focus on performance in Silverlight 4 – with the goal of being twice as fast, delivering 30% faster startup times and profiling support for Silverlight applications in Visual Studio 2010 and other profilers. It still remains a 10 second installation even though a lot of features were added.
The roadmap for Silverlight 4 is: There will be a Beta, and RC, and an RTM. The Beta will be feature-complete. The Silverlight 4 Beta is now available for download from http://silverlight.net. The final release will ship in H1 2010.
My big question remains: What’s to become of WPF with these extensions for Silverlight?…
The second keynote started with Steven Sinofsky, President, Windows and Windows Live Division, Microsoft Corp.
Steven is talking about the development process of Windows 7 and what it means to develop for Windows 7. He’s also going to say a few words about what we’re going to see going forward.
In the Windows 7 engineering process, Steven emphasized the new innovative features and said that they’ve learned that they need to balance the new features with fundamental improvements to the system, exactly as in Windows 7. The E7 Blog engaged a great dialog about the nuances of developing Windows; on the other hand, the hardware, drivers, application compatibility – ecosystem readiness – were in a significantly better state than in earlier releases of Windows. The major milestones (M3, Beta, RC) were ready for testing and real work – these were not just preview releases. [Even though I know first-hand that there were still significant changes in the product, especially between Beta and RC.]
Microsoft collected a huge amount of telemetry from Windows 7. Some examples of the information collected were comments from the “Send Feedback” button in the caption of each window; information about devices plugged in and drivers loaded; reliability diagnostics; usage statistics for buttons, keyboard accelerators, sequences of common events and tasks; and of course WER reports. Over 80% of customers voluntarily opted in to sending this information to Microsoft. (Steven emphasized that the information is private, confidential and does not contain any personally identifiable details.)
Some numbers from the period of the pre-RTM versions of Windows 7 – 1.7M “Send Feedback” reports; 91K unique external devices plugged into the system, of them 14K unique printers; 883K unique applications; 8.1M installations, of which 4.3M of the RC; 10.4M WER reports submitted; 4,753 code changes driven by WER; 6K measurement points in Windows 7 for telemetry reports; 900M user sessions – logon/logoff; 514M times the Start menu was clicked; 46M times the Aero Snap and Aero Shake were used (which are new features!!).
Screen resolution information: 55% with 1024x768; less than 0.5% with 19200x1200. This is not intuitive information for developers ;-) [Also see E7 Blog for more numbers.]
Steven also mentioned new features in Windows 7 (see my Windows7 tag) – Ribbon, multitouch, Sensor and Location, jump lists, and many others. These are no new features in Windows 7 announced at this time. :-)
One of the lessons learned is the need to support netbooks, which are the fastest emerging market. (And laptop sales have long ago taken over desktop PC sales.) This changes the hardware requirements for Windows 7 – and there are some changes that application vendors must make. [See Chapter 13 in our book.]
DirectCompute is another technology mentioned – using the power of the GPU even for work that was usually intended for the CPU. GPUs are highly parallelizable, and I will attend a session on DirectCompute tomorrow.
Steven just announced that Microsoft is giving away the Acer laptop designed especially for the PDC 2009 to all PDC attendees for free! It’s got multitouch, an ambient light sensor, and lots of other features for experiencing Windows 7! This is a really incredible surprise after there were no significant giveaways during the first day of the conference.

Steven wrapped up the talk by discussing how Internet Explorer 9 is going to focus on performance esp. in JavaScript, support HTML 5, and use hardware support for graphics and text acceleration. For example, in IE9 there will be support for D2D and DirectWrite for high-fidelity text rendering (replacing GDI). Of course Direct2D and DirectWrite are only available on Windows 7 (or with an additional installation, on Windows Vista).
More Posts
Next page »