Event Tracing for Windows has been with us since Windows 2000. It is an infrastructure for raising events from various system components, and has only been used by a small number of kernel-mode entities. In Windows XP, MOF files (familiar from WMI provider metadata) were used to describe events. Finally, in Windows Vista and Windows Server 2008 events were described by XML manifests, an investment was made in popularizing ETW, and hundreds of new event providers were added.
What kind of information is generated by all these providers? Well, first of all, there’s the Windows Event Log which consumes some of the information generated by ETW providers (but not all). So we get all kind of diagnostic messages on things happening in the system. Another provider is the Performance Monitor, which features the ability to query a collection set of ETW events. Integrating these various sources of information is not an easy task, especially if you are alternating between analyzing a system as a whole and analyzing a set of specific applications within the same trace.
It is this integration that has led to the birth of the Windows Performance Toolkit. It features the data collection and integration tools necessary to interpret and utilize ETW output correctly. It is specifically constructed so that lots of information can be viewed in a coherent fashion, and so that a system-wide image of what’s going on can be obtained. Additionally, through the use of the kernel sampling interrupt, a global sampling profiler is available (including call stack analysis). And best of all? It’s completely free!
So let’s take a look at some of the abilities of this new toolkit. First of all, you need to install it from WHDC. Note that the installation is only supported on Windows Vista SP1 and Windows Server 2008 (I got it to be almost fully functional without Vista SP1, but caveat emptor). Data collection can be performed on a Windows XP SP2 or Windows Server 2003 system (you only need xperf.exe and perfctrl.dll for that), but the trace decoding can only be performed on NT 6.0 and higher.
Let’s start with a sample global data collection. (By the way, if you’re not into my walkthroughs, the toolkit comes with an extensive set of documentation – online and offline. The starter document is 65 pages, which gives you an idea of how big this thing really is.)
I went to an administrative command line prompt on my Vista, navigated to the toolkit’s installation directory (C:\Program Files\Microsoft Windows Performance Toolkit by default) and typed:
xperf -on Base
This enables a set of ETW providers to publish their events (the exact set of providers can be seen by typing xperf -providers). These events are then collected and written to temporary buffers. Note that these temporary buffers might grow very large, so if you plan to perform data collection across a long period of time, you better have some free disk space and lots of memory available on the box.
I then proceeded to write a few lines in this post, open a couple of programs, and then went back to the command line and typed:
xperf -d result.etl
This disables the selected ETW data collection and writes the raw results to the result.etl file (note that this command might take quite some time to complete). These results are, well, raw, so they need to be analyzed soon. We can accomplish this via the command line by launching a set of actions on the output file, or using the GUI Performance Analyzer tool (xperfview) which is part of the toolkit.
An example of what you can accomplish on the command line can be demonstrated by typing the following action sequence:
xperf -i result.etl -o proclist.txt -a process
This generates a list of processes active during the trace into the proclist.txt file. Here’s some of the sample content from my box:
Start Time, End Time, Process, DataPtr, Process Name ( PID), ParentPID, SessionID, UniqueKey MIN, MAX, Process, 0X0000000000168EE0, Idle ( 0), 0, 0, 0x0000f80002150400 MIN, MAX, Process, 0X00000000001693C0, System ( 4), 0, 0, 0x0000fa8001897040 MIN, MAX, Process, 0X00000000001A17A0, svchost.exe ( 356), 708, 0, 0x0000fa8004ce4040 MIN, MAX, Process, 0X0000000000171080, smss.exe ( 520), 4, 0, 0x0000fa800455f610 MIN, MAX, Process, 0X00000000001712B0, csrss.exe ( 616), 604, 0, 0x0000fa8004804c10 MIN, MAX, Process, 0X00000000001904A0, svchost.exe ( 620), 708, 0, 0x0000fa8004d2ac10
Another example of command line info generation would be:
xperf -i result.etl -o diskio.txt -a diskio
This gives you some disk I/O statistics for the duration of the trace. Here’s the sample from my box:
Start Time, End Time, Read, Write, Usage % 0, 1000000, 18, 30, 51.37 1000000, 2000000, 0, 11, 2.44 2000000, 3000000, 0, 2, 1.05 3000000, 4000000, 1, 4, 4.26 4000000, 5000000, 1, 15, 20.17 5000000, 6000000, 0, 4, 1.26 6000000, 7000000, 0, 7, 2.19 7000000, 8000000, 1, 30, 2.02 8000000, 9000000, 9, 3, 13.47
Finally, if we’re interested in a graphical interpretation of everything, just go ahead and launch the Performance Analyzer (xperfview) and then navigate to the trace file, or just type:
…from the command line. Here’s what the initial output looks like on my system:
That sure looks like a lot of useful data! The first graph is CPU utilization by process, the second one is disk utilization by process. You can filter some processes out, of course:
…and you can zoom in, select, make the graphs overlay so you can see the information side by side:
And then there’s the summary table feature, if you just right click on the graph you’re interested in and choose “Summary Table”:
…and if you look at the Disk I/O summary and right click for a Detail Graph, you get the breakdown of specific I/O requests and can even visually see fragmented file access if present (!):
Note how you can see the volume on the right (C and D in my case), and if you hover on each individual dot you see the information on the process making the request and the actual file path. On the left you see the disk location information (offset in bytes from the disk start), so that if there are jumps across the disk all the time, we have some fragmented non-sequential access going on.
And then there’s the page fault details, to the level of the file being requested, the size of the I/O request, the average time spent servicing the request, the total I/O time, etc.:
Bear in mind: we have seen nothing yet. There are some preconfigured profiles for analyzing standby/resumption, hibernation, boot, application startup, network activity, and many other scenarios. You can get events generated on virtual memory allocation, power management events, registry access, driver events, system calls, interrupts, DPCs/APCs, context switches, what not. (Friendly reminder once more: there’s documentation available both online and offline, go and read it right away! :-))
However, there’s another piece of coolness to see if you’re still with me and interested in what else the tool is capable of. Using the kernel sample profile interrupt, the ETW system can capture the instruction pointer and the stack during trace execution. This data is logged to an ETL file, and can then be analyzed to see what kind of call stacks your application encountered. (This also gives away the information on performance bottlenecks – i.e. where your application spends the most of its time.)
So what I did to demonstrate this was write a very simple application which displays prime numbers in a given range. It’s written in a very inefficient way so that it takes quite some time to execute. To profile what’s going on inside, I went to the administrative command prompt again and typed:
xperf -on SysProf
Note for advanced users: “Base” would also have worked, because it also enables profiling. If I’m not interested in collecting disk I/O and memory statistics, however, then I’m better off with “SysProf” (which resolves to PROC_THREAD+LOADER+PROFILE). Without PROC_THREAD and LOADER you won’t get reliable results.
Then all I had to do is launch my application, let it run to termination, generate a results file as before (xperf -d …) and launch the Performance Analyzer to see what we got.
The first and foremost thing that you have to do at this point is have debugging symbols configured properly. You could spend hours on trying to understand what’s wrong with what you have done, only to discover it’s due to symbols improperly configured. To get symbols for Windows code lined up properly all you need to do is set up the _NT_SYMBOL_PATH system environment variable (My Computer -> Properties -> Advanced System Settings -> Environment Variables) to the following string:
You can replace C:\Symbols with any other downstream store on your disk that you prefer (or you could omit this section entirely). What this tells the symbols engine is that whenever symbols for Windows are needed, they are automagically downloaded from Microsoft and cached at the downstream store (C:\Symbols). Yes, this requires a connection to the Internet. Yes, the symbols will be re-downloaded if Windows updates are installed or a service pack is deployed. Yes, if you’re offline, you could also download the entire cache and install it, but whenever something is updated you risk getting out of sync.
For your own code, you can prepend the path to the PDB files to the above string, for example:
set _NT_SYMBOL_PATH = C:\Code\MyApp\release;%NT_SYMBOL_PATH% ''' OR ''' set _NT_SYMBOL_PATH = C:\Code\MyApp\release;SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
When you finally launch the tool, before you go ahead and open any graphs, go to the Trace menu and click Load Symbols. This might take some time now or later, depending on what you’re trying to do. Essentially, it might be downloading symbols for the entire system for the first time, which could take several minutes or even more on a slow connection. On my system, there were 142 modules for which symbols were downloaded, occupying a total of 93MB disk space.
After you’ve got symbols loaded, choose a region on the graph or the entire graph, right click and select “Summary Table”. Use the column chooser on the left to group by Process and Stack, and show you the Weight and the Weight%. Note that you can do this for any process, not just for your code – all you need is symbols (which you can have on debug and release builds nowadays with no problem at all). For my prime numbers application on my system I get the following output:
So it seems that MyApp.exe!IsPrime took 38.26% of the weight in this profiling session. printf, on the other hand, took 0.71% of the weight.
Note we can’t deduce the execution time from this output; we can only see the relative weight a function had (i.e. the number of samples where this function was on the stack compared to the total number of samples taken by the profiler).
After optimizing the code a little bit (so that prime numbers are calculated more efficiently) and running it again under the profiler, here are the results:
The code was so optimized that there are no samples for the IsPrime function anymore (it was not inlined – we’re talking about a debug build), and printf is responsible for 2.92% of the weight.
Yes, there is some room for improvement here because commercial tools like Microsoft’s own Visual Studio Profiler give you so much more, including the ability to compare performance reports, analyze managed code, use a convenient API to determine when profiling should start and when it should end, profile memory allocations etc. – but this one is for free and it’s part of an integrated suite of lots and lots of additional functionality. Additionally, if you look closely into the output, you can see that the grouping performed by this profiler can be done by the stack trace and not the function – so you can see how many times your function or sequence of functions was called in a particular order and manner. This is something even commercial profilers make a difficult objective to achieve.
Just one final note before you go ahead and try it: stack profiling works only on the 32-bit editions of Windows Vista or Windows Server 2008. To be more accurate, I couldn’t figure out how to make it work on a 64-bit system and gave up (and the following blog post gives me some hope that one day we’ll figure it out). All I could get on a 64-bit system is just a list of functions getting called with their sample times and counts (which is great), but not the detailed elegant call stacks you can get on a 32-bit machine.
So get your hands on the Windows Performance Toolkit and try it out! Let me know what you discovered.