On Windows, .NET contains numerous very useful ETW events, which can be used for tracing garbage collections, assembly loading, exceptions thrown, object allocations, and other interesting scenarios. All events can come with a stack trace, which helps understand where they’re coming from. In fact, I’d say for some events, not getting the stack trace means the event is completely useless — e.g. what good is the ExceptionThrown event if you don’t have the exception stack trace?
On Linux, .NET Core doesn’t use ETW (Event Tracing for Windows, ya know). It uses LTTng instead, which is an awesome tracing framework but doesn’t have stack trace support for userspace events. But I think we can hack around it. Specifically, all LTTng events used by .NET Core are fired through a set of auto-generated functions, named
EventXplatEnabled<EventName>. If we trace these functions using standard dynamic tracing (uprobes) with ftrace, perf, or BPF, we don’t get the event payload (which can be also quite important), but we do get the stack trace. If we only need the event count (a rough replacement for Windows performance counters) or the stack traces, we don’t have create an LTTng session and record the events, which can also help lower the overhead. The downside is using hacky internal details, which can change at any moment — but that’s the nature of dynamic tracing.
Here’s a simple demo. Suppose you know from looking at the heap statistics or the GC LTTng events that you have lots of garbage collections, and would like to reduce the object allocations in your app. To do so, you have to figure out where the allocations are coming from. The GCAllocationTick event can tell you roughly which objects you’re allocating by using a low-overhead sampling approach, but it doesn’t tell you where the allocations are coming from, which is quite important. What we’re going to do, then, is trace the
EventXplatEnabledGCAllocationTick function in libcoreclr.so, and gather its stack traces. Then, we’ll generate a flame graph that points to the heavy allocation sites in the app. I’ll demonstrate two ways — with perf, and with the stackcount tool from BCC (which is based on eBPF).
The perf way:
perf probe -x $APPDIR/libcoreclr.so -a EventXplatGCEnabledAllocationTick* perf record -p $PID probe_libcoreclr:* -g -o allocs.data perf script | $FLAMEGRAPH/stackcollapse-perf.pl | $FLAMEGRAPH/flamegraph.pl > allocs.svg
The stackcount way:
$BCC/stackcount $APPDIR/libcoreclr.so:EventXplatEnabledGCAllocationTick* -f > allocs.stacks $FLAMEGRAPH/flamegraph.pl < allocs.stacks > allocs.svg
Just for fun, let me show you the flame graph, pointing to the
StatsController.Get method as the primary source of allocations:
To quantify the overhead, I tested this approach on an ASP.NET Core app with a trivial endpoint that performs hundreds of thousands of allocations per second. I also tested LTTng event collection, where I created an LTTng session and enabled only the GCAllocationTick event. I ran the benchmark for 20 seconds with 10 concurrent clients in each mode. The results were as follows:
- No tracing: 12.32 ms/request
- Dynamic tracing with perf: 12.51 ms/request (total of 58,422 events recorded; almost 3,000 events per second)
- Dynamic tracing with stackcount: 12.59 ms/request
- LTTng recording: 12.72 ms/request (total of 56,931 events recorded)
Just to clarify, any of the approaches above still require launching the application with the
COMPlus_EnableEventLog=1environment variable. If it is off, the entire eventing infrastructure is not invoked at all. Incidentally, for this application, which has the potential of generating hundreds of thousands of events per second, turning off this environment variable produces a huge speedup: 7.39 ms/request. For an app with more reasonable event rates, it will probably make sense to keep the environment variable on, because turning it off means you can’t do any meaningful event collection without restarting the process.
In conclusion, it seems that using dynamic tracing to probe the CoreCLR methods directly is a feasible approach for collecting stack traces of interesting CLR events. You don’t get the event payload (although in some cases it can be collected as well from the function’s arguments), but you do get the code location, which is often enough.