I hope you’re outraged that your performance tools are lying to you. For quite a while, many Java sampling profilers have been known to blatantly misrepresent reality. In a nutshell, stack sampling using the documented JVMTI GetStackTrace method produces results that are biased towards safepoints, and not representative of the real CPU processing performed by your program.
Over the years, alternative profilers popped up, trying to fix this problem by using AsyncGetCallTrace, a less-documented API that doesn’t wait for a safepoint, and can produce more accurate results. Simply calling AGCT from a timer signal handler gives you a fairly reliable way to do stack sampling of JVM processes. Unfortunately, even AGCT can sometimes fail, and in any case, it doesn’t help with profiling the non-Java parts of your process: JVM code, GC, JIT, syscalls, kernel work performed on your behalf, and really anything else that’s not pure JVM bytecode.
Another popular alternative is using Linux perf, which doesn’t directly support Java but has great support for profiling native code, and doesn’t have any trouble looking at kernel stacks as well. For JVM support, you need two pieces:
- A perf map that maps JIT-compiled addresses to function names (as a corollary, only compiled frames are supported; interpreter frames are invisible)
- A JIT switch -XX:+PreserveFramePointer that makes sure perf can walk the Java stack, added in OpenJDK 1.8u60
When using this method:
- You end up losing interpreter frames
- You can’t profile an older JVM that doesn’t have the PreserveFramePointer flag
- You risk having stale entries in your perf map because the JIT can throw away and recompile code
- You risk not having certain functions in your perf map because the JIT threw the code away
At JPoint 2017, Andrei Pangin and Vadim Tsesko from Odnoklassniki introduced a new approach for JVM profiling on Linux, which brings together the best from both worlds: perf for native code and kernel frames, and AGCT for Java frames. Thus, async-profiler was born.
Async-profiler’s method of operation is fairly simple. It uses the perf_events API to configure CPU sampling into a memory buffer, and asks for a signal to be delivered when a sample occurs. The signal handler then calls AsyncGetCallTrace, and merges the two stacks together: the Java stack, captured by AsyncGetCallTrace, and the native + kernel stack, captured by perf_events. For non-Java threads, only the perf_events stack is retained.
Async-profiler’s approach for constructing a merged call stack, from Andrei Pangin’s and Vadim Tsesko’s presentation at JPoint 2017
This approach has its limitations, but it also offers a lot of appeal. You don’t need a special switch to preserve frame pointers. You get full-fidelity data about interpreter frames. The agent supports older JVMs. The stack aggregation happens in the agent, so there are no expensive perf.data files to store and parse.
A flame graph generated by using async-profiler
To try async-profiler, you can build from source (it’s very simple) and then use the helper profiler.sh script, which I contributed:
./profiler.sh start $(pidof java) ./profiler.sh stop -o flamegraph -f /tmp/java.stacks
Full instructions are in the README — any feedback, contributions, or suggestions are very welcome. Odnoklassniki are using this in production, but I’m sure they’ll be delighted to know that you found it useful, too!
You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.