Two New eBPF Tools: memleak and argdist

February 14, 2016

no comments

Warning: This post requires a bit of background. I strongly recommend Brendan Gregg’s introduction to eBPF and bcc. With that said, the post below describes two new bcc-based tools, which you can use directly without perusing the implementation details.

A few weeks ago, I started experimenting with eBPF. In a nutshell, eBPF (introduced in Linux kernel 3.19 and further improved in 4.x kernels) allows you to attach verifiably-safe programs to arbitrary functions in the kernel or a user process. These little programs, which execute in kernel mode, can collect performance information, trace diagnostic data, and aggregate statistics that are then exposed to user mode. Although BPF’s lingua franca is a custom instruction set, the bcc project provides a C-to-BPF compiler and a Python module that can be used from user mode to load BPF programs, attach them, and print their results. The bcc repository contains numerous examples of using BPF programs, and a growing collection of tracing tools that perform in-kernel aggregations, offering much lower overhead than perf or similar alternatives.

The result of my work is currently two new scripts: memleak and argdist. memleak is a script that helps detect memory leaks in kernel components or user processes by keeping track of allocations that haven’t been freed including the call stack that performed the allocation. argdist is a generic tool that traces function arguments into a histogram or frequency counting table to explore a function’s behavior over time. To experiment with the tools in this post, you will need to install bcc on a modern kernel (4.1+ is recommended). Instructions and prerequisites are available on the bcc installation page.

memleak

In basic mode, memleak attaches to either malloc and free, or kmalloc and kfree, and collects outstanding allocations. Outstanding allocations older than a certain age are printed along with the allocating call stack. For example, here’s some output from a leaking user program:

# ./memleak -p $(pidof allocs) 
Attaching to malloc and free in pid 5193, Ctrl+C to quit. 
[11:16:33] Top 10 stacks with outstanding allocations: 
  80 bytes in 5 allocations from stack 
    main+0x6d [/home/vagrant/allocs] (400862) 
    __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
[11:16:38] Top 10 stacks with outstanding allocations: 
  160 bytes in 10 allocations from stack 
    main+0x6d [/home/vagrant/allocs] (400862) 
    __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
[11:16:43] Top 10 stacks with outstanding allocations: 
  240 bytes in 15 allocations from stack 
    main+0x6d [/home/vagrant/allocs] (400862) 
    __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)

It looks like main keeps allocating more and more memory that isn’t being freed.

Additional options include printing only the top N allocating stacks, only stacks that allocated more than N bytes, capturing only specific allocation sizes, reducing overhead by capturing only every N-th allocation, and more. The really cool part is how easy it was to build this tool, even though I had to roll my own user symbol decoding support (currently based on a rather hackish invocation of `objdump`).

argdist

argdist is a Swiss Army knife designed to analyze the distribution of a function’s arguments. It attaches to functions in the kernel or a user process, collects specific argument values, stores them in a histogram or frequency counting collection, and displays them for further analysis. Let’s start with a couple of simple examples. Suppose you want to find what allocation sizes are common in your application. The probe syntax for malloc is the following: p:c:malloc(size_t size):size_t:size. This obscure-looking string is rather simple, really: p stands for probe (could also be r, which is a probe on the return from the function), c is the library that contains the function, malloc(size_t size) is the function’s signature, and the rest is the type and value of the expression that you want to collect.

# ./argdist -p 2420 -C 'p:c:malloc(size_t size):size_t:size' 
[01:42:29] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
[01:42:30] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
[01:42:31] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
       1 size = 16 
[01:42:32] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
      2 size = 16 
[01:42:33] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
       3 size = 16 
[01:42:34] 
p:c:malloc(size_t size):size_t:size 
   COUNT EVENT 
       4 size = 16 
^C

This application seems to be allocating only blocks of size 16. We can do a similar thing with kernel functions. For example, here’s a probe in kmalloc — the grouping here is by both the allocation type (gfp_t) and the allocation size:

# ./argdist -C 'p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size'
[03:42:29] 
p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size 
  COUNT EVENT 
      1 flags = 16, size = 152 
      2 flags = 131280, size = 8 
      7 flags = 131280, size = 16 
[03:42:30] 
p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size 
  COUNT EVENT 
      1 flags = 16, size = 152 
      6 flags = 131280, size = 8 
     19 flags = 131280, size = 16 
[03:42:31] 
p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size 
  COUNT EVENT 
      2 flags = 16, size = 152 
     10 flags = 131280, size = 8 
     31 flags = 131280, size = 16 
[03:42:32] 
p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size 
  COUNT EVENT 
      2 flags = 16, size = 152 
     14 flags = 131280, size = 8 
     43 flags = 131280, size = 16 
^C

Now, let’s get a histogram of write sizes to file descriptor 1 (STDOUT) across all processes on the system. Note the probe syntax, which now includes an additional filter (fd==1):

# ./argdist -H 'p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1' 
[01:47:17] 
p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1 
 len :  count  distribution 
   0 -> 1  : 0 |                                        | 
   2 -> 3  : 0 |                                        | 
   4 -> 7  : 0 |                                        | 
   8 -> 15 : 1 |****************************************| 
  16 -> 31 : 0 |                                        | 
  32 -> 63 : 1 |****************************************| 
[01:47:19] 
p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1 
 len :  count    distribution 
   0 -> 1   : 0  |                                        | 
   2 -> 3   : 0  |                                        | 
   4 -> 7   : 0  |                                        | 
   8 -> 15  : 3  |*********                               | 
  16 -> 31  : 0  |                                        | 
  32 -> 63  : 5  |***************                         | 
  64 -> 127 : 13 |****************************************| 
^C

Here’s another example — let’s snoop on gets return values, effectively capturing user input across the whole system! The probe syntax this time includes the special $retval keyword:

# ./argdist -i 10 -n 1 -C 'r:c:gets():char*:(char*)$retval:$retval!=0' 
[02:12:23] 
r:c:gets():char*:$retval:$retval!=0 
  COUNT EVENT 
      1 (char*)$retval = hi there 
      3 (char*)$retval = sasha 
      8 (char*)$retval = hello

Another thing you can do is wait for the function to return and then refer to its execution time (latency) and the values of the arguments it had on entry. For example, ever wondered how many nanoseconds it takes to allocate a typical byte using malloc?

# ./argdist -H 'r:c:malloc(size_t size):u64:$latency/$entry(size);ns per byte' -n 1 -i 10
[01:11:13] 
  ns per byte  : count distribution 
  0    -> 1    : 0     |                                        | 
  2    -> 3    : 4     |*****************                       | 
  4    -> 7    : 3     |*************                           | 
  8    -> 15   : 2     |********                                | 
  16   -> 31   : 1     |****                                    | 
  32   -> 63   : 0     |                                        | 
  64   -> 127  : 7     |*******************************         | 
  128  -> 255  : 1     |****                                    | 
  256  -> 511  : 0     |                                        | 
  512  -> 1023 : 1     |****                                    | 
  1024 -> 2047 : 1     |****                                    | 
  2048 -> 4095 : 9     |****************************************| 
  4096 -> 8191 : 1     |****                                    |

argdist is a fairly sophisticated tool, so it has a lot of switches and features that I haven’t described here. You can control tracing frequency, monitor functions that have struct parameter types, capture complex expressions and filters, capture multiple variables in a single probe, and even capture data from multiple functions in a single run.

Summary

This post introduced memleak and argdist, two new tools based on bcc/eBPF that demonstrate the power of dynamic tracing. memleak helps diagnose memory leaks, and argdist helps analyze function arguments using histograms and frequency collections. I intend to continue working on these and other bcc-based tools in the future. If this looks cool or useful, please head over to GitHub and contribute!


I am posting shorter links, comments and thoughts on Twitter as well. Follow me: @goldshtn

Add comment
facebook linkedin twitter email

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*