08-08-2019, 08:13 AM
Trace code in Fedora with bpftrace
<div><p>bpftrace is <a href="https://github.com/iovisor/bpftrace">a new eBPF-based tracing tool</a> that was first included in Fedora 28. It was developed by Brendan Gregg, Alastair Robertson and Matheus Marchini with the help of a loosely-knit team of hackers across the Net. A tracing tool lets you analyze what a system is doing behind the curtain. It tells you which functions in code are being called, with which arguments, how many times, and so on.</p>
<p>This article covers some basics about bpftrace, and how it works. Read on for more information and some useful examples.</p>
<p> <span id="more-29015"></span> </p>
<h4>eBPF (extended Berkeley Packet Filter)</h4>
<p><a href="https://lwn.net/Articles/740157/">eBPF</a> is a tiny virtual machine, or a virtual CPU to be more precise, in the Linux Kernel. The eBPF can load and run small programs in a safe and controlled way in kernel space. This makes it safer to use, even in production systems. This virtual machine has its own instruction set architecture (<a href="https://github.com/iovisor/bpf-docs/blob/master/eBPF.md">ISA</a>) resembling a subset of modern processor architectures. The ISA makes it easy to translate those programs to the real hardware. The kernel performs just-in-time translation to native code for main architectures to improve the performance. </p>
<p>The eBPF virtual machine allows the kernel to be extended programmatically. Nowadays several kernel subsystems take advantage of this new powerful Linux Kernel capability. Examples include networking, seccomp, tracing, and more. The main idea is to attach eBPF programs into specific code points, and thereby extend the original kernel behavior.</p>
<p>eBPF machine language is very powerful. But writing code directly in it is extremely painful, because it’s a low level language. This is where bpftrace comes in. It provides a high-level language to write eBPF tracing scripts. The tool then translates these scripts to eBPF with the help of clang/LLVM libraries, and then attached to the specified code points.</p>
<h2>Installation and quick start</h2>
<p>To install bpftrace, run the following command in a terminal <a href="https://fedoramagazine.org/howto-use-sudo/">using </a><em><a href="https://fedoramagazine.org/howto-use-sudo/">sudo</a></em>:</p>
<pre class="wp-block-preformatted">$ sudo dnf install bpftrace</pre>
<p>Try it out with a “hello world” example:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 'BEGIN { printf("hello world\n"); }'</pre>
<p>Note that you must run <em>bpftrace</em> as <em>root</em> due to the privileges required. Use the <em>-e</em> option to specify a program, and to construct the so-called “one-liners.” This example only prints <em>hello world</em>, and then waits for you to press <strong>Ctrl+C</strong>.</p>
<p><em>BEGIN</em> is a special probe name that fires only once at the beginning of execution. Every action inside the curly braces <em>{ }</em> fires whenever the probe is hit — in this case, it’s just a <em>printf</em>.</p>
<p style="text-align:left">Let’s jump now to a more useful example:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_execve { printf("%s called %s\n", comm, str(args->filename)); }'</pre>
<p>This example prints the parent process name <em>(comm)</em> and the name of every new process being created in the system. <em>t:syscalls:sys_enter_execve</em> is a kernel tracepoint. It’s a shorthand for <em>tracepoint:syscalls:sys_enter_execve</em>, but both forms can be used. The next section shows you how to list all available tracepoints.</p>
<p><em>comm</em> is a bpftrace builtin that represents the process name. <em>filename</em> is a field of the <em>t:syscalls:sys_enter_execve</em> tracepoint. You can access these fields through the <em>args</em> builtin.</p>
<p>All available fields of the tracepoint can be listed with this command:</p>
<pre class="wp-block-preformatted">bpftrace -lv "t:syscalls:sys_enter_execve"</pre>
<h2>Example usage</h2>
<h3>Listing probes</h3>
<p>A central concept for <em>bpftrace</em> are <strong>probe points</strong>. Probe points are instrumentation points in code (kernel or userspace) where eBPF programs can be attached. They fit into the following categories:</p>
<ul>
<li><em>kprobe</em> – kernel function start</li>
<li><em>kretprobe</em> – kernel function return</li>
<li><em>uprobe</em> – user-level function start</li>
<li><em>uretprobe</em> – user-level function return</li>
<li><em>tracepoint</em> – kernel static tracepoints</li>
<li><em>usdt</em> – user-level static tracepoints</li>
<li><em>profile</em> – timed sampling</li>
<li><em>interval</em> – timed output</li>
<li><em>software</em> – kernel software events</li>
<li><em>hardware</em> – processor-level events</li>
</ul>
<p>All available <em>kprobe/kretprobe</em>, <em>tracepoints</em>, <em>software</em> and <em>hardware</em> probes can be listed with this command:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -l</pre>
<p>The <em>uprobe/uretprobe</em> and <em>usdt</em> probes are userspace probes specific to a given executable. To use them, use the special syntax shown later in this article.</p>
<p>The <em>profile</em> and <em>interval</em> probes fire at fixed time intervals. Fixed time intervals are not covered in this article.</p>
<h3>Counting system calls</h3>
<p><strong>Maps</strong> are special BPF data types that store counts, statistics, and histograms. You can use maps to summarize how many times each syscall is being called:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_* { @[probe] = count(); }'</pre>
<p>Some probe types allow wildcards to match multiple probes. You can also specify multiple attach points for an action block using a comma separated list. In this example, the action block attaches to all tracepoints whose name starts with <em>t:syscalls:sys_enter_</em>, which means all available syscalls.</p>
<p>The bpftrace builtin function <em>count()</em> counts the number of times this function is called. <em>@[]</em> represents a map (an associative array). The key of this map is <em>probe</em>, which is another bpftrace builtin that represents the full probe name.</p>
<p>Here, the same action block is attached to every syscall. Then, each time a syscall is called the map will be updated, and the entry is incremented in the map relative to this same syscall. When the program terminates, it automatically prints out all declared maps.</p>
<p>This example counts the syscalls called globally, it’s also possible to filter for a specific process by <em>PID</em> using the bpftrace filter syntax:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_* / pid == 1234 / { @[probe] = count(); }'</pre>
<h3>Write bytes by process</h3>
<p>Using these concepts, let’s analyze how many bytes each process is writing:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_exit_write /args->ret > 0/ { @[comm] = sum(args->ret); }'</pre>
<p><em>bpftrace</em> attaches the action block to the write syscall return probe (<em>t:syscalls:sys_exit_write</em>). Then, it uses a filter to discard the negative values, which are error codes <em>(/args->ret > 0/)</em>.</p>
<p>The map key <em>comm</em> represents the process name that called the syscall. The <em>sum()</em> builtin function accumulates the number of bytes written for each map entry or process. <em>args</em> is a bpftrace builtin to access tracepoint’s arguments and return values. Finally, if successful, the <em>write</em> syscall returns the number of written bytes. <em>args->ret</em> provides access to the bytes.</p>
<h3>Read size distribution by process (histogram):</h3>
<p><em>bpftrace</em> supports the creation of histograms. Let’s analyze one example that creates a histogram of the <em>read</em> size distribution by process:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'</pre>
<p>Histograms are BPF maps, so they must always be attributed to a map (<em>@</em>). In this example, the map key is <em>comm</em>. </p>
<p>The example makes <em>bpftrace</em> generate one histogram for every process that calls the <em>read</em> syscall. To generate just one global histogram, attribute the <em>hist()</em> function just to <em>‘@’</em> (without any key).</p>
<p>bpftrace automatically prints out declared histograms when the program terminates. The value used as base for the histogram creation is the number of read bytes, found through <em>args->ret</em>.</p>
<h3>Tracing userspace programs</h3>
<p>You can also trace userspace programs with <em>uprobes/uretprobes</em> and <em>USDT</em> (User-level Statically Defined Tracing). The next example uses a <em>uretprobe</em>, which probes to the end of a user-level function. It gets the command lines issued in every <em>bash</em> running in the system:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'</pre>
<p>To list all available <em>uprobes/uretprobes</em> of the <em>bash</em> executable, run this command:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -l "uprobe:/bin/bash"</pre>
<p><em>uprobe</em> instruments the beginning of a user-level function’s execution, and <em>uretprobe</em> instruments the end (its return). <em>readline()</em> is a function of <em>/bin/bash</em>, and it returns the typed command line. <em>retval</em> is the return value for the instrumented function, and can only be accessed on <em>uretprobe</em>.</p>
<p>When using <em>uprobes</em>, you can access arguments with <em>arg0..argN</em>. A <em>str()</em> call is necessary to turn the <em>char *</em> pointer to a <em>string</em>.</p>
<h2>Shipped Scripts</h2>
<p>There are many useful scripts shipped with bpftrace package. You can find them in the <em>/usr/share/bpftrace/tools/</em> directory.</p>
<p>Among them, you can find:</p>
<ul>
<li><em>killsnoop.bt</em> – Trace signals issued by the kill() syscall.</li>
<li><em>tcpconnect.bt</em> – Trace all TCP network connections.</li>
<li><em>pidpersec.bt</em> – Count new procesess (via fork) per second.</li>
<li><em>opensnoop.bt</em> – Trace open() syscalls.</li>
<li><em>vfsstat.bt</em> – Count some VFS calls, with per-second summaries.</li>
</ul>
<p>You can directly use the scripts. For example:</p>
<pre class="wp-block-preformatted">$ sudo /usr/share/bpftrace/tools/killsnoop.bt</pre>
<p>You can also study these scripts as you create new tools.</p>
<h2>Links</h2>
<ul>
<li>bpftrace reference guide – <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md">https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md</a></li>
<li>bpftrace (DTrace 2.0) for Linux 2018 – <a href="http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux-2018.html">http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux-2018.html</a></li>
<li>BPF: the universal in-kernel virtual machine – <a href="https://lwn.net/Articles/599755/">https://lwn.net/Articles/599755/</a></li>
<li>Linux Extended BPF (eBPF) Tracing Tools – <a href="http://www.brendangregg.com/ebpf.html">http://www.brendangregg.com/ebpf.html</a></li>
<li>Dive into BPF: a list of reading material – <a href="https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/">https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf</a></li>
</ul>
<hr class="wp-block-separator" />
<p><em>Photo by </em><a href="https://unsplash.com/@wehavemegapixels?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"><em>Roman Romashov</em></a><em> on </em><a href="https://unsplash.com/search/photos/trace?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"><em>Unsplash</em></a><em>.</em></p>
</div>
<div><p>bpftrace is <a href="https://github.com/iovisor/bpftrace">a new eBPF-based tracing tool</a> that was first included in Fedora 28. It was developed by Brendan Gregg, Alastair Robertson and Matheus Marchini with the help of a loosely-knit team of hackers across the Net. A tracing tool lets you analyze what a system is doing behind the curtain. It tells you which functions in code are being called, with which arguments, how many times, and so on.</p>
<p>This article covers some basics about bpftrace, and how it works. Read on for more information and some useful examples.</p>
<p> <span id="more-29015"></span> </p>
<h4>eBPF (extended Berkeley Packet Filter)</h4>
<p><a href="https://lwn.net/Articles/740157/">eBPF</a> is a tiny virtual machine, or a virtual CPU to be more precise, in the Linux Kernel. The eBPF can load and run small programs in a safe and controlled way in kernel space. This makes it safer to use, even in production systems. This virtual machine has its own instruction set architecture (<a href="https://github.com/iovisor/bpf-docs/blob/master/eBPF.md">ISA</a>) resembling a subset of modern processor architectures. The ISA makes it easy to translate those programs to the real hardware. The kernel performs just-in-time translation to native code for main architectures to improve the performance. </p>
<p>The eBPF virtual machine allows the kernel to be extended programmatically. Nowadays several kernel subsystems take advantage of this new powerful Linux Kernel capability. Examples include networking, seccomp, tracing, and more. The main idea is to attach eBPF programs into specific code points, and thereby extend the original kernel behavior.</p>
<p>eBPF machine language is very powerful. But writing code directly in it is extremely painful, because it’s a low level language. This is where bpftrace comes in. It provides a high-level language to write eBPF tracing scripts. The tool then translates these scripts to eBPF with the help of clang/LLVM libraries, and then attached to the specified code points.</p>
<h2>Installation and quick start</h2>
<p>To install bpftrace, run the following command in a terminal <a href="https://fedoramagazine.org/howto-use-sudo/">using </a><em><a href="https://fedoramagazine.org/howto-use-sudo/">sudo</a></em>:</p>
<pre class="wp-block-preformatted">$ sudo dnf install bpftrace</pre>
<p>Try it out with a “hello world” example:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 'BEGIN { printf("hello world\n"); }'</pre>
<p>Note that you must run <em>bpftrace</em> as <em>root</em> due to the privileges required. Use the <em>-e</em> option to specify a program, and to construct the so-called “one-liners.” This example only prints <em>hello world</em>, and then waits for you to press <strong>Ctrl+C</strong>.</p>
<p><em>BEGIN</em> is a special probe name that fires only once at the beginning of execution. Every action inside the curly braces <em>{ }</em> fires whenever the probe is hit — in this case, it’s just a <em>printf</em>.</p>
<p style="text-align:left">Let’s jump now to a more useful example:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_execve { printf("%s called %s\n", comm, str(args->filename)); }'</pre>
<p>This example prints the parent process name <em>(comm)</em> and the name of every new process being created in the system. <em>t:syscalls:sys_enter_execve</em> is a kernel tracepoint. It’s a shorthand for <em>tracepoint:syscalls:sys_enter_execve</em>, but both forms can be used. The next section shows you how to list all available tracepoints.</p>
<p><em>comm</em> is a bpftrace builtin that represents the process name. <em>filename</em> is a field of the <em>t:syscalls:sys_enter_execve</em> tracepoint. You can access these fields through the <em>args</em> builtin.</p>
<p>All available fields of the tracepoint can be listed with this command:</p>
<pre class="wp-block-preformatted">bpftrace -lv "t:syscalls:sys_enter_execve"</pre>
<h2>Example usage</h2>
<h3>Listing probes</h3>
<p>A central concept for <em>bpftrace</em> are <strong>probe points</strong>. Probe points are instrumentation points in code (kernel or userspace) where eBPF programs can be attached. They fit into the following categories:</p>
<ul>
<li><em>kprobe</em> – kernel function start</li>
<li><em>kretprobe</em> – kernel function return</li>
<li><em>uprobe</em> – user-level function start</li>
<li><em>uretprobe</em> – user-level function return</li>
<li><em>tracepoint</em> – kernel static tracepoints</li>
<li><em>usdt</em> – user-level static tracepoints</li>
<li><em>profile</em> – timed sampling</li>
<li><em>interval</em> – timed output</li>
<li><em>software</em> – kernel software events</li>
<li><em>hardware</em> – processor-level events</li>
</ul>
<p>All available <em>kprobe/kretprobe</em>, <em>tracepoints</em>, <em>software</em> and <em>hardware</em> probes can be listed with this command:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -l</pre>
<p>The <em>uprobe/uretprobe</em> and <em>usdt</em> probes are userspace probes specific to a given executable. To use them, use the special syntax shown later in this article.</p>
<p>The <em>profile</em> and <em>interval</em> probes fire at fixed time intervals. Fixed time intervals are not covered in this article.</p>
<h3>Counting system calls</h3>
<p><strong>Maps</strong> are special BPF data types that store counts, statistics, and histograms. You can use maps to summarize how many times each syscall is being called:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_* { @[probe] = count(); }'</pre>
<p>Some probe types allow wildcards to match multiple probes. You can also specify multiple attach points for an action block using a comma separated list. In this example, the action block attaches to all tracepoints whose name starts with <em>t:syscalls:sys_enter_</em>, which means all available syscalls.</p>
<p>The bpftrace builtin function <em>count()</em> counts the number of times this function is called. <em>@[]</em> represents a map (an associative array). The key of this map is <em>probe</em>, which is another bpftrace builtin that represents the full probe name.</p>
<p>Here, the same action block is attached to every syscall. Then, each time a syscall is called the map will be updated, and the entry is incremented in the map relative to this same syscall. When the program terminates, it automatically prints out all declared maps.</p>
<p>This example counts the syscalls called globally, it’s also possible to filter for a specific process by <em>PID</em> using the bpftrace filter syntax:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_enter_* / pid == 1234 / { @[probe] = count(); }'</pre>
<h3>Write bytes by process</h3>
<p>Using these concepts, let’s analyze how many bytes each process is writing:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_exit_write /args->ret > 0/ { @[comm] = sum(args->ret); }'</pre>
<p><em>bpftrace</em> attaches the action block to the write syscall return probe (<em>t:syscalls:sys_exit_write</em>). Then, it uses a filter to discard the negative values, which are error codes <em>(/args->ret > 0/)</em>.</p>
<p>The map key <em>comm</em> represents the process name that called the syscall. The <em>sum()</em> builtin function accumulates the number of bytes written for each map entry or process. <em>args</em> is a bpftrace builtin to access tracepoint’s arguments and return values. Finally, if successful, the <em>write</em> syscall returns the number of written bytes. <em>args->ret</em> provides access to the bytes.</p>
<h3>Read size distribution by process (histogram):</h3>
<p><em>bpftrace</em> supports the creation of histograms. Let’s analyze one example that creates a histogram of the <em>read</em> size distribution by process:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'</pre>
<p>Histograms are BPF maps, so they must always be attributed to a map (<em>@</em>). In this example, the map key is <em>comm</em>. </p>
<p>The example makes <em>bpftrace</em> generate one histogram for every process that calls the <em>read</em> syscall. To generate just one global histogram, attribute the <em>hist()</em> function just to <em>‘@’</em> (without any key).</p>
<p>bpftrace automatically prints out declared histograms when the program terminates. The value used as base for the histogram creation is the number of read bytes, found through <em>args->ret</em>.</p>
<h3>Tracing userspace programs</h3>
<p>You can also trace userspace programs with <em>uprobes/uretprobes</em> and <em>USDT</em> (User-level Statically Defined Tracing). The next example uses a <em>uretprobe</em>, which probes to the end of a user-level function. It gets the command lines issued in every <em>bash</em> running in the system:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'</pre>
<p>To list all available <em>uprobes/uretprobes</em> of the <em>bash</em> executable, run this command:</p>
<pre class="wp-block-preformatted">$ sudo bpftrace -l "uprobe:/bin/bash"</pre>
<p><em>uprobe</em> instruments the beginning of a user-level function’s execution, and <em>uretprobe</em> instruments the end (its return). <em>readline()</em> is a function of <em>/bin/bash</em>, and it returns the typed command line. <em>retval</em> is the return value for the instrumented function, and can only be accessed on <em>uretprobe</em>.</p>
<p>When using <em>uprobes</em>, you can access arguments with <em>arg0..argN</em>. A <em>str()</em> call is necessary to turn the <em>char *</em> pointer to a <em>string</em>.</p>
<h2>Shipped Scripts</h2>
<p>There are many useful scripts shipped with bpftrace package. You can find them in the <em>/usr/share/bpftrace/tools/</em> directory.</p>
<p>Among them, you can find:</p>
<ul>
<li><em>killsnoop.bt</em> – Trace signals issued by the kill() syscall.</li>
<li><em>tcpconnect.bt</em> – Trace all TCP network connections.</li>
<li><em>pidpersec.bt</em> – Count new procesess (via fork) per second.</li>
<li><em>opensnoop.bt</em> – Trace open() syscalls.</li>
<li><em>vfsstat.bt</em> – Count some VFS calls, with per-second summaries.</li>
</ul>
<p>You can directly use the scripts. For example:</p>
<pre class="wp-block-preformatted">$ sudo /usr/share/bpftrace/tools/killsnoop.bt</pre>
<p>You can also study these scripts as you create new tools.</p>
<h2>Links</h2>
<ul>
<li>bpftrace reference guide – <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md">https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md</a></li>
<li>bpftrace (DTrace 2.0) for Linux 2018 – <a href="http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux-2018.html">http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux-2018.html</a></li>
<li>BPF: the universal in-kernel virtual machine – <a href="https://lwn.net/Articles/599755/">https://lwn.net/Articles/599755/</a></li>
<li>Linux Extended BPF (eBPF) Tracing Tools – <a href="http://www.brendangregg.com/ebpf.html">http://www.brendangregg.com/ebpf.html</a></li>
<li>Dive into BPF: a list of reading material – <a href="https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/">https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf</a></li>
</ul>
<hr class="wp-block-separator" />
<p><em>Photo by </em><a href="https://unsplash.com/@wehavemegapixels?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"><em>Roman Romashov</em></a><em> on </em><a href="https://unsplash.com/search/photos/trace?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"><em>Unsplash</em></a><em>.</em></p>
</div>