SystemTap 笔记 (16)—— probe alias

Probe alias的语法:

probe <alias> = <probepoint> { <prologue_stmts> }
probe <alias> += <probepoint> { <epilogue_stmts> }

(1)第一种方式定义的prologue_stmts会在probe handler执行前执行,而第二种方式定义的<epilogue_stmts>则会在probe handler执行后执行。要注意,上述的方式只是定义了probe alias,而并没有激活它们(参考Re: How does stap execute probe aliases?):

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
# ./timer_test.stp
semantic error: no probes found
Pass 2: analysis failed.  [man error::pass2]

下面则能正常运行:

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
probe timer_alias {}
# ./timer_test.stp
Entering timer
......

(2)看下面脚本的执行结果:

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
probe timer_alias += timer.s(3) {printf("Leaving timer\n")}
probe timer_alias {printf("In timer \n")}
# ./timer_test.stp
Entering timer
In timer
In timer
Leaving timer
......

它相当于执行下面的脚本(参考 Re: Why is the same log printed twice when using probe alias?):

# cat timer_test.stp
#!/usr/bin/stap

probe timer.s(3)
{
        printf("Entering timer\n")
        printf("In timer\n")
}
probe timer.s(3)
{
        printf("In timer\n")
        printf("Leaving timer\n")
}

(3) Alias suffixes

It is possible to include a suffix with a probe alias invocation. If only the initial part of a probe point matches an alias, the remainder is treated as a suffix and attached to the underlying probe point(s) when the alias is expanded. For example:

/* Define an alias: */
probe sendrecv = tcp.sendmsg, tcp.recvmsg { … }

/* Use the alias in its basic form: */
probe sendrecv { … }

/* Use the alias with an additional suffix: */
probe sendrecv.return { … }

Here, the second use of the probe alias is equivalent to writing probe tcp.sendmsg.return, tcp.recvmsg.return.

 

SystemTap 笔记 (15)—— syscall probes

SystemTap提供了系统调用(syscall)的probe

# stap -L "syscall.*"
syscall.accept sockfd:long addr_uaddr:long addrlen_uaddr:long name:string flags:long flags_str:string argstr:string
syscall.accept4 sockfd:long addr_uaddr:long addrlen_uaddr:long flags:long name:string flags_str:string argstr:string
syscall.access name:string pathname:string mode:long mode_str:string argstr:string $filename:long int $mode:long int
syscall.acct name:string filename:string argstr:string $name:long int
......
# stap -L "syscall.*.return"
syscall.accept.return
syscall.accept4.return
syscall.access.return name:string retstr:string $return:long int $filename:long int $mode:long int
syscall.acct.return name:string retstr:string $return:long int $name:long int
......

关于syscall probe的变量定义:

Each probe alias defines a variety of variables. Look at the tapset source code to find the most reliable source of variable definitions. Generally, each variable listed in the standard manual page is available as a script-level variable. For example, syscall.open exposes file name, flags, and mode. In addition, a standard suite of variables is available at most aliases, as follows:

argstr: A pretty-printed form of the entire argument list, without parentheses.
name: The name of the system call.
retstr: For return probes, a pretty-printed form of the system call result.

syscall.opensyscall.open.return为例:

# stap -L "syscall.open"
syscall.open filename:string mode:long __nr:long name:string flags:long argstr:string $filename:long int $flags:long int $mode:long int

# stap -e 'probe syscall.open{printf("argstr is %s, __nr is %d\n", argstr, __nr)}'
argstr is "/sys/fs/cgroup/systemd/system.slice/systemd-udevd.service/cgroup.procs", O_RDONLY|O_CLOEXEC, __nr is 2
argstr is "/etc/passwd", O_RDONLY|O_CLOEXEC, __nr is 2
argstr is "/proc/self/maps", O_RDONLY|O_CLOEXEC, __nr is 2
......

# stap -e 'probe syscall.open{printf("filename is %s, name is %s, flags is 0x%x, mode is 0x%x\n", filename, name, flags, mode)}'
filename is "/sys/fs/cgroup/systemd/system.slice/systemd-udevd.service/cgroup.procs", name is open, flags is 0x80000, mode is 0x1b6
filename is "/proc/interrupts", name is open, flags is 0x0, mode is 0x1b6
filename is "/proc/stat", name is open, flags is 0x0, mode is 0x1b6
......

# stap -e 'probe syscall.open{printf("filename is 0x%x, $flags is 0x%x, $mode is 0x%x\n", $filename, $flags, $mode)}'
filename is 0x1a658f0, $flags is 0x80000, $mode is 0x1b6
filename is 0x7f750760b26e, $flags is 0x0, $mode is 0x1b6
filename is 0x7f750760b291, $flags is 0x0, $mode is 0x1b6
filename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
......

# stap -L "syscall.open.return"
syscall.open.return __nr:long name:string retstr:string $return:long int $filename:long int $flags:long int $mode:long int

# stap -e 'probe syscall.open.return{printf("__nr is %d, name is %s, retstr is %s\n", __nr, name, retstr)}'
__nr is 2, name is open, retstr is 13
__nr is 2, name is open, retstr is 3
__nr is 2, name is open, retstr is 3
__nr is 2, name is open, retstr is -2 (ENOENT)
__nr is 2, name is open, retstr is -2 (ENOENT)
__nr is 2, name is open, retstr is -2 (ENOENT)
......

# stap -e 'probe syscall.open.return{printf("fiilename is 0x%x, $flags is 0x%x, $mode is 0x%x\n", $filename, $flags, $mode)}'
fiilename is 0x7f750760b26e, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7f750760b291, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
......

 

SystemTap 笔记 (14)—— Tracing user-space程序需要安装debug-info包

SystemTap追踪user-space程序时需要安装user-space程序的debug-info包。举个例子:

 # stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}
                      ^

semantic error: no match (similar functions: malloc, calloc, realloc, close, mbrtowc)
Pass 2: analysis failed.  [man error::pass2]

安装coreutils-debuginfo包:

# zypper in coreutils-debuginfo
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
  coreutils-debuginfo

The following package is not supported by its vendor:
  coreutils-debuginfo

1 new package to install.
Overall download size: 2.1 MiB. Already cached: 0 B. After the operation, additional 18.6 MiB will be used.
Continue? [y/n/? shows all options] (y): y
Retrieving package coreutils-debuginfo-8.22-9.1.x86_64                                                (1/1),   2.1 MiB ( 18.6 MiB unpacked)
Retrieving: coreutils-debuginfo-8.22-9.1.x86_64.rpm ...................................................................[done (105.3 KiB/s)]
Checking for file conflicts: ........................................................................................................[done]
(1/1) Installing: coreutils-debuginfo-8.22-9.1 ......................................................................................[done]

再次执行上述命令:

# stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
bin  boot  dev  etc  home  lib  lib64  lost+found  mnt  opt  proc  root  run  sbin  selinux  srv  sys  tmp  usr  var
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403828 : main+0xa58/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403887 : main+0xab7/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x4039e4 : main+0xc14/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
.....

 

SystemTap 笔记 (13)—— Statistical aggregate

Statistical aggregate主要用来统计一组数据。它使用“<<< value”运算来把一个value加到这个集合里。举个例子,假设目前是个空集合,执行“<<< 1”以后,集合里有了一个元素:1;再执行“<<< 2”以后,集合里有了两个元素:12。还有一些针对statistical aggregate的运算:countavg等等。通过以下例子,可以很容易知道这些运算的意义:

# cat test.stp
#!/usr/bin/stap

global reads

probe vfs.read
{
  reads[execname(),pid()] <<< $count
}
probe timer.s(3)
{
  foreach([execname, pid] in reads)
  {
        if (execname == "stapio")
        {
                printf("count (%s %d) : %d \n", execname, pid, @count(reads[execname, pid]))
                printf("sum (%s %d) : %d \n", execname, pid, @sum(reads[execname, pid]))
                printf("min (%s %d) : %d \n", execname, pid, @min(reads[execname, pid]))
                printf("max (%s %d) : %d \n", execname, pid, @max(reads[execname, pid]))
                printf("avg (%s %d) : %d \n", execname, pid, @avg(reads[execname, pid]))
                exit()
        }
  }
}
# ./test.stp
count (stapio 10762) : 17
sum (stapio 10762) : 1982472
min (stapio 10762) : 8196
max (stapio 10762) : 131072
avg (stapio 10762) : 116616

参考资料:
Computing for Statistical Aggregates

 

SystemTap 笔记 (12)—— Associate array和foreach

SystemTap也提供associate array,并且必须是global变量:

global array_name[index_expression]

最多可以有9index_expression,它们之间用,分割:

device[pid(),execname(),uid(),ppid(),"W"] = devname

可以用if语句来检查某个key是否存在:

if([index_expression] in array_name) statement

以下面程序为例:

# cat test.stp
#!/usr/bin/stap
global reads

probe vfs.read
{
  reads[execname()] ++
}

probe timer.s(3)
{
  printf("=======\n")
  foreach (count in reads+)
    printf("%s : %d \n", count, reads[count])
  if(["stapio"] in reads) {
    printf("stapio read detected, exiting\n")
    exit()
  }
}

# ./test.stp
=======
systemd-udevd : 4
gmain : 5
avahi-daemon : 11
stapio : 17
stapio read detected, exiting

可以用delete操作来删除一个元素或是整个数组:

delete removes an element.

The following statement removes from ARRAY the element specified by the index tuple. The value will no longer be available, and subsequent iterations will not report the element. It is not an error to delete an element that does not exist.

delete ARRAY[INDEX1, INDEX2, …]

The following syntax removes all elements from ARRAY:
delete ARRAY

foreach是输出associate array的一个重要方法:

General syntax:
foreach (VAR in ARRAY) STMT
The foreach statement loops over each element of a named global array, assigning the current key to VAR. The array must not be modified within the statement. If you add a single plus (+) or minus (-) operator after the VAR or the ARRAY identifier, the iteration order will be sorted by the ascending or descending index or value.

The following statement behaves the same as the first example, except it is used when an array is indexed with a tuple of keys. Use a sorting suffix on at most one VAR or ARRAY identifier.

foreach ([VAR1, VAR2, …] in ARRAY) STMT

You can combine the first and second syntax to capture both the full tuple and the keys at the same time as follows.

foreach (VALUE = [VAR1, VAR2, …] in ARRAY) STMT

The following statement is the same as the first example, except that the limit keyword limits the number of loop iterations to EXP times. EXP is evaluated once at the beginning of the loop.

foreach (VAR in ARRAY limit EXP) STMT

下面这个例子很好地解释了如何用foreach操作associate array

# cat test.stp
#!/usr/bin/stap 
global reads 

probe vfs.read 
{ 
    reads[pid(), execname()]++ 
} 

probe timer.s(3) 
{ 
    printf("======Total=====\n") 
    foreach ([pid, name] in reads) 
        printf("%d %s: %d\n", pid, name, reads[pid, name]) 
    printf("======Another Total=====\n") 
    foreach (val = [pid, name] in reads) 
        printf("%d %s: %d\n", pid, name, val) 
    printf("======PID Ascending=====\n") 
    foreach ([pid+, name] in reads limit 3) 
        printf("%d %s: %d\n", pid, name, reads[pid, name]) 
    printf("======Count Descending=====\n") 
        foreach ([pid, name] in reads- limit 3) 
    printf("%d %s: %d\n", pid, name, reads[pid, name]) 
    exit() 
} 

执行如下:

# ./test.stp
======Total=====
14367 stapio: 17
976 gmain: 5
791 avahi-daemon: 11
806 irqbalance: 16
14368 systemd-udevd: 1
432 systemd-udevd: 3
======Another Total=====
14367 stapio: 17
976 gmain: 5
791 avahi-daemon: 11
806 irqbalance: 16
14368 systemd-udevd: 1
432 systemd-udevd: 3
======PID Ascending=====
432 systemd-udevd: 3
791 avahi-daemon: 11
806 irqbalance: 16
======Count Descending=====
14367 stapio: 17
806 irqbalance: 16
791 avahi-daemon: 11

首先输出associate array中的所有内容(使用两种不同方式),接着分别按PID升序和associate array中的value降序输出前三个。

 

SystemTap 笔记 (11)—— 命令行参数

SystemTap使用$@传递命令行参数:$传递整数参数,@传递字符串参数。举例如下:

# cat test.stp
#!/usr/bin/stap

probe begin
{
        printf("arg1 is %d, arg2 is %s\n", $1, @2)
        exit()
}

执行如下:

 # ./test.stp 100 "test"
arg1 is 100, arg2 is test

参考资料:
Command-Line Arguments

 

SystemTap 笔记 (10)—— “@defined”和“@choose_defined”

随着代码的不断变化,有些target variable可能在新的版本里不存在了。@defined用来检查target variable是否存在。举例如下:

probe vm.pagefault = kernel.function("__handle_mm_fault@mm/memory.c") ?,
                     kernel.function("handle_mm_fault@mm/memory.c") ?
{
        write_access = (@defined($flags) ? $flags & FAULT_FLAG_WRITE : $write_access)
}

上述代码则用来根据flags是否存在,来赋给write_access不同的值。

此外还有@choose_defined@choose_defined($a, $b)相当于@defined($a)? $a : $b。举例如下:

probe vm.pagefault = kernel.function("handle_mm_fault@mm/memory.c")
{
        write_access = @choose_defined($write_access, 0)
}

 

参考资料:
Checking Target Variable Availability

Arguments

 

SystemTap 笔记 (9)—— “?”和“!”

probe后面有时会跟?或是!字符:

kernel.function("no_such_function") ?
module("awol").function("no_such_function") !

对此,man手册的解释如下:

However, a probe point may be followed by a “?” character, to indicate that it is optional, and that no error should result if it fails to resolve. Optionalness passes down through all levels of alias/wildcard expansion. Alternately, a probe point may be followed by a “!” character, to indicate that it is both optional and sufficient. (Think vaguely of the prolog cut operator.) If it does resolve, then no further probe points in the same comma-separated list will be resolved. Therefore, the “!” sufficiency mark only makes sense in a list of probe point alternatives.

?表明probe是可选的,即使不存在相应的probe,也不会导致命令出错,而是继续解析其它的probe!表明probe一旦解析成功,则不会继续解析后面的probe。因此!只在存在probe列表的情况下才有效。

 

SystemTap 笔记 (8)—— typecasting

当指针是一个void *类型,或是保存为整数后,可以使用cast运算符指定指针的数据类型:

@cast(p, "type_name"[, "module"])->member

可选的module参数用来指定从哪里得到type_name(The optional module tells the translator where to look for information about that type. Multiple modules may be specified as a list with : separators. If the module is not specified, it will default either to the probe module for dwarf probes, or to “kernel” for functions and all other probes types.)。

另外,translator可以从头文件中创建type信息:

The translator can create its own module with type information from a header surrounded by angle brackets, in case normal debuginfo is not available. For kernel headers, prefix it with “kernel” to use the appropriate build system. All other headers are build with default GCC parameters into a user module. Multiple headers may be specified in sequence to resolve a codependency.

@cast(tv, “timeval”, “<sys/time.h>”)->tvsec
@cast(task, “task
struct”, “kernel<linux/sched.h>”)->tgid
@cast(task, “taskstruct”, “kernel<linux/sched.h><linux/fsstruct.h>”)->fs->umask

参考例子:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%d\n", $f->f_flags); exit(); }'
32768

使用cast运算符:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%d\n", @cast($f, "file", "kernel<linux/fs.h>" )->f_flags); exit(); }'
32768

SystemTap 笔记 (7)—— target variable (2)

SystemTap可以为target variable生成一系列可打印的字符串:

$$vars

Expands to a character string that is equivalent to sprintf(“parm1=%x … parmN=%x var1=%x … varN=%x”, parm1, …, parmN, var1, …, varN) for each variable in scope at the probe point. Some values may be printed as “=?” if their run-time location cannot be found.

$$locals

Expands to a subset of $$vars containing only the local variables.

$$parms

Expands to a subset of $$vars containing only the function parameters.

$$return

Is available in return probes only. It expands to a string that is equivalent to sprintf(“return=%x”, $return) if the probed function has a return value, or else an empty string.

参考下面例子:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$vars); exit(); }'
f=0xffff880022ec6080 open=0x0 cred=0xffff880030d483c0 empty_fops={...} inode=? error=?
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms); exit(); }'
f=0xffff880030d453c0 open=0x0 cred=0xffff880030d483c0
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$locals); exit(); }'
empty_fops={...} inode=? error=?

在上述变量后加上$$$可以打印更详细的结构体信息。参考下例:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms$); exit(); }'
f={.f_u={...}, .f_path={...}, .f_inode=0x0, .f_op=0x0, .f_lock={...}, .f_count={...}, .f_flags=32768, .f_mode=0, .f_pos=0, .f_owner={...}, .f_cred=0xffff880030d483c0, .f_ra={...}, .f_version=0, .f_security=0xffff880012982b80, .private_data=0x0, .f_ep_links={...}, .f_tfile_llink={...}, .f_mapping=0x0} open=<function>:0x0 cred={.usage={...}, .uid={...}, .gid={...}, .suid={...}, .sgid={...}, .euid={...}, .egid={...}, .fsuid={...}, .fsgid={...}, .securebits=0, .cap_inheritable={...}, .cap_permitted={...}, .cap
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms$$); exit(); }'
f={.f_u={.fu_llist={.next=0x0}, .fu_rcuhead={.next=0x0, .func=0x0}}, .f_path={.mnt=0xffff8800337510e0, .dentry=0xffff8800318b82d8}, .f_inode=0x0, .f_op=0x0, .f_lock={<union>={.rlock={.raw_lock={.head_tail=0, <class>={.tickets={.head=0, .tail=0}, .owner=0}}}}}, .f_count={.counter=1}, .f_flags=32768, .f_mode=0, .f_pos=0, .f_owner={.lock={.raw_lock={.lock=1048576, .write=1048576}}, .pid=0x0, .pid_type=0, .uid={.val=0}, .euid={.val=0}, .signum=0}, .f_cred=0xffff880030d89300, .f_ra={.start=0, .size=0, .async_si