SystemTap 笔记 (16)—— probe alias

Probe alias的语法:

probe <alias> = <probepoint> { <prologue_stmts> }
probe <alias> += <probepoint> { <epilogue_stmts> }

(1)第一种方式定义的prologue_stmts会在probe handler执行前执行,而第二种方式定义的<epilogue_stmts>则会在probe handler执行后执行。要注意,上述的方式只是定义了probe alias,而并没有激活它们(参考Re: How does stap execute probe aliases?):

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
# ./timer_test.stp
semantic error: no probes found
Pass 2: analysis failed.  [man error::pass2]

下面则能正常运行:

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
probe timer_alias {}
# ./timer_test.stp
Entering timer
......

(2)看下面脚本的执行结果:

# cat timer_test.stp
#!/usr/bin/stap

probe timer_alias = timer.s(3) {printf("Entering timer\n")}
probe timer_alias += timer.s(3) {printf("Leaving timer\n")}
probe timer_alias {printf("In timer \n")}
# ./timer_test.stp
Entering timer
In timer
In timer
Leaving timer
......

它相当于执行下面的脚本(参考 Re: Why is the same log printed twice when using probe alias?):

# cat timer_test.stp
#!/usr/bin/stap

probe timer.s(3)
{
        printf("Entering timer\n")
        printf("In timer\n")
}
probe timer.s(3)
{
        printf("In timer\n")
        printf("Leaving timer\n")
}

(3) Alias suffixes

It is possible to include a suffix with a probe alias invocation. If only the initial part of a probe point matches an alias, the remainder is treated as a suffix and attached to the underlying probe point(s) when the alias is expanded. For example:

/* Define an alias: */
probe sendrecv = tcp.sendmsg, tcp.recvmsg { … }

/* Use the alias in its basic form: */
probe sendrecv { … }

/* Use the alias with an additional suffix: */
probe sendrecv.return { … }

Here, the second use of the probe alias is equivalent to writing probe tcp.sendmsg.return, tcp.recvmsg.return.

 

Mesos笔记 (1)—— 架构

本文内容摘自下列文章:
APACHE MESOS: THE TRUE OS FOR THE SOFTWARE DEFINED DATA CENTER?
Mesos Architecture

Imagine if instead of individual physical servers, we could aggregate all the resources in a data center into a single large virtual pool, exposing not virtual machines but primitives such as CPU, RAM, and I/O? In conjunction, imagine if we could break applications into small isolated units of tasks that could be dynamically assigned resources from our virtual data center pool, based on the needs of the applications in our data center? The analogy here would be a PC with an operating system that is pooling the PC’s processors and RAM and coordinating the allocation and deallocation of those resources for use by different processes. Now extend that analogy to make the data center the PC with Mesos as the operating system kernel. That, in a nutshell, is how Mesos is transforming the data center and making true SDDC a reality.

通俗地讲,可以把一个数据中心里所有的硬件资源看成一个整体。Mesos的功能就是为应用程序管理和分配这些资源。

Mesos结构如下图所示:

mesos-arch1

The modified diagram above from the Apache Mesos website shows how Mesos implements it’s two-level scheduling architecture for managing multiple types of applications. The first level is the master daemon which manages slave daemons running on each node in the Mesos cluster. The cluster consists of all servers, physical or virtual, that will be running applications tasks, such as Hadoop and MPI jobs. The second level consists of a component called a framework. A framework includes a scheduler and an executor process, the latter of which also runs on each node. Mesos is able to communicate with different types of frameworks with each one managing a different clustered application. The diagram above shows Hadoop and MPI but other frameworks have been written as well for other types of applications.

Mesos cluster里的每一个node都运行一个slave daemon程序,由一个master daemon程序统一管理。Mesos上运行的程序称之为framework,它包含两个部分:schedulerexecutor process。分布式系统通常包含一个controller和多个workerworker可以不依赖于controller而独立地运行。对于framework而言,scheduler就是controllerexecutor process就是worker

 

A framework running on top of Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework’s tasks (see the App/Framework development guide for more details about application schedulers and executors). While the master determines how many resources are offered to each framework, the frameworks’ schedulers select which of the offered resources to use. When a frameworks accepts offered resources, it passes to Mesos a description of the tasks it wants to run on them. In turn, Mesos launches the tasks on the corresponding slaves.

schedulermaster得到需要的资源,而executor process则会在slave node上运行framework task

Ubuntu使用初体验

今天用了一下Ubuntu,第一感觉就是在root账户的使用方面同RHELSuSE不大一样:

(1)安装Ubuntu过程没有设置root密码的步骤。需要你使用创建的账户登录后,使用sudo passwd命令设置root密码。

(2)似乎不能使用root账户进行ssh远程连接(更新:解决办法在这里):

$ ssh root@10.10.249.177

要使用你创建的账户:

$ ssh nan@110.10.249.177

 

Haskell笔记 (4)—— 用前缀方法书写表达式

The infix style of writing an expression is just a convenience; we can also write an expression in prefix form, where the operator precedes its arguments. To do this, we must enclose the operator in parentheses.

举例如下:

ghci> 3 ^ 4
81
ghci> (^) 3 4
81
ghci> ^ 3 4

<interactive>:68:1: parse error on input ‘^’

 

SystemTap 笔记 (15)—— syscall probes

SystemTap提供了系统调用(syscall)的probe

# stap -L "syscall.*"
syscall.accept sockfd:long addr_uaddr:long addrlen_uaddr:long name:string flags:long flags_str:string argstr:string
syscall.accept4 sockfd:long addr_uaddr:long addrlen_uaddr:long flags:long name:string flags_str:string argstr:string
syscall.access name:string pathname:string mode:long mode_str:string argstr:string $filename:long int $mode:long int
syscall.acct name:string filename:string argstr:string $name:long int
......
# stap -L "syscall.*.return"
syscall.accept.return
syscall.accept4.return
syscall.access.return name:string retstr:string $return:long int $filename:long int $mode:long int
syscall.acct.return name:string retstr:string $return:long int $name:long int
......

关于syscall probe的变量定义:

Each probe alias defines a variety of variables. Look at the tapset source code to find the most reliable source of variable definitions. Generally, each variable listed in the standard manual page is available as a script-level variable. For example, syscall.open exposes file name, flags, and mode. In addition, a standard suite of variables is available at most aliases, as follows:

argstr: A pretty-printed form of the entire argument list, without parentheses.
name: The name of the system call.
retstr: For return probes, a pretty-printed form of the system call result.

syscall.opensyscall.open.return为例:

# stap -L "syscall.open"
syscall.open filename:string mode:long __nr:long name:string flags:long argstr:string $filename:long int $flags:long int $mode:long int

# stap -e 'probe syscall.open{printf("argstr is %s, __nr is %d\n", argstr, __nr)}'
argstr is "/sys/fs/cgroup/systemd/system.slice/systemd-udevd.service/cgroup.procs", O_RDONLY|O_CLOEXEC, __nr is 2
argstr is "/etc/passwd", O_RDONLY|O_CLOEXEC, __nr is 2
argstr is "/proc/self/maps", O_RDONLY|O_CLOEXEC, __nr is 2
......

# stap -e 'probe syscall.open{printf("filename is %s, name is %s, flags is 0x%x, mode is 0x%x\n", filename, name, flags, mode)}'
filename is "/sys/fs/cgroup/systemd/system.slice/systemd-udevd.service/cgroup.procs", name is open, flags is 0x80000, mode is 0x1b6
filename is "/proc/interrupts", name is open, flags is 0x0, mode is 0x1b6
filename is "/proc/stat", name is open, flags is 0x0, mode is 0x1b6
......

# stap -e 'probe syscall.open{printf("filename is 0x%x, $flags is 0x%x, $mode is 0x%x\n", $filename, $flags, $mode)}'
filename is 0x1a658f0, $flags is 0x80000, $mode is 0x1b6
filename is 0x7f750760b26e, $flags is 0x0, $mode is 0x1b6
filename is 0x7f750760b291, $flags is 0x0, $mode is 0x1b6
filename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
......

# stap -L "syscall.open.return"
syscall.open.return __nr:long name:string retstr:string $return:long int $filename:long int $flags:long int $mode:long int

# stap -e 'probe syscall.open.return{printf("__nr is %d, name is %s, retstr is %s\n", __nr, name, retstr)}'
__nr is 2, name is open, retstr is 13
__nr is 2, name is open, retstr is 3
__nr is 2, name is open, retstr is 3
__nr is 2, name is open, retstr is -2 (ENOENT)
__nr is 2, name is open, retstr is -2 (ENOENT)
__nr is 2, name is open, retstr is -2 (ENOENT)
......

# stap -e 'probe syscall.open.return{printf("fiilename is 0x%x, $flags is 0x%x, $mode is 0x%x\n", $filename, $flags, $mode)}'
fiilename is 0x7f750760b26e, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7f750760b291, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
fiilename is 0x7ffcf45d73d0, $flags is 0x0, $mode is 0x1b6
......

 

find命令的“-exec COMMAND \;”

下面这个find命令列出当前目录下的*.stp文件:

# find . -name '*.stp' -exec ls {} \;
./Documents/one.stp
./Documents/two.stp

关于find命令的“-exec COMMAND \;”:

find

-exec COMMAND \;

Carries out COMMAND on each file that find matches. The command sequence terminates with ; (the “;” is escaped to make certain the shell passes it to find literally, without interpreting it as a special character).

If COMMAND contains {}, then find substitutes the full path name of the selected file for “{}”.

;的作用是标示命令完结,\;是让shell;原封不动地传给find命令。而{}会使用查找出来的文件的全路径名。

参考资料:
16.2. Complex Commands

 

Linux kernel 笔记 (59)——Kconfig中的“depends on”和“select”

Kconfig文件中:

config A
    depends on B
    select C

它的含义是:CONFIG_A配置与否,取决于CONFIG_B是否配置。一旦CONFIG_A配置了,CONFIG_C也自动配置了。

参考资料:
“select” vs “depends” in kernel Kconfig

 

SystemTap 笔记 (14)—— Tracing user-space程序需要安装debug-info包

SystemTap追踪user-space程序时需要安装user-space程序的debug-info包。举个例子:

 # stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}
                      ^

semantic error: no match (similar functions: malloc, calloc, realloc, close, mbrtowc)
Pass 2: analysis failed.  [man error::pass2]

安装coreutils-debuginfo包:

# zypper in coreutils-debuginfo
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
  coreutils-debuginfo

The following package is not supported by its vendor:
  coreutils-debuginfo

1 new package to install.
Overall download size: 2.1 MiB. Already cached: 0 B. After the operation, additional 18.6 MiB will be used.
Continue? [y/n/? shows all options] (y): y
Retrieving package coreutils-debuginfo-8.22-9.1.x86_64                                                (1/1),   2.1 MiB ( 18.6 MiB unpacked)
Retrieving: coreutils-debuginfo-8.22-9.1.x86_64.rpm ...................................................................[done (105.3 KiB/s)]
Checking for file conflicts: ........................................................................................................[done]
(1/1) Installing: coreutils-debuginfo-8.22-9.1 ......................................................................................[done]

再次执行上述命令:

# stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
bin  boot  dev  etc  home  lib  lib64  lost+found  mnt  opt  proc  root  run  sbin  selinux  srv  sys  tmp  usr  var
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403828 : main+0xa58/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403887 : main+0xab7/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x4039e4 : main+0xc14/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
.....

 

Linux kernel 笔记 (58)——ioctl

ioctl系统调用的函数原型:

int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).

...并不是代表可变参数,而只是一个可选参数,...在这里防止编译时进行类型检查。

目前在struct file_operations结构体中已不再有ioctl成员:

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

取而代之是unlocked_ioctlcompat_ioctl

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

unlocked_ioctl代替ioctl,而compat_ioctl用在32位程序运行在64位操作系统上调用ioctl系统调用。

ioctl的命令是32-bit长,包含以下4个字段:

---------------------------------------------------------------
| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|

各个字段定义:

type
The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

number
The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

direction
The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

size
The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.

相关的macro定义:

extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
    ((sizeof(t) == sizeof(t[1]) && \
      sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
      sizeof(t) : __invalid_size_argument_for_IOC)

#define _IOC(dir,type,nr,size) \
    (((dir)  << _IOC_DIRSHIFT) | \
     ((type) << _IOC_TYPESHIFT) | \
     ((nr)   << _IOC_NRSHIFT) | \
     ((size) << _IOC_SIZESHIFT))

/* used to create numbers */
#define _IO(type,nr)        _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOR_BAD(type,nr,size)  _IOC(_IOC_READ,(type),(nr),sizeof(size))
#define _IOW_BAD(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),sizeof(size))
#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size))

/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)        (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)       (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)     (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)       (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)

参考资料:
The new way of ioctl()
Linux Kernel ioctl(), unlockedioctl(), and compatioctl()
Advanced Char Driver Operations