openat VS open

2.6.16版本开始,GNU/Linux引入openat系统调用:

#define _XOPEN_SOURCE 700 /* Or define _POSIX_C_SOURCE >= 200809 */
#include <fcntl.h>
int openat(int  dirfd , const char * pathname , int  flags , ... /* mode_t  mode */);
Returns file descriptor on success, or –1 on error

open相比,多了一个dirfd参数。关于它的用法,参考以下解释:

If pathname specifies a relative pathname, then it is interpreted relative to the directory referred to by the open file descriptor dirfd, rather than relative to the process’s current working directory.

If pathname specifies a relative pathname, and dirfd contains the special value AT_FDCWD , then pathname is interpreted relative to the process’s current working directory (i.e., the same behavior as open(2)).

If pathname specifies an absolute pathname, then dirfd is ignored.

总结起来,如果pathname是绝对路径,则dirfd参数没用。如果pathname是相对路径,并且dirfd的值不是AT_FDCWD,则pathname的参照物是相对于dirfd指向的目录,而不是进程的当前工作目录;反之,如果dirfd的值是AT_FDCWDpathname则是相对于进程当前工作目录的相对路径,此时等同于open。参考kernel代码则一目了然:

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;

    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
        umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;

    return do_sys_open(dfd, filename, flags, mode);
}

引入openat(及其它at结尾的函数)有以下两个原因:

First, openat() allows an application to avoid race conditions that could occur when using open(2) to open files in directories other than the current working directory. These race conditions result from the fact that some component of the directory prefix given to open(2) could be changed in parallel with the call to open(2). Such races can be avoided by opening a file descriptor for the target directory, and then specifying that file descriptor as the dirfd argument of openat().

Second, openat() allows the implementation of a per-thread “current working directory”, via file descriptor(s) maintained by the application. (This functionality can also be obtained by tricks based on the use of /proc/self/fd/dirfd, but less efficiently.)

参考资料:
openat(2) – Linux man page
The Linux programming interface

 

Linux kernel 笔记 (42)——container_of

container_of定义在<linux/kernel.h>中:

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:    the pointer to the member.
 * @type:   the type of the container struct this is embedded in.
 * @member: the name of the member within the struct.
 *
 */
#define container_of(ptr, type, member) ({          \
    const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
    (type *)( (char *)__mptr - offsetof(type,member) );})

它的功能是通过一个结构体成员的地址,得到结构体的地址。举例如下:

struct st_A
{
        int member_b;
        int member_c;
};

struct st_A a;

container_of(&(a.member_c), struct st_A, member_c)会得到变量a的地址,也就是&a的值。

 

SystemTap 笔记 (6)—— 打印userspace堆栈信息

使用SystemTap打印user-space程序的调用栈信息时,需要产生足够的调试信息。这时需要-d--ldd两个选项:

-d MODULE
          Add symbol/unwind information for the given module into the kernel object module.  This  may  enable  symbolic  tracebacks
          from those modules/programs, even if they do not have an explicit probe placed into them.

--ldd  Add symbol/unwind  information  for  all  shared libraries suspected by ldd to be necessary for user-space binaries being
          probe or listed with the -d option.  Caution: this can make the probe modules considerably larger.

-d选项负责加载模块/可执行程序的符号表信息,而-ldd则加载-d modulemodule或是probe需要的共享库符号表信息。参考下例:

 # stap -d /usr/lib/systemd/systemd-udevd --ldd -e 'probe kprocess.create {print_ubacktrace()}'
<no user backtrace at kernel.function("copy_process@../kernel/fork.c:1146").return>
 0x7fec1d14f011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f6feb135011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
WARNING: Missing unwind data for module, rerun with 'stap -d /usr/lib64/libglib-2.0.so.0.3800.2'
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c2ff7ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f22c3a01c35 [/usr/lib64/libglib-2.0.so.0.3800.2+0x8cc35/0x302000]
 0x7f20966a5011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f20966a5011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
WARNING: Missing unwind data for module, rerun with 'stap -d /usr/lib/systemd/systemd'
 0x7f4e59945ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x4364f3 [/usr/lib/systemd/systemd+0x364f3/0x113000]
 0x7f22c2ff7ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f22c3a01c35 [/usr/lib64/libglib-2.0.so.0.3800.2+0x8cc35/0x302000]
 0x7fb1bdfb6011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7fb1bdfb6011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f3bb6e94011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f3bb6e94011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f783f704ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f783fd2169b [/usr/lib64/libpython2.7.so.1.0+0x10f69b/0x3a0000]

参考资料:
Is there any better method to pass “-d OBJECT” options in command line?
User-Space Stack Backtraces

 

 

libvirt笔记 (4) —— log配置

libvirt库通过以下三个环境变量配置log

The library configuration of logging is through 3 environment variables allowing to control the logging behaviour:

LIBVIRT_DEBUG: it can take the four following values:

1 or “debug”: asking the library to log every message emitted, though the filters can be used to avoid filling up the output

2 or “info”: log all non-debugging information

3 or “warn”: log warnings and errors, that’s the default value

4 or “error”: log only error messages

LIBVIRTLOGFILTERS: defines logging filters

LIBVIRTLOGOUTPUTS: defines logging outputs

Note that, for example, setting LIBVIRT_DEBUG= is the same as unset. If you specify an invalid value, it will be ignored with a warning. If you have an error in a filter or output string, some of the settings may be applied up to the point at which libvirt encountered the error.

libvirtd daemon程序也有三个类似的配置项(存储在配置文件libvirtd.conf):

log_level: accepts the following values:

4: only errors

3: warnings and errors

2: information, warnings and errors

1: debug and everything

log_filters: defines logging filters

log_outputs: defines logging outputs

对于libvirtd程序来讲,log配置项的优先级如下:

When starting the libvirt daemon, any logging environment variable settings will override settings in the config file. Command line options take precedence over all. If no outputs are defined for libvirtd, it will try to use

0.10.0 or later: systemd journal, if /run/systemd/journal/socket exists 0.9.0 or later: file /var/log/libvirt/libvirtd.log if running as a daemon before 0.9.0: syslog if running as a daemon all versions: to stderr stream if running in the foreground

参考资料:
Logging in the library and the daemon

 

SystemTap 笔记 (5)—— target variable (1)

关于target variable的解释:

The probe events that map to actual locations in the code (for example kernel.function(“function”) and kernel.statement(“statement”)) allow the use of target variables to obtain the value of variables visible at that location in the code. You can use the -L option to list the target variable available at a probe point.

其实,目前更倾向于使用context variable这个名字,而不是target variable(可以参考这封邮件)。使用target variable需要有kerneldebuginfo。参考下面例子:

# stap -L 'kernel.function("vfs_read")'
kernel.function("vfs_read@../fs/read_write.c:381") $file:struct file* $buf:char* $count:size_t $pos:loff_t*

每个target variable前面有$:后面跟着变量类型。例如:file变量的类型就是struct file*。也可对照vfs_read的定义:

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)

此外,对于target variable不属于当前probelocal变量,可以使用@var("varname@src/file.c")来访问:

When a target variable is not local to the probe point, like a global external variable or a file local static variable defined in another file then it can be referenced through “@var(“varname@src/file.c”)”.

请看下面这个例子:

# stap -e 'probe kernel.function("vfs_read") {
           printf ("current files_stat max_files: %d\n",
                   @var("files_stat@fs/file_table.c")->max_files);
           exit(); }'
current files_stat max_files: 82002

也可以通过指针访问一些基本类型的数据:

kernel_char(address)
Obtain the character at address from kernel memory.
kernel_short(address)
Obtain the short at address from kernel memory.
kernel_int(address)
Obtain the int at address from kernel memory.
kernel_long(address)
Obtain the long at address from kernel memory
kernel_string(address)
Obtain the string at address from kernel memory.
kernel_string_n(address, n)
Obtain the string at address from the kernel memory and limits the string to n bytes.

 

Linux kernel 笔记 (41)——“inode”结构体中的“i_rdev”成员

inode结构体中有一个i_rdev成员(定义在<linux/fs.h>中):

struct inode {
    ......
    dev_t           i_rdev;
    ......
}

如果inode代表一个设备,则i_rdev的值为设备号。为了代码更好地可移植性,获取inodemajorminor号应该使用imajoriminor函数:

static inline unsigned iminor(const struct inode *inode)
{
    return MINOR(inode->i_rdev);
}

static inline unsigned imajor(const struct inode *inode)
{
    return MAJOR(inode->i_rdev);
}

 

Linux kernel 笔记 (40)——”file”和“inode”结构体的比较

LDD中对file结构体的描述:

struct file, defined in <linux/fs.h>, is the second most important data structure used in device drivers. Note that a file has nothing to do with the FILE pointers of user-space programs. A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.

The file structure represents an open file . (It is not specific to device drivers; every open file in the system has an associated struct file in kernel space.) It is created by the kernel on open and is passed to any function that operates on the file, until the last close. After all instances of the file are closed, the kernel releases the data structure.

In the kernel sources, a pointer to struct file is usually called either file or filp (“file pointer”). We’ll consistently call the pointer filp to prevent ambiguities with the structure itself. Thus, file refers to the structure and filp to a pointer to the structure.

inode结构体的描述:

The inode structure is used by the kernel internally to represent files. Therefore, it is different from the file structure that represents an open file descriptor. There can be numerous file structures representing multiple open descriptors on a single file, but they all point to a single inode structure.

总结如下:在kernel中,每一个文件都有一个inode结构体来表示,而file结构体是和打开的文件描述符关联的。如果一个文件被打开多次,有多个文件描述符,也就相应地有多个file结构体与这个文件关联。而inode却永远只有一个。

 

libvirt笔记 (3) —— 得到virtualization host的能力信息

getCapabilities方法得到一个字符串,用来描述virtualization host的能力,以及能创建什么样的Guest OS。请看下面代码:

#!/usr/bin/python

from __future__ import print_function
import sys
import libvirt

conn = libvirt.open('xen:///')
if conn == None:
    print('Failed to open connection to xen:///', file=sys.stderr)
    exit(1)

caps = conn.getCapabilities() # caps will be a string of XML
print('Capabilities:\n'+caps)

conn.close()
exit(0)

执行如下:

Capabilities:
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
      <features>
        <pae/>
      </features>
    </cpu>
    <power_management/>
    <migration_features>
      <live/>
    </migration_features>
    <topology>
      <cells num='1'>
        <cell id='0'>
          <memory unit='KiB'>1048512</memory>
          <cpus num='0'>
          </cpus>
        </cell>
      </cells>
    </topology>
  </host>

  <guest>
    <os_type>xen</os_type>
    <arch name='x86_64'>
      <wordsize>64</wordsize>
      <emulator>/usr/lib/xen/bin/qemu-system-i386</emulator>
      <machine>xenpv</machine>
      <domain type='xen'/>
    </arch>
  </guest>

  <guest>
    <os_type>xen</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/lib/xen/bin/qemu-system-i386</emulator>
      <machine>xenpv</machine>
      <domain type='xen'/>
    </arch>
    <features>
      <pae/>
    </features>
  </guest>

</capabilities>

参考资料:
Capability information

 

SystemTap 笔记 (4)—— timer event

timer event会周期性执行handler。举个例子:

# stap -e 'probe timer.s(1) { printf("Hello world!\n");}'
Hello world!
Hello world!
Hello world!
Hello world!

上面脚本每隔1秒打印一次Hello world!

timer event定义如下:

timer.ms(milliseconds)
timer.us(microseconds)
timer.ns(nanoseconds)
timer.hz(hertz)
timer.jiffies(jiffies)

另外,还有一种randomize表示方式(参考自这里):

timer.jiffies(N).randomize(M)

The probe handler is run every N jiffies (a kernel-defined unit of time, typically between 1 and 60 ms). If the “randomize” component is given, a linearly distributed random value in the range [-M..+M] is added to N every time the handler is run. N is restricted to a reasonable range (1 to around a million), and M is restricted to be smaller than N.

Alternatively, intervals may be specified in units of time. There are two probe point variants similar to the jiffies timer:

timer.ms(N)

timer.ms(N).randomize(M)

Here, N and M are specified in milliseconds, but the full options for units are seconds (s/sec), milliseconds (ms/msec), microseconds (us/usec), nanoseconds (ns/nsec), and hertz (hz). Randomization is not supported for hertz timers.

最后结合一个例子看一下如何使用timer event(选自这里):

global count_jiffies, count_ms
probe timer.jiffies(100) { count_jiffies ++ }
probe timer.ms(100) { count_ms ++ }
probe timer.ms(12345)
{
  hz=(1000*count_jiffies) / count_ms
  printf ("jiffies:ms ratio %d:%d => CONFIG_HZ=%d\n",
    count_jiffies, count_ms, hz)
  exit ()
}

首先要知道,每秒发生HZjiffies

其次,每发生100jiffiescount_jiffies计数加1,所以脚本退出时,一共发生100 * count_jiffiesHZ。一共经历了count_ms / 10秒。

最后计算CONFIG_HZ(100 * count_jiffies) / (count_ms / 10) = (1000 * count_jiffies) / count_ms

 

Linux kernel 笔记 (39)——”THIS_MODULE”

THIS_MODULE是一个macro,定义在<linux/module.h>中:

#ifdef MODULE
#define MODULE_GENERIC_TABLE(gtype,name)            \
extern const struct gtype##_id __mod_##gtype##_table        \
  __attribute__ ((unused, alias(__stringify(name))))

extern struct module __this_module;
#define THIS_MODULE (&__this_module)
#else  /* !MODULE */
#define MODULE_GENERIC_TABLE(gtype,name)
#define THIS_MODULE ((struct module *)0)
#endif

THIS_MODULE即是__this_module这个变量的地址。__this_module会指向这个模块起始的地址空间,恰好是struct module变量定义的位置。

file_operations结构体的第一个成员是struct module类型的指针,定义在<linux/fs.h>中:

struct file_operations {
    struct module *owner;
    ......
}

LDD对其的解释:

struct module *owner

The first file_operations field is not an operation at all; it is a pointer to the module that “owns” the structure. This field is used to prevent the module from being unloaded while its operations are in use. Almost all the time, it is simply initialized to THIS_MODULE , a macro defined in <linux/module.h>.

owner指向绑定file_operations的模块。在大多时候,只需把THIS_MODULE赋给它即可。

参考资料:
Where is the memory allocation of “_thismodule” variable?
深入淺出 insmod, #1