/dev/mem,/dev/kmem和/dev/port

/dev/mem/dev/kmem/dev/port这三个文件分别代表物理内存,kernel虚拟内存和I/O端口。参考下面:

/dev/mem is a character device file that is an image of the main memory of the computer. It may be used, for example, to examine (and even patch) the system. Byte addresses in /dev/mem are interpreted as physical memory addresses. References to nonexistent locations cause errors to be returned.

The file /dev/kmem is the same as /dev/mem, except that the kernel virtual memory rather than physical memory is accessed.

/dev/port is similar to /dev/mem, but the I/O ports are accessed.

 

SystemTap 笔记 (8)—— typecasting

当指针是一个void *类型,或是保存为整数后,可以使用cast运算符指定指针的数据类型:

@cast(p, "type_name"[, "module"])->member

可选的module参数用来指定从哪里得到type_name(The optional module tells the translator where to look for information about that type. Multiple modules may be specified as a list with : separators. If the module is not specified, it will default either to the probe module for dwarf probes, or to “kernel” for functions and all other probes types.)。

另外,translator可以从头文件中创建type信息:

The translator can create its own module with type information from a header surrounded by angle brackets, in case normal debuginfo is not available. For kernel headers, prefix it with “kernel” to use the appropriate build system. All other headers are build with default GCC parameters into a user module. Multiple headers may be specified in sequence to resolve a codependency.

@cast(tv, “timeval”, “<sys/time.h>”)->tvsec
@cast(task, “task
struct”, “kernel<linux/sched.h>”)->tgid
@cast(task, “taskstruct”, “kernel<linux/sched.h><linux/fsstruct.h>”)->fs->umask

参考例子:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%d\n", $f->f_flags); exit(); }'
32768

使用cast运算符:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%d\n", @cast($f, "file", "kernel<linux/fs.h>" )->f_flags); exit(); }'
32768

SystemTap 笔记 (7)—— target variable (2)

SystemTap可以为target variable生成一系列可打印的字符串:

$$vars

Expands to a character string that is equivalent to sprintf(“parm1=%x … parmN=%x var1=%x … varN=%x”, parm1, …, parmN, var1, …, varN) for each variable in scope at the probe point. Some values may be printed as “=?” if their run-time location cannot be found.

$$locals

Expands to a subset of $$vars containing only the local variables.

$$parms

Expands to a subset of $$vars containing only the function parameters.

$$return

Is available in return probes only. It expands to a string that is equivalent to sprintf(“return=%x”, $return) if the probed function has a return value, or else an empty string.

参考下面例子:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$vars); exit(); }'
f=0xffff880022ec6080 open=0x0 cred=0xffff880030d483c0 empty_fops={...} inode=? error=?
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms); exit(); }'
f=0xffff880030d453c0 open=0x0 cred=0xffff880030d483c0
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$locals); exit(); }'
empty_fops={...} inode=? error=?

在上述变量后加上$$$可以打印更详细的结构体信息。参考下例:

# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms$); exit(); }'
f={.f_u={...}, .f_path={...}, .f_inode=0x0, .f_op=0x0, .f_lock={...}, .f_count={...}, .f_flags=32768, .f_mode=0, .f_pos=0, .f_owner={...}, .f_cred=0xffff880030d483c0, .f_ra={...}, .f_version=0, .f_security=0xffff880012982b80, .private_data=0x0, .f_ep_links={...}, .f_tfile_llink={...}, .f_mapping=0x0} open=<function>:0x0 cred={.usage={...}, .uid={...}, .gid={...}, .suid={...}, .sgid={...}, .euid={...}, .egid={...}, .fsuid={...}, .fsgid={...}, .securebits=0, .cap_inheritable={...}, .cap_permitted={...}, .cap
# stap -e 'probe kernel.function("do_dentry_open") {printf("%s\n", $$parms$$); exit(); }'
f={.f_u={.fu_llist={.next=0x0}, .fu_rcuhead={.next=0x0, .func=0x0}}, .f_path={.mnt=0xffff8800337510e0, .dentry=0xffff8800318b82d8}, .f_inode=0x0, .f_op=0x0, .f_lock={<union>={.rlock={.raw_lock={.head_tail=0, <class>={.tickets={.head=0, .tail=0}, .owner=0}}}}}, .f_count={.counter=1}, .f_flags=32768, .f_mode=0, .f_pos=0, .f_owner={.lock={.raw_lock={.lock=1048576, .write=1048576}}, .pid=0x0, .pid_type=0, .uid={.val=0}, .euid={.val=0}, .signum=0}, .f_cred=0xffff880030d89300, .f_ra={.start=0, .size=0, .async_si

 

Linux kernel 笔记 (44)——使用字符设备

Linux kernel 使用 cdev结构体代表字符设备(char device),定义在<linux/cdev.h>

#include <linux/kobject.h>
#include <linux/kdev_t.h>
#include <linux/list.h>

struct file_operations;
struct inode;
struct module;

struct cdev {
    struct kobject kobj;
    struct module *owner;
    const struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
};

void cdev_init(struct cdev *, const struct file_operations *);

struct cdev *cdev_alloc(void);

void cdev_put(struct cdev *p);

int cdev_add(struct cdev *, dev_t, unsigned);

void cdev_del(struct cdev *);

void cd_forget(struct inode *);

分配和初始化cdev结构体的两种方式:

(1)

struct cdev *my_cdev = cdev_alloc();
my_cdev->ops = &my_fops;
my_cdev->owner = THIS_MODULE;

(2)另外一种是cdev嵌入到代表设备的结构体中:

struct scull_dev {
    ......
    struct cdev cdev; /* Char device structure */
    ......
};

static void scull_setup_cdev(struct scull_dev *dev, int index)
{
    int err, devno = MKDEV(scull_major, scull_minor + index);
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    ......
}

两种方式都要注意把owner赋值为THIS_MODULE

初始化cdev结构体以后,要使用cdev_add把设备加入系统:

/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
    int error;

    p->dev = dev;
    p->count = count;

    error = kobj_map(cdev_map, dev, count, NULL,
             exact_match, exact_lock, p);
    if (error)
        return error;

    kobject_get(p->kobj.parent);

    return 0;
}

要注意count指定的是连续的minor number数。

删除设备使用cdev_del函数:

/**
 * cdev_del() - remove a cdev from the system
 * @p: the cdev structure to be removed
 *
 * cdev_del() removes @p from the system, possibly freeing the structure
 * itself.
 */
void cdev_del(struct cdev *p)
{
    cdev_unmap(p->dev, p->count);
    kobject_put(&p->kobj);
}

2.6版本之前的注册和删除设备的register_chrdevunregister_chrdev函数已经过时,不再使用。

 

Linux kernel 笔记 (43)——do_sys_open

以下是do_sys_openkernel 3.12版本的代码:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_flags op;
    int fd = build_open_flags(flags, mode, &op);
    struct filename *tmp;

    if (fd)
        return fd;

    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);

    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            fsnotify_open(f);
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

核心部分如下:

a)get_unused_fd_flags得到一个文件描述符;
b)do_filp_open得到一个struct file结构;
c)fd_install把文件描述符和struct file结构关联起来。

struct file包含f_op成员:

struct file {
    ......
    const struct file_operations    *f_op;
    ......
    void            *private_data;
    ......
}

struct file_operations又包含open成员:

struct file_operations {
    ......
    int (*open) (struct inode *, struct file *);
    ......
}

open成员的两个参数:实际文件的inode节点和struct file结构。

open系统调用执行驱动中open方法之前(struct file_operations中的open成员),会将private_data置成NULL,用户可以根据自己的需要设置private_data的值(参考do_dentry_open函数)。

 

openat VS open

2.6.16版本开始,GNU/Linux引入openat系统调用:

#define _XOPEN_SOURCE 700 /* Or define _POSIX_C_SOURCE >= 200809 */
#include <fcntl.h>
int openat(int  dirfd , const char * pathname , int  flags , ... /* mode_t  mode */);
Returns file descriptor on success, or –1 on error

open相比,多了一个dirfd参数。关于它的用法,参考以下解释:

If pathname specifies a relative pathname, then it is interpreted relative to the directory referred to by the open file descriptor dirfd, rather than relative to the process’s current working directory.

If pathname specifies a relative pathname, and dirfd contains the special value AT_FDCWD , then pathname is interpreted relative to the process’s current working directory (i.e., the same behavior as open(2)).

If pathname specifies an absolute pathname, then dirfd is ignored.

总结起来,如果pathname是绝对路径,则dirfd参数没用。如果pathname是相对路径,并且dirfd的值不是AT_FDCWD,则pathname的参照物是相对于dirfd指向的目录,而不是进程的当前工作目录;反之,如果dirfd的值是AT_FDCWDpathname则是相对于进程当前工作目录的相对路径,此时等同于open。参考kernel代码则一目了然:

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;

    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
        umode_t, mode)
{
    if (force_o_largefile())
        flags |= O_LARGEFILE;

    return do_sys_open(dfd, filename, flags, mode);
}

引入openat(及其它at结尾的函数)有以下两个原因:

First, openat() allows an application to avoid race conditions that could occur when using open(2) to open files in directories other than the current working directory. These race conditions result from the fact that some component of the directory prefix given to open(2) could be changed in parallel with the call to open(2). Such races can be avoided by opening a file descriptor for the target directory, and then specifying that file descriptor as the dirfd argument of openat().

Second, openat() allows the implementation of a per-thread “current working directory”, via file descriptor(s) maintained by the application. (This functionality can also be obtained by tricks based on the use of /proc/self/fd/dirfd, but less efficiently.)

参考资料:
openat(2) – Linux man page
The Linux programming interface

 

Linux kernel 笔记 (42)——container_of

container_of定义在<linux/kernel.h>中:

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:    the pointer to the member.
 * @type:   the type of the container struct this is embedded in.
 * @member: the name of the member within the struct.
 *
 */
#define container_of(ptr, type, member) ({          \
    const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
    (type *)( (char *)__mptr - offsetof(type,member) );})

它的功能是通过一个结构体成员的地址,得到结构体的地址。举例如下:

struct st_A
{
        int member_b;
        int member_c;
};

struct st_A a;

container_of(&(a.member_c), struct st_A, member_c)会得到变量a的地址,也就是&a的值。

 

SystemTap 笔记 (6)—— 打印userspace堆栈信息

使用SystemTap打印user-space程序的调用栈信息时,需要产生足够的调试信息。这时需要-d--ldd两个选项:

-d MODULE
          Add symbol/unwind information for the given module into the kernel object module.  This  may  enable  symbolic  tracebacks
          from those modules/programs, even if they do not have an explicit probe placed into them.

--ldd  Add symbol/unwind  information  for  all  shared libraries suspected by ldd to be necessary for user-space binaries being
          probe or listed with the -d option.  Caution: this can make the probe modules considerably larger.

-d选项负责加载模块/可执行程序的符号表信息,而-ldd则加载-d modulemodule或是probe需要的共享库符号表信息。参考下例:

 # stap -d /usr/lib/systemd/systemd-udevd --ldd -e 'probe kprocess.create {print_ubacktrace()}'
<no user backtrace at kernel.function("copy_process@../kernel/fork.c:1146").return>
 0x7fec1d14f011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f6feb135011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
WARNING: Missing unwind data for module, rerun with 'stap -d /usr/lib64/libglib-2.0.so.0.3800.2'
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c2ff7ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f22c3a01c35 [/usr/lib64/libglib-2.0.so.0.3800.2+0x8cc35/0x302000]
 0x7f20966a5011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f20966a5011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
WARNING: Missing unwind data for module, rerun with 'stap -d /usr/lib/systemd/systemd'
 0x7f4e59945ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x4364f3 [/usr/lib/systemd/systemd+0x364f3/0x113000]
 0x7f22c2ff7ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f22c3a01c35 [/usr/lib64/libglib-2.0.so.0.3800.2+0x8cc35/0x302000]
 0x7fb1bdfb6011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f22c3026011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7fb1bdfb6011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f3bb6e94011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f3bb6e94011 : clone+0x31/0x90 [/lib64/libc-2.19.so]
 0x7f783f704ed4 : __fork+0xb4/0x320 [/lib64/libc-2.19.so]
 0x7f783fd2169b [/usr/lib64/libpython2.7.so.1.0+0x10f69b/0x3a0000]

参考资料:
Is there any better method to pass “-d OBJECT” options in command line?
User-Space Stack Backtraces

 

 

libvirt笔记 (4) —— log配置

libvirt库通过以下三个环境变量配置log

The library configuration of logging is through 3 environment variables allowing to control the logging behaviour:

LIBVIRT_DEBUG: it can take the four following values:

1 or “debug”: asking the library to log every message emitted, though the filters can be used to avoid filling up the output

2 or “info”: log all non-debugging information

3 or “warn”: log warnings and errors, that’s the default value

4 or “error”: log only error messages

LIBVIRTLOGFILTERS: defines logging filters

LIBVIRTLOGOUTPUTS: defines logging outputs

Note that, for example, setting LIBVIRT_DEBUG= is the same as unset. If you specify an invalid value, it will be ignored with a warning. If you have an error in a filter or output string, some of the settings may be applied up to the point at which libvirt encountered the error.

libvirtd daemon程序也有三个类似的配置项(存储在配置文件libvirtd.conf):

log_level: accepts the following values:

4: only errors

3: warnings and errors

2: information, warnings and errors

1: debug and everything

log_filters: defines logging filters

log_outputs: defines logging outputs

对于libvirtd程序来讲,log配置项的优先级如下:

When starting the libvirt daemon, any logging environment variable settings will override settings in the config file. Command line options take precedence over all. If no outputs are defined for libvirtd, it will try to use

0.10.0 or later: systemd journal, if /run/systemd/journal/socket exists 0.9.0 or later: file /var/log/libvirt/libvirtd.log if running as a daemon before 0.9.0: syslog if running as a daemon all versions: to stderr stream if running in the foreground

参考资料:
Logging in the library and the daemon

 

SystemTap 笔记 (5)—— target variable (1)

关于target variable的解释:

The probe events that map to actual locations in the code (for example kernel.function(“function”) and kernel.statement(“statement”)) allow the use of target variables to obtain the value of variables visible at that location in the code. You can use the -L option to list the target variable available at a probe point.

其实,目前更倾向于使用context variable这个名字,而不是target variable(可以参考这封邮件)。使用target variable需要有kerneldebuginfo。参考下面例子:

# stap -L 'kernel.function("vfs_read")'
kernel.function("vfs_read@../fs/read_write.c:381") $file:struct file* $buf:char* $count:size_t $pos:loff_t*

每个target variable前面有$:后面跟着变量类型。例如:file变量的类型就是struct file*。也可对照vfs_read的定义:

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)

此外,对于target variable不属于当前probelocal变量,可以使用@var("varname@src/file.c")来访问:

When a target variable is not local to the probe point, like a global external variable or a file local static variable defined in another file then it can be referenced through “@var(“varname@src/file.c”)”.

请看下面这个例子:

# stap -e 'probe kernel.function("vfs_read") {
           printf ("current files_stat max_files: %d\n",
                   @var("files_stat@fs/file_table.c")->max_files);
           exit(); }'
current files_stat max_files: 82002

也可以通过指针访问一些基本类型的数据:

kernel_char(address)
Obtain the character at address from kernel memory.
kernel_short(address)
Obtain the short at address from kernel memory.
kernel_int(address)
Obtain the int at address from kernel memory.
kernel_long(address)
Obtain the long at address from kernel memory
kernel_string(address)
Obtain the string at address from kernel memory.
kernel_string_n(address, n)
Obtain the string at address from the kernel memory and limits the string to n bytes.