SystemTap 笔记 (14)—— Tracing user-space程序需要安装debug-info包


 # stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}

semantic error: no match (similar functions: malloc, calloc, realloc, close, mbrtowc)
Pass 2: analysis failed.  [man error::pass2]


# zypper in coreutils-debuginfo
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:

The following package is not supported by its vendor:

1 new package to install.
Overall download size: 2.1 MiB. Already cached: 0 B. After the operation, additional 18.6 MiB will be used.
Continue? [y/n/? shows all options] (y): y
Retrieving package coreutils-debuginfo-8.22-9.1.x86_64                                                (1/1),   2.1 MiB ( 18.6 MiB unpacked)
Retrieving: coreutils-debuginfo-8.22-9.1.x86_64.rpm ...................................................................[done (105.3 KiB/s)]
Checking for file conflicts: ........................................................................................................[done]
(1/1) Installing: coreutils-debuginfo-8.22-9.1 ......................................................................................[done]


# stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
bin  boot  dev  etc  home  lib  lib64  lost+found  mnt  opt  proc  root  run  sbin  selinux  srv  sys  tmp  usr  var
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403828 : main+0xa58/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403887 : main+0xab7/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x4039e4 : main+0xc14/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/]


Linux kernel 笔记 (58)——ioctl


int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).


目前在struct file_operations结构体中已不再有ioctl成员:

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);


long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);



| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|


The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.


extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
    ((sizeof(t) == sizeof(t[1]) && \
      sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
      sizeof(t) : __invalid_size_argument_for_IOC)

#define _IOC(dir,type,nr,size) \
    (((dir)  << _IOC_DIRSHIFT) | \
     ((type) << _IOC_TYPESHIFT) | \
     ((nr)   << _IOC_NRSHIFT) | \
     ((size) << _IOC_SIZESHIFT))

/* used to create numbers */
#define _IO(type,nr)        _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOR_BAD(type,nr,size)  _IOC(_IOC_READ,(type),(nr),sizeof(size))
#define _IOW_BAD(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),sizeof(size))
#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size))

/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)        (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)       (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)     (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)       (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)

The new way of ioctl()
Linux Kernel ioctl(), unlockedioctl(), and compatioctl()
Advanced Char Driver Operations


Haskell笔记 (3)—— list和tuple


Because strings are lists, we can use list functions on them.


> "Hello " ++ "world"
"Hello world"


>let list = [1, 2, 3, 4]
> print list
> [5] : list

    Non type-variable argument in the constraint: Num [t]
    (Use FlexibleContexts to permit this)
    When checking that ‘it’ has the inferred type
      it :: forall t. (Num t, Num [t]) => [[t]]
> 5 : list


Ranges are a way of making lists that are arithmetic sequences of elements that can be enumerated. Numbers can be enumerated. One, two, three, four, etc. Characters can also be enumerated. The alphabet is an enumeration of characters from A to Z. Names can’t be enumerated. What comes after “John”?


> [1 .. 20]


> [2, 4 .. 20]


cycle takes a list and cycles it into an infinite list. If you just try to display the result, it will go on forever so you have to slice it off somewhere.

> take 10 (cycle [1,2,3])  
> take 12 (cycle "LOL ")  


repeat takes an element and produces an infinite list of just that element. It’s like cycling a list with only one element.

> take 10 (repeat 5)  


replicate function if you want some number of the same element in a list.

> replicate 10 5



>  [1,2]
> [1,2,]

<interactive>:12:6: parse error on input ‘]’


There’s a special type, () , that acts as a tuple of zero elements. This type has only one value, which is also written () . Both the type and the value are usually pronounced “unit.” If you are familiar with C, () is somewhat similar to void.


> :type fst
fst :: (a, b) -> a
> :type snd
snd :: (a, b) -> b


The list [1,2,3] in Haskell is actually shorthand for the list 1:(2:(3:[])), where [] is the empty list and : is the infix operator that adds its first argument to the front of its second argument (a list). (: and [] are like Lisp’s cons and nil, respectively.) Since : is right associative, we can also write this list as 1:2:3:[].

Linux kernel 笔记 (56)——init进程


Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly.






Linux kernel 笔记 (55)——抢占(Preemption)

Linux下,一个task运行在user-mode时,是总可以被抢占的(preemptible):kernel通过正常的clock tick中断来切换task

Linux kernel2.6版本之前,是不支持kernel-mode抢占的:比如task执行了系统调用,则只能等调用执行完毕,才能让出CPU;或者taskkernel-mode代码主动调用schedule来调度其它task运行。从2.6版本开始,Linux支持了kernel-mode抢占,因此除非kernel-mode代码关闭了local CPU的中断,否则它任何时候都可能被抢占。

Preemption under Linux


为什么单CPU没有“memory reorder”问题?

这篇文章提到单核系统上不会有“memory reorder”问题:

Two threads being timesliced on a single CPU core won’t run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won’t really know about each other’s reordering.

仍以Memory Reordering Caught in the Act的图为例:



其实可以这样理解:单核CPU系统上,多个线程实际是交替顺序执行的,无法真正做到“并行”。无论两个线程或多个线程的代码如何乱序执行,CPU知道它们原本应该的执行顺序,一旦这种乱序会改变程序的运行结果,CPU会做出相应的“补救”措施,比如丢弃结果,重新执行等等,来保证代码会按照应该执行的顺序执行。所以“memory reorder”问题不会在单核系统上出现。

Why doesn’t the instruction reorder issue occur on a single CPU core?


Linux kernel 笔记 (54)——如何选择“spinlock”或“mutex”


If the critical section needs to sleep, you have no choice but to use a mutex. It’s illegal to schedule,preempt, or sleep on a wait queue after acquiring a spinlock.

Because mutexes put the calling thread to sleep in the face of contention, you have no choice but to use spinlocks inside interrupt handlers.


SystemTap 笔记 (13)—— Statistical aggregate

Statistical aggregate主要用来统计一组数据。它使用“<<< value”运算来把一个value加到这个集合里。举个例子,假设目前是个空集合,执行“<<< 1”以后,集合里有了一个元素:1;再执行“<<< 2”以后,集合里有了两个元素:12。还有一些针对statistical aggregate的运算:countavg等等。通过以下例子,可以很容易知道这些运算的意义:

# cat test.stp

global reads

  reads[execname(),pid()] <<< $count
probe timer.s(3)
  foreach([execname, pid] in reads)
        if (execname == "stapio")
                printf("count (%s %d) : %d \n", execname, pid, @count(reads[execname, pid]))
                printf("sum (%s %d) : %d \n", execname, pid, @sum(reads[execname, pid]))
                printf("min (%s %d) : %d \n", execname, pid, @min(reads[execname, pid]))
                printf("max (%s %d) : %d \n", execname, pid, @max(reads[execname, pid]))
                printf("avg (%s %d) : %d \n", execname, pid, @avg(reads[execname, pid]))
# ./test.stp
count (stapio 10762) : 17
sum (stapio 10762) : 1982472
min (stapio 10762) : 8196
max (stapio 10762) : 131072
avg (stapio 10762) : 116616

Computing for Statistical Aggregates