SystemTap 笔记 (14)—— Tracing user-space程序需要安装debug-info包

SystemTap追踪user-space程序时需要安装user-space程序的debug-info包。举个例子:

 # stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
semantic error: while resolving probe point: identifier 'process' at <input>:1:7
        source: probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}
                      ^

semantic error: no match (similar functions: malloc, calloc, realloc, close, mbrtowc)
Pass 2: analysis failed.  [man error::pass2]

安装coreutils-debuginfo包:

# zypper in coreutils-debuginfo
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
  coreutils-debuginfo

The following package is not supported by its vendor:
  coreutils-debuginfo

1 new package to install.
Overall download size: 2.1 MiB. Already cached: 0 B. After the operation, additional 18.6 MiB will be used.
Continue? [y/n/? shows all options] (y): y
Retrieving package coreutils-debuginfo-8.22-9.1.x86_64                                                (1/1),   2.1 MiB ( 18.6 MiB unpacked)
Retrieving: coreutils-debuginfo-8.22-9.1.x86_64.rpm ...................................................................[done (105.3 KiB/s)]
Checking for file conflicts: ........................................................................................................[done]
(1/1) Installing: coreutils-debuginfo-8.22-9.1 ......................................................................................[done]

再次执行上述命令:

# stap -d /bin/ls --ldd -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' -c "ls /"
bin  boot  dev  etc  home  lib  lib64  lost+found  mnt  opt  proc  root  run  sbin  selinux  srv  sys  tmp  usr  var
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403828 : main+0xa58/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x411674 : xmemdup+0x14/0x30 [/usr/bin/ls]
 0x40ee4a : clone_quoting_options+0x2a/0x40 [/usr/bin/ls]
 0x403887 : main+0xab7/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
 0x404f39 : _start+0x29/0x30 [/usr/bin/ls]
 0x4114a0 : xmalloc+0x0/0x20 [/usr/bin/ls]
 0x4039e4 : main+0xc14/0x2140 [/usr/bin/ls]
 0x7fad37eefb05 : __libc_start_main+0xf5/0x1c0 [/lib64/libc-2.19.so]
.....

 

Linux kernel 笔记 (58)——ioctl

ioctl系统调用的函数原型:

int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).

...并不是代表可变参数,而只是一个可选参数,...在这里防止编译时进行类型检查。

目前在struct file_operations结构体中已不再有ioctl成员:

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

取而代之是unlocked_ioctlcompat_ioctl

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

unlocked_ioctl代替ioctl,而compat_ioctl用在32位程序运行在64位操作系统上调用ioctl系统调用。

ioctl的命令是32-bit长,包含以下4个字段:

---------------------------------------------------------------
| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|

各个字段定义:

type
The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

number
The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

direction
The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

size
The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.

相关的macro定义:

extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
    ((sizeof(t) == sizeof(t[1]) && \
      sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
      sizeof(t) : __invalid_size_argument_for_IOC)

#define _IOC(dir,type,nr,size) \
    (((dir)  << _IOC_DIRSHIFT) | \
     ((type) << _IOC_TYPESHIFT) | \
     ((nr)   << _IOC_NRSHIFT) | \
     ((size) << _IOC_SIZESHIFT))

/* used to create numbers */
#define _IO(type,nr)        _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOR_BAD(type,nr,size)  _IOC(_IOC_READ,(type),(nr),sizeof(size))
#define _IOW_BAD(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),sizeof(size))
#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size))

/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)        (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)       (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)     (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)       (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)

参考资料:
The new way of ioctl()
Linux Kernel ioctl(), unlockedioctl(), and compatioctl()
Advanced Char Driver Operations

 

Haskell笔记 (3)—— list和tuple

(1)

Because strings are lists, we can use list functions on them.

举例如下:

> "Hello " ++ "world"
"Hello world"

(2)
:用来连接一个元素和list++连接两个list

>let list = [1, 2, 3, 4]
> print list
[1,2,3,4]
> [5] : list

<interactive>:23:1:
    Non type-variable argument in the constraint: Num [t]
    (Use FlexibleContexts to permit this)
    When checking that ‘it’ has the inferred type
      it :: forall t. (Num t, Num [t]) => [[t]]
> 5 : list
[5,1,2,3,4]

(3)

Ranges are a way of making lists that are arithmetic sequences of elements that can be enumerated. Numbers can be enumerated. One, two, three, four, etc. Characters can also be enumerated. The alphabet is an enumeration of characters from A to Z. Names can’t be enumerated. What comes after “John”?

举例如下:

> [1 .. 20]
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

也可以指定元素之间的“步长”(24之间差2):

> [2, 4 .. 20]
[2,4,6,8,10,12,14,16,18,20]

另外,cycle循环一个list

cycle takes a list and cycles it into an infinite list. If you just try to display the result, it will go on forever so you have to slice it off somewhere.

> take 10 (cycle [1,2,3])  
[1,2,3,1,2,3,1,2,3,1]  
> take 12 (cycle "LOL ")  
"LOL LOL LOL "   

repeat循环一个元素

repeat takes an element and produces an infinite list of just that element. It’s like cycling a list with only one element.

> take 10 (repeat 5)  
[5,5,5,5,5,5,5,5,5,5]  

也可以使用replicate函数:

replicate function if you want some number of the same element in a list.

> replicate 10 5
[5,5,5,5,5,5,5,5,5,5]

(4)

list最后一个元素后面不能再跟,

>  [1,2]
[1,2]
> [1,2,]

<interactive>:12:6: parse error on input ‘]’

(5)

There’s a special type, () , that acts as a tuple of zero elements. This type has only one value, which is also written () . Both the type and the value are usually pronounced “unit.” If you are familiar with C, () is somewhat similar to void.

(6)fstsnd这两个函数只能操作包含两个元素的tuple

> :type fst
fst :: (a, b) -> a
> :type snd
snd :: (a, b) -> b

(7)

The list [1,2,3] in Haskell is actually shorthand for the list 1:(2:(3:[])), where [] is the empty list and : is the infix operator that adds its first argument to the front of its second argument (a list). (: and [] are like Lisp’s cons and nil, respectively.) Since : is right associative, we can also write this list as 1:2:3:[].

Linux kernel 笔记 (56)——init进程

以下摘自plka

Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly.

init进程是Linux运行的第一个进程,是其它所有进程的“祖先”。

sinit展示了一个最简单的init进程实现:初始化服务和等待清理子进程。

 

希望国内出版社多引进英文技术原版书

目前国内出版社引进的英文技术原版书太少了,大多数还是出版一些国人翻译的版本。我不知道直接引进原版书和找人翻译在成本方面有多大差距,但是凭心而论,很多翻译的书真是不敢恭维,不客气地讲,感觉有些翻译者似乎自己都没看明白原著的意思。毕竟目前很多先进的技术和经典的书籍都是从西方传过来的,也是用英文写的。读原汁原味的书,可以更准确地理解原著的意思,也可以提高自己的英文水平。所以真的很希望国内的出版社可以多引进文技术原版书!

Linux kernel 笔记 (55)——抢占(Preemption)

Linux下,一个task运行在user-mode时,是总可以被抢占的(preemptible):kernel通过正常的clock tick中断来切换task

Linux kernel2.6版本之前,是不支持kernel-mode抢占的:比如task执行了系统调用,则只能等调用执行完毕,才能让出CPU;或者taskkernel-mode代码主动调用schedule来调度其它task运行。从2.6版本开始,Linux支持了kernel-mode抢占,因此除非kernel-mode代码关闭了local CPU的中断,否则它任何时候都可能被抢占。

参考资料:
Preemption under Linux

 

为什么单CPU没有“memory reorder”问题?

这篇文章提到单核系统上不会有“memory reorder”问题:

Two threads being timesliced on a single CPU core won’t run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won’t really know about each other’s reordering.

仍以Memory Reordering Caught in the Act的图为例:

marked-example2

reordered

其实可以这样理解:单核CPU系统上,多个线程实际是交替顺序执行的,无法真正做到“并行”。无论两个线程或多个线程的代码如何乱序执行,CPU知道它们原本应该的执行顺序,一旦这种乱序会改变程序的运行结果,CPU会做出相应的“补救”措施,比如丢弃结果,重新执行等等,来保证代码会按照应该执行的顺序执行。所以“memory reorder”问题不会在单核系统上出现。

参考资料:
Why doesn’t the instruction reorder issue occur on a single CPU core?
preempt_disable的问题

 

Linux kernel 笔记 (54)——如何选择“spinlock”或“mutex”

ELDD中提到如何选择spinlockmutex

If the critical section needs to sleep, you have no choice but to use a mutex. It’s illegal to schedule,preempt, or sleep on a wait queue after acquiring a spinlock.

Because mutexes put the calling thread to sleep in the face of contention, you have no choice but to use spinlocks inside interrupt handlers.

 

SystemTap 笔记 (13)—— Statistical aggregate

Statistical aggregate主要用来统计一组数据。它使用“<<< value”运算来把一个value加到这个集合里。举个例子,假设目前是个空集合,执行“<<< 1”以后,集合里有了一个元素:1;再执行“<<< 2”以后,集合里有了两个元素:12。还有一些针对statistical aggregate的运算:countavg等等。通过以下例子,可以很容易知道这些运算的意义:

# cat test.stp
#!/usr/bin/stap

global reads

probe vfs.read
{
  reads[execname(),pid()] <<< $count
}
probe timer.s(3)
{
  foreach([execname, pid] in reads)
  {
        if (execname == "stapio")
        {
                printf("count (%s %d) : %d \n", execname, pid, @count(reads[execname, pid]))
                printf("sum (%s %d) : %d \n", execname, pid, @sum(reads[execname, pid]))
                printf("min (%s %d) : %d \n", execname, pid, @min(reads[execname, pid]))
                printf("max (%s %d) : %d \n", execname, pid, @max(reads[execname, pid]))
                printf("avg (%s %d) : %d \n", execname, pid, @avg(reads[execname, pid]))
                exit()
        }
  }
}
# ./test.stp
count (stapio 10762) : 17
sum (stapio 10762) : 1982472
min (stapio 10762) : 8196
max (stapio 10762) : 131072
avg (stapio 10762) : 116616

参考资料:
Computing for Statistical Aggregates