Linux kernel 笔记 (58)——ioctl

ioctl系统调用的函数原型:

int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).

...并不是代表可变参数,而只是一个可选参数,...在这里防止编译时进行类型检查。

目前在struct file_operations结构体中已不再有ioctl成员:

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

取而代之是unlocked_ioctlcompat_ioctl

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

unlocked_ioctl代替ioctl,而compat_ioctl用在32位程序运行在64位操作系统上调用ioctl系统调用。

ioctl的命令是32-bit长,包含以下4个字段:

---------------------------------------------------------------
| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|

各个字段定义:

type
The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

number
The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

direction
The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

size
The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.

相关的macro定义:

extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
    ((sizeof(t) == sizeof(t[1]) && \
      sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
      sizeof(t) : __invalid_size_argument_for_IOC)

#define _IOC(dir,type,nr,size) \
    (((dir)  << _IOC_DIRSHIFT) | \
     ((type) << _IOC_TYPESHIFT) | \
     ((nr)   << _IOC_NRSHIFT) | \
     ((size) << _IOC_SIZESHIFT))

/* used to create numbers */
#define _IO(type,nr)        _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOR_BAD(type,nr,size)  _IOC(_IOC_READ,(type),(nr),sizeof(size))
#define _IOW_BAD(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),sizeof(size))
#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size))

/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)        (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)       (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)     (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)       (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)

参考资料:
The new way of ioctl()
Linux Kernel ioctl(), unlockedioctl(), and compatioctl()
Advanced Char Driver Operations

 

Linux kernel 笔记 (56)——init进程

以下摘自plka

Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly.

init进程是Linux运行的第一个进程,是其它所有进程的“祖先”。

sinit展示了一个最简单的init进程实现:初始化服务和等待清理子进程。

 

Linux kernel 笔记 (55)——抢占(Preemption)

Linux下,一个task运行在user-mode时,是总可以被抢占的(preemptible):kernel通过正常的clock tick中断来切换task

Linux kernel2.6版本之前,是不支持kernel-mode抢占的:比如task执行了系统调用,则只能等调用执行完毕,才能让出CPU;或者taskkernel-mode代码主动调用schedule来调度其它task运行。从2.6版本开始,Linux支持了kernel-mode抢占,因此除非kernel-mode代码关闭了local CPU的中断,否则它任何时候都可能被抢占。

参考资料:
Preemption under Linux

 

Linux kernel 笔记 (54)——如何选择“spinlock”或“mutex”

ELDD中提到如何选择spinlockmutex

If the critical section needs to sleep, you have no choice but to use a mutex. It’s illegal to schedule,preempt, or sleep on a wait queue after acquiring a spinlock.

Because mutexes put the calling thread to sleep in the face of contention, you have no choice but to use spinlocks inside interrupt handlers.

 

Linux kernel 笔记 (53)——为什么“interrupt handler”不能被抢占?

Interrupt handler会复用当前被中断taskkernel stack,它并不是一个真正的task,也不拥有task_struct。因此一旦被调度出去,就无法再被调度回来继续执行。所以interrupt handler不允许被抢占。

参考资料:
Why can’t you sleep in an interrupt handler in the Linux kernel? Is this true of all OS kernels?
Why kernel code/thread executing in interrupt context cannot sleep?;
Are there any difference between “kernel preemption” and “interrupt”?;
Why can not processes switch in atomic context?

 

Linux kernel 笔记 (52)——使用“spinlock”的进程不能被抢占

以下摘自这封邮件

A process cannot be preempted nor sleep while holding a spinlock due spinlocks behavior. If a process grabs a spinlock and goes to sleep before releasing it. A second process (or an interrupt handler) that to grab the spinlock will busy wait. On an uniprocessor machine the second process will lock the CPU not allowing the first process to wake up and release the spinlock so the second process can continue, it is basically a deadlock.

This happens since grabbing an spinlocks also disables interrupts and this is required to synchronize threads with interrupt handlers.

当一个task获得spinlock以后,它不能被抢占(比如调用sleep)。因为如果这时有另外一个task也想获得这个spinlock,在UP系统上,这个task就会一直占据CPU,并且不停地尝试获得锁。而第一个task没有机会重新执行来释放锁,这就造成“死锁”。Interrupt handler也是同样道理。

 

Linux kernel 笔记 (51)——”atomic context”

以下摘自这篇文章

Kernel code generally runs in one of two fundamental contexts. Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary. But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.

There is more to it than that, though: any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.

“Deadlocking forever” tends not to appear on users’ wishlists for the kernel, so the kernel developers go out of their way to avoid that situation. To that end, code which is running in atomic context carefully follows a number of rules, including (1) no access to user space, and, crucially, (2) no sleeping. Problems can result, though, when a particular kernel function does not know which context it might be invoked in. The classic example is kmalloc() and friends, which take an explicit argument (GFPKERNEL or GFPATOMIC) specifying whether sleeping is possible or not.

处理中断代码属于atomic context,必须遵守下面的原则:
a)不能访问user space
b)不能sleep

 

Linux kernel 笔记 (50)——”context switch”和”mode switch”

以下内容摘自stackoverflow上的这个帖子

At a high level, there are two separate mechanisms to understand. The first is the kernel entry/exit mechanism: this switches a single running thread from running usermode code to running kernel code in the context of that thread, and back again. The second is the context switch mechanism itself, which switches in kernel mode from running in the context of one thread to another.

So, when Thread A calls sched_yield() and is replaced by Thread B, what happens is:

Thread A enters the kernel, changing from user mode to kernel mode;
Thread A in the kernel context-switches to Thread B in the kernel;
Thread B exits the kernel, changing from kernel mode back to user mode.

Each user thread has both a user-mode stack and a kernel-mode stack. When a thread enters the kernel, the current value of the user-mode stack (SS:ESP) and instruction pointer (CS:EIP) are saved to the thread’s kernel-mode stack, and the CPU switches to the kernel-mode stack – with the int $80 syscall mechanism, this is done by the CPU itself. The remaining register values and flags are then also saved to the kernel stack.

When a thread returns from the kernel to user-mode, the register values and flags are popped from the kernel-mode stack, then the user-mode stack and instruction pointer values are restored from the saved values on the kernel-mode stack.

When a thread context-switches, it calls into the scheduler (the scheduler does not run as a separate thread – it always runs in the context of the current thread). The scheduler code selects a process to run next, and calls the switchto() function. This function essentially just switches the kernel stacks – it saves the current value of the stack pointer into the TCB for the current thread (called struct taskstruct in Linux), and loads a previously-saved stack pointer from the TCB for the next thread. At this point it also saves and restores some other thread state that isn’t usually used by the kernel – things like floating point/SSE registers.

So you can see that the core user-mode state of a thread isn’t saved and restored at context-switch time – it’s saved and restored to the thread’s kernel stack when you enter and leave the kernel. The context-switch code doesn’t have to worry about clobbering the user-mode register values – those are already safely saved away in the kernel stack by that point.

总结如下:
mode switch”是一个运行的taskuser-mode切换到kernel-mode,或者切换回来。而“context switch”一定发生在kernel mode,进行task的切换。

每个user task有一个user-mode stack和一个kernel-mode stack,当从user-mode切换到kernel-mode时,寄存器的值要保存到kernel-mode stack,反之,从kernel-mode切换回user-mode时,要把寄存器的值恢复出来。

进行“context switch”时,scheduler将当前kernel-mode stack中的值保存在task_struct中,并把下一个将要运行tasktask_struct值恢复到kernel-mode stack中。这样,从kernel-mode返回到user-mode,就会运行另外一个task

 

Crash工具笔记 (3)—— 在Xen环境使用crash

这两周一直在crash邮件列表里讨论如何在SuSE Xen上使用crash调试Dom0 kernel。邮件来来回回讨论很多(参见这里),最后还发现了一个bug。细节不说了,把最后的结果总结一下:

(1)由于SuSE kerenl默认编译打开CONFIG_STRICT_DEVMEM编译开关,所以crash工具无法完全访问/dev/mem,可以使用/proc/kcore作为代替;

(2)SuSE带有crash.ko驱动(位于:“/lib/modules/uname -r/updates/crash.ko”),但默认没有安装,可以自己手动安装(使用insmod命令),然后就可以使用了:

# crash

crash 7.1.3
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

crash: /boot/xen-4.5.gz: original filename unknown
       Use "-f /boot/xen-4.5.gz" on command line to prevent this message.

WARNING: machine type mismatch:
         crash utility: X86_64
         /var/tmp/xen-4.5.gz_ud3IRy: X86

crash: /boot/symtypes-3.12.49-6-default.gz: original filename unknown
       Use "-f /boot/symtypes-3.12.49-6-default.gz" on command line to
prevent this message.

crash: /boot/symvers-3.12.49-6-default.gz: original filename unknown
       Use "-f /boot/symvers-3.12.49-6-default.gz" on command line to
prevent this message.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /boot/vmlinux-3.12.49-6-xen.gz
   DEBUGINFO: /usr/lib/debug/boot/vmlinux-3.12.49-6-xen.debug
    DUMPFILE: /dev/crash
        CPUS: 128
        DATE: Fri Nov 20 06:55:06 2015
      UPTIME: 18:51:36
LOAD AVERAGE: 1.76, 1.48, 1.21
       TASKS: 1230
    NODENAME: dl980-5
     RELEASE: 3.12.49-6-xen
     VERSION: #1 SMP Mon Oct 26 16:05:37 UTC 2015 (11560c3)
     MACHINE: x86_64  (1995 Mhz)
      MEMORY: 125.9 GB
         PID: 6618
     COMMAND: "crash"
        TASK: ffff881ea93b2140  [THREAD_INFO: ffff881e869f2000]
         CPU: 112
       STATE: TASK_RUNNING (ACTIVE)