Linux kernel 笔记 (63)——改变启动的kernel

原文在这里

得到当前系统运行的kernel(系统为CentOS):

# egrep ^menuentry /etc/grub2.cfg | cut -f 2 -d \'
CentOS Linux (4.8.3) 7 (Core)
CentOS Linux (3.10.0-327.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-d07a2009dd34415fa45624985dccbdf6) 7 (Core)

使用grub2-set-default改变启动的kernel

# grub2-set-default 0

如果仅仅想生效一次,可以使用grub2-reboot命令:

# grub2-reboot 0

Linux kernel 笔记 (62)——list_head

双向链表是Linux kernel中常用的数据结构,定义如下:

struct list_head {
    struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
    struct list_head name = LIST_HEAD_INIT(name)

static inline void INIT_LIST_HEAD(struct list_head *list)
{
    list->next = list;
    list->prev = list;
}
...

下图选自plka

Capture

从上图可以看出,定义链表需要一个头结点,通过头结点继而可以完成插入,删除元素等操作。来看一个例子(list.c):

struct list_head {
        struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
        struct list_head name = LIST_HEAD_INIT(name)


int main(void) {
        LIST_HEAD(dev_list);
        return 0;
}

检查gcc预处理的输出:

# gcc -E -P list.c
struct list_head {
 struct list_head *next, *prev;
};
int main(void) {
 struct list_head dev_list = { &(dev_list), &(dev_list) };
 return 0;
}

可以看到,头结点dev_listprevnext都指向了自己。下面代码达到同样的效果:

struct list_head {
    struct list_head *next, *prev;
};

static inline void INIT_LIST_HEAD(struct list_head *list)
{
    list->next = list;
    list->prev = list;
}

int main(void) {
    struct list_head dev_list;
    INIT_LIST_HEAD(&dev_list);
    return 0;
}

 

Linux kernel 笔记 (60)——scheduling domain

NUMA系统上,由于不同CPU直接访问本地内存和远端内存的时间相差很大,所以更好地调度算法就显得很重要。Linux kernel引入了scheduling domain的概念。可以参看下面例子:

[root@localhost ~]# cd /proc/sys/kernel/sched_domain/
[root@localhost sched_domain]# ls
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7
[root@localhost sched_domain]# ls -alt *
cpu0:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu1:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu2:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu3:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu4:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu5:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu6:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu7:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

/proc/sys/kernel/sched_domain/目录下每个CPU都有一个自己的目录,并且每个CPU目录下都有和自己相关的domain信息。

multi-level系统中,也拥有multi-levelscheduling domain(内核中结构体是struct sched_domain)。每个scheduling domain包含一组共享属性和调度策略的CPU;每个scheduling domain包含至少一个或多个CPU group(内核中结构体是struct sched_group),每个CPU group会被scheduling domain看做一个独立的单元。

scheduling domain的核心代码位于kernel\sched\core.c中,关于/proc/sys/kernel/sched_domain/cpu$/domain$中各个文件的含义,都可以在这里找到。

NUMA系统上,如果一个node利用率非常高,比如高于90%,而另一个node利用率可能只有60%~70%,这时可以尝试disable wakeup affinity

参考资料:
Scheduling domains
sched-domains.txt
Does domain0 in /proc/sys/kernel/sched_domain/cpu$ refer top-level domain in the system?
How to understan /proc/sys/kernel/sched_domain/cpu$/domain$/flags?

Linux kernel 笔记 (59)——Kconfig中的“depends on”和“select”

Kconfig文件中:

config A
    depends on B
    select C

它的含义是:CONFIG_A配置与否,取决于CONFIG_B是否配置。一旦CONFIG_A配置了,CONFIG_C也自动配置了。

参考资料:
“select” vs “depends” in kernel Kconfig

 

Linux kernel 笔记 (58)——ioctl

ioctl系统调用的函数原型:

int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).

...并不是代表可变参数,而只是一个可选参数,...在这里防止编译时进行类型检查。

目前在struct file_operations结构体中已不再有ioctl成员:

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

取而代之是unlocked_ioctlcompat_ioctl

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

unlocked_ioctl代替ioctl,而compat_ioctl用在32位程序运行在64位操作系统上调用ioctl系统调用。

ioctl的命令是32-bit长,包含以下4个字段:

---------------------------------------------------------------
| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|

各个字段定义:

type
The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

number
The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

direction
The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

size
The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.

相关的macro定义:

extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
    ((sizeof(t) == sizeof(t[1]) && \
      sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
      sizeof(t) : __invalid_size_argument_for_IOC)

#define _IOC(dir,type,nr,size) \
    (((dir)  << _IOC_DIRSHIFT) | \
     ((type) << _IOC_TYPESHIFT) | \
     ((nr)   << _IOC_NRSHIFT) | \
     ((size) << _IOC_SIZESHIFT))

/* used to create numbers */
#define _IO(type,nr)        _IOC(_IOC_NONE,(type),(nr),0)
#define _IOR(type,nr,size)  _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOW(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
#define _IOR_BAD(type,nr,size)  _IOC(_IOC_READ,(type),(nr),sizeof(size))
#define _IOW_BAD(type,nr,size)  _IOC(_IOC_WRITE,(type),(nr),sizeof(size))
#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size))

/* used to decode ioctl numbers.. */
#define _IOC_DIR(nr)        (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK)
#define _IOC_TYPE(nr)       (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK)
#define _IOC_NR(nr)     (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK)
#define _IOC_SIZE(nr)       (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK)

参考资料:
The new way of ioctl()
Linux Kernel ioctl(), unlockedioctl(), and compatioctl()
Advanced Char Driver Operations

 

Linux kernel 笔记 (56)——init进程

以下摘自plka

Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly.

init进程是Linux运行的第一个进程,是其它所有进程的“祖先”。

sinit展示了一个最简单的init进程实现:初始化服务和等待清理子进程。

 

Linux kernel 笔记 (55)——抢占(Preemption)

Linux下,一个task运行在user-mode时,是总可以被抢占的(preemptible):kernel通过正常的clock tick中断来切换task

Linux kernel2.6版本之前,是不支持kernel-mode抢占的:比如task执行了系统调用,则只能等调用执行完毕,才能让出CPU;或者taskkernel-mode代码主动调用schedule来调度其它task运行。从2.6版本开始,Linux支持了kernel-mode抢占,因此除非kernel-mode代码关闭了local CPU的中断,否则它任何时候都可能被抢占。

参考资料:
Preemption under Linux

 

Linux kernel 笔记 (54)——如何选择“spinlock”或“mutex”

ELDD中提到如何选择spinlockmutex

If the critical section needs to sleep, you have no choice but to use a mutex. It’s illegal to schedule,preempt, or sleep on a wait queue after acquiring a spinlock.

Because mutexes put the calling thread to sleep in the face of contention, you have no choice but to use spinlocks inside interrupt handlers.