kernel | 我的站点

Linux kernel 笔记（63）——改变启动的kernel

原文在这里。

得到当前系统运行的kernel（系统为CentOS）：

# egrep ^menuentry /etc/grub2.cfg | cut -f 2 -d \'
CentOS Linux (4.8.3) 7 (Core)
CentOS Linux (3.10.0-327.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-d07a2009dd34415fa45624985dccbdf6) 7 (Core)

使用grub2-set-default改变启动的kernel：

# grub2-set-default 0

如果仅仅想生效一次，可以使用grub2-reboot命令：

＃ grub2-reboot 0

“Page out”和“swap out”

下文摘自Linux performance and tuning guidelines：

The pages are used mainly for two purposes: page cache and process address space. The page cache is pages mapped to a file on disk. The pages that belong to a process address space (called anonymous memory because it is not mapped to any files, and it has no name) are used for heap and stack. When kswapd reclaims pages, it would rather shrink the page cache than page out (or swap out) the pages owned by processes. A large proportion of page cache that is reclaimed and process address space that is reclaimed might depend on the usage scenario and will affect performance. You can take some control of this behavior by using /proc/sys/vm/swappiness.

Page out and swap out: The phrases “page out” and “swap out” are sometimes confusing. The phrase “page out” means take some pages (a part of entire address space) into swap space while “swap out” means taking entire address space into swap space. They are sometimes used interchangeably.

Linux kernel 笔记（62）——list_head

双向链表是Linux kernel中常用的数据结构，定义如下：

struct list_head {
    struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
    struct list_head name = LIST_HEAD_INIT(name)

static inline void INIT_LIST_HEAD(struct list_head *list)
{
    list->next = list;
    list->prev = list;
}
...

下图选自plka：

从上图可以看出，定义链表需要一个头结点，通过头结点继而可以完成插入，删除元素等操作。来看一个例子（list.c）：

struct list_head {
        struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
        struct list_head name = LIST_HEAD_INIT(name)


int main(void) {
        LIST_HEAD(dev_list);
        return 0;
}

检查gcc预处理的输出：

# gcc -E -P list.c
struct list_head {
 struct list_head *next, *prev;
};
int main(void) {
 struct list_head dev_list = { &(dev_list), &(dev_list) };
 return 0;
}

可以看到，头结点dev_list的prev和next都指向了自己。下面代码达到同样的效果：

struct list_head {
    struct list_head *next, *prev;
};

static inline void INIT_LIST_HEAD(struct list_head *list)
{
    list->next = list;
    list->prev = list;
}

int main(void) {
    struct list_head dev_list;
    INIT_LIST_HEAD(&dev_list);
    return 0;
}

Linux kernel 笔记（61）——PID是0，1，2的进程

在Linux kernel中，0号进程是scheduler，1号进程是init/systemd（所有user thread的祖先），2号进程是[kthreadd]（所有kernel thread的父进程）。

参考资料：
Which process has PID 0?；
init process: ancestor of all processes?.

Linux kernel 笔记（60）——scheduling domain

在NUMA系统上，由于不同CPU直接访问本地内存和远端内存的时间相差很大，所以更好地调度算法就显得很重要。Linux kernel引入了scheduling domain的概念。可以参看下面例子：

[root@localhost ~]# cd /proc/sys/kernel/sched_domain/
[root@localhost sched_domain]# ls
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7
[root@localhost sched_domain]# ls -alt *
cpu0:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu1:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu2:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu3:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu4:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu5:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu6:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

cpu7:
total 0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain0
dr-xr-xr-x. 1 root root 0 Feb 26 20:06 domain1
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 .
dr-xr-xr-x. 1 root root 0 Feb 26 19:37 ..

在/proc/sys/kernel/sched_domain/目录下每个CPU都有一个自己的目录，并且每个CPU目录下都有和自己相关的domain信息。

在multi-level系统中，也拥有multi-level的scheduling domain（内核中结构体是struct sched_domain）。每个scheduling domain包含一组共享属性和调度策略的CPU；每个scheduling domain包含至少一个或多个CPU group（内核中结构体是struct sched_group），每个CPU group会被scheduling domain看做一个独立的单元。

scheduling domain的核心代码位于kernel\sched\core.c中，关于/proc/sys/kernel/sched_domain/cpu$/domain$中各个文件的含义，都可以在这里找到。

在NUMA系统上，如果一个node利用率非常高，比如高于90%，而另一个node利用率可能只有60%~70%，这时可以尝试disable wakeup affinity。

参考资料：
Scheduling domains；
sched-domains.txt；
Does domain0 in /proc/sys/kernel/sched_domain/cpu$ refer top-level domain in the system?；
How to understan /proc/sys/kernel/sched_domain/cpu$/domain$/flags?。

shmmax和shmall

Linux kernel中针对shared memory有两个重要的配置项：shmmax和shmall：

shmmax定义了一次分配shared memory的最大长度，单位是byte：

# cat /proc/sys/kernel/shmmax
18446744073692774399

shmall定义了一共能分配shared memory的最大长度，单位是page：

最大“shared memory” = shmall（cat /proc/sys/kernel/shmall） * pagesize（getconf PAGE_SIZE）

以shmmax为例，介绍一下修改值的方法：

（1）现在系统shmmax的值：

# sysctl -a | grep shmmax
kernel.shmmax = 18446744073692774399

（2）修改shmmax的值：

# echo "536870912" > /proc/sys/kernel/shmmax
# sysctl -a | grep shmmax
kernel.shmmax = 536870912

可以看到值发生了变化。但是重启系统以后，shmmax又变回之前的值。如果要让值永久生效，可以使用下列方法：

# echo "kernel.shmmax = 536870912" >>  /etc/sysctl.conf
# sysctl -a | grep shmmax
kernel.shmmax = 18446744073692774399
# sysctl -p
kernel.shmmax = 536870912
# sysctl -a | grep shmmax
kernel.shmmax = 536870912

另外，关于如何设置shmall和shmmax的值，也可以参考这个脚本。

参考资料：
The Mysterious World of Shmmax and Shmall；
Configuring SHMMAX and SHMALL for Oracle in Linux；
What is shmmax, shmall, shmmni? Shared Memory Max。

Linux kernel 笔记（59）——Kconfig中的“depends on”和“select”

在Kconfig文件中：

config A
    depends on B
    select C

它的含义是：CONFIG_A配置与否，取决于CONFIG_B是否配置。一旦CONFIG_A配置了，CONFIG_C也自动配置了。

参考资料：
“select” vs “depends” in kernel Kconfig。

Linux kernel 笔记（58）——ioctl

ioctl系统调用的函数原型：

int ioctl(int fd, unsigned long cmd, ...);

In a real system, however, a system call can’t actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware “gates.” Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp . The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument).

...并不是代表可变参数，而只是一个可选参数，...在这里防止编译时进行类型检查。

目前在struct file_operations结构体中已不再有ioctl成员：

int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);

取而代之是unlocked_ioctl和compat_ioctl：

long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);

unlocked_ioctl代替ioctl，而compat_ioctl用在32位程序运行在64位操作系统上调用ioctl系统调用。

ioctl的命令是32-bit长，包含以下4个字段：

---------------------------------------------------------------
| dirction(2/3-bit)|size(14/13-bit)| type(8-bit)|number(8-bit)|

各个字段定义：

type
The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide ( IOCTYPEBITS ).

number
The ordinal (sequential) number. It’s eight bits ( IOCNRBITS ) wide.

direction
The direction of data transfer, if the particular command involves a data transfer. The possible values are IOCNONE (no data transfer), IOCREAD , IOCWRITE , and IOCREAD|IOCWRITE (data is transferred both ways). Data transfer is seen from the application’s point of view; IOCREAD means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so IOC READ and IOCWRITE can be extracted using a logical AND operation.

size
The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro IOCSIZEBITS . It’s not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We’ll see how this field is used soon.

Linux kernel 笔记（57）——hardware interrtupt不能访问userspace

以下摘自plka：

系统调用（system call）可以访问userspace，而硬件中断（hardware interrupt）则不能。

Linux kernel 笔记（56）——init进程

以下摘自plka：

Linux employs a hierarchical scheme in which each process depends on a parent process. The kernel starts the init program as the first process that is responsible for further system initialization actions and display of the login prompt or (in more widespread use today) display of a graphical login interface. init is therefore the root from which all processes originate, more or less directly.

init进程是Linux运行的第一个进程，是其它所有进程的“祖先”。

sinit展示了一个最简单的init进程实现：初始化服务和等待清理子进程。

一	二	三	四	五	六	日
« 12月
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30