Linux kernel 笔记 (53)——为什么“interrupt handler”不能被抢占?

Interrupt handler会复用当前被中断taskkernel stack,它并不是一个真正的task,也不拥有task_struct。因此一旦被调度出去,就无法再被调度回来继续执行。所以interrupt handler不允许被抢占。

参考资料:
Why can’t you sleep in an interrupt handler in the Linux kernel? Is this true of all OS kernels?
Why kernel code/thread executing in interrupt context cannot sleep?;
Are there any difference between “kernel preemption” and “interrupt”?;
Why can not processes switch in atomic context?

 

Linux kernel 笔记 (52)——使用“spinlock”的进程不能被抢占

以下摘自这封邮件

A process cannot be preempted nor sleep while holding a spinlock due spinlocks behavior. If a process grabs a spinlock and goes to sleep before releasing it. A second process (or an interrupt handler) that to grab the spinlock will busy wait. On an uniprocessor machine the second process will lock the CPU not allowing the first process to wake up and release the spinlock so the second process can continue, it is basically a deadlock.

This happens since grabbing an spinlocks also disables interrupts and this is required to synchronize threads with interrupt handlers.

当一个task获得spinlock以后,它不能被抢占(比如调用sleep)。因为如果这时有另外一个task也想获得这个spinlock,在UP系统上,这个task就会一直占据CPU,并且不停地尝试获得锁。而第一个task没有机会重新执行来释放锁,这就造成“死锁”。Interrupt handler也是同样道理。

 

Linux kernel 笔记 (51)——”atomic context”

以下摘自这篇文章

Kernel code generally runs in one of two fundamental contexts. Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary. But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.

There is more to it than that, though: any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.

“Deadlocking forever” tends not to appear on users’ wishlists for the kernel, so the kernel developers go out of their way to avoid that situation. To that end, code which is running in atomic context carefully follows a number of rules, including (1) no access to user space, and, crucially, (2) no sleeping. Problems can result, though, when a particular kernel function does not know which context it might be invoked in. The classic example is kmalloc() and friends, which take an explicit argument (GFPKERNEL or GFPATOMIC) specifying whether sleeping is possible or not.

处理中断代码属于atomic context,必须遵守下面的原则:
a)不能访问user space
b)不能sleep

 

Linux kernel 笔记 (50)——”context switch”和”mode switch”

以下内容摘自stackoverflow上的这个帖子

At a high level, there are two separate mechanisms to understand. The first is the kernel entry/exit mechanism: this switches a single running thread from running usermode code to running kernel code in the context of that thread, and back again. The second is the context switch mechanism itself, which switches in kernel mode from running in the context of one thread to another.

So, when Thread A calls sched_yield() and is replaced by Thread B, what happens is:

Thread A enters the kernel, changing from user mode to kernel mode;
Thread A in the kernel context-switches to Thread B in the kernel;
Thread B exits the kernel, changing from kernel mode back to user mode.

Each user thread has both a user-mode stack and a kernel-mode stack. When a thread enters the kernel, the current value of the user-mode stack (SS:ESP) and instruction pointer (CS:EIP) are saved to the thread’s kernel-mode stack, and the CPU switches to the kernel-mode stack – with the int $80 syscall mechanism, this is done by the CPU itself. The remaining register values and flags are then also saved to the kernel stack.

When a thread returns from the kernel to user-mode, the register values and flags are popped from the kernel-mode stack, then the user-mode stack and instruction pointer values are restored from the saved values on the kernel-mode stack.

When a thread context-switches, it calls into the scheduler (the scheduler does not run as a separate thread – it always runs in the context of the current thread). The scheduler code selects a process to run next, and calls the switchto() function. This function essentially just switches the kernel stacks – it saves the current value of the stack pointer into the TCB for the current thread (called struct taskstruct in Linux), and loads a previously-saved stack pointer from the TCB for the next thread. At this point it also saves and restores some other thread state that isn’t usually used by the kernel – things like floating point/SSE registers.

So you can see that the core user-mode state of a thread isn’t saved and restored at context-switch time – it’s saved and restored to the thread’s kernel stack when you enter and leave the kernel. The context-switch code doesn’t have to worry about clobbering the user-mode register values – those are already safely saved away in the kernel stack by that point.

总结如下:
mode switch”是一个运行的taskuser-mode切换到kernel-mode,或者切换回来。而“context switch”一定发生在kernel mode,进行task的切换。

每个user task有一个user-mode stack和一个kernel-mode stack,当从user-mode切换到kernel-mode时,寄存器的值要保存到kernel-mode stack,反之,从kernel-mode切换回user-mode时,要把寄存器的值恢复出来。

进行“context switch”时,scheduler将当前kernel-mode stack中的值保存在task_struct中,并把下一个将要运行tasktask_struct值恢复到kernel-mode stack中。这样,从kernel-mode返回到user-mode,就会运行另外一个task

 

Linux kernel 笔记 (49)——ERESTARTSYS和EINTR

LDD3中提到驱动代码返回ERESTARTSYSEINTR时如何选择:

Note the check on the return value of down_interruptible; if it returns nonzero, the operation was interrupted. The usual thing to do in this situation is to return -ERESTARTSYS。 Upon seeing this return code, the higher layers of the kernel will either restart the call from the beginning or return the error to the user. If you return -ERESTARTSYS , you must first undo any user-visible changes that might have been made, so that the right thing happens when the system call is retried. If you cannot undo things in this manner, you should return -EINTR instead.

即如果可以把用户看到的设备状态完全回滚到执行驱动代码之前,则返回ERESTARTSYS,否则返回EINTR。因为EINTR错误可以使系统调用失败,并且返回错误码为EINTR给应用程序。而ERESTARTSYS有可能会让kernel重新发起操作,而不会惊动应用程序。可以参考这篇帖子

 

Linux kernel 笔记 (48)——CONFIG_STRICT_DEVMEM和/dev/crash

CONFIG_STRICT_DEVMEM配置项的作用是控制对/dev/mem的访问:一旦置成yes,则只能访问一段特定的区域。比如在X86平台,只能访问内存开始的1M区域:

# dd if=/dev/mem of=/dev/null
dd: error reading ‘/dev/mem’: Operation not permitted
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.0349979 s, 30.0 MB/s

RedHat开发了一个驱动:/dev/crash,可以用来取代/dev/mem,方便调试器(例如crash)访问物理内存区域。

参考资料:
/dev/crash Driver
Tools:Memory Imaging

 

Linux kernel 笔记 (47)——操作信号量的函数

操作信号量的函数如下:

#include <linux/semaphore.h>
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_killable(struct semaphore *sem);
int down_trylock(struct semaphore *sem); 
int down_timeout(struct semaphore *sem, long jiffies);
void up(struct semaphore *sem);

down已经不再推荐使用。

down_interruptible可以被信号打断,因此需要检查返回值:只有返回0,才表明成功获取了信号量。使用down_interruptible例子如下:

if (down_interruptible(&sem)) return -ERESTARTSYS;

down_killable只能被fatal信号打断,这种信号通常用来终止进程,因此down_killable用了保证用户进程可以被杀死,否则一旦有死锁进程,则只能重启系统。

down_trylock是非阻塞版本的down,也要检查返回值。举例如下:

if (file->f_flags & O_NONBLOCK) {
    if (down_trylock(&iosem)) return -EAGAIN;
} else {
    if (down_interruptible(&iosem)) return -ERESTARTSYS;
}

down_timeout用来等待一段时间,中间也不能被信号打断。

up用来释放信号量,不需要提供interrupt版本。

参考资料:
Mutex, semaphore and the proc file system

 

Linux kernel 笔记 (46)——配置crashkernel参数

crashkernel用来配置Kexec启动的第二个kernelcrash kernel),即用来捕获第一个kernel crash dumpkernel的大小和位置。 配置crashkernel参数有四种形式:

(1)

crashkernel=size[@offset]  

保留[offset,offset + size]这段内存,如果@offset省略,则会自动选择一个合适的offset
(2)

crashkernel=range1:size1[,range2:size2,...][@offset]
range=start-[end](包含`start`,但不包含`end`)

举例来看:

crashkernel=512M-2G:64M,2G-:128M

含义如下:
a)如果内存小于512M,则不保留内存;
b)内存介于512M2G之间,保留64M内存;
c)内存2G以上,保留128M内存。

(3)

crashkernel=size,high

只用于X86_64平台。当内存大于4G时,允许kerneltop,也就是高于4G的内存地址开始分配。如果内存小于4G,则自然从低于4G的地址空间分配。如果指定crashkernel=size,则这个选项会被忽略。

(4)

crashkernel=size,low

只用于X86_64平台。当指定crashkernel=size,high时,也需要在low,也就是低于4G的内存地址分配一段内存。默认情况下,系统会尝试自动分配至少256M内存。

参考资料:
Kernel Parameters

 

Linux kernel 笔记 (45)——f_pos

f_pos定义在file结构体(定义在<linux/fs.h>),表示文件当前的读写位置:

struct file {
    ......
    loff_t          f_pos;
    ......
}

LDD3中关于f_pos的描述:

loff_t f_pos;

The current reading or writing position. loff_t is a 64-bit value on all platforms ( long long in gcc terminology). The driver can read this value if it needs to know the current position in the file but should not normally change it; read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly. The one exception to this rule is in the llseek method, the purpose of which is to change the file position.

驱动的读写操作不需要直接更新filp->f_pos。关于其中原因,可参考这篇笔记

 

Linux kernel 笔记 (44)——使用字符设备

Linux kernel 使用 cdev结构体代表字符设备(char device),定义在<linux/cdev.h>

#include <linux/kobject.h>
#include <linux/kdev_t.h>
#include <linux/list.h>

struct file_operations;
struct inode;
struct module;

struct cdev {
    struct kobject kobj;
    struct module *owner;
    const struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
};

void cdev_init(struct cdev *, const struct file_operations *);

struct cdev *cdev_alloc(void);

void cdev_put(struct cdev *p);

int cdev_add(struct cdev *, dev_t, unsigned);

void cdev_del(struct cdev *);

void cd_forget(struct inode *);

分配和初始化cdev结构体的两种方式:

(1)

struct cdev *my_cdev = cdev_alloc();
my_cdev->ops = &my_fops;
my_cdev->owner = THIS_MODULE;

(2)另外一种是cdev嵌入到代表设备的结构体中:

struct scull_dev {
    ......
    struct cdev cdev; /* Char device structure */
    ......
};

static void scull_setup_cdev(struct scull_dev *dev, int index)
{
    int err, devno = MKDEV(scull_major, scull_minor + index);
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    ......
}

两种方式都要注意把owner赋值为THIS_MODULE

初始化cdev结构体以后,要使用cdev_add把设备加入系统:

/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
    int error;

    p->dev = dev;
    p->count = count;

    error = kobj_map(cdev_map, dev, count, NULL,
             exact_match, exact_lock, p);
    if (error)
        return error;

    kobject_get(p->kobj.parent);

    return 0;
}

要注意count指定的是连续的minor number数。

删除设备使用cdev_del函数:

/**
 * cdev_del() - remove a cdev from the system
 * @p: the cdev structure to be removed
 *
 * cdev_del() removes @p from the system, possibly freeing the structure
 * itself.
 */
void cdev_del(struct cdev *p)
{
    cdev_unmap(p->dev, p->count);
    kobject_put(&p->kobj);
}

2.6版本之前的注册和删除设备的register_chrdevunregister_chrdev函数已经过时,不再使用。