Linux | 我的站点

Linux kernel 笔记（49）——ERESTARTSYS和EINTR

LDD3中提到驱动代码返回ERESTARTSYS和EINTR时如何选择：

Note the check on the return value of down_interruptible; if it returns nonzero, the operation was interrupted. The usual thing to do in this situation is to return -ERESTARTSYS。 Upon seeing this return code, the higher layers of the kernel will either restart the call from the beginning or return the error to the user. If you return -ERESTARTSYS , you must first undo any user-visible changes that might have been made, so that the right thing happens when the system call is retried. If you cannot undo things in this manner, you should return -EINTR instead.

即如果可以把用户看到的设备状态完全回滚到执行驱动代码之前，则返回ERESTARTSYS，否则返回EINTR。因为EINTR错误可以使系统调用失败，并且返回错误码为EINTR给应用程序。而ERESTARTSYS有可能会让kernel重新发起操作，而不会惊动应用程序。可以参考这篇帖子。

Linux kernel 笔记（48）——CONFIG_STRICT_DEVMEM和/dev/crash

CONFIG_STRICT_DEVMEM配置项的作用是控制对/dev/mem的访问：一旦置成yes，则只能访问一段特定的区域。比如在X86平台，只能访问内存开始的1M区域：

# dd if=/dev/mem of=/dev/null
dd: error reading ‘/dev/mem’: Operation not permitted
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.0349979 s, 30.0 MB/s

RedHat开发了一个驱动：/dev/crash，可以用来取代/dev/mem，方便调试器（例如crash）访问物理内存区域。

参考资料：
/dev/crash Driver；
Tools:Memory Imaging。

Linux kernel 笔记（47）——操作信号量的函数

操作信号量的函数如下：

#include <linux/semaphore.h>
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_killable(struct semaphore *sem);
int down_trylock(struct semaphore *sem); 
int down_timeout(struct semaphore *sem, long jiffies);
void up(struct semaphore *sem);

down已经不再推荐使用。

down_interruptible可以被信号打断，因此需要检查返回值：只有返回0，才表明成功获取了信号量。使用down_interruptible例子如下：

if (down_interruptible(&sem)) return -ERESTARTSYS;

down_killable只能被fatal信号打断，这种信号通常用来终止进程，因此down_killable用了保证用户进程可以被杀死，否则一旦有死锁进程，则只能重启系统。

down_trylock是非阻塞版本的down，也要检查返回值。举例如下：

if (file->f_flags & O_NONBLOCK) {
    if (down_trylock(&iosem)) return -EAGAIN;
} else {
    if (down_interruptible(&iosem)) return -ERESTARTSYS;
}

down_timeout用来等待一段时间，中间也不能被信号打断。

up用来释放信号量，不需要提供interrupt版本。

参考资料：
Mutex, semaphore and the proc file system。

Linux kernel 笔记（46）——配置crashkernel参数

crashkernel用来配置Kexec启动的第二个kernel（crash kernel），即用来捕获第一个kernel crash dump的kernel的大小和位置。配置crashkernel参数有四种形式：

（1）

crashkernel=size[@offset]

保留[offset，offset + size]这段内存，如果@offset省略，则会自动选择一个合适的offset。
（2）

crashkernel=range1:size1[,range2:size2,...][@offset]
range=start-[end](包含`start`，但不包含`end`)

举例来看：

crashkernel=512M-2G:64M,2G-:128M

含义如下：
a）如果内存小于512M，则不保留内存；
b）内存介于512M和2G之间，保留64M内存；
c）内存2G以上，保留128M内存。

（3）

crashkernel=size,high

只用于X86_64平台。当内存大于4G时，允许kernel从top，也就是高于4G的内存地址开始分配。如果内存小于4G，则自然从低于4G的地址空间分配。如果指定crashkernel=size，则这个选项会被忽略。

（4）

crashkernel=size,low

只用于X86_64平台。当指定crashkernel=size,high时，也需要在low，也就是低于4G的内存地址分配一段内存。默认情况下，系统会尝试自动分配至少256M内存。

参考资料：
Kernel Parameters。

kmod简介

kmod提供了一组操作Linux kernel module的工具，它是构建在libkmod库之上的（这个库也随kmod源码一并提供）。代码地址：http://git.kernel.org/cgit/utils/kernel/kmod/kmod.git/。

在SuSE Linux执行如下命令：

/sbin # ls -alt | grep kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 depmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 insmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 lsmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 modinfo -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 modprobe -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 rmmod -> /usr/bin/kmod

可以看到，平时常用的insmod，modprobe等命令本质上调用的都是kmod命令。

Linux kernel 笔记（45）——f_pos

f_pos定义在file结构体（定义在<linux/fs.h>），表示文件当前的读写位置：

struct file {
    ......
    loff_t          f_pos;
    ......
}

LDD3中关于f_pos的描述：

loff_t f_pos;

The current reading or writing position. loff_t is a 64-bit value on all platforms ( long long in gcc terminology). The driver can read this value if it needs to know the current position in the file but should not normally change it; read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly. The one exception to this rule is in the llseek method, the purpose of which is to change the file position.

驱动的读写操作不需要直接更新filp->f_pos。关于其中原因，可参考这篇笔记。

Linux系统上“run”和“/var/run”目录

以下摘自wikipedia：

Modern Linux distributions include a /run directory as a temporary filesystem (tmpfs) which stores volatile runtime data, following the FHS version 3.0. According to the FHS version 2.3, such data were stored in /var/run but this was a problem in some cases because this directory isn’t always available at early boot. As a result, these programs have had to resort to trickery, such as using /dev/.udev, /dev/.mdadm, /dev/.systemd or /dev/.mount directories, even though the device directory isn’t intended for such data.[19] Among other advantages, this makes the system easier to use normally with the root filesystem mounted read-only.

/run是一个临时文件系统，存储系统启动以来的信息。当系统重启时，这个目录下的文件应该被删掉或清除。如果你的系统上有/var/run目录，应该让它指向run。参看SuSE 12的实现：

# df -h
Filesystem      Size  Used Avail Use% Mounted on
......
tmpfs           431M  7.1M  424M   2% /run
......

# ls -lt /var/run
lrwxrwxrwx 1 root root 4 Nov  5 21:14 /var/run -> /run

/dev/mem，/dev/kmem和/dev/port

/dev/mem，/dev/kmem和/dev/port这三个文件分别代表物理内存，kernel虚拟内存和I/O端口。参考下面：

/dev/mem is a character device file that is an image of the main memory of the computer. It may be used, for example, to examine (and even patch) the system. Byte addresses in /dev/mem are interpreted as physical memory addresses. References to nonexistent locations cause errors to be returned.

The file /dev/kmem is the same as /dev/mem, except that the kernel virtual memory rather than physical memory is accessed.

/dev/port is similar to /dev/mem, but the I/O ports are accessed.

Linux kernel 笔记（44）——使用字符设备

Linux kernel 使用 cdev结构体代表字符设备（char device），定义在<linux/cdev.h>：

#include <linux/kobject.h>
#include <linux/kdev_t.h>
#include <linux/list.h>

struct file_operations;
struct inode;
struct module;

struct cdev {
    struct kobject kobj;
    struct module *owner;
    const struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
};

void cdev_init(struct cdev *, const struct file_operations *);

struct cdev *cdev_alloc(void);

void cdev_put(struct cdev *p);

int cdev_add(struct cdev *, dev_t, unsigned);

void cdev_del(struct cdev *);

void cd_forget(struct inode *);

分配和初始化cdev结构体的两种方式：

（1）

struct cdev *my_cdev = cdev_alloc();
my_cdev->ops = &my_fops;
my_cdev->owner = THIS_MODULE;

（2）另外一种是cdev嵌入到代表设备的结构体中：

struct scull_dev {
    ......
    struct cdev cdev; /* Char device structure */
    ......
};

static void scull_setup_cdev(struct scull_dev *dev, int index)
{
    int err, devno = MKDEV(scull_major, scull_minor + index);
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    ......
}

两种方式都要注意把owner赋值为THIS_MODULE。

初始化cdev结构体以后，要使用cdev_add把设备加入系统：

/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
    int error;

    p->dev = dev;
    p->count = count;

    error = kobj_map(cdev_map, dev, count, NULL,
             exact_match, exact_lock, p);
    if (error)
        return error;

    kobject_get(p->kobj.parent);

    return 0;
}

要注意count指定的是连续的minor number数。

删除设备使用cdev_del函数：

/**
 * cdev_del() - remove a cdev from the system
 * @p: the cdev structure to be removed
 *
 * cdev_del() removes @p from the system, possibly freeing the structure
 * itself.
 */
void cdev_del(struct cdev *p)
{
    cdev_unmap(p->dev, p->count);
    kobject_put(&p->kobj);
}

2.6版本之前的注册和删除设备的register_chrdev和unregister_chrdev函数已经过时，不再使用。

Linux kernel 笔记（43）——do_sys_open

以下是do_sys_open在kernel 3.12版本的代码：

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_flags op;
    int fd = build_open_flags(flags, mode, &op);
    struct filename *tmp;

    if (fd)
        return fd;

    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);

    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            fsnotify_open(f);
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

核心部分如下：

a）get_unused_fd_flags得到一个文件描述符；
b）do_filp_open得到一个struct file结构；
c）fd_install把文件描述符和struct file结构关联起来。

struct file包含f_op成员：

struct file {
    ......
    const struct file_operations    *f_op;
    ......
    void            *private_data;
    ......
}

而struct file_operations又包含open成员：

struct file_operations {
    ......
    int (*open) (struct inode *, struct file *);
    ......
}

open成员的两个参数：实际文件的inode节点和struct file结构。

在open系统调用执行驱动中open方法之前（struct file_operations中的open成员），会将private_data置成NULL，用户可以根据自己的需要设置private_data的值（参考do_dentry_open函数）。

2025 年 6 月
一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30