Linux kernel 笔记 (49)——ERESTARTSYS和EINTR

LDD3中提到驱动代码返回ERESTARTSYSEINTR时如何选择:

Note the check on the return value of down_interruptible; if it returns nonzero, the operation was interrupted. The usual thing to do in this situation is to return -ERESTARTSYS。 Upon seeing this return code, the higher layers of the kernel will either restart the call from the beginning or return the error to the user. If you return -ERESTARTSYS , you must first undo any user-visible changes that might have been made, so that the right thing happens when the system call is retried. If you cannot undo things in this manner, you should return -EINTR instead.

即如果可以把用户看到的设备状态完全回滚到执行驱动代码之前,则返回ERESTARTSYS,否则返回EINTR。因为EINTR错误可以使系统调用失败,并且返回错误码为EINTR给应用程序。而ERESTARTSYS有可能会让kernel重新发起操作,而不会惊动应用程序。可以参考这篇帖子

 

Linux kernel 笔记 (48)——CONFIG_STRICT_DEVMEM和/dev/crash

CONFIG_STRICT_DEVMEM配置项的作用是控制对/dev/mem的访问:一旦置成yes,则只能访问一段特定的区域。比如在X86平台,只能访问内存开始的1M区域:

# dd if=/dev/mem of=/dev/null
dd: error reading ‘/dev/mem’: Operation not permitted
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.0349979 s, 30.0 MB/s

RedHat开发了一个驱动:/dev/crash,可以用来取代/dev/mem,方便调试器(例如crash)访问物理内存区域。

参考资料:
/dev/crash Driver
Tools:Memory Imaging

 

Linux kernel 笔记 (47)——操作信号量的函数

操作信号量的函数如下:

#include <linux/semaphore.h>
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_killable(struct semaphore *sem);
int down_trylock(struct semaphore *sem); 
int down_timeout(struct semaphore *sem, long jiffies);
void up(struct semaphore *sem);

down已经不再推荐使用。

down_interruptible可以被信号打断,因此需要检查返回值:只有返回0,才表明成功获取了信号量。使用down_interruptible例子如下:

if (down_interruptible(&sem)) return -ERESTARTSYS;

down_killable只能被fatal信号打断,这种信号通常用来终止进程,因此down_killable用了保证用户进程可以被杀死,否则一旦有死锁进程,则只能重启系统。

down_trylock是非阻塞版本的down,也要检查返回值。举例如下:

if (file->f_flags & O_NONBLOCK) {
    if (down_trylock(&iosem)) return -EAGAIN;
} else {
    if (down_interruptible(&iosem)) return -ERESTARTSYS;
}

down_timeout用来等待一段时间,中间也不能被信号打断。

up用来释放信号量,不需要提供interrupt版本。

参考资料:
Mutex, semaphore and the proc file system

 

Linux kernel 笔记 (46)——配置crashkernel参数

crashkernel用来配置Kexec启动的第二个kernelcrash kernel),即用来捕获第一个kernel crash dumpkernel的大小和位置。 配置crashkernel参数有四种形式:

(1)

crashkernel=size[@offset]  

保留[offset,offset + size]这段内存,如果@offset省略,则会自动选择一个合适的offset
(2)

crashkernel=range1:size1[,range2:size2,...][@offset]
range=start-[end](包含`start`,但不包含`end`)

举例来看:

crashkernel=512M-2G:64M,2G-:128M

含义如下:
a)如果内存小于512M,则不保留内存;
b)内存介于512M2G之间,保留64M内存;
c)内存2G以上,保留128M内存。

(3)

crashkernel=size,high

只用于X86_64平台。当内存大于4G时,允许kerneltop,也就是高于4G的内存地址开始分配。如果内存小于4G,则自然从低于4G的地址空间分配。如果指定crashkernel=size,则这个选项会被忽略。

(4)

crashkernel=size,low

只用于X86_64平台。当指定crashkernel=size,high时,也需要在low,也就是低于4G的内存地址分配一段内存。默认情况下,系统会尝试自动分配至少256M内存。

参考资料:
Kernel Parameters

 

kmod简介

kmod提供了一组操作Linux kernel module的工具,它是构建在libkmod库之上的(这个库也随kmod源码一并提供)。代码地址:http://git.kernel.org/cgit/utils/kernel/kmod/kmod.git/

SuSE Linux执行如下命令:

/sbin # ls -alt | grep kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 depmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 insmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 lsmod -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 modinfo -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 modprobe -> /usr/bin/kmod
lrwxrwxrwx 1 root root        13 Nov  5 21:17 rmmod -> /usr/bin/kmod

可以看到,平时常用的insmodmodprobe等命令本质上调用的都是kmod命令。

 

Linux kernel 笔记 (45)——f_pos

f_pos定义在file结构体(定义在<linux/fs.h>),表示文件当前的读写位置:

struct file {
    ......
    loff_t          f_pos;
    ......
}

LDD3中关于f_pos的描述:

loff_t f_pos;

The current reading or writing position. loff_t is a 64-bit value on all platforms ( long long in gcc terminology). The driver can read this value if it needs to know the current position in the file but should not normally change it; read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly. The one exception to this rule is in the llseek method, the purpose of which is to change the file position.

驱动的读写操作不需要直接更新filp->f_pos。关于其中原因,可参考这篇笔记

 

Linux系统上“run”和“/var/run”目录

以下摘自wikipedia

Modern Linux distributions include a /run directory as a temporary filesystem (tmpfs) which stores volatile runtime data, following the FHS version 3.0. According to the FHS version 2.3, such data were stored in /var/run but this was a problem in some cases because this directory isn’t always available at early boot. As a result, these programs have had to resort to trickery, such as using /dev/.udev, /dev/.mdadm, /dev/.systemd or /dev/.mount directories, even though the device directory isn’t intended for such data.[19] Among other advantages, this makes the system easier to use normally with the root filesystem mounted read-only.

/run是一个临时文件系统,存储系统启动以来的信息。当系统重启时,这个目录下的文件应该被删掉或清除。如果你的系统上有/var/run目录,应该让它指向run。参看SuSE 12的实现:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
......
tmpfs           431M  7.1M  424M   2% /run
......

# ls -lt /var/run
lrwxrwxrwx 1 root root 4 Nov  5 21:14 /var/run -> /run

 

/dev/mem,/dev/kmem和/dev/port

/dev/mem/dev/kmem/dev/port这三个文件分别代表物理内存,kernel虚拟内存和I/O端口。参考下面:

/dev/mem is a character device file that is an image of the main memory of the computer. It may be used, for example, to examine (and even patch) the system. Byte addresses in /dev/mem are interpreted as physical memory addresses. References to nonexistent locations cause errors to be returned.

The file /dev/kmem is the same as /dev/mem, except that the kernel virtual memory rather than physical memory is accessed.

/dev/port is similar to /dev/mem, but the I/O ports are accessed.

 

Linux kernel 笔记 (44)——使用字符设备

Linux kernel 使用 cdev结构体代表字符设备(char device),定义在<linux/cdev.h>

#include <linux/kobject.h>
#include <linux/kdev_t.h>
#include <linux/list.h>

struct file_operations;
struct inode;
struct module;

struct cdev {
    struct kobject kobj;
    struct module *owner;
    const struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
};

void cdev_init(struct cdev *, const struct file_operations *);

struct cdev *cdev_alloc(void);

void cdev_put(struct cdev *p);

int cdev_add(struct cdev *, dev_t, unsigned);

void cdev_del(struct cdev *);

void cd_forget(struct inode *);

分配和初始化cdev结构体的两种方式:

(1)

struct cdev *my_cdev = cdev_alloc();
my_cdev->ops = &my_fops;
my_cdev->owner = THIS_MODULE;

(2)另外一种是cdev嵌入到代表设备的结构体中:

struct scull_dev {
    ......
    struct cdev cdev; /* Char device structure */
    ......
};

static void scull_setup_cdev(struct scull_dev *dev, int index)
{
    int err, devno = MKDEV(scull_major, scull_minor + index);
    cdev_init(&dev->cdev, &scull_fops);
    dev->cdev.owner = THIS_MODULE;
    dev->cdev.ops = &scull_fops;
    ......
}

两种方式都要注意把owner赋值为THIS_MODULE

初始化cdev结构体以后,要使用cdev_add把设备加入系统:

/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)
{
    int error;

    p->dev = dev;
    p->count = count;

    error = kobj_map(cdev_map, dev, count, NULL,
             exact_match, exact_lock, p);
    if (error)
        return error;

    kobject_get(p->kobj.parent);

    return 0;
}

要注意count指定的是连续的minor number数。

删除设备使用cdev_del函数:

/**
 * cdev_del() - remove a cdev from the system
 * @p: the cdev structure to be removed
 *
 * cdev_del() removes @p from the system, possibly freeing the structure
 * itself.
 */
void cdev_del(struct cdev *p)
{
    cdev_unmap(p->dev, p->count);
    kobject_put(&p->kobj);
}

2.6版本之前的注册和删除设备的register_chrdevunregister_chrdev函数已经过时,不再使用。

 

Linux kernel 笔记 (43)——do_sys_open

以下是do_sys_openkernel 3.12版本的代码:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_flags op;
    int fd = build_open_flags(flags, mode, &op);
    struct filename *tmp;

    if (fd)
        return fd;

    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);

    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            fsnotify_open(f);
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

核心部分如下:

a)get_unused_fd_flags得到一个文件描述符;
b)do_filp_open得到一个struct file结构;
c)fd_install把文件描述符和struct file结构关联起来。

struct file包含f_op成员:

struct file {
    ......
    const struct file_operations    *f_op;
    ......
    void            *private_data;
    ......
}

struct file_operations又包含open成员:

struct file_operations {
    ......
    int (*open) (struct inode *, struct file *);
    ......
}

open成员的两个参数:实际文件的inode节点和struct file结构。

open系统调用执行驱动中open方法之前(struct file_operations中的open成员),会将private_data置成NULL,用户可以根据自己的需要设置private_data的值(参考do_dentry_open函数)。