FreeBSD kernel 笔记(13)——delaying execution

4delaying execution方法(选自:FreeBSD Device Drivers):

Sleeping Sleeping is done when you must wait for something to occur before you can proceed.
Event Handlers Event handlers let you register one or more functions to be executed when an event occurs.
Callouts Callouts let you perform asynchronous code execution. Callouts are used to execute your functions at a specific time.
Taskqueues Taskqueues also let you perform asynchronous code execution. Taskqueues are used for deferred work.

FreeBSD kernel 笔记(11)——condition variables

线程同步除了使用mutex,还可以使用conditional variables(下列内容摘自FreeBSD Device Drivers):

Condition variables synchronize the execution of two or more threads based upon the value of an object. In contrast, locks synchronize threads by controlling their access to objects.

Condition variables are used in conjunction with locks to “block” threads until a condition is true. It works like this: A thread first acquires the foo lock. Then it examines the condition. If the condition is false, it sleeps on the bar condition variable. While asleep on bar , threads relinquish foo . A thread that causes the condition to be true wakes up the threads sleeping on bar . Threads woken up in this manner reacquire foo before proceeding.

此外,使用conditional variables必然涉及到lock,以下是关于lock的规则(下列内容摘自FreeBSD Kernel Developer’s Manual):

The lock argument is a pointer to either mutex(9), rwlock(9), or sx(9) lock. A mutex(9) argument must be initialized with MTX_DEF and not MTX_SPIN. A thread must hold lock before calling cvwait(), cvwaitsig(), cvwaitunlock(), cvtimedwait(), or cvtimedwaitsig(). When a thread waits on a condition, lock is atomically released before the thread is blocked, then reacquired before the function call returns. In addition, the thread will fully drop the Giant mutex (even if recursed) while the it is suspended and will reacquire the Giant mutex before the function returns. The cvwaitunlock() function does not reacquire the lock before returning. Note that the Giant mutex may be specified as lock. However, Giant may not be used as lock for the cvwaitunlock() function. All waiters must pass the same lock in con- junction with cvp.

简而言之,即线程在调用cv_wait()等系列函数检查condition变成true时,它必须已经获得lock。在cv_wait()中,线程会先释放lock,然后阻塞在这里等待condition变成true,在从cv_wait()返回后,又重新获得lock。要注意,cv_wait_unlock()函数返回是不会重新获得lock

FreeBSD kernel 笔记(10)——mutex

FreeBSD kernel提供两种mutexspin mutexsleep mutex(下列内容摘自 FreeBSD Device Drivers):

Spin Mutexes
Spin mutexes are simple spin locks. If a thread attempts to acquire a spin lock that is being held by another thread, it will “spin” and wait for the lock to be released. Spin, in this case, means to loop infinitely on the CPU. This spinning can result in deadlock if a thread that is holding a spin lock is interrupted or if it context switches, and all subsequent threads attempt to acquire that lock. Consequently, while holding a spin mutex all interrupts are blocked on the local processor and a context switch cannot be performed.

Spin mutexes should be held only for short periods of time and should be used only to protect objects related to nonpreemptive interrupts and low- level scheduling code (McKusick and Neville-Neil, 2005). Ordinarily, you’ll never use spin mutexes.

Sleep Mutexes
Sleep mutexes are the most commonly used lock. If a thread attempts to acquire a sleep mutex that is being held by another thread, it will context switch (that is, sleep) and wait for the mutex to be released. Because of this behavior, sleep mutexes are not susceptible to the deadlock described above.

Sleep mutexes support priority propagation. When a thread sleeps on a sleep mutex and its priority is higher than the sleep mutex’s current owner, the current owner will inherit the priority of this thread (Baldwin, 2002). This characteristic prevents a lower priority thread from blocking a higher priority thread.

NOTE Sleeping (for example, calling a *sleep function) while holding a mutex is never safe and must be avoided; otherwise, there are numerous assertions that will fail and the kernel will panic.

使用spin mutex时,为了防止deadlock,要把local cpu关中断并且不能进行context switch。通常情况下,应该使用sleep mutex。另外要注意,获得mutex的线程不能sleep,否则会导致kernel panic

此外,还有shared/exclusive locks

Shared/exclusive locks (sx locks) are locks that threads can hold while asleep. As the name implies, multiple threads can have a shared hold on an sx lock, but only one thread can have an exclusive hold on an sx lock. When a thread has an exclusive hold on an sx lock, other threads cannot have a shared hold on that lock.

sx locks do not support priority propagation and are inefficient com- pared to mutexes. The main reason for using sx locks is that threads can sleep while holding one.

reader/writer locks

Reader/writer locks (rw locks) are basically mutexes with sx lock semantics. Like sx locks, threads can hold rw locks as a reader, which is identical to a shared hold, or as a writer, which is identical to an exclusive hold. Like mutexes, rw locks support priority propagation and threads cannot hold them while sleeping (or the kernel will panic).

rw locks are used when you need to protect an object that is mostly going to be read from instead of written to.

shared/exclusive locksreader/writer locks语义类似,但有以下区别:拥有shared/exclusive locks的线程可以sleep,但不支持priority propagation;拥有reader/writer locks的线程不可以sleep,但支持priority propagation

FreeBSD kernel 笔记(9)——modeventtype_t定义

modeventtype_t定义如下:

typedef enum modeventtype {
    MOD_LOAD,
    MOD_UNLOAD,
    MOD_SHUTDOWN,
    MOD_QUIESCE
} modeventtype_t;
typedef int (*modeventhand_t)(module_t, int /* modeventtype_t */, void *);

MOD_LOADMOD_UNLOADMOD_SHUTDOWN都好理解。分别是在加载,卸载模块,还有关机时传入模块处理函数的值。而关于MOD_QUIESCE可以参考FreeBSD Device Drivers

When one issues the kldunload(8) command, MOD_QUIESCE is run before MOD_UNLOAD . If MOD_QUIESCE returns an error, MOD_UNLOAD does not get executed. In other words, MOD_QUIESCE verifies that it is safe to unload your module.

NOTE The kldunload -f command ignores every error returned by MOD_QUIESCE . So you can always unload a module, but it may not be the best idea.

另外,关于MOD_QUIESCEMOD_SHUTDOWN区别,也可参考FreeBSD Kernel Developer’s Manual

The difference between MOD_QUIESCE and MOD_UNLOAD is that the module should fail MOD_QUIESCE if it is currently in use, whereas MOD_UNLOAD should only fail if it is impossible to unload the module, for instance because there are memory references to the module which cannot be revoked.

FreeBSD kernel 笔记(8)——双向链表

FreeBSD kernel提供了对双向链表的支持(定义在sys/sys/queue.h中):

/*
 * List declarations.
 */
#define LIST_HEAD(name, type)                       \
struct name {                               \
    struct type *lh_first;  /* first element */         \
}

#define LIST_CLASS_HEAD(name, type)                 \
struct name {                               \
    class type *lh_first;   /* first element */         \
}

#define LIST_HEAD_INITIALIZER(head)                 \
    { NULL }

#define LIST_ENTRY(type)                        \
struct {                                \
    struct type *le_next;   /* next element */          \
    struct type **le_prev;  /* address of previous next element */  \
}

#define LIST_CLASS_ENTRY(type)                      \
struct {                                \
    class type *le_next;    /* next element */          \
    class type **le_prev;   /* address of previous next element */  \
}

#define LIST_EMPTY(head)    ((head)->lh_first == NULL)

#define LIST_FIRST(head)    ((head)->lh_first)

#define LIST_FOREACH(var, head, field)                  \
    for ((var) = LIST_FIRST((head));                \
        (var);                          \
        (var) = LIST_NEXT((var), field))

#define LIST_NEXT(elm, field)   ((elm)->field.le_next)

#define LIST_INSERT_HEAD(head, elm, field) do {             \
    QMD_LIST_CHECK_HEAD((head), field);             \
    if ((LIST_NEXT((elm), field) = LIST_FIRST((head))) != NULL) \
        LIST_FIRST((head))->field.le_prev = &LIST_NEXT((elm), field);\
    LIST_FIRST((head)) = (elm);                 \
    (elm)->field.le_prev = &LIST_FIRST((head));         \
} while (0)

......

FreeBSD Device Drivers代码为例:

(1)race_softc结构体定义:

struct race_softc {
    LIST_ENTRY(race_softc) list;
    int unit;
};

展开以后变成如下代码:

struct race_softc {
    struct { \
        struct race_softc *le_next; /* next element */          \
        struct race_softc **le_prev;    /* address of previous next element */  \
    } list;
    int unit;
};

(2)双向链表头定义:

static LIST_HEAD(, race_softc) race_list = LIST_HEAD_INITIALIZER(&race_list);

展开以后变成如下代码:

struct {struct race_softc *lh_first;} race_list = {NULL};

(3)插入一个元素:

sc = (struct race_softc *)malloc(sizeof(struct race_softc), M_RACE, M_WAITOK | M_ZERO);
sc->unit = unit;    
LIST_INSERT_HEAD(&race_list, sc, list);

展开以后变成如下代码:

sc = (struct race_softc *)malloc(sizeof(struct race_softc), M_RACE, M_WAITOK | M_ZERO);
sc->unit = unit;
do {                \
    QMD_LIST_CHECK_HEAD((race_list), list);             \
    if ((LIST_NEXT((sc), list) = LIST_FIRST((race_list))) != NULL)  \
        LIST_FIRST((race_list))->list.le_prev = &LIST_NEXT((sc), list);\
    LIST_FIRST((race_list)) = (sc);                 \
    (sc)->list.le_prev = &LIST_FIRST((race_list));          \
} while (0)

展开以后变成如下代码:

do { 
    if (((((sc))->list.le_next) = (((&race_list))->lh_first)) != ((void *)0)) (((&race_list))->lh_first)->list.le_prev = &(((sc))->list.le_next); 
    (((&race_list))->lh_first) = (sc); 
    (sc)->list.le_prev = &(((&race_list))->lh_first); 
} while (0);

即把元素插在链表头部。因为sc位于链表头部,所以其list.le_prev指向它自己。

FreeBSD kernel 笔记(7)——cdevsw结构体中定义不支持操作

下面摘自FreeBSD Device Drivers

If a d_foo function is undefined the corresponding operation is unsupported. However, dopen and dclose are unique; when they’re undefined the kernel will automatically define them as follows:
int
nullop(void)
{
return (0);
}
This ensures that every registered character device can be opened and closed.

即在cdevsw结构体中,d_opend_close是永远不为空的。

/*
 * Character device switch table
 */
struct cdevsw {
    int         d_version;
    u_int           d_flags;
    const char      *d_name;
    d_open_t        *d_open;
    d_fdopen_t      *d_fdopen;
    d_close_t       *d_close;
    d_read_t        *d_read;
    d_write_t       *d_write;
    d_ioctl_t       *d_ioctl;
    d_poll_t        *d_poll;
    d_mmap_t        *d_mmap;
    d_strategy_t        *d_strategy;
    dumper_t        *d_dump;
    d_kqfilter_t        *d_kqfilter;
    d_purge_t       *d_purge;
    d_mmap_single_t     *d_mmap_single;

    int32_t         d_spare0[3];
    void            *d_spare1[3];

    /* These fields should not be messed with by drivers */
    LIST_HEAD(, cdev)   d_devs;
    int         d_spare2;
    union {
        struct cdevsw       *gianttrick;
        SLIST_ENTRY(cdevsw) postfree_list;
    } __d_giant;
};

FreeBSD kernel 笔记(6)——设备通信和控制

FreeBSD系统上,设备通信和控制主要通过sysctlioctl接口:

Generally, sysctls are employed to adjust parameters, and ioctls are used for everything else—that’s why ioctls are the catchall of I/O operations.

ioctl比较简单,不在这里赘述。

要在kernel模块中增加对sysctl的支持,首先要调用sysctl_ctx_init初始化一个sysctl_ctx_list结构体(使用完,通过sysctl_ctx_free来进行释放);然后使用SYSCTL_ADD_*系列函数加入系统支持的参数。需要注意的是,SYSCTL_ADD_*系列函数的第二个参数用来指定新加入参数属于哪个parent node,可以使用下面两个macro来指定其位置:SYSCTL_STATIC_CHILDRENSYSCTL_CHILDREN(如果SYSCTL_STATIC_CHILDREN没有参数,则会新增加一个系统的top-level category)。

另外,SYSCTL_ADD_PROC会增加一个处理函数。其参数是SYSCTL_HANDLER_ARGS

#define SYSCTL_HANDLER_ARGS struct sysctl_oid *oidp, void *arg1,    \
    intptr_t arg2, struct sysctl_req *req

arg1指向sysctl命令需要处理的数据,arg2指向数据的长度。

参考资料:
FreeBSD Device Drivers

FreeBSD kernel 笔记(5)——分配内存

FreeBSD kernel编程分配内存可以参考这两篇文档:MALLOC(9)CONTIGMALLOC(9)。需要注意以下几点:

(1)在中断上下文中使用malloc系列分配内存函数时,要使用M_NOWAIT标记;

(2)contigmalloc有一个boundary参数:

If the given value “boundary” is non-zero, then the set of physical pages cannot cross any physical address boundary that is a multiple of that value.

举个例子,如果boundary设置为1M,则实际分配的物理内存页面可以位于0~1M1M~2M,而不能位于1.9M~2.1M

FreeBSD kernel 笔记(4)——UIO

UIO相关的结构体和函数定义:

 #include <sys/types.h>
 #include <sys/uio.h>

 struct uio {
     struct  iovec *uio_iov;         /* scatter/gather list */
     int     uio_iovcnt;         /* length of scatter/gather list */
     off_t   uio_offset;         /* offset in target object */
     ssize_t uio_resid;          /* remaining bytes to copy */
     enum    uio_seg uio_segflg;     /* address space */
     enum    uio_rw uio_rw;      /* operation */
     struct  thread *uio_td;         /* owner */
 };

 int
 uiomove(void *buf, int howmuch, struct uio *uiop);

 int
 uiomove_nofault(void *buf, int howmuch, struct uio *uiop);

关于uio结构体需要注意的是:如果uio_iovcnt不为1,可以把uio_iov所指向的struct iovec看成一个连接起来的大bufferuio_offset指向这个bufferoffest,而uio_resid表明还有多少字节需要copy。在执行read操作时,uio_offset表明已经填充的buffer大小,而uio_resid表明buffer剩余的空间。可以参考这个程序

uiomoveuiomove_nofault本质上调用的都是uiomove_faultflag函数:

static int
uiomove_faultflag(void *cp, int n, struct uio *uio, int nofault)
{
    struct thread *td;
    struct iovec *iov;
    size_t cnt;
    int error, newflags, save;

    td = curthread;
    error = 0;

    KASSERT(uio->uio_rw == UIO_READ || uio->uio_rw == UIO_WRITE,
    ("uiomove: mode"));
    KASSERT(uio->uio_segflg != UIO_USERSPACE || uio->uio_td == td,
    ("uiomove proc"));
    if (!nofault)
        WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
        "Calling uiomove()");

    /* XXX does it make a sense to set TDP_DEADLKTREAT for UIO_SYSSPACE ? */
    newflags = TDP_DEADLKTREAT;
    if (uio->uio_segflg == UIO_USERSPACE && nofault) {
        /*
         * Fail if a non-spurious page fault occurs.
         */
        newflags |= TDP_NOFAULTING | TDP_RESETSPUR;
    }
    save = curthread_pflags_set(newflags);

    while (n > 0 && uio->uio_resid) {
        iov = uio->uio_iov;
        cnt = iov->iov_len;
        if (cnt == 0) {
            uio->uio_iov++;
            uio->uio_iovcnt--;
            continue;
        }
        if (cnt > n)
            cnt = n;

        switch (uio->uio_segflg) {

        case UIO_USERSPACE:
            maybe_yield();
            if (uio->uio_rw == UIO_READ)
                error = copyout(cp, iov->iov_base, cnt);
            else
                error = copyin(iov->iov_base, cp, cnt);
            if (error)
                goto out;
            break;

        case UIO_SYSSPACE:
            if (uio->uio_rw == UIO_READ)
                bcopy(cp, iov->iov_base, cnt);
            else
                bcopy(iov->iov_base, cp, cnt);
            break;
        case UIO_NOCOPY:
            break;
        }
        iov->iov_base = (char *)iov->iov_base + cnt;
        iov->iov_len -= cnt;
        uio->uio_resid -= cnt;
        uio->uio_offset += cnt;
        cp = (char *)cp + cnt;
        n -= cnt;
    }
out:
    curthread_pflags_restore(save);
    return (error);
}

可以看到这个函数会对传入的uio结构体的内容进行修改。

关于uiomove_nofault()函数,参考如下定义:

The function uiomovenofault() requires that the buffer and I/O vectors be accessible without incurring a page fault. The source and destination addresses must be physically mapped for read and write access, respec- tively, and neither the source nor destination addresses may be pageable. Thus, the function uiomovenofault() can be called from contexts where acquiring virtual memory system locks or sleeping are prohibited.

参考资料:
UIO