Linux kernel 笔记 (43)——do_sys_open

以下是do_sys_openkernel 3.12版本的代码:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    struct open_flags op;
    int fd = build_open_flags(flags, mode, &op);
    struct filename *tmp;

    if (fd)
        return fd;

    tmp = getname(filename);
    if (IS_ERR(tmp))
        return PTR_ERR(tmp);

    fd = get_unused_fd_flags(flags);
    if (fd >= 0) {
        struct file *f = do_filp_open(dfd, tmp, &op);
        if (IS_ERR(f)) {
            put_unused_fd(fd);
            fd = PTR_ERR(f);
        } else {
            fsnotify_open(f);
            fd_install(fd, f);
        }
    }
    putname(tmp);
    return fd;
}

核心部分如下:

a)get_unused_fd_flags得到一个文件描述符;
b)do_filp_open得到一个struct file结构;
c)fd_install把文件描述符和struct file结构关联起来。

struct file包含f_op成员:

struct file {
    ......
    const struct file_operations    *f_op;
    ......
    void            *private_data;
    ......
}

struct file_operations又包含open成员:

struct file_operations {
    ......
    int (*open) (struct inode *, struct file *);
    ......
}

open成员的两个参数:实际文件的inode节点和struct file结构。

open系统调用执行驱动中open方法之前(struct file_operations中的open成员),会将private_data置成NULL,用户可以根据自己的需要设置private_data的值(参考do_dentry_open函数)。

 

Linux kernel 笔记 (42)——container_of

container_of定义在<linux/kernel.h>中:

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:    the pointer to the member.
 * @type:   the type of the container struct this is embedded in.
 * @member: the name of the member within the struct.
 *
 */
#define container_of(ptr, type, member) ({          \
    const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
    (type *)( (char *)__mptr - offsetof(type,member) );})

它的功能是通过一个结构体成员的地址,得到结构体的地址。举例如下:

struct st_A
{
        int member_b;
        int member_c;
};

struct st_A a;

container_of(&(a.member_c), struct st_A, member_c)会得到变量a的地址,也就是&a的值。

 

Linux kernel 笔记 (41)——“inode”结构体中的“i_rdev”成员

inode结构体中有一个i_rdev成员(定义在<linux/fs.h>中):

struct inode {
    ......
    dev_t           i_rdev;
    ......
}

如果inode代表一个设备,则i_rdev的值为设备号。为了代码更好地可移植性,获取inodemajorminor号应该使用imajoriminor函数:

static inline unsigned iminor(const struct inode *inode)
{
    return MINOR(inode->i_rdev);
}

static inline unsigned imajor(const struct inode *inode)
{
    return MAJOR(inode->i_rdev);
}

 

Linux kernel 笔记 (40)——”file”和“inode”结构体的比较

LDD中对file结构体的描述:

struct file, defined in <linux/fs.h>, is the second most important data structure used in device drivers. Note that a file has nothing to do with the FILE pointers of user-space programs. A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.

The file structure represents an open file . (It is not specific to device drivers; every open file in the system has an associated struct file in kernel space.) It is created by the kernel on open and is passed to any function that operates on the file, until the last close. After all instances of the file are closed, the kernel releases the data structure.

In the kernel sources, a pointer to struct file is usually called either file or filp (“file pointer”). We’ll consistently call the pointer filp to prevent ambiguities with the structure itself. Thus, file refers to the structure and filp to a pointer to the structure.

inode结构体的描述:

The inode structure is used by the kernel internally to represent files. Therefore, it is different from the file structure that represents an open file descriptor. There can be numerous file structures representing multiple open descriptors on a single file, but they all point to a single inode structure.

总结如下:在kernel中,每一个文件都有一个inode结构体来表示,而file结构体是和打开的文件描述符关联的。如果一个文件被打开多次,有多个文件描述符,也就相应地有多个file结构体与这个文件关联。而inode却永远只有一个。

 

Linux kernel 笔记 (39)——”THIS_MODULE”

THIS_MODULE是一个macro,定义在<linux/module.h>中:

#ifdef MODULE
#define MODULE_GENERIC_TABLE(gtype,name)            \
extern const struct gtype##_id __mod_##gtype##_table        \
  __attribute__ ((unused, alias(__stringify(name))))

extern struct module __this_module;
#define THIS_MODULE (&__this_module)
#else  /* !MODULE */
#define MODULE_GENERIC_TABLE(gtype,name)
#define THIS_MODULE ((struct module *)0)
#endif

THIS_MODULE即是__this_module这个变量的地址。__this_module会指向这个模块起始的地址空间,恰好是struct module变量定义的位置。

file_operations结构体的第一个成员是struct module类型的指针,定义在<linux/fs.h>中:

struct file_operations {
    struct module *owner;
    ......
}

LDD对其的解释:

struct module *owner

The first file_operations field is not an operation at all; it is a pointer to the module that “owns” the structure. This field is used to prevent the module from being unloaded while its operations are in use. Almost all the time, it is simply initialized to THIS_MODULE , a macro defined in <linux/module.h>.

owner指向绑定file_operations的模块。在大多时候,只需把THIS_MODULE赋给它即可。

参考资料:
Where is the memory allocation of “_thismodule” variable?
深入淺出 insmod, #1

 

Linux kernel 笔记 (38)——”__user”修饰符

kernel代码中,有时会看到函数声明中有的参数带有__user修饰符:

ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);

LDD给出的解释:

This annotation is a form of documentation, noting that a pointer is a user-space address that cannot be directly dereferenced. For normal compilation, __user has no effect, but it can be used by external checking software to find misuse of user-space addresses.

__user表明参数是一个用户空间的指针,不能在kernel代码中直接访问。也方便其它工具对代码进行检查。

 

Linux kernel 笔记 (37)——”system.map”和“/proc/kallsyms”

system.map包含kernel image的符号表。/proc/kallsyms则包含kernel image和所有动态加载模块的符号表。如果一个函数被编译器内联(inline)或者优化掉了,则它在/proc/kallsyms有可能找不到。

此外,如果不是root用户,则显示/proc/kallsyms中的地址都是0

$ cat /proc/kallsyms | more
0000000000000000 A irq_stack_union
0000000000000000 A __per_cpu_start
0000000000000000 A cpu_debug_store
0000000000000000 A cpu_tss_rw
......

$ sudo cat /proc/kallsyms | more
[sudo] password for xiaonan:
0000000000000000 A irq_stack_union
0000000000000000 A __per_cpu_start
0000000000004000 A cpu_debug_store
0000000000005000 A cpu_tss_rw
0000000000008000 A gdt_page
0000000000009000 A exception_stacks
......

看起来kallsyms_lookup_name需要CONFIG_KALLSYMS_ALL设置为Y(参考CONFIG_KALLSYMS_ALL)。

Module.symvers包含了kernel所有exportsymbols(参考What is the purpose of “Module.symvers” in Linux?)。这个链接讲了/proc/kallsyms,可以和Module.symvers对比。

参考资料:

Reading kallsyms in user-mode ;
Does kallsyms have all the symbol of kernel functions?
System.map file and /proc/kallsyms
system.map

 

Linux kernel 笔记 (36)——”procfs”简介

Procfs是一个RAM-based虚拟文件系统,挂载在proc目录下。proc就像系统的一面镜子,通过它可以得到运行系统的很多信息。 /proc中的文件是在访问时由kernel动态生成的。

dl980-5:/proc # ls
1     1166  1234  1452  170   198   243   287   308   3334  372   416   46    503  550   5989  640   756   911  960        execdomains
10    1167  1236  1453  1702  199   244   2872  3087  334   373   4163  460   504  551   599   641   7591  912  961        fb
100   1168  1237  146   1703  2     245   2873  309   3342  374   417   461   505  552   6     642   76    913  962        filesystems
1003  1169  124   147   1704  20    246   2876  3093  3344  375   418   462   506  553   60    643   764   914  963        fs
1007  117   1240  1478  1706  200   247   2877  31    335   376   419   4627  507  554   600   645   77    915  964        interrupts
1008  1170  1241  1479  171   201   248   288   310   3352  377   42    463   508  555   601   646   78    916  965        iomem
1009  1171  1242  148   172   2018  249   2881  311   3358  378   420   464   509  556   602   647   7818  917  966        ioports
101   1172  1243  149   1721  202   25    2884  312   336   379   421   4646  51   557   603   648   8     918  967        ipmi
1010  1173  1244  15    1729  203   250   289   313   3360  38    422   465   510  558   604   649   80    919  968        irq
1011  1174  1245  150   173   204   251   2890  3137  3361  380   423   466   511  56    605   65    81    92   969        kallsyms
1012  1175  1246  151   1736  205   252   29    3139  337   381   424   467   512  560   6052  650   82    920  97         kcore
1013  1176  1247  152   1737  206   253   290   314   338   382   425   468   513  561   6058  651   83    921  970        key-users
1015  1177  1248  1520  174   2060  254   2908  315   3389  3827  426   469   514  562   606   652   84    922  971        kmsg
1016  1178  1249  1522  175   207   255   291   3155  339   383   427   47    515  563   607   653   8461  923  972        kpagecount
1017  1179  125   1529  1751  208   256   2911  3156  34    384   428   470   516  564   608   654   85    924  973        kpageflags
1018  118   1250  153   1755  209   257   2912  3158  340   385   429   471   517  565   6086  655   86    925  974        latency_stats
1019  1180  1251  154   176   21    258   2915  316   341   386   43    472   518  566   609   656   87    926  975        loadavg
102   1181  1252  155   177   210   259   292   3161  342   387   430   473   519  567   61    657   874   927  976        locks
1025  1182  1253  156   178   211   26    2921  3163  3426  388   431   475   52   568   610   658   876   928  977        meminfo
103   1183  1254  157   1787  212   260   2927  3166  343   389   432   476   520  57    611   659   877   929  978        misc
1042  1186  1255  158   1788  213   261   293   317   344   390   433   477   521  570   612   66    879   93   98         modules
1044  1187  1259  16    179   215   262   2932  318   345   391   434   4777  522  5708  613   660   88    930  980        mounts
1045  1188  126   160   1790  216   263   2939  3183  346   392   435   478   523  571   614   661   880   931  981        mtrr
....

数字代表的是进程号,也是一个目录,通过/proc/pid就可以得到这个进程的信息。其它像kmsgmeminfo等则提供了系统的其它信息。

参考资料:
EXPLORING LINUX PROCFS VIA SHELL SCRIPTS

 

Linux kernel 笔记 (35)——”linux/version.h”文件

<linux/version.h>是由顶级目录下的Makefile生成的:

......
define filechk_version.h
    (echo \#define LINUX_VERSION_CODE $(shell                         \
    expr $(VERSION) \* 65536 + 0$(PATCHLEVEL) \* 256 + 0$(SUBLEVEL)); \
    echo '#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))';)
endef

$(version_h): $(srctree)/Makefile FORCE
    $(call filechk,version.h)
    $(Q)rm -f $(old_version_h)
......

它包含了LINUX_VERSION_CODEKERNEL_VERSION这两个macro定义。以下面这个版本为例:

......
#define LINUX_VERSION_CODE 199680
#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))
......

199680对应3.12.0版本。

参考资料:
How the “<linux/version.h>” file is generated?

 

Linux kernel 笔记 (34)——模块参数

module_parammodule_param_named定义在<linux/moduleparam.h>文件:

/**
 * module_param - typesafe helper for a module/cmdline parameter
 * @value: the variable to alter, and exposed parameter name.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * @value becomes the module parameter, or (prefixed by KBUILD_MODNAME and a
 * ".") the kernel commandline parameter.  Note that - is changed to _, so
 * the user can use "foo-bar=1" even for variable "foo_bar".
 *
 * @perm is 0 if the the variable is not to appear in sysfs, or 0444
 * for world-readable, 0644 for root-writable, etc.  Note that if it
 * is writable, you may need to use kparam_block_sysfs_write() around
 * accesses (esp. charp, which can be kfreed when it changes).
 *
 * The @type is simply pasted to refer to a param_ops_##type and a
 * param_check_##type: for convenience many standard types are provided but
 * you can create your own by defining those variables.
 *
 * Standard types are:
 *  byte, short, ushort, int, uint, long, ulong
 *  charp: a character pointer
 *  bool: a bool, values 0/1, y/n, Y/N.
 *  invbool: the above, only sense-reversed (N = true).
 */
#define module_param(name, type, perm)              \
    module_param_named(name, name, type, perm)

/**
 * module_param_named - typesafe helper for a renamed module/cmdline parameter
 * @name: a valid C identifier which is the parameter name.
 * @value: the actual lvalue to alter.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * Usually it's a good idea to have variable names and user-exposed names the
 * same, but that's harder if the variable must be non-static or is inside a
 * structure.  This allows exposure under a different name.
 */
#define module_param_named(name, value, type, perm)            \
    param_check_##type(name, &(value));                \
    module_param_cb(name, &param_ops_##type, &value, perm);        \
    __MODULE_PARM_TYPE(name, #type)

module_param用来定义一个模块参数,type指定类型(intbool等等),perm指定用户访问权限,取值如下(<linux/stat.h>):

#define S_IRWXU 00700
#define S_IRUSR 00400
#define S_IWUSR 00200
#define S_IXUSR 00100

#define S_IRWXG 00070
#define S_IRGRP 00040
#define S_IWGRP 00020
#define S_IXGRP 00010

#define S_IRWXO 00007
#define S_IROTH 00004
#define S_IWOTH 00002
#define S_IXOTH 00001

#define S_IRWXUGO   (S_IRWXU|S_IRWXG|S_IRWXO)
#define S_IALLUGO   (S_ISUID|S_ISGID|S_ISVTX|S_IRWXUGO)
#define S_IRUGO     (S_IRUSR|S_IRGRP|S_IROTH)
#define S_IWUGO     (S_IWUSR|S_IWGRP|S_IWOTH)
#define S_IXUGO     (S_IXUSR|S_IXGRP|S_IXOTH)

module_param_named则是为变量取一个可读性更好的名字。

ktap源码为例:

int kp_max_loop_count = 100000;
module_param_named(max_loop_count, kp_max_loop_count, int, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(max_loop_count, "max loop execution count");

加载ktapvm模块,读取kp_max_loop_count的值:

[root@Linux ~]# cat /sys/module/ktapvm/parameters/max_loop_count
100000
[root@Linux ~]# ls -lt /sys/module/ktapvm/parameters/max_loop_count
-rw-r--r--. 1 root root 4096 Oct 22 22:51 /sys/module/ktapvm/parameters/max_loop_count

可以看到kp_max_loop_count变量在/sys/module/ktapvm/parameters文件夹下的名字是max_loop_count,值是100000,只有root用户拥有写权限。可以通过修改这个文件达到改变kp_max_loop_count变量的目的:

[root@Linux ~]# echo 200000 > /sys/module/ktapvm/parameters/max_loop_count
[root@Linux ~]# cat /sys/module/ktapvm/parameters/max_loop_count
200000

MODULE_PARM_DESC用来定义参数的描述信息,使用modinfo命令可以查看:

[root@Linux ~]# modinfo ktapvm.ko
.....
parm:           max_loop_count:max loop execution count (int)

参考资料:
Everything You Wanted to Know About Module Parameters