技术 | 我的站点

Linux kernel 笔记（22）——一个简单的模块Makefile

以LDD3中Compiling and Loading一节的编译模块的Makefile为例：

# If KERNELRELEASE is defined, we've been invoked from the
# kernel build system and can use its language.
ifneq ($(KERNELRELEASE),)
    obj-m := hello.o 

# Otherwise we were called directly from the command
# line; invoke the kernel build system.
else

    KERNELDIR ?= /lib/modules/$(shell uname -r)/build
    PWD  := $(shell pwd)

default:
    $(MAKE) -C $(KERNELDIR) M=$(PWD) modules

endif

当在命令行执行make命令时（当前工作目录即模块源文件所在目录），因为当前模块所在目录里没有定义KERNELRELEASE，所以执行else部分，即把KERNELDIR和PWD变量赋值。

接下来执行“$(MAKE) -C $(KERNELDIR) M=$(PWD) modules”命令。-C选项的含义是把目录切换到KERNELDIR目录下，然后读取KERNELDIR目录下的Makefile。M选项是在编译modules再切换回模块所在目录。此时由于KERNELRELEASE变量已经定义，即可以得知需要编译obj-m。

参考资料：
Understanding a make file for making .ko files。

Linux kernel 笔记（21）——per-CPU变量

per-CPU变量顾名思义，即当你声明一个per-CPU变量时，当前系统上的每个CPU都会有一份当前变量的copy。使用per-CPU变量好处是访问它几乎不需要加锁，因为每个CPU都有一份copy。此外，CPU可以把这个变量放在自己的cache里，访问起来会特别快。定义per-CPU变量方法如下：

DEFINE_PER_CPU(type, name);

如果per-CPU变量是数组，则定义方式如下：

DEFINE_PER_CPU(type[length], array);

per-CPU变量可以导出，供其它模块使用：

EXPORT_PER_CPU_SYMBOL(per_cpu_var);
EXPORT_PER_CPU_SYMBOL_GPL(per_cpu_var);

要在其它模块使用per-CPU变量，则需要声明：

DECLARE_PER_CPU(type, name);

访问per-CPU变量可以使用get_cpu_var(var)和set_cpu_var(var)这两个macro：

/* <linux/percpu.h>*/

/*
 * Must be an lvalue. Since @var must be a simple identifier,
 * we force a syntax error here if it isn't.
 */
#define get_cpu_var(var) (*({               \
    preempt_disable();              \
    &__get_cpu_var(var); }))

/*
 * The weird & is necessary because sparse considers (void)(var) to be
 * a direct dereference of percpu variable (var).
 */
#define put_cpu_var(var) do {               \
    (void)&(var);                   \
    preempt_enable();               \
} while (0)

因为kernel线程是允许preemption的，所以在get_cpu_var中需要调用preempt_disable，并且要和put_cpu_var配对使用。

访问另一个CPU的per-CPU变量：

per_cpu(variable, int cpu_id);

参考资料：
Driver porting: per-CPU variables;
Per-CPU Variables。

Linux kernel 笔记（20）——设备的major和minor号

在/dev目录下执行ls -lt命令：

上面红框框起来的部分就是设备号，前面是major，后面是minor。 major号表示设备所使用的驱动，而minor号则表示具体的设备。在上图中，tty的驱动都是driver 4，而利用minor号区别不同的tty设备。另外，通过/proc/devices文件也可以看到设备所使用的驱动，即major号：

linux-a21w:/dev # cat /proc/devices
Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  7 vcs
......

关于dev_t，major和minor号定义如下（kernel版本是4.0）:

/* <linux/types.h>: */
typedef __u32 __kernel_dev_t;
typedef __kernel_dev_t      dev_t;

/* <linux/kdev_t.h> */
#define MINORBITS   20
#define MINORMASK   ((1U << MINORBITS) - 1)

#define MAJOR(dev)  ((unsigned int) ((dev) >> MINORBITS))
#define MINOR(dev)  ((unsigned int) ((dev) & MINORMASK))
#define MKDEV(ma,mi)    (((ma) << MINORBITS) | (mi))

dev_t占32 bit长，其中高12位是major，低20位是minor。

获取设备号的两种方法：

（1）预先指定设备号：

int register_chrdev_region(dev_t from, unsigned count, const char *name)

from包含major和minor，通常情况下minor指定为0。count指定连续设备号的数量，name指定设备的名字。register_chrdev_region实现如下：

/**
 * register_chrdev_region() - register a range of device numbers
 * @from: the first in the desired range of device numbers; must include
 *        the major number.
 * @count: the number of consecutive device numbers required
 * @name: the name of the device or driver.
 *
 * Return value is zero on success, a negative error code on failure.
 */
int register_chrdev_region(dev_t from, unsigned count, const char *name)
{
    struct char_device_struct *cd;
    dev_t to = from + count;
    dev_t n, next;

    for (n = from; n < to; n = next) {
        next = MKDEV(MAJOR(n)+1, 0);
        if (next > to)
            next = to;
        cd = __register_chrdev_region(MAJOR(n), MINOR(n),
                   next - n, name);
        if (IS_ERR(cd))
            goto fail;
    }
    return 0;
fail:
    to = n;
    for (n = from; n < to; n = next) {
        next = MKDEV(MAJOR(n)+1, 0);
        kfree(__unregister_chrdev_region(MAJOR(n), MINOR(n), next - n));
    }
    return PTR_ERR(cd);
}

可以看到register_chrdev_region即是把from开始连续count个设备号（dev_t类型，包含major和minor）都注册。

举个例子（/drivers/tty/tty_io.c）：

register_chrdev_region(MKDEV(TTYAUX_MAJOR, 1), 1, "/dev/console")

（2）动态分配设备号（推荐使用）：

int alloc_chrdev_region(dev_t *dev, unsigned int firstminor, unsigned int count, char *name);

dev是传出参数，为动态获得的设备号；firstminor指定第一个minor；count和name同register_chrdev_region的参数定义。alloc_chrdev_region实现如下：

/**
 * alloc_chrdev_region() - register a range of char device numbers
 * @dev: output parameter for first assigned number
 * @baseminor: first of the requested range of minor numbers
 * @count: the number of minor numbers required
 * @name: the name of the associated device or driver
 *
 * Allocates a range of char device numbers.  The major number will be
 * chosen dynamically, and returned (along with the first minor number)
 * in @dev.  Returns zero or a negative error code.
 */
int alloc_chrdev_region(dev_t *dev, unsigned baseminor, unsigned count,
            const char *name)
{
    struct char_device_struct *cd;
    cd = __register_chrdev_region(0, baseminor, count, name);
    if (IS_ERR(cd))
        return PTR_ERR(cd);
    *dev = MKDEV(cd->major, cd->baseminor);
    return 0;
}

举个例子（/drivers/watchdog/watchdog_dev.c）：

alloc_chrdev_region(&watchdog_devt, 0, MAX_DOGS, "watchdog");

释放设备号：

void unregister_chrdev_region(dev_t first, unsigned int count);

Linux kernel 笔记（19）——“soft lockup – CPU# stuck …”bug

“soft lockup - CPU# stuck ...”bug的kernel log类似这样：

[   28.124107] BUG: soft lockup - CPU#0 stuck for 23s! [init:1]
[   28.124720] Modules linked in:
[   28.125247] Supported: Yes
[   28.125763] Modules linked in:
[   28.126277] Supported: Yes
[   28.126774] 
[   28.127264] Pid: 1, comm: init Not tainted 3.0.101-63-xen #1  
[   28.127765] EIP: 0061:[<c00ded0a>] EFLAGS: 00000202 CPU: 0
[   28.128002] EIP is at handle_mm_fault+0x18a/0x2b0
[   28.128002] EAX: 0002bfc1 EBX: 00000000 ECX: 00000000 EDX: 00000000
[   28.128002] ESI: 2bfc1067 EDI: 00000000 EBP: ebfc6200 ESP: ebc35d48
[   28.128002]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: e021
[   28.128002] Process init (pid: 1, ti=ebc08000 task=ebc32ce0 task.ti=ebc34000)
[   28.128002] Stack:
[   28.128002]  ebfc1778 ebfc6200 00000029 0002bfc1 00000000 080efc90 ebfc2570 ebfc9e40
[   28.128002]  ec7bd000 ebc35dfc 00000003 ebfc2570 080efc90 c0350ad4 00000029 00000100
[   28.128002]  00000008 00000003 ebfc9e78 ebc32ce0 ebfc9e40 00000000 00000029 00000003
[   28.128002] Call Trace:
[   28.128002]  [<c0350ad4>] do_page_fault+0x1f4/0x4b0
[   28.128002]  [<c034df54>] error_code+0x30/0x38
[   28.128002]  [<c01da35f>] clear_user+0x2f/0x50   
[   28.128002]  [<c01480d4>] load_elf_binary+0xae4/0xc30    
[   28.128002]  [<c01094d1>] search_binary_handler+0x1e1/0x2e0  
[   28.128002]  [<c01097b4>] do_execve_common+0x1e4/0x280   
[   28.128002]  [<c000a9c2>] sys_execve+0x52/0x80   
[   28.128002]  [<c035443e>] ptregs_execve+0x12/0x18    
[   28.128002]  [<c034dc3d>] syscall_call+0x7/0x7   
[   28.128002]  [<c000933f>] kernel_execve+0x1f/0x30    
[   28.128002]  [<c000424e>] init_post+0xde/0x130   
[   28.128002]  [<c057d638>] kernel_init+0x160/0x18f    
[   28.128002]  [<c0354526>] kernel_thread_helper+0x6/0x10  
[   28.128002] Code: 89 f2 89 f8 81 e2 00 f0 ff ff 25 ff 0f 00 00 89 54 24 0c 89 44 24 10 8b 44 24 0c 8b 54 24 10 0f ac d0 0c 89 44 24 0c 8b 44 24 0c <c1> ea 0c 89 54 24 10 c1 e0 05 03 44 24 20 e8 b3 90 ff ff 8b 54 
......

这个Bug背后的原理是这样的：

Linux kernel针对每个CPU都有一个watchdog进程。使用ps -ef | grep watctdog可以看到：

[nan@longriver ~]$ ps -ef | grep watchdog
root         6     2  0 Apr20 ?        00:00:16 [watchdog/0]
root        10     2  0 Apr20 ?        00:00:11 [watchdog/1]
root        14     2  0 Apr20 ?        00:00:10 [watchdog/2]
root        18     2  0 Apr20 ?        00:00:09 [watchdog/3]
nan   6726  4608  0 17:28 pts/28   00:00:00 grep watchdog

watchdog进程会搜集所监控的CPU的关于使用时间的信息（[watchdog/X]中的X代表监控的CPU ID），并把这些信息存在kernel中。kernel中有专门的interrupt函数会调用softlockup计数器，并把当前的时间与之前kernel中存储的时间值进行比较。如果相差超过一个门限值，则就认为watchdog进程没有获得足够的执行时间用来更新kernel中的信息，也就是CPU一直被其它task占据着。这会被kernel认为是一种不正常的现象，就会打印出如上所示的call trace，register等等信息。

参考资料：
Linux Kernel BUG: soft lockup CPU#1 stuck

Luajit笔记（2）——FFI库（1）

Luajit提供的FFI库（ffi.*）允许Lua代码调用外部的C函数和使用C数据结构。但是默认情况下，FFI库不会被加载和初始化。因此建议在每个使用FFI库的Lua文件开头加载：

local ffi = require("ffi")

看下面这个例子：

local ffi = require("ffi")

ffi.cdef[[
int printf(const char *format, ...);
]]

ffi.C.printf("Hello world!\n")

执行结果如下：

Hello world!

（1）
ffi.cdef的定义如下：

ffi.cdef(def)

def必须是一个Lua字符串，建议使用“[[...]]”这种格式。ffi.cdef包含的是对C语言类型的定义和外部符号（变量和函数）的声明（仅仅是声明，并没有和实际的内存地址进行绑定，实际的绑定是通过C library namespace）。要注意对C类型的声明不会经过C预处理器，除了#pragma pack以外，包括#define在内的指令都要进行处理替换，比如使用enum等等。

（2）
ffi.C是默认的C library namespace。它同编译器有些类似，但不用显示地声明链接库。在POSIX系统上，ffi.C会在default或global的namespace上绑定符号。包括libc，libm，等等。还有Luajit自身提供的API

“devel”包是什么？

在RHEL上安装package时，经常看到同样名字的package有两个：分别是带和不带devel后缀的。例如：

elfutils-libelf.x86_64 : Library to read and write ELF files
elfutils-libelf-devel.x86_64 : Development support for libelf

两者区别是：不带devel后缀的package，通常只包含能让程序运行的动态库和配置文件。而带devel后缀的package，则包含使用这个package开发程序的所有的必需文件。比如头文件，等等。有时devel package还包含静态库。

参考资料：
What are *-devel packages?。

Suse使用初体验

这两天折腾了一下Suse（SLES 12 Beta版本），感觉和RedHat系列还是有一些不同。记录下来，以备以后查找：

（1）YaST2

YaST (Yet another Setup Tool)2是Suse系统上的配置工具的：

感觉很好用。配置网络，FTP，Proxy等等，很方便。另外，单击图标就可以启动软件了，让用惯了“双击”的我开始不大适应。

（2）zypper

命令行安装软件使用zypper命令（in代表install）：

zypper in git-core

卸载（rm）：

zypper rm git-core

另外注意，git包的名字叫git-core。

（3）命令窗口

使用Alt + F2快捷键可以调出命令窗口：

输入gnome-terminal可以打开一个终端。

（4）supportconfig

Suse提供了supportconfig工具，用来抓取系统的信息，对debugging提供了很大的帮助：

Linux kernel 笔记（18）——current变量

kernel代码中有一个current变量，它是一个指针，用来指向执行当前这段kernel代码的进程。举个例子，当一个进程执行open系统调用时，在kernel中，就可以用current来访问这个进程。current定义在<asm/current.h>中，以X86平台为例：

#ifndef __ASSEMBLY__
struct task_struct;

DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
    return this_cpu_read_stable(current_task);
}

#define current get_current()

#endif /* __ASSEMBLY__ */

可以看到currrent变量实际上是一个指向struct task_struct的指针，而struct task_struct则保存了关于进程的信息。

Shark代码分析笔记（4）——bpf.lua（未完待续）

bpf.lua位于bpf文件夹下，其代码如下（省去版权信息）：

local ffi = require("ffi")
local C = ffi.C

local bpf = {}

bpf.cargs_extra = ""

---------------------------------------------------------------

bpf.cargs_add = function(s)
  bpf.cargs_extra = s
end

---------------------------------------------------------------

ffi.cdef[[
int load_bpf_file(const char *path);
]]

local function load_bpf_file(path)
  local ret = C.load_bpf_file(path)
  if tonumber(ret) ~= 0 then
    os.exit(-1)
  end
end

local function run_llc_bpf(bc_file, bpfobj_file)
  os.execute("llc-bpf -march=bpf -filetype=obj -o " ..
              bpfobj_file .. " " .. bc_file)
end

---------------------------------------------------------------
local function split(s, delimiter)
  local result = {};
  for match in (s..delimiter):gmatch("(.-)"..delimiter) do
    table.insert(result, match);
  end
  return result;
end

local basic_type_tbl = {
  "char", "u8", "short", "u16", "int", "u32", "long", "long long", "u64"
}

local function get_real_type(typestr)
  for _, v in pairs(basic_type_tbl) do
    if v == typestr then
      return v
    end
  end

  return "cdata"
end

local function process_type(typestr0)
  local typestr, size, size

  typestr, size = string.match(typestr0, "(.*) %[(.*)%]")
  if typestr ~= nil and size ~= nil then
    sizestr = "sizeof("..typestr .. ") * "..size
    typestr = typestr
    if typestr == "char" then
      typestr = "string"
    end
  else
    typestr, size = string.match(typestr0, "(.*)%[(.*)%]")
    if typestr ~= nil and size ~= nil then
      sizestr = "sizeof("..typestr .. ") * "..size
      typestr = typestr
      if typestr == "char" then
        typestr = "string"
      end
    else
      sizestr = "sizeof("..typestr0..")"
      typestr = get_real_type(typestr0)
    end
  end

  return typestr, sizestr
end

local function gen_map_decl(map_type, key_type, val_type, entries, name)
  local key_typestr, key_sizestr = process_type(key_type)
  local val_typestr, val_sizestr = process_type(val_type)

  if not entries then
    entries = 1024;
  end

  local str = string.format("struct bpf_map_def SEC(\"maps\") %s = {\n" ..
                            "\t.name = \"%s\",                 \n" ..
                            "\t.key_type = \"%s\",             \n" ..
                            "\t.val_type = \"%s\",             \n" ..
                            "\t.type = BPF_MAP_TYPE_%s,        \n" ..
                            "\t.key_size = %s,                 \n" ..
                            "\t.value_size = %s,               \n" ..
                            "\t.max_entries = %d,              \n};\n",
        name, name, key_typestr, val_typestr, map_type, key_sizestr,
        val_sizestr, entries)
  io.write(str)
end

local function translate_cdef(s)
  local n = split(s, '\n')
  for _, line in pairs(n) do
    local key_type, val_type, entries, name
    local match = false

    key_type, val_type, name = string.match(line,
             "bpf_map_hash<([^,]*), ([^,]*)> (.*);")
    if key_type and val_type and name then
      gen_map_decl("HASH", key_type, val_type, nil, name)
      match = true
    end

    key_type, val_type, entries, name = string.match(line,
             "bpf_map_hash<([^,]*), ([^,]*), (%d+)> (.*);")
    if key_type and val_type and entries and name then
      gen_map_decl("HASH", key_type, val_type, entries, name)
      match = true
    end

    key_type, val_type, name = string.match(line,
             "bpf_map_array<([^,]*), ([^,]*)> (.*);")
    if key_type and val_type and name then
      gen_map_decl("ARRAY", key_type, val_type, nil, name)
      match = true
    end

    key_type, val_type, entries, name = string.match(line,
             "bpf_map_array<([^,]*), ([^,]*), (%d+)> (.*);")
    if key_type and val_type and entries and name then
      gen_map_decl("ARRAY", key_type, val_type, entries, name)
      match = true
    end

    if not match then
      io.write(line, "\n")
    end
  end

  io.write("\nchar _license[] SEC(\"license\") = \"GPL\";\n")
  io.write("\n#include <linux/version.h>\n")
  io.write("\nu32 _version SEC(\"version\") = LINUX_VERSION_CODE;\n")
end

local function compile(srcfile, objfile)
   local f = io.popen("uname -r", "r")
   local release = f:read()
   local linuxinc = string.format("/lib/modules/%s/build/", release)
   local bpfinc = "bpf/libbpf"

   local clang_cmd = string.format("clang -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.8/include -I%s -I%s/arch/x86/include -I%s/arch/x86/include/generated/uapi -I%s/arch/x86/include/generated  -I%s/include -I%s/arch/x86/include/uapi -I%s/arch/x86/include/generated/uapi -I%s/include/uapi -I%s/include/generated/uapi -include %s/include/linux/kconfig.h %s -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign -O2 -emit-llvm -x c -c %s -o %s", bpfinc, linuxinc, linuxinc, linuxinc, linuxinc, linuxinc, linuxinc, linuxinc, linuxinc, linuxinc, bpf.cargs_extra, srcfile, objfile)

  os.execute(clang_cmd)
end

-- builtin cdef function
bpf.cdef = function(s)
  local file

  local srctmp = os.tmpname()
  file = io.open(srctmp, "w")
  io.output(file)
  translate_cdef(s)

  io.close(file)
  io.output(io.stdout)

  -- dump source
  local f = io.open(srctmp, "rb")
  local content = f:read("*all")
  print(content)
  f:close()

  local bctmp = os.tmpname()
  compile(srctmp, bctmp)

  local bpftmp = os.tmpname()
  run_llc_bpf(bctmp, bpftmp)

  -- pass bpf table for bpf object loading, need clear it after load done.
  _G["bpf"] = bpf
  load_bpf_file(bpftmp)

  os.remove(srctmp)
  os.remove(bctmp)
  os.remove(bpftmp)
end

---------------------------------------------------------------

bpf.print_map = function(map)
  local map = map
  for k in pairs(map) do
    print(k, map[k])
  end
end

---------------------------------------------------------------

bpf.copy_map = function(map)
  local new = {}
  local map = map
  for k in pairs(map) do
    new[k] = map[k]
  end
  return new
end

---------------------------------------------------------------

local function fill_line(n, max)
  for i = 1, max do
    if i < n then
      io.write("*")
    else
      io.write(" ")
    end
  end
end

-- print histogram for bpf.var.map
-- Only support number key now.
bpf.print_hist_map = function(t)
  local histo = {}
  local stdSum, max = 0, 0

  for k in pairs(t) do
    if t[k] ~= 0 then
      local k1 = math.pow(2, math.floor(math.log(k, 2)))
      if histo[k1] == nil then histo[k1] = 0 end
      histo[k1] = histo[k1] + t[k]

      stdSum = stdSum + t[k]
      if k1 > max then max = k1 end
    end
  end

  print("        value  ------------- Distribution -------------  count")
  local k = 0
  while k <= max do
    local v = histo[k]
    if v == nil then v = 0 end

    io.write(string.format("%13d |", k))
    fill_line(v * 40 / stdSum, 40)
    io.write(string.format("| %d", v))

    print()

    if k == 0 then k = 1 else k = k * 2 end
  end
end

return bpf

（1）

local ffi = require("ffi")
local C = ffi.C

local bpf = {}

bpf.cargs_extra = ""

以上代码导入ffi模块，并且定义bpf这个table。bpf table有一个cargs_extra的key，所对应的值是一个空字符串。

BIOS和UEFI

BIOS（Basic Input/Output System）是固化在主板芯片上的一段代码。电脑启动时，由BIOS负责启动各个硬件单元，并把“控制权”交给操作系统。BIOS使用MBR（Master Boot Record，存储电脑的分区表）来决定使用哪个操作系统。BIOS为用户提供了一个操作硬件的接口。

UEFI（Unified Extensible Firmware Interface）是BIOS的继任者。它同BIOS的比较如下：

a）16-bit vs 32/64-bit
BIOS只能工作在16-bit处理器模式上，且只能访问1M内存。而UEFI则可工作在32/64-bit处理器模式上，且可访问更大的内存空间。

b）Booting
MBR限制每个磁盘只能有4个分区，且可以boot的磁盘大小限制在2.2TB。而UEFI使用GUID Partition Table，可以访问更大的磁盘。

c）Extensions
UEFI支持老的extensions（比如，ACPI）。

参考资料：
HTG Explains: Learn How UEFI Will Replace Your PC’s BIOS

2025 年 6 月
一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30