Use clang-tindy on VoidLinux

To use clang-tidy on VoidLinux, you should install clang-tools-extra first:

# xbps-install -Suv clang-tools-extra

Take following code as an example:

# cat main.cpp
struct Base
{
        virtual void do_thing() = 0;
        int data;
};

struct Derived : Base
{
        virtual void do_thing(int i) {}
};

int main()
{
}

Use clang-tidy to check it:

# clang-tidy -checks=* *.cpp --
387 warnings generated.
/root/main.cpp:1:8: warning: constructor does not initialize these fields: data [cppcoreguidelines-pro-type-member-init]
struct Base
       ^
/root/main.cpp:4:6: warning: member variable 'data' has public visibility [misc-non-private-member-variables-in-classes]
        int data;
            ^
Suppressed 384 warnings (384 in non-user code).
Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well.

 

dmesg intro

Because dmesg saved me from an installing CUDA driver issue this month, I decide to write this short post to introduce it. According to Wikipedia:

dmesg (display message or driver message) is a command on most Unix-like operating systems that prints the message buffer of the kernel.[1] The output of this command typically contains the messages produced by the device drivers.

On Unix/Linux systems, kernel, kernel modules (e.g., device drivers), and even user-space processes may output logs in kernel buffer. So dmesg is a neat tool for debugging purpose (please refer Linux Performance Analysis in 60,000 Milliseconds). Compared to BSD families, dmesg on Linux provides more options. So this post will use dmesg on Linux as an example.

Firstly, how to know the underlying kernel buffer size? It depends on CONFIG_LOG_BUF_SHIFT. On my Arch Linux:

$ zgrep CONFIG_LOG_BUF_SHIFT /proc/config.gz
CONFIG_LOG_BUF_SHIFT=17

It means the buffer’s size is 2 ^ 17 = 128 KiB (please refer How to find out a linux kernel ring buffer size? and default kernel .config file).

Secondly, dmesg on Linux supports many handy options. E.g. -H is used to display in human-readable format:

$ dmesg -H
[Apr15 09:26] Linux version 5.0.5-arch1-1-ARCH (builduser@heftig-17705) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREE>
[  +0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=5bee8879-dca0-4b3a-8abd-1bdf96e17a1f rw quiet
[  +0.000000] KERNEL supported cpus:
[  +0.000000]   Intel GenuineIntel
[  +0.000000]   AMD AuthenticAMD
[  +0.000000]   Hygon HygonGenuine
[  +0.000000]   Centaur CentaurHauls
......

and -k tells dmesg only shows user-space messages:

$ dmesg -u -H
[Apr15 09:26] systemd[1]: systemd 241.67-1-arch running in system mode. (+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -S>
[  +0.021954] systemd[1]: Detected architecture x86-64.
[  +0.021551] systemd[1]: Set hostname to <tesla-p100>.
[  +0.098510] systemd[1]: Listening on Journal Socket (/dev/log).
[  +0.000203] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[  +0.000032] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[  +0.000047] systemd[1]: Listening on udev Control Socket.
[  +0.001751] systemd[1]: Listening on Process Core Dump Socket.
[  +0.000010] systemd[1]: Reached target System Time Synchronized.
[  +0.000050] systemd[1]: Listening on Journal Socket.
[  +0.273822] systemd-journald[1148]: Received request to flush runtime journal from PID 1
(END)

To know about other options, please refer manual.

Last but not least, dmesg‘s source code is a good place to learn how to develop command line applications on Linux.

Put dmesg into your toolkit, maybe it will save you one day too.

 

A performance issue caused by NUMA

The essence of NUMA is accessing local memory fast while remote slow, and I was bit by it today.

The original code is like this:

/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
    ......
    vector<> local_partition = create_big_vector_partition();
    /* processing the vector partition*/
    ......
}

I tried to create a big vector out of OpenMP block, then every thread just grabs a partition and processes it:

vector<> big_vector = create_big_vector();

#pragma omp parallel for
for (...)
{
    ......
    vector<>& local_partition = get_partition(big_vector);
    /* processing the vector partition*/
    ......
}

I measure the execution time of OpenMP block:

#pragma omp parallel for
for (...)
{
    ......
}

Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.

After some experiments and analysis, numastat helps me to pinpoint the problem:

$ numastat
                           node0           node1
numa_hit              6259740856      7850720376
numa_miss              120468683       128900132
numa_foreign           128900132       120468683
interleave_hit             32881           32290
local_node            6259609322      7850520401
other_node             120600217       129100106

In original solution, every thread creates vector partition in local memory of CPU. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.

What I learn from Practical Binary Analysis as a non-reverse-engineering engineer?

I spent the past two weeks in reading Practical Binary Analysis. Since I am not a professional reverse engineer, I glossed over the “Part III: Advanced Binary Analysis”, so I only read half the book. Even though, I still get a big gain:

(1) Know better of ELF file. On *nix Operating system, ELF file is everywhere: executable file, object file, shared library and coredump. “Chapter 2: The ELF format” gives me a clear explanation of the composition of ELF. E.g., I know why some functions have “@plt” suffix when using gdb to debug it.

(2) Master many tricks about GNU Binutils. GNU Binutils is a toolbox which provides versatile command line programs to analyze ELF files. Literally it relies heavily on BFD library. I also get some sense about how to use BFD library to tweak ELF files.

(3) “Appendix A: A crash course on X86 assembly” is a good tutorial for refreshing X86 architecture and assembly language.

(4) Others: E.g., I understand how to use LD_PRELOAD environmental variable and dynamic linking functions to manipulate shared library.

All in all, if you are working on *nix (although this book is based on Linux, I think most knowledge are also applicable to other *nix), you should try to read this book. I promise it is not a waste of time and you can always learn something, believe me!

The display format of Bash’s built-in time command

Check the default output format of bash‘s built-in time command:

# time

real    0m0.000s
user    0m0.000s
sys     0m0.000s

You can use -p option to output in POSIX format:

# time -p
real 0.00
user 0.00
sys 0.00

From bash source code, we know the the definitions of these two formats:

#define POSIX_TIMEFORMAT "real %2R\nuser %2U\nsys %2S"
#define BASH_TIMEFORMAT  "\nreal\t%3lR\nuser\t%3lU\nsys\t%3lS"

To decipher the meanings of them, we need to refer bash manual:

TIMEFORMAT

The value of this parameter is used as a format string specifying how the timing information for pipelines prefixed with the time reserved word should be displayed. The ‘%’ character introduces an escape sequence that is expanded to a time value or other information. The escape sequences and their meanings are as follows; the braces denote optional portions.

%%
A literal ‘%’.

%[p][l]R
The elapsed time in seconds.

%[p][l]U
The number of CPU seconds spent in user mode.

%[p][l]S
The number of CPU seconds spent in system mode.

%P
The CPU percentage, computed as (%U + %S) / %R.

The optional p is a digit specifying the precision, the number of fractional digits after a decimal point. A value of 0 causes no decimal point or fraction to be output. At most three places after the decimal point may be specified; values of p greater than 3 are changed to 3. If p is not specified, the value 3 is used.

The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p determines whether or not the fraction is included.
……

Take POSIX_TIMEFORMAT as an example: %2R denotes using second as time unit, and the precision is two digits after a decimal point; %2U and %2S are similar.

Now you can comprehend the output of time, correct? Try using BASH_TIMEFORMAT as a practice.