nanxiao | Nan Xiao's Blog

dmesg intro

Because dmesg saved me from an installing CUDA driver issue this month, I decide to write this short post to introduce it. According to Wikipedia:

dmesg (display message or driver message) is a command on most Unix-like operating systems that prints the message buffer of the kernel.[1] The output of this command typically contains the messages produced by the device drivers.

On Unix/Linux systems, kernel, kernel modules (e.g., device drivers), and even user-space processes may output logs in kernel buffer. So dmesg is a neat tool for debugging purpose (please refer Linux Performance Analysis in 60,000 Milliseconds). Compared to BSD families, dmesg on Linux provides more options. So this post will use dmesg on Linux as an example.

Firstly, how to know the underlying kernel buffer size? It depends on CONFIG_LOG_BUF_SHIFT. On my Arch Linux:

$ zgrep CONFIG_LOG_BUF_SHIFT /proc/config.gz
CONFIG_LOG_BUF_SHIFT=17

It means the buffer’s size is 2 ^ 17 = 128 KiB (please refer How to find out a linux kernel ring buffer size? and default kernel .config file).

Secondly, dmesg on Linux supports many handy options. E.g. -H is used to display in human-readable format:

$ dmesg -H
[Apr15 09:26] Linux version 5.0.5-arch1-1-ARCH (builduser@heftig-17705) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREE>
[  +0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=5bee8879-dca0-4b3a-8abd-1bdf96e17a1f rw quiet
[  +0.000000] KERNEL supported cpus:
[  +0.000000]   Intel GenuineIntel
[  +0.000000]   AMD AuthenticAMD
[  +0.000000]   Hygon HygonGenuine
[  +0.000000]   Centaur CentaurHauls
......

and -k tells dmesg only shows user-space messages:

$ dmesg -u -H
[Apr15 09:26] systemd[1]: systemd 241.67-1-arch running in system mode. (+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -S>
[  +0.021954] systemd[1]: Detected architecture x86-64.
[  +0.021551] systemd[1]: Set hostname to <tesla-p100>.
[  +0.098510] systemd[1]: Listening on Journal Socket (/dev/log).
[  +0.000203] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[  +0.000032] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[  +0.000047] systemd[1]: Listening on udev Control Socket.
[  +0.001751] systemd[1]: Listening on Process Core Dump Socket.
[  +0.000010] systemd[1]: Reached target System Time Synchronized.
[  +0.000050] systemd[1]: Listening on Journal Socket.
[  +0.273822] systemd-journald[1148]: Received request to flush runtime journal from PID 1
(END)

To know about other options, please refer manual.

Last but not least, dmesg‘s source code is a good place to learn how to develop command line applications on Linux.

Put dmesg into your toolkit, maybe it will save you one day too.

Different RVO behaviors between gcc and clang

Regarding RVO (Return Value Optimization), I think this video gives a real good explanation. Let’s cut to the chase. Check following code:

# cat rvo.cpp
#include <iostream>

class Foo
{
public:
        Foo(){std::cout << "Foo default constructor!\n";}
        ~Foo(){std::cout << "Foo destructor!\n";}
        Foo(const Foo&){std::cout << "Foo copy constructor!\n";}
        Foo& operator=(const Foo&){std::cout << "Foo assignment!\n"; return *this;}
        Foo(const Foo&&){std::cout << "Foo move constructor!\n";}
        Foo& operator=(const Foo&&){std::cout << "Foo move assignment!\n"; return *this;}
};

Foo func(bool flag)
{
        Foo temp;
        if (flag)
        {
                std::cout << "if\n";
        }
        else
        {
                std::cout << "else\n";
        }
        return temp;
}

int main()
{
        auto f = func(true);
        return 0;
}

On my Arch Linux platform, gcc version is 8.2.1 and clang version is 8.0.0. I tried to use -std=c++11, -std=c++14, -std=c++17 and -std=c++2a for both compilers, all generated same output:

Foo default constructor!
if
Foo destructor!

So both compilers are clever enough to realize there is no need to create “Foo temp” variable (please refer Small examples show copy elision in C++). Modify the func a little:

Foo func(bool flag)
{
        if (flag)
        {
                Foo temp;
                std::cout << "if\n";
                return temp;
        }
        else
        {
                Foo temp;
                std::cout << "else\n";
                return temp;
        }
}

This time, For clang compiler (All four compiling options: -std=c++11, -std=c++14, -std=c++17 and -std=c++2a), the program generated output as same as above:

Foo default constructor!
if
Foo destructor!

But for gcc, (All four compiling options: -std=c++11, -std=c++14, -std=c++17 and -std=c++2a), the program generated different output:

Foo default constructor!
if
Foo move constructor!
Foo destructor!
Foo destructor!

So it means gcc generated both two variables: “Foo temp” and “auto f“. It not only means clang does better optimization than gcc in this scenario, but means if you do something in move constructor, and expect it should be called. You program logic will depend on the compiler: it works under gcc but not for clang. Beware of this trap, otherwise it may bite you one day, like me today!

Beware of OpenMP’s thread pool

For the sake of efficiency, OpenMP‘s implementation always uses thread pool to cache threads (please refer this topic). Check following simple code:

#include <unistd.h>
#include <stdio.h>
#include <omp.h>

int main(void){
        #pragma omp parallel for
        for(int i = 0; i < 256; i++)
        {
            sleep(1);
        }

        printf("Exit loop\n");

        while (1)
        {
            sleep(1);
        }

        return 0;
}

Mys server has 104 logical CPUs. Build and run it:

$ gcc -fopenmp test.c -o test
$ ./test
Exit loop

After “Exit loop” is printed, there is actually only master thread is active. Check number of threads:

$ ps --no-headers -T `pidof test` | wc -l
104

We can see all non-active threads are not destroyed and ready for future use (clang also uses thread pool inside).

The 103 non-active threads are not free; they consume resource and Operating System needs to take care of them. Sometimes they can encumber your process’s performance, especially on a system which already has heavy load. So when you write following code next time:

 #pragma omp parallel for
 for(...)
 {
    ......
 }

Try to answer following questions:
1) How many threads will be spawned?
2) Will these threads be actively used in future or only this time? If they are only valid for this time, is it possible that they become burden of the process? Please try to measure the performance of program. If the answer is yes, how about use other thread implementation instead?

P.S., the full code is here.

Use different build system for cmake

Besides idiomatic make, cmake also supports other build systems (e.g., ninja):

# cmake -h
......
Generators

The following generators are available on this platform (* marks default):
* Unix Makefiles               = Generates standard UNIX makefiles.
  Ninja                        = Generates build.ninja files.
  Watcom WMake                 = Generates Watcom WMake makefiles.
  CodeBlocks - Ninja           = Generates CodeBlocks project files.
  CodeBlocks - Unix Makefiles  = Generates CodeBlocks project files.
  CodeLite - Ninja             = Generates CodeLite project files.
  CodeLite - Unix Makefiles    = Generates CodeLite project files.
  Sublime Text 2 - Ninja       = Generates Sublime Text 2 project files.
  Sublime Text 2 - Unix Makefiles
                               = Generates Sublime Text 2 project files.
  Kate - Ninja                 = Generates Kate project files.
  Kate - Unix Makefiles        = Generates Kate project files.
  Eclipse CDT4 - Ninja         = Generates Eclipse CDT 4.0 project files.
  Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files.
......

My OS is Arch Linux. To use ninja, first install it:

# pacman -Syu ninja

Then specify using ninja when executing cmake command:

# cmake -G Ninja ..

Finally, use ninja to build:

# ninja

Tuning performance is harder than debugging bugs

In the past week, I focused on resolving an application performance issue, i.e., try to pinpoint why the code didn’t run as fast as I expected. Once upon a time, I am convinced that tuning performance is indeed harder than debugging bugs.

I have more than 10 years experience in software programming, and in recent 4 years, I spend ~20% working time in performance tuning related work: mostly application, sometimes the whole system. Regarding to debugging software bug, if the bug can be always reproduced, it should not be hard to find the root cause. Please notice, I never say it should be easy to fix: e.g., some huge technical debt. If for some reasons, the bug is not 100% reproducible, e.g., the notorious multi-thread bug, you can resort to methods to increase reproduce ratio and help you to pinpoint the culprit: add more logs, change execution time sequence, and so on. However, when talking about performance issue, the thing is “you don’t know something you don’t know“.

In most cases, as a software engineer, you don’t need to keep a watchful eye on hardware, Operating System, compiler, etc. You just need to concentrate on your own code. But to make your program performant, it is not enough to only analyze your code, you need to find answers to questions like this: why does the program run slower in this more powerful platform? Why does profiling make program run even faster? Why can’t multi-thread give a big performance rise? The more you dive into, the more you find you don’t know: architectures of CPU, the mechanism behind Operating System, tons of compiler’s options, and so forth. Even small catch can make your program hiccup! Furthermore, the stackoverflow is not the good place to call for help for performance issue, so the only guy you can rely on is yourself at most of time.

Nonetheless, the fun of performance tuning is also here: after days even weeks of endeavor, I finally find the bottleneck. It is not only exciting experience but every time I learn something I totally don’t know before amid this process. Performance tuning forces you to get a whole picture of the computer system, not only the code you write. This can broaden your view and let you know the essence of computer science.

Performance tuning is harder than debugging bugs, but it also pays off! Enjoy it!