April, 2019 | Nan Xiao's Blog

Fix “Permission denied, please try again.” issue when using git protocol

If you want to use git instead of https protocol, you need to leverage SSH keys. otherwise you will encounter following errors:

$ git clone git@xxxxx/xxx.git
Cloning into 'xxx'...
git@xxx's password:
Permission denied, please try again.

If you don’t have SSH keys, you need to use ssh-keygen to generate a pair of keys, then copy public key into your account. The following picture shows how to add key in gitlab(github is similar):

Reference:
Which remote URL should I use?

The takeaways of GCC optimization

GCC optimization is a good document but a little verbose. I summarize the takeaways of it in this post:

(1) Use -march to generate the optimized code for specified CPU. E.g., -march=native. But beware that the code will not have backwards compatibility for older/different CPUs.

(2) -O2 is recommended. But if play with cmake, “cmake -DCMAKE_BUILD_TYPE=Release ..” will use -O3 to compile code. At least from my own experience,-O3 should be OK.

(3) -pipe has no effect on the generated code but makes compilation fast. If the system has enough memory, use it.

So the best method of compiling code using gcc should be:

# gcc -march=native -O2 -pipe test.c -o test

Or use -O3 instead of -O2? I dunno, choose -O2 or -O3 at your own risk.

dmesg intro

Because dmesg saved me from an installing CUDA driver issue this month, I decide to write this short post to introduce it. According to Wikipedia:

dmesg (display message or driver message) is a command on most Unix-like operating systems that prints the message buffer of the kernel.[1] The output of this command typically contains the messages produced by the device drivers.

On Unix/Linux systems, kernel, kernel modules (e.g., device drivers), and even user-space processes may output logs in kernel buffer. So dmesg is a neat tool for debugging purpose (please refer Linux Performance Analysis in 60,000 Milliseconds). Compared to BSD families, dmesg on Linux provides more options. So this post will use dmesg on Linux as an example.

Firstly, how to know the underlying kernel buffer size? It depends on CONFIG_LOG_BUF_SHIFT. On my Arch Linux:

$ zgrep CONFIG_LOG_BUF_SHIFT /proc/config.gz
CONFIG_LOG_BUF_SHIFT=17

It means the buffer’s size is 2 ^ 17 = 128 KiB (please refer How to find out a linux kernel ring buffer size? and default kernel .config file).

Secondly, dmesg on Linux supports many handy options. E.g. -H is used to display in human-readable format:

$ dmesg -H
[Apr15 09:26] Linux version 5.0.5-arch1-1-ARCH (builduser@heftig-17705) (gcc version 8.2.1 20181127 (GCC)) #1 SMP PREE>
[  +0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=5bee8879-dca0-4b3a-8abd-1bdf96e17a1f rw quiet
[  +0.000000] KERNEL supported cpus:
[  +0.000000]   Intel GenuineIntel
[  +0.000000]   AMD AuthenticAMD
[  +0.000000]   Hygon HygonGenuine
[  +0.000000]   Centaur CentaurHauls
......

and -k tells dmesg only shows user-space messages:

$ dmesg -u -H
[Apr15 09:26] systemd[1]: systemd 241.67-1-arch running in system mode. (+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -S>
[  +0.021954] systemd[1]: Detected architecture x86-64.
[  +0.021551] systemd[1]: Set hostname to <tesla-p100>.
[  +0.098510] systemd[1]: Listening on Journal Socket (/dev/log).
[  +0.000203] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[  +0.000032] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[  +0.000047] systemd[1]: Listening on udev Control Socket.
[  +0.001751] systemd[1]: Listening on Process Core Dump Socket.
[  +0.000010] systemd[1]: Reached target System Time Synchronized.
[  +0.000050] systemd[1]: Listening on Journal Socket.
[  +0.273822] systemd-journald[1148]: Received request to flush runtime journal from PID 1
(END)

To know about other options, please refer manual.

Last but not least, dmesg‘s source code is a good place to learn how to develop command line applications on Linux.

Put dmesg into your toolkit, maybe it will save you one day too.

Different RVO behaviors between gcc and clang

Regarding RVO (Return Value Optimization), I think this video gives a real good explanation. Let’s cut to the chase. Check following code:

# cat rvo.cpp
#include <iostream>

class Foo
{
public:
        Foo(){std::cout << "Foo default constructor!\n";}
        ~Foo(){std::cout << "Foo destructor!\n";}
        Foo(const Foo&){std::cout << "Foo copy constructor!\n";}
        Foo& operator=(const Foo&){std::cout << "Foo assignment!\n"; return *this;}
        Foo(const Foo&&){std::cout << "Foo move constructor!\n";}
        Foo& operator=(const Foo&&){std::cout << "Foo move assignment!\n"; return *this;}
};

Foo func(bool flag)
{
        Foo temp;
        if (flag)
        {
                std::cout << "if\n";
        }
        else
        {
                std::cout << "else\n";
        }
        return temp;
}

int main()
{
        auto f = func(true);
        return 0;
}

On my Arch Linux platform, gcc version is 8.2.1 and clang version is 8.0.0. I tried to use -std=c++11, -std=c++14, -std=c++17 and -std=c++2a for both compilers, all generated same output:

Foo default constructor!
if
Foo destructor!

So both compilers are clever enough to realize there is no need to create “Foo temp” variable (please refer Small examples show copy elision in C++). Modify the func a little:

Foo func(bool flag)
{
        if (flag)
        {
                Foo temp;
                std::cout << "if\n";
                return temp;
        }
        else
        {
                Foo temp;
                std::cout << "else\n";
                return temp;
        }
}

This time, For clang compiler (All four compiling options: -std=c++11, -std=c++14, -std=c++17 and -std=c++2a), the program generated output as same as above:

Foo default constructor!
if
Foo destructor!

But for gcc, (All four compiling options: -std=c++11, -std=c++14, -std=c++17 and -std=c++2a), the program generated different output:

Foo default constructor!
if
Foo move constructor!
Foo destructor!
Foo destructor!

So it means gcc generated both two variables: “Foo temp” and “auto f“. It not only means clang does better optimization than gcc in this scenario, but means if you do something in move constructor, and expect it should be called. You program logic will depend on the compiler: it works under gcc but not for clang. Beware of this trap, otherwise it may bite you one day, like me today!

Beware of OpenMP’s thread pool

For the sake of efficiency, OpenMP‘s implementation always uses thread pool to cache threads (please refer this topic). Check following simple code:

#include <unistd.h>
#include <stdio.h>
#include <omp.h>

int main(void){
        #pragma omp parallel for
        for(int i = 0; i < 256; i++)
        {
            sleep(1);
        }

        printf("Exit loop\n");

        while (1)
        {
            sleep(1);
        }

        return 0;
}

Mys server has 104 logical CPUs. Build and run it:

$ gcc -fopenmp test.c -o test
$ ./test
Exit loop

After “Exit loop” is printed, there is actually only master thread is active. Check number of threads:

$ ps --no-headers -T `pidof test` | wc -l
104

We can see all non-active threads are not destroyed and ready for future use (clang also uses thread pool inside).

The 103 non-active threads are not free; they consume resource and Operating System needs to take care of them. Sometimes they can encumber your process’s performance, especially on a system which already has heavy load. So when you write following code next time:

 #pragma omp parallel for
 for(...)
 {
    ......
 }

Try to answer following questions:
1) How many threads will be spawned?
2) Will these threads be actively used in future or only this time? If they are only valid for this time, is it possible that they become burden of the process? Please try to measure the performance of program. If the answer is yes, how about use other thread implementation instead?

P.S., the full code is here.