CUDA P2P is not guaranteed to be faster than staged through the host

Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:

(1) Disable P2P, you can see CPU utilization ratio is very high: 86.7%, and the bandwidth is nearly 10.67GB/s:

(2) Enable P2P, CPU utilization drops down to 1.3% only, and the bandwidth is about 1.6GB/s fall behind: 9.00GB/s:

The test file is here.


Arch Linux: a developer-friendly Operating System

I have been using Arch Linux as the working environment for nearly 2 years. Generally speaking, the experience is very good, and I want to recommend it for more people, especially software engineers.

Because I am a developer, one SSH client is mostly enough, and fascinating desktop doesn’t appeal to me. Since Arch Linux is “rolling release” mode, it means I can always get the newest kernel and software packages (one “pacman -Syu” command will refresh the whole system), and my favorite thing is to make use of the newest functions provided by compiler and kernel . On the contrary, one pain point of distributions whose mode are “point release” is sometimes you need to compile the vanilla kernel yourself if you want to try some up-to-date features (E.g., use eBPF on RHEL 7) .

The package management system is another killer feature. For instance, I used to try to develop OpenMP program using clang. Not similar as gcc, clangrequires additional package:

# pacman -S clang
Optional dependencies for clang
    openmp: OpenMP support in clang with -fopenmp
    python2: for scan-view and git-clang-format

The prompt not only shows me that I need to install openmp package, but also requires “-fopenmp” option to compile OpenMP program. This is very humanized. I tried to enable OpenMP feature of clang on some other OSs, and the process is not as smooth as Arch Linux.

Last but not least, Arch Linux community is friendly, I can always get help from other enthusiastic guys.

For me as a programmer, what I need is just a stable Operating System which can always provide latest software toolchains to meet my requirements, and I don’t want to spend much time to tweak it. Arch Linux seems fulfill these demands perfectly. If you have simple requirement as me, why not give it a shot?

A performance engineering team’s story

It is a story about a performance engineer team: most is genuine, and some is made-up.

Server performance related work requires a broad knowledge and experience: software, hardware, Operating System kernel, and so on. The team members’ background is also a rich variety: embedded system developer, PCB designer, QA, DevOps, compiler researcher, etc.

The team’s daily work is like this:

(1) Since company’s main products are servers, run SPEC benchmark programs and cooperate with other team to get the best result is the most important task. Since only your servers exceed other competitors, the customers are willing to pay money for them.

(2) It is no controversial that Linux has dominated the server market, so being proficient in Linux profiling and tracing tools (from perf,ftrace to BPF) is a compulsory class. Meanwhile, the company also has its own proprietary Unix which is still serving some critical business: bank, telecommunication, etc. The members also need to use some esoteric tools to help diagnosing issues on this proprietary Unix. At the same time, coding is another important task: develop company’s own profiling tools, and contribute to FOSS projects.

(3) Support customers and other teams. E.g., one customer wants to know the Oracle‘s performance on some server models; the other finds that Docker doesn’t run as expected. Another colleague comes to your desk: “When you are free, could you help to check how to optimize this program? Boss isn’t satisfied with it”.

(4) The team encourages sharing. During every weekly meeting, you can introduces a Unix command trick, a debugging skill, like this. Members can have 5% ~ 10% time to do hobby projects, but the prerequisite is work first. This benefit is not free lunch, you should report it in the meeting too.

So what is the point of this article? Nothing, just telling a story, that’s it.

First taste of building cilk program on Arch Linux

I bump into cilk that is much like OpenMP this week, and give it a try on Arch Linux:

(1) Follow the Quick start to build Tapir-Meta:

$ git clone --recursive
$ cd Tapir-Meta/
$ ./ release
$ source ./

Please note gcc is the default compiler on Arch Linux, and there is a compile error of using gcc. So I switch to clang:

$ CC=clang CXX=claang++ ./ release

The is simple, just make Tapir/LLVM compilers as the default ones:

$ which clang
$ which clang++

(2) Build libcilkrts (please refer this issue: “/usr/bin/ld: cannot find -lcilkrts” error).

(3) Write a simple program:

$ cat test.c
#include <unistd.h>
#include <cilk/cilk.h>

int main(void)
        cilk_spawn sleep(100);
        cilk_spawn sleep(100);
        cilk_spawn sleep(100);

        return 0;

Build it:

$ clang -L/home/xiaonan/cilkrts/build/install/lib -fcilkplus test.c
$ ldd a.out (0x00007ffdccd32000) => /home/xiaonan/cilkrts/build/install/lib/ (0x00007f1fe4d60000) => /usr/lib/ (0x00007f1fe4b48000) => /usr/lib/ (0x00007f1fe478c000) => /usr/lib/ (0x00007f1fe4588000) => /usr/lib/ (0x00007f1fe436a000) => /usr/lib/ (0x00007f1fe3fe1000) => /usr/lib/ (0x00007f1fe3c4c000)
    /lib64/ => /usr/lib64/ (0x00007f1fe4f7e000)

From ldd output, we can see it links the ibcilkrts library. Execute the program:

$ ./a.out &
[1] 25530
[[email protected] ~]$ ps -T 25530 | wc -l

We can see a.out spawns many threads.

Learn new technology through writing a tutorial about it

I like to get my feet wet on new technologies, but find if I don’t use it for some time, e.g., several months, I will forget a lot of details, not sure whether other people have the same feeling :-). To let me get a quick refreshment of the technology after a while, I resort to the old school method: writing notes. But one day, I came out a idea: why not try to write a tutorial during studying instead of only recording? So in the past 2 years, Golang 101 hacks and OpenMP Little Book are born. The whole process is really rewarding :

(1) Sometimes you think you have grasped the knowledge, but when you begin to write an article to explain it, you will find there are some points you can’t understand thoroughly. To make your article more clearly, you need to write code to verify it, look for help in the internet, etc. Among this process, you will get a deeper understanding.

(2) Your tutorial can be reviewed by other brilliant engineers who can point out your mistakes, in the meantime, the tutorial can also help others. E.g., I find my tutorial is quoted in stackoverflow’s answer occasionally, and it really encourages me!

(3) Since I am not a native English speaker, creating an English tutorial can also help to improve and practice my English skills. I highly recommend you use English to compose, because that can make your idea shared among the people all over the world!

Based on above points, writing technological tutorial is definitely a win-win process. Why not give a shot? I think you can!