Technology | Nan Xiao's Blog

An experiment about OpenMP parallel loop

From my testing, the OpenMP will launch threads number equals to “virtual CPU”, though it is not 100% guaranteed. Today I do a test about whether loop levels affect OpenMP performance.

Given the “virtual CPU” is 104 on my system, I define following constants and variables:

#define CPU_NUM (104)
#define LOOP_NUM (100)
#define ARRAY_SIZE (CPU_NUM * LOOP_NUM)

double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE];

(1) Just one-level loop:

#pragma omp parallel
    for (int i = 0; i < ARRAY_SIZE; i++)
    {
        func(a, b, c, i);
    }

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 7.208773
Time consumed is 7.080540
Time consumed is 7.643123
Time consumed is 7.377163
Time consumed is 7.418053
Time consumed is 7.226235
Time consumed is 7.887611
Time consumed is 7.200167
Time consumed is 7.264515
Time consumed is 7.140937

(2) Use two-level loop:

for (int i = 0; i < LOOP_NUM; i++)
{
    #pragma omp parallel
        for (int j = 0; j < CPU_NUM; j++)
        {
            func(a, b, c, i * CPU_NUM + j);
        }
}

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 8.333529
Time consumed is 8.164226
Time consumed is 9.705631
Time consumed is 8.695201
Time consumed is 8.972555
Time consumed is 8.126084
Time consumed is 8.286818
Time consumed is 8.162565
Time consumed is 7.884917
Time consumed is 8.073982

At least from this test, one-level loop has a better performance. If you are interested, the source code is here.

CUDA P2P is not guaranteed to be faster than staged through the host

Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:

(1) Disable P2P, you can see CPU utilization ratio is very high: 86.7%, and the bandwidth is nearly 10.67GB/s:

(2) Enable P2P, CPU utilization drops down to 1.3% only, and the bandwidth is about 1.6GB/s fall behind: 9.00GB/s:

P.S., the full code is here.

Arch Linux: a developer-friendly Operating System

I have been using Arch Linux as the working environment for nearly 2 years. Generally speaking, the experience is very good, and I want to recommend it for more people, especially software engineers.

Because I am a developer, one SSH client is mostly enough, and fascinating desktop doesn’t appeal to me. Since Arch Linux is “rolling release” mode, it means I can always get the newest kernel and software packages (one “pacman -Syu” command will refresh the whole system), and my favorite thing is to make use of the newest functions provided by compiler and kernel . On the contrary, one pain point of distributions whose mode are “point release” is sometimes you need to compile the vanilla kernel yourself if you want to try some up-to-date features (E.g., use eBPF on RHEL 7) .

The package management system is another killer feature. For instance, I used to try to develop OpenMP program using clang. Not similar as gcc, clangrequires additional package:

# pacman -S clang
......
Optional dependencies for clang
    openmp: OpenMP support in clang with -fopenmp
    python2: for scan-view and git-clang-format
......

The prompt not only shows me that I need to install openmp package, but also requires “-fopenmp” option to compile OpenMP program. This is very humanized. I tried to enable OpenMP feature of clang on some other OSs, and the process is not as smooth as Arch Linux.

Last but not least, Arch Linux community is friendly, I can always get help from other enthusiastic guys.

For me as a programmer, what I need is just a stable Operating System which can always provide latest software toolchains to meet my requirements, and I don’t want to spend much time to tweak it. Arch Linux seems fulfill these demands perfectly. If you have simple requirement as me, why not give it a shot?

First taste of building cilk program on Arch Linux

I bump into cilk that is much like OpenMP this week, and give it a try on Arch Linux:

(1) Follow the Quick start to build Tapir-Meta:

$ git clone --recursive https://github.com/wsmoses/Tapir-Meta.git
$ cd Tapir-Meta/
$ ./build.sh release
$ source ./setup-env.sh

Please note gcc is the default compiler on Arch Linux, and there is a compile error of using gcc. So I switch to clang:

$ CC=clang CXX=claang++ ./build.sh release

The setup-env.sh is simple, just make Tapir/LLVM compilers as the default ones:

$ which clang
/home/xiaonan/Tapir-Meta/tapir/build/bin/clang
$ which clang++
/home/xiaonan/Tapir-Meta/tapir/build/bin/clang++

(2) Build libcilkrts (please refer this issue: “/usr/bin/ld: cannot find -lcilkrts” error).

(3) Write a simple program:

$ cat test.c
#include <unistd.h>
#include <cilk/cilk.h>

int main(void)
{
        cilk_spawn sleep(100);
        cilk_spawn sleep(100);
        cilk_spawn sleep(100);

        cilk_sync;
        return 0;
}

Build it:

$ clang -L/home/xiaonan/cilkrts/build/install/lib -fcilkplus test.c
$ ldd a.out
    linux-vdso.so.1 (0x00007ffdccd32000)
    libcilkrts.so.5 => /home/xiaonan/cilkrts/build/install/lib/libcilkrts.so.5 (0x00007f1fe4d60000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f1fe4b48000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f1fe478c000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f1fe4588000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f1fe436a000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f1fe3fe1000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007f1fe3c4c000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f1fe4f7e000)

From ldd output, we can see it links the ibcilkrts library. Execute the program:

$ ./a.out &
[1] 25530
[xiaonan@tesla-p100 ~]$ ps -T 25530 | wc -l
105

We can see a.out spawns many threads.

configure script may not check pthread correctly on OpenBSD

I have come into at least 2 times that one project was built well on Linux, while can’t find pthread related definitions on OpenBSD, like this:

......
../../runtime/cilk-internal.h:39:6: error: unknown type name 'pthread_mutex_t'
     pthread_mutex_t posix;
     ^
../../runtime/cilk-internal.h:211:6: error: unknown type name 'pthread_t'
     pthread_t *tid;
     ^
../../runtime/cilk-internal.h:216:6: error: unknown type name 'pthread_cond_t'
     pthread_cond_t  waiting_workers_cond;
     ^
../../runtime/cilk-internal.h:217:6: error: unknown type name 'pthread_cond_t'
     pthread_cond_t  wakeup_first_worker_cond;
     ^
../../runtime/cilk-internal.h:218:6: error: unknown type name 'pthread_cond_t'
     pthread_cond_t  wakeup_other_workers_cond;
     ^
../../runtime/cilk-internal.h:219:6: error: unknown type name 'pthread_mutex_t'
     pthread_mutex_t workers_mutex;
     ^
../../runtime/cilk-internal.h:220:6: error: unknown type name 'pthread_cond_t'
     pthread_cond_t  workers_done_cond;
......

The source code is as following:

......
#if HAVE_PTHREAD
#include <pthread.h>
#endif
......

While the generated config.h doesn’t define HAVE_PTHREAD macro:

/* Define if you have POSIX threads libraries and header files. */
/* #undef HAVE_PTHREAD */

But in fact, the OpenBSD has provided all support of pthread. So please be aware of this issue.