Beware of OpenMP’s thread pool

For the sake of efficiency, OpenMP‘s implementation always uses thread pool to cache threads (please refer this topic). Check following simple code:

#include <unistd.h>
#include <stdio.h>
#include <omp.h>

int main(void){
        #pragma omp parallel for
        for(int i = 0; i < 256; i++)
        {
            sleep(1);
        }

        printf("Exit loop\n");

        while (1)
        {
            sleep(1);
        }

        return 0;
}

Mys server has 104 logical CPUs. Build and run it:

$ gcc -fopenmp test.c -o test
$ ./test
Exit loop

After “Exit loop” is printed, there is actually only master thread is active. Check number of threads:

$ ps --no-headers -T `pidof test` | wc -l
104

We can see all non-active threads are not destroyed and ready for future use (clang also uses thread pool inside).

The 103 non-active threads are not free; they consume resource and Operating System needs to take care of them. Sometimes they can encumber your process’s performance, especially on a system which already has heavy load. So when you write following code next time:

 #pragma omp parallel for
 for(...)
 {
    ......
 }

Try to answer following questions:
1) How many threads will be spawned?
2) Will these threads be actively used in future or only this time? If they are only valid for this time, is it possible that they become burden of the process? Please try to measure the performance of program. If the answer is yes, how about use other thread implementation instead?

P.S., the full code is here.

A performance issue caused by NUMA

The essence of NUMA is accessing local memory fast while remote slow, and I was bit by it today.

The original code is like this:

/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
    ......
    vector<> local_partition = create_big_vector_partition();
    /* processing the vector partition*/
    ......
}

I tried to create a big vector out of OpenMP block, then every thread just grabs a partition and processes it:

vector<> big_vector = create_big_vector();

#pragma omp parallel for
for (...)
{
    ......
    vector<>& local_partition = get_partition(big_vector);
    /* processing the vector partition*/
    ......
}

I measure the execution time of OpenMP block:

#pragma omp parallel for
for (...)
{
    ......
}

Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.

After some experiments and analysis, numastat helps me to pinpoint the problem:

$ numastat
                           node0           node1
numa_hit              6259740856      7850720376
numa_miss              120468683       128900132
numa_foreign           128900132       120468683
interleave_hit             32881           32290
local_node            6259609322      7850520401
other_node             120600217       129100106

In original solution, every thread creates vector partition in local memory of CPU. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.

An experiment about OpenMP parallel loop

From my testing, the OpenMP will launch threads number equals to “virtual CPU”, though it is not 100% guaranteed. Today I do a test about whether loop levels affect OpenMP performance.

Given the “virtual CPU” is 104 on my system, I define following constants and variables:

#define CPU_NUM (104)
#define LOOP_NUM (100)
#define ARRAY_SIZE (CPU_NUM * LOOP_NUM)

double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE];  

(1) Just one-level loop:

#pragma omp parallel
    for (int i = 0; i < ARRAY_SIZE; i++)
    {
        func(a, b, c, i);
    }

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 7.208773
Time consumed is 7.080540
Time consumed is 7.643123
Time consumed is 7.377163
Time consumed is 7.418053
Time consumed is 7.226235
Time consumed is 7.887611
Time consumed is 7.200167
Time consumed is 7.264515
Time consumed is 7.140937

(2) Use two-level loop:

for (int i = 0; i < LOOP_NUM; i++)
{
    #pragma omp parallel
        for (int j = 0; j < CPU_NUM; j++)
        {
            func(a, b, c, i * CPU_NUM + j);
        }
}

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 8.333529
Time consumed is 8.164226
Time consumed is 9.705631
Time consumed is 8.695201
Time consumed is 8.972555
Time consumed is 8.126084
Time consumed is 8.286818
Time consumed is 8.162565
Time consumed is 7.884917
Time consumed is 8.073982

At least from this test, one-level loop has a better performance. If you are interested, the source code is here.

Clang may be a better choice than gcc in developing OpenMP program

As referred in The first gcc bug I ever meet, I upgraded gcc to the newest 7.1.0 version to conquer building OpenMP errors. But unfortunately, when using taskloop clause, weird issue happened again. My application utilizes HElib, and I just added following statement in a source file:

#pragma omp taskloop

Then the strange link error reported:

In function `EncryptedArray::EncryptedArray(EncryptedArray const&)':
/root/Project/../../HElib/src/EncryptedArray.h:539: undefined reference to `cloned_ptr<EncryptedArrayBase, deep_clone<EncryptedArrayBase> >::cloned_ptr(cloned_ptr<EncryptedArrayBase, deep_clone<EncryptedArrayBase> > const&)'
collect2: error: ld returned 1 exit status

I tried to debug it, nevertheless, nothing valuable was found.

So I attempted to use clang. Install it on ArchLinux like this:

# pacman -S clang
resolving dependencies...
looking for conflicting packages...

Packages (2) llvm-libs-4.0.0-3  clang-4.0.0-3

Total Download Size:53.24 MiB
Total Installed Size:  275.24 MiB

:: Proceed with installation? [Y/n] y
......
checking available disk space  [#########################################] 100%
:: Processing package changes...
(1/2) installing llvm-libs   [#########################################] 100%
(2/2) installing clang   [#########################################] 100%
Optional dependencies for clang
openmp: OpenMP support in clang with -fopenmp
python2: for scan-view and git-clang-format [installed]
:: Running post-transaction hooks...
(1/1) Arming ConditionNeedsUpdate...

Unlike gcc, to enable OpenMP feature in clang, we need to install an additional openmp package:

# pacman -S openmp

Write a simple program:

# cat parallel.cpp
#include <stdio.h>
#include <omp.h>

int main(void) {
    omp_set_num_threads(5);

    #pragma omp parallel for
    for (int i = 0; i < 5; i++) {

        #pragma omp taskloop
        for (int j = 0; j < 3; j++) {
            printf("%d\n", omp_get_thread_num());
        }

    }   
}

Compile and run it:

# clang++ -fopenmp parallel.cpp
# ./a.out
0
0
0
0
0
1
1
2
4
4
4
4
3
0
1

Clang OpenMP works as I expected. Build my project again, no eccentric errors! Work like a charm!

So according to my testing experience, clang may be a better choice than gcc in developing OpenMP program, especially for some new OpenMP features.

The first gcc bug I ever meet

I have used gcc for more than 10 years, but never met a bug before. In my mind, the gcc is one of the stable software in the world, but at yesterday, the myth ended.

I tried to use OpenMP to optimize my program, and all was OK until the taskloop construct was added:

#pragma omp taskloop

The build flow terminated with the following errors:

xxxxxxx.cpp:142:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

Whoops! It seemed I got the lucky draw! Since my project uses a lot of compile options:

... -g -O2 -fopenmp -fprofile-arcs -ftest-coverage ...

I must narrow down to find the root cause. Firstly, I only use -fopenmp, then everything was OK; Secondly, adding -g -O2, no problem; …. After combination trial, -fopenmp -fprofile-arcs can cause the problem.

To confirm it, I wrote a simple program:

int main(void) {
    #pragma omp taskloop
    for (int i = 0; i < 2; i++) {
    }
    return 0;
}

Compile it:

# gcc -fopenmp -fprofile-arcs parallel.c
parallel.c:6:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

Yeah, the bug was reproduced! It verified my assumption.

To bypass this issue, I decided to try the newest gcc. My OS is ArchLinux,and the gcc version is 6.3.1:

# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 6.3.1 20170306 (GCC)

Now ArchLinux doesn’t provide gcc 7.1 installation package, so I need to download and build it myself:

# wget http://gcc.parentingamerica.com/releases/gcc-7.1.0/gcc-7.1.0.tar.gz
# tar xvf gcc-7.1.0.tar.gz
# cd gcc-7.1.0/
# mkdir build
# cd build

Select the configuration options is a headache task for me, and I decide to copy the current options for 6.3.1 (Please refer the above output from gcc -v):

# ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info .....

Since I don’t need to compile ada and lto, I remove these from --enable-languages:

--enable-languages=c,c++,fortran,go,objc,obj-c++

Besides this, I also need to build and install isl library myself or through ArchLinux isl package. Once configuration is successful, I can build and install it:

# make
# make install

Check the newest gcc:

# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 7.1.0 (GCC)

Compile the program again:

# gcc -fopenmp -fprofile-arcs parallel.c
#

This time compilation is successful!

P.S. The gcc may have other bugs when supporting some new OpenMP directives, so please pay attention to it.