Tag Archives: OpenMP

Clang may be a better choice than gcc in developing OpenMP program

Published / by nanxiao / Leave a Comment

As referred in The first gcc bug I ever meet, I upgraded gcc to the newest 7.1.0 version to conquer building OpenMP errors. But unfortunately, when using taskloop clause, weird issue happened again. My application utilizes HElib, and I just added following statement in a source file:

#pragma omp taskloop

Then the strange link error reported:

In function `EncryptedArray::EncryptedArray(EncryptedArray const&)':
/root/Project/../../HElib/src/EncryptedArray.h:539: undefined reference to `cloned_ptr<EncryptedArrayBase, deep_clone<EncryptedArrayBase> >::cloned_ptr(cloned_ptr<EncryptedArrayBase, deep_clone<EncryptedArrayBase> > const&)'
collect2: error: ld returned 1 exit status

I tried to debug it, nevertheless, nothing valuable was found.

So I attempted to use clang. Install it on ArchLinux like this:

# pacman -S clang
resolving dependencies...
looking for conflicting packages...

Packages (2) llvm-libs-4.0.0-3  clang-4.0.0-3

Total Download Size:53.24 MiB
Total Installed Size:  275.24 MiB

:: Proceed with installation? [Y/n] y
......
checking available disk space  [#########################################] 100%
:: Processing package changes...
(1/2) installing llvm-libs   [#########################################] 100%
(2/2) installing clang   [#########################################] 100%
Optional dependencies for clang
openmp: OpenMP support in clang with -fopenmp
python2: for scan-view and git-clang-format [installed]
:: Running post-transaction hooks...
(1/1) Arming ConditionNeedsUpdate...

Unlike gcc, to enable OpenMP feature in clang, we need to install an additional openmp package:

# pacman -S openmp

Write a simple program:

# cat parallel.cpp
#include <stdio.h>
#include <omp.h>

int main(void) {
    omp_set_num_threads(5);

    #pragma omp parallel for
    for (int i = 0; i < 5; i++) {

        #pragma omp taskloop
        for (int j = 0; j < 3; j++) {
            printf("%d\n", omp_get_thread_num());
        }

    }   
}

Compile and run it:

# clang++ -fopenmp parallel.cpp
# ./a.out
0
0
0
0
0
1
1
2
4
4
4
4
3
0
1

Clang OpenMP works as I expected. Build my project again, no eccentric errors! Work like a charm!

So according to my testing experience, clang may be a better choice than gcc in developing OpenMP program, especially for some new OpenMP features.

The first gcc bug I ever meet

Published / by nanxiao / Leave a Comment

I have used gcc for more than 10 years, but never met a bug before. In my mind, the gcc is one of the stable software in the world, but at yesterday, the myth ended.

I tried to use OpenMP to optimize my program, and all was OK until the taskloop construct was added:

#pragma omp taskloop

The build flow terminated with the following errors:

xxxxxxx.cpp:142:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

Whoops! It seemed I got the lucky draw! Since my project uses a lot of compile options:

... -g -O2 -fopenmp -fprofile-arcs -ftest-coverage ...

I must narrow down to find the root cause. Firstly, I only use -fopenmp, then everything was OK; Secondly, adding -g -O2, no problem; …. After combination trial, -fopenmp -fprofile-arcs can cause the problem.

To confirm it, I wrote a simple program:

int main(void) {
    #pragma omp taskloop
    for (int i = 0; i < 2; i++) {
    }
    return 0;
}

Compile it:

# gcc -fopenmp -fprofile-arcs parallel.c
parallel.c:6:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.archlinux.org/> for instructions.

Yeah, the bug was reproduced! It verified my assumption.

To bypass this issue, I decided to try the newest gcc. My OS is ArchLinux,and the gcc version is 6.3.1:

# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 6.3.1 20170306 (GCC)

Now ArchLinux doesn’t provide gcc 7.1 installation package, so I need to download and build it myself:

# wget http://gcc.parentingamerica.com/releases/gcc-7.1.0/gcc-7.1.0.tar.gz
# tar xvf gcc-7.1.0.tar.gz
# cd gcc-7.1.0/
# mkdir build
# cd build

Select the configuration options is a headache task for me, and I decide to copy the current options for 6.3.1 (Please refer the above output from gcc -v):

# ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info .....

Since I don’t need to compile ada and lto, I remove these from --enable-languages:

--enable-languages=c,c++,fortran,go,objc,obj-c++

Besides this, I also need to build and install isl library myself or through ArchLinux isl package. Once configuration is successful, I can build and install it:

# make
# make install

Check the newest gcc:

# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 7.1.0 (GCC)

Compile the program again:

# gcc -fopenmp -fprofile-arcs parallel.c
#

This time compilation is successful!

P.S. The gcc may have other bugs when supporting some new OpenMP directives, so please pay attention to it.

The pitfalls of using OpenMP parallel for-loops

Published / by nanxiao / Leave a Comment

According to this discussion:

#pragma omp parallel for
for (...)
{
}

is a shortcut of

#pragma omp parallel
{ 
#pragma omp for
    for (...)
    {
    }
}

and it seems more convenient of using “#pragma omp parallel for“. But there are some pitfalls which you should pay attention to:

(1) You can’t assume the number of threads will be equal to for-loops iteration counts even it is very small. For example (The machine has only cores.):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Build and run this program:

# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1

We can see only 2 threads are generated. Run it in another 32-core machine:

# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3

We can see 5 threads are launched.

(2) Use num_threads clause to modify the program as following:

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Rebuild and run it on original 2-core machine:

# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0

We can see this time 5 threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:

#pragma omp parallel for num_threads(30000)
    for (int i = 0; i < 30000; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }

Execute it:

# ./a.out

libgomp: Thread creation failed: Resource temporarily unavailable

So we should notice the the created thread number.

(3) If you use C++ thread_local variable, you should take care:

#include <omp.h>
#include <stdio.h>

int main(void) {
    thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        array[i] = i + 1;
    }

    for (int i = 0; i < 5; i++) {
        printf("array[%d] is %d\n", i, array[i]);
    }
    return 0;
}

Compile and run:

# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0

We can see only the first element is changed, so it must be thread 0‘s work. Remove the thread_local qualifier, and rebuild. This time you get the wanted result:

# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5

The caveat of building OpenMP program

Published / by nanxiao / Leave a Comment

When building OpenMP program, you must be sure to use -fopenmp option in both compile and link stage (refer stackoverflow), else you may get a hit.

Take the following example:

#include <unistd.h>
#include <omp.h>

int main(void){

        #pragma omp parallel num_threads(4)
        for(;;)
        {
            sleep(1);
        }

        return 0;
}

Use gcc to build it (contains both compile and link):

gcc -fopenmp parallel.c

Execute the program and check the threads number:

$ ./a.out &
[1] 5684
$ ps -T 5684
  PID  SPID TTY      STAT   TIME COMMAND
 5684  5684 pts/16   Sl     0:00 ./a.out
 5684  5685 pts/16   Sl     0:00 ./a.out
 5684  5686 pts/16   Sl     0:00 ./a.out
 5684  5687 pts/16   Sl     0:00 ./a.out

There are 4 threads which as our expected.

Then we create a neat Makefile and split the compile and link stages separately:

all:
        gcc -fopenmp -c parallel.c -o parallel.o
        gcc parallel.o
clean:
        rm *.o a.out

Run the Makefie:

$ make
gcc -fopenmp -c parallel.c -o parallel.o
gcc parallel.o
parallel.o: In function `main':
parallel.c:(.text+0x19): undefined reference to `GOMP_parallel'
collect2: error: ld returned 1 exit status
make: *** [Makefile:3: all] Error 1

During the link phase, the gcc complained it can’t find GOMP_parallel. So we need to add -fopenmp in link command too:

all:
        gcc -fopenmp -c parallel.c -o parallel.o
        gcc -fopenmp parallel.o
clean:
        rm *.o a.out

This time all is OK:

$ make
gcc -fopenmp -c parallel.c -o parallel.o
gcc -fopenmp parallel.o
$ ./a.out &
[2] 6502
$ ps -T 6502
  PID  SPID TTY      STAT   TIME COMMAND
 6502  6502 pts/16   Sl     0:00 ./a.out
 6502  6503 pts/16   Sl     0:00 ./a.out
 6502  6504 pts/16   Sl     0:00 ./a.out
 6502  6505 pts/16   Sl     0:00 ./a.out

You can also use ldd tool to check a.out‘s dynamic libraries:

$ ldd /usr/lib/libgomp.so.1
    linux-vdso.so.1 (0x00007ffd9c0dd000)
    libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007fe5554ee000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007fe5552d0000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007fe554f2c000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007fe554d28000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fe55571c000)

The libgomp includes the GOMP_parallel definition.

An example of debugging parallel program

Published / by nanxiao / Leave a Comment

In the past 2 weeks, I was tortured by a nasty bug, yes, a not-always-reproducible, use-third-party-code, parallel-program issue. Thank Goodness! The root cause was found on this Thursday, and I think the whole progress is a perfect experience and tutorial of how to debug parallel program. So I record it here and hope it can give you some tips when facing similar situation.

Background: I am leveraging FHEW to implement some complex calculations and experiments. Now the FHEW only works in single-thread environment, since many functions share the same global variables, like this:

......
double *in;
fftw_complex *out;
fftw_plan plan_fft_forw, plan_fft_back;

void FFTsetup() {
  in = (double*) fftw_malloc(sizeof(double) * 2*N);
  out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * (N + 2));
  plan_fft_forw = fftw_plan_dft_r2c_1d(2*N, in, out,  FFTW_PATIENT);
  plan_fft_back = fftw_plan_dft_c2r_1d(2*N, out, in,  FFTW_PATIENT);
}

void FFTforward(Ring_FFT res, const Ring_ModQ val) {
  for (int k = 0; k < N; ++k)   {
    in[k] = (double) (val[k]);
    in[k+N] = 0.0;          
  }
  fftw_execute(plan_fft_forw); 
  for (int k = 0; k < N2; ++k) 
    res[k] = (double complex) out[2*k+1];               
}
.....

So to make it can be used in parallel, I first need to modify it to be used in multi-thread program. This work involves of allocating resources of every thread, adding lock of necessary global variables, etc. The multi-thread FHEW works always OK until I employed it to implement a complicated function. At most times the result was correct except some rare occasions. So I began the tough debugging work.

(1) Log. Undoubtedly, the log is the most effective one in all debugging tools. I tried to add as many traces as I can. An important attention you must pay is now that multiple threads are running simultaneously, you must identify the logs are from which thread. For example, if you use OpenMP, you can use following code to differentiate thread IDs:

......
printf("Thread(%d/%d) ......", omp_get_thread_num(), omp_get_num_threads(), ......);
......

After analyzing the log, a function was spotted, but it did noting but use 2 consecutive HomNAND to implement the AND operations (You should consider HomNAND is just a NAND gate in logical circuit, and use two NAND gates can realize AND gate.). According to AND gate truth-table, when two inputs are 0, the output should be 0 too, but in my program, the result is 1, which causes the final error.

(2) Simplify the program and try to find the reproducible condition. Since the bug was related to HomNAND and my original program was a little complicated, I needed to do some simple tests and try to find the reproducible condition. I designed 3 experiments:

a) Use original single-thread FHEW, and execute one HomNAND continuously:

for (;;) {
    ......
    HomNAND();
    ......
}

b) Use modified multi-thread FHEW, and spawn 8 threads. Every thread run one HomNAND in dead-loop:

#pragma omp parallel for
for (int loop = 0; loop < 8; loop++) {
    ......
    HomNAND();
    ......
}

c) Use modified multi-thread FHEW, and spawn 8 threads. Every thread run two HomNANDs to achieve AND function in dead-loop:

#pragma omp parallel for
for (int loop = 0; loop < 8; loop++) {
    ......
    HomNAND()
    HomNAND();
    ......
}

After testing, I found case a) and b) were always OK, while c) would generate wrong results after running 1 or 2 hours.

(3) Now the thing became intricate. Because I modified the FHEW, and FHEW also used FFTW, it was not an easy task to locate where the problem was. Additionally, the thread-safe usage of FFTW also involved many discussions (like this) and pitfalls (like this), so I resort the authors of FFTW to make sure my code is not wrong.

(4) Back to the experiments, I suddenly found that I have tested single-thread-single-HomNAND, multiple-thread-single-HomNAND, multiple-thread-two-HomNAND cases, but how about single-thread-two-HomNAND? If single-thread-two-HomNAND case can fail, it can prove the my modification code was innocent, and the problem should reside on the original FHEW code:

for (;;) {
    ......
    HomNAND();
    HomNAND();
    ......
}

Yeah, I was very lucky! After one hour, the error occurred. The murder was found, so the next step was to seek solution with the author.

Based on this experience, I think analyzing log and simplifying test model are very important techniques during resolving parallel program bugs. Hope everyone can improve debugging ability, happy debugging!