The pitfalls of using OpenMP parallel for-loops

According to this discussion:

#pragma omp parallel for
for (...)
{
}

is a shortcut of

#pragma omp parallel
{ 
#pragma omp for
    for (...)
    {
    }
}

and it seems more convenient of using “#pragma omp parallel for“. But there are some pitfalls which you should pay attention to:

(1) You can’t assume the number of threads will be equal to for-loops iteration counts even it is very small. For example (The machine has only cores.):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Build and run this program:

# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1

We can see only 2 threads are generated. Run it in another 32-core machine:

# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3

We can see 5 threads are launched.

(2) Use num_threads clause to modify the program as following:

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Rebuild and run it on original 2-core machine:

# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0

We can see this time 5 threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:

#pragma omp parallel for num_threads(30000)
    for (int i = 0; i < 30000; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }

Execute it:

# ./a.out

libgomp: Thread creation failed: Resource temporarily unavailable

So we should notice the the created thread number.

P.S., if the iteration number is smaller than core number, the number of threads will be equal to core number (still in 32-core machine):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 4; i++) {
        if (0 == omp_get_thread_num()) {
            printf("thread number is %d\n", omp_get_num_threads());
        }
    }
    return 0;
}

The output is:

thread number is 32

(3) If you use C++ thread_local variable, you should take care:

#include <omp.h>
#include <stdio.h>

int main(void) {
    thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        array[i] = i + 1;
    }

    for (int i = 0; i < 5; i++) {
        printf("array[%d] is %d\n", i, array[i]);
    }
    return 0;
}

Compile and run:

# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0

We can see only the first element is changed, so it must be thread 0‘s work. Remove the thread_local qualifier, and rebuild. This time you get the wanted result:

# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5

The caveat of building OpenMP program

When building OpenMP program, you must be sure to use -fopenmp option in both compile and link stage (refer stackoverflow), else you may get a hit.

Take the following example:

#include <unistd.h>
#include <omp.h>

int main(void){

        #pragma omp parallel num_threads(4)
        for(;;)
        {
            sleep(1);
        }

        return 0;
}

Use gcc to build it (contains both compile and link):

gcc -fopenmp parallel.c

Execute the program and check the threads number:

$ ./a.out &
[1] 5684
$ ps -T 5684
  PID  SPID TTY      STAT   TIME COMMAND
 5684  5684 pts/16   Sl     0:00 ./a.out
 5684  5685 pts/16   Sl     0:00 ./a.out
 5684  5686 pts/16   Sl     0:00 ./a.out
 5684  5687 pts/16   Sl     0:00 ./a.out

There are 4 threads which as our expected.

Then we create a neat Makefile and split the compile and link stages separately:

all:
        gcc -fopenmp -c parallel.c -o parallel.o
        gcc parallel.o
clean:
        rm *.o a.out

Run the Makefie:

$ make
gcc -fopenmp -c parallel.c -o parallel.o
gcc parallel.o
parallel.o: In function `main':
parallel.c:(.text+0x19): undefined reference to `GOMP_parallel'
collect2: error: ld returned 1 exit status
make: *** [Makefile:3: all] Error 1

During the link phase, the gcc complained it can’t find GOMP_parallel. So we need to add -fopenmp in link command too:

all:
        gcc -fopenmp -c parallel.c -o parallel.o
        gcc -fopenmp parallel.o
clean:
        rm *.o a.out

This time all is OK:

$ make
gcc -fopenmp -c parallel.c -o parallel.o
gcc -fopenmp parallel.o
$ ./a.out &
[2] 6502
$ ps -T 6502
  PID  SPID TTY      STAT   TIME COMMAND
 6502  6502 pts/16   Sl     0:00 ./a.out
 6502  6503 pts/16   Sl     0:00 ./a.out
 6502  6504 pts/16   Sl     0:00 ./a.out
 6502  6505 pts/16   Sl     0:00 ./a.out

You can also use ldd tool to check a.out‘s dynamic libraries:

$ ldd /usr/lib/libgomp.so.1
    linux-vdso.so.1 (0x00007ffd9c0dd000)
    libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007fe5554ee000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007fe5552d0000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007fe554f2c000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007fe554d28000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fe55571c000)

The libgomp includes the GOMP_parallel definition.

An example of debugging parallel program

In the past 2 weeks, I was tortured by a nasty bug, yes, a not-always-reproducible, use-third-party-code, parallel-program issue. Thank Goodness! The root cause was found on this Thursday, and I think the whole progress is a perfect experience and tutorial of how to debug parallel program. So I record it here and hope it can give you some tips when facing similar situation.

Background: I am leveraging FHEW to implement some complex calculations and experiments. Now the FHEW only works in single-thread environment, since many functions share the same global variables, like this:

......
double *in;
fftw_complex *out;
fftw_plan plan_fft_forw, plan_fft_back;

void FFTsetup() {
  in = (double*) fftw_malloc(sizeof(double) * 2*N);
  out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * (N + 2));
  plan_fft_forw = fftw_plan_dft_r2c_1d(2*N, in, out,  FFTW_PATIENT);
  plan_fft_back = fftw_plan_dft_c2r_1d(2*N, out, in,  FFTW_PATIENT);
}

void FFTforward(Ring_FFT res, const Ring_ModQ val) {
  for (int k = 0; k < N; ++k)   {
    in[k] = (double) (val[k]);
    in[k+N] = 0.0;          
  }
  fftw_execute(plan_fft_forw); 
  for (int k = 0; k < N2; ++k) 
    res[k] = (double complex) out[2*k+1];               
}
.....

So to make it can be used in parallel, I first need to modify it to be used in multi-thread program. This work involves of allocating resources of every thread, adding lock of necessary global variables, etc. The multi-thread FHEW works always OK until I employed it to implement a complicated function. At most times the result was correct except some rare occasions. So I began the tough debugging work.

(1) Log. Undoubtedly, the log is the most effective one in all debugging tools. I tried to add as many traces as I can. An important attention you must pay is now that multiple threads are running simultaneously, you must identify the logs are from which thread. For example, if you use OpenMP, you can use following code to differentiate thread IDs:

......
printf("Thread(%d/%d) ......", omp_get_thread_num(), omp_get_num_threads(), ......);
......

After analyzing the log, a function was spotted, but it did noting but use 2 consecutive HomNAND to implement the AND operations (You should consider HomNAND is just a NAND gate in logical circuit, and use two NAND gates can realize AND gate.). According to AND gate truth-table, when two inputs are 0, the output should be 0 too, but in my program, the result is 1, which causes the final error.

(2) Simplify the program and try to find the reproducible condition. Since the bug was related to HomNAND and my original program was a little complicated, I needed to do some simple tests and try to find the reproducible condition. I designed 3 experiments:

a) Use original single-thread FHEW, and execute one HomNAND continuously:

for (;;) {
    ......
    HomNAND();
    ......
}

b) Use modified multi-thread FHEW, and spawn 8 threads. Every thread run one HomNAND in dead-loop:

#pragma omp parallel for
for (int loop = 0; loop < 8; loop++) {
    ......
    HomNAND();
    ......
}

c) Use modified multi-thread FHEW, and spawn 8 threads. Every thread run two HomNANDs to achieve AND function in dead-loop:

#pragma omp parallel for
for (int loop = 0; loop < 8; loop++) {
    ......
    HomNAND()
    HomNAND();
    ......
}

After testing, I found case a) and b) were always OK, while c) would generate wrong results after running 1 or 2 hours.

(3) Now the thing became intricate. Because I modified the FHEW, and FHEW also used FFTW, it was not an easy task to locate where the problem was. Additionally, the thread-safe usage of FFTW also involved many discussions (like this) and pitfalls (like this), so I resort the authors of FFTW to make sure my code is not wrong.

(4) Back to the experiments, I suddenly found that I have tested single-thread-single-HomNAND, multiple-thread-single-HomNAND, multiple-thread-two-HomNAND cases, but how about single-thread-two-HomNAND? If single-thread-two-HomNAND case can fail, it can prove the my modification code was innocent, and the problem should reside on the original FHEW code:

for (;;) {
    ......
    HomNAND();
    HomNAND();
    ......
}

Yeah, I was very lucky! After one hour, the error occurred. The murder was found, so the next step was to seek solution with the author.

Based on this experience, I think analyzing log and simplifying test model are very important techniques during resolving parallel program bugs. Hope everyone can improve debugging ability, happy debugging!