The pitfalls of using OpenMP parallel for-loops

According to this discussion:

#pragma omp parallel for
for (...)
{
}

is a shortcut of

#pragma omp parallel
{ 
#pragma omp for
    for (...)
    {
    }
}

and it seems more convenient of using “#pragma omp parallel for“. But there are some pitfalls which you should pay attention to:

(1) You can’t assume the number of threads will be equal to for-loops iteration counts even it is very small. For example (The machine has only cores.):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Build and run this program:

# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1

We can see only 2 threads are generated. Run it in another 32-core machine:

# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3

We can see 5 threads are launched.

(2) Use num_threads clause to modify the program as following:

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }
    return 0;
}

Rebuild and run it on original 2-core machine:

# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0

We can see this time 5 threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:

#pragma omp parallel for num_threads(30000)
    for (int i = 0; i < 30000; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    }

Execute it:

# ./a.out

libgomp: Thread creation failed: Resource temporarily unavailable

So we should notice the the created thread number.

P.S., if the iteration number is smaller than core number, the number of threads will be equal to core number (still in 32-core machine):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 4; i++) {
        if (0 == omp_get_thread_num()) {
            printf("thread number is %d\n", omp_get_num_threads());
        }
    }
    return 0;
}

The output is:

thread number is 32

(3) If you use C++ thread_local variable, you should take care:

#include <omp.h>
#include <stdio.h>

int main(void) {
    thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        array[i] = i + 1;
    }

    for (int i = 0; i < 5; i++) {
        printf("array[%d] is %d\n", i, array[i]);
    }
    return 0;
}

Compile and run:

# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0

We can see only the first element is changed, so it must be thread 0‘s work. Remove the thread_local qualifier, and rebuild. This time you get the wanted result:

# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5