According to this discussion:
#pragma omp parallel for
for (...)
{
}
is a shortcut of
#pragma omp parallel
{
#pragma omp for
for (...)
{
}
}
and it seems more convenient of using “#pragma omp parallel for
“. But there are some pitfalls which you should pay attention to:
(1) You can’t assume the number of threads will be equal to for-loops
iteration counts even it is very small. For example (The machine has only 2
cores.):
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
return 0;
}
Build and run this program:
# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1
We can see only 2
threads are generated. Run it in another 32-core
machine:
# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3
We can see 5
threads are launched.
(2) Use num_threads
clause to modify the program as following:
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for num_threads(5)
for (int i = 0; i < 5; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
return 0;
}
Rebuild and run it on original 2-core
machine:
# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0
We can see this time 5
threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:
#pragma omp parallel for num_threads(30000)
for (int i = 0; i < 30000; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
Execute it:
# ./a.out
libgomp: Thread creation failed: Resource temporarily unavailable
So we should notice the the created thread number.
P.S., if the iteration number is smaller than core number, the number of threads will be equal to core number (still in 32-core machine):
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
if (0 == omp_get_thread_num()) {
printf("thread number is %d\n", omp_get_num_threads());
}
}
return 0;
}
The output is:
thread number is 32
(3) If you use C++ thread_local
variable, you should take care:
#include <omp.h>
#include <stdio.h>
int main(void) {
thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
for (int i = 0; i < 5; i++) {
array[i] = i + 1;
}
for (int i = 0; i < 5; i++) {
printf("array[%d] is %d\n", i, array[i]);
}
return 0;
}
Compile and run:
# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0
We can see only the first element is changed, so it must be thread 0
‘s work. Remove the thread_local
qualifier, and rebuild. This time you get the wanted result:
# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5