According to this discussion:
#pragma omp parallel for
for (...)
{
}
is a shortcut of
#pragma omp parallel
{
#pragma omp for
for (...)
{
}
}
and it seems more convenient of using “#pragma omp parallel for
“. But there are some pitfalls which you should pay attention to:
(1) You can’t assume the number of threads will be equal to for-loops
iteration counts even it is very small. For example (The machine has only 2
cores.):
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
return 0;
}
Build and run this program:
# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1
We can see only 2
threads are generated. Run it in another 32-core
machine:
# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3
We can see 5
threads are launched.
(2) Use num_threads
clause to modify the program as following:
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for num_threads(5)
for (int i = 0; i < 5; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
return 0;
}
Rebuild and run it on original 2-core
machine:
# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0
We can see this time 5
threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:
#pragma omp parallel for num_threads(30000)
for (int i = 0; i < 30000; i++) {
printf("thread is %d\n", omp_get_thread_num());
}
Execute it:
# ./a.out
libgomp: Thread creation failed: Resource temporarily unavailable
So we should notice the the created thread number.
P.S., if the iteration number is smaller than core number, the number of threads will be equal to core number (still in 32-core machine):
#include <omp.h>
#include <stdio.h>
int main(void) {
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
if (0 == omp_get_thread_num()) {
printf("thread number is %d\n", omp_get_num_threads());
}
}
return 0;
}
The output is:
thread number is 32
(3) If you use C++ thread_local
variable, you should take care:
#include <omp.h>
#include <stdio.h>
int main(void) {
thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
for (int i = 0; i < 5; i++) {
array[i] = i + 1;
}
for (int i = 0; i < 5; i++) {
printf("array[%d] is %d\n", i, array[i]);
}
return 0;
}
Compile and run:
# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0
We can see only the first element is changed, so it must be thread 0
‘s work. Remove the thread_local
qualifier, and rebuild. This time you get the wanted result:
# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5
concerning (3):
monk = thread
book = array
You tell 5 monks to write one page of a book and hand each one their own thread_local book. And you do not tell them to keep the books and dismiss 4 of the monks after they finish writing. You should not be surprised, that you end up with a book with only one page written (the page that thread 0 wrote – the master thread, that you did not end up dismissing).
If you want all 5 pages of the book – make them write to a shared book instead of a thread_local book.
I think, the example shows that in this case, C++11’s thread_local works with openmp as you would hope, it does.
Kind Regards,
Simon
Hi Simon,
Thanks for your detailed and vivid explanation!
Best Regards
Nan Xiao