From my testing, the OpenMP
will launch threads number equals to “virtual CPU”, though it is not 100%
guaranteed. Today I do a test about whether loop levels affect OpenMP
performance.
Given the “virtual CPU” is 104
on my system, I define following constants and variables:
#define CPU_NUM (104)
#define LOOP_NUM (100)
#define ARRAY_SIZE (CPU_NUM * LOOP_NUM)
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE];
(1) Just one-level loop:
#pragma omp parallel
for (int i = 0; i < ARRAY_SIZE; i++)
{
func(a, b, c, i);
}
Execute 10
times consecutively:
$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 7.208773
Time consumed is 7.080540
Time consumed is 7.643123
Time consumed is 7.377163
Time consumed is 7.418053
Time consumed is 7.226235
Time consumed is 7.887611
Time consumed is 7.200167
Time consumed is 7.264515
Time consumed is 7.140937
(2) Use two-level loop:
for (int i = 0; i < LOOP_NUM; i++)
{
#pragma omp parallel
for (int j = 0; j < CPU_NUM; j++)
{
func(a, b, c, i * CPU_NUM + j);
}
}
Execute 10
times consecutively:
$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 8.333529
Time consumed is 8.164226
Time consumed is 9.705631
Time consumed is 8.695201
Time consumed is 8.972555
Time consumed is 8.126084
Time consumed is 8.286818
Time consumed is 8.162565
Time consumed is 7.884917
Time consumed is 8.073982
At least from this test, one-level loop has a better performance. If you are interested, the source code is here.