The essence of NUMA
is accessing local memory fast while remote slow, and I was bit by it today.
The original code is like this:
/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
......
vector<> local_partition = create_big_vector_partition();
/* processing the vector partition*/
......
}
I tried to create a big vector out of OpenMP
block, then every thread just grabs a partition and processes it:
vector<> big_vector = create_big_vector();
#pragma omp parallel for
for (...)
{
......
vector<>& local_partition = get_partition(big_vector);
/* processing the vector partition*/
......
}
I measure the execution time of OpenMP
block:
#pragma omp parallel for
for (...)
{
......
}
Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.
After some experiments and analysis, numastat
helps me to pinpoint the problem:
$ numastat
node0 node1
numa_hit 6259740856 7850720376
numa_miss 120468683 128900132
numa_foreign 128900132 120468683
interleave_hit 32881 32290
local_node 6259609322 7850520401
other_node 120600217 129100106
In original solution, every thread creates vector partition in local memory of CPU
. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.