A performance issue caused by NUMA

The essence of NUMA is accessing local memory fast while remote slow, and I was bit by it today.

The original code is like this:

/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
    ......
    vector<> local_partition = create_big_vector_partition();
    /* processing the vector partition*/
    ......
}

I tried to create a big vector out of OpenMP block, then every thread just grabs a partition and processes it:

vector<> big_vector = create_big_vector();

#pragma omp parallel for
for (...)
{
    ......
    vector<>& local_partition = get_partition(big_vector);
    /* processing the vector partition*/
    ......
}

I measure the execution time of OpenMP block:

#pragma omp parallel for
for (...)
{
    ......
}

Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.

After some experiments and analysis, numastat helps me to pinpoint the problem:

$ numastat
                           node0           node1
numa_hit              6259740856      7850720376
numa_miss              120468683       128900132
numa_foreign           128900132       120468683
interleave_hit             32881           32290
local_node            6259609322      7850520401
other_node             120600217       129100106

In original solution, every thread creates vector partition in local memory of CPU. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.

Leave a Reply Cancel reply