Tuning performance is harder than debugging bugs

In the past week, I focused on resolving an application performance issue, i.e., try to pinpoint why the code didn’t run as fast as I expected. Once upon a time, I am convinced that tuning performance is indeed harder than debugging bugs.

I have more than 10 years experience in software programming, and in recent 4 years, I spend ~20% working time in performance tuning related work: mostly application, sometimes the whole system. Regarding to debugging software bug, if the bug can be always reproduced, it should not be hard to find the root cause. Please notice, I never say it should be easy to fix: e.g., some huge technical debt. If for some reasons, the bug is not 100% reproducible, e.g., the notorious multi-thread bug, you can resort to methods to increase reproduce ratio and help you to pinpoint the culprit: add more logs, change execution time sequence, and so on. However, when talking about performance issue, the thing is “you don’t know something you don’t know“.

In most cases, as a software engineer, you don’t need to keep a watchful eye on hardware, Operating System, compiler, etc. You just need to concentrate on your own code. But to make your program performant, it is not enough to only analyze your code, you need to find answers to questions like this: why does the program run slower in this more powerful platform? Why does profiling make program run even faster? Why can’t multi-thread give a big performance rise? The more you dive into, the more you find you don’t know: architectures of CPU, the mechanism behind Operating System, tons of compiler’s options, and so forth. Even small catch can make your program hiccup! Furthermore, the stackoverflow is not the good place to call for help for performance issue, so the only guy you can rely on is yourself at most of time.

Nonetheless, the fun of performance tuning is also here: after days even weeks of endeavor, I finally find the bottleneck. It is not only exciting experience but every time I learn something I totally don’t know before amid this process. Performance tuning forces you to get a whole picture of the computer system, not only the code you write. This can broaden your view and let you know the essence of computer science.

performance tuning is harder than debugging bugs, but it also pays off! Enjoy it!

A performance issue caused by NUMA

The essence of NUMA is accessing local memory fast while remote slow, and I was bit by it today.

The original code is like this:

/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
    ......
    vector<> local_partition = create_big_vector_partition();
    /* processing the vector partition*/
    ......
}

I tried to create a big vector out of OpenMP block, then every thread just grabs a partition and processes it:

vector<> big_vector = create_big_vector();

#pragma omp parallel for
for (...)
{
    ......
    vector<>& local_partition = get_partition(big_vector);
    /* processing the vector partition*/
    ......
}

I measure the execution time of OpenMP block:

#pragma omp parallel for
for (...)
{
    ......
}

Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.

After some experiments and analysis, numastat helps me to pinpoint the problem:

$ numastat
                           node0           node1
numa_hit              6259740856      7850720376
numa_miss              120468683       128900132
numa_foreign           128900132       120468683
interleave_hit             32881           32290
local_node            6259609322      7850520401
other_node             120600217       129100106

In original solution, every thread creates vector partition in local memory of CPU. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.

Parallelization may cause GPU utilization become worse

Recently, I did an experiment about CUDA. The following is a simple dead-loop code:

......
while (1)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

Run it in single thread, the GPU utilization is ~80%, while in two threads, the utilization is reduced to ~60%; in three threads, the utilization is reduced to ~40%. I can’t comprehend this phenomenon, so posted topics in both stackoverflow and CUDA developer forum. Unfortunately, there was no response.

After some investigation, I found this post and know there is a kernel launch queue firstly. I modified the code and profile again (use nvprof instead of GUI):

......
for (int i = 0; i < 10000; i++)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

The following is the profile output of one, two and three threads:

==22209== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  17.577ms     10000  1.7570us  1.7270us  2.8800us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   97.83%  70.567ms     10000  7.0560us  4.4700us  13.296ms  cudaLaunchKernel
                    2.17%  1.5644ms     10000     156ns     119ns  15.779us  cudaGetLastError

==23662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  35.288ms     20000  1.7640us  1.7270us  12.704us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.09%  473.79ms     20000  23.689us  5.0040us  13.294ms  cudaLaunchKernel
                    0.91%  4.3564ms     20000     217ns     117ns  6.4690us  cudaGetLastError

==27597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  52.587ms     30000  1.7520us  1.7270us  2.9440us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.23%  2.10159s     30000  70.053us  13.545us  13.778ms  cudaLaunchKernel
                    0.77%  16.328ms     30000     544ns     368ns  19.316us  cudaGetLastError

We can see the average execution time of cudaLaunchKernel scales up, so it manifests there is bottle neck in the kernel launch queue when running more threads.

Small tips of optimizing CUDA programs

In this post, I will introduce some tips which can improve CUDA programs’ performance:

(1) Use “pool” to cache resources.

“Pool” is a common data structure which can be used in boosting performance (You can refer my another post which introduces “pool” specially). From my experience, using memory pool to avoid allocating/freeing CUDA memory frequently is a very effective trick. The other resource I want to cache is CUDAstream. Yes, since CUDA 7, you can use --default-stream per-thread compile option to enable a “regular” stream for every host thread, but if you want to use multiple streams in one thread, a “stream” pool may be a choice.

P.S., this is my implementation of memory pool.

(2) Batch processing in stream.

The effect of tip is to reduce synchronizing stream. I.e.:

kernel_1< , , , st>();

kernel_2< , , , st>();

cudaStreamSynchronize(st);

instead of:

kernel_1< , , , st>();
cudaStreamSynchronize(st);

kernel_2< , , , st>();
cudaStreamSynchronize(st);  

(3) Use Peer-to-Peer communication.

In most cases, the Peer-to-Peer communication among devices should be faster than using host as an agent, but it is not “absolute truth” (You can refer this post).

These tips are just my own ideas, and you should test and pick appropriate ones for your own application environment.

An experiment about OpenMP parallel loop

From my testing, the OpenMP will launch threads number equals to “virtual CPU”, though it is not 100% guaranteed. Today I do a test about whether loop levels affect OpenMP performance.

Given the “virtual CPU” is 104 on my system, I define following constants and variables:

#define CPU_NUM (104)
#define LOOP_NUM (100)
#define ARRAY_SIZE (CPU_NUM * LOOP_NUM)

double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE];  

(1) Just one-level loop:

#pragma omp parallel
    for (int i = 0; i < ARRAY_SIZE; i++)
    {
        func(a, b, c, i);
    }

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 7.208773
Time consumed is 7.080540
Time consumed is 7.643123
Time consumed is 7.377163
Time consumed is 7.418053
Time consumed is 7.226235
Time consumed is 7.887611
Time consumed is 7.200167
Time consumed is 7.264515
Time consumed is 7.140937

(2) Use two-level loop:

for (int i = 0; i < LOOP_NUM; i++)
{
    #pragma omp parallel
        for (int j = 0; j < CPU_NUM; j++)
        {
            func(a, b, c, i * CPU_NUM + j);
        }
}

Execute 10 times consecutively:

$ cc -O2 -fopenmp parallel.c
$ ./a.out
Time consumed is 8.333529
Time consumed is 8.164226
Time consumed is 9.705631
Time consumed is 8.695201
Time consumed is 8.972555
Time consumed is 8.126084
Time consumed is 8.286818
Time consumed is 8.162565
Time consumed is 7.884917
Time consumed is 8.073982

At least from this test, one-level loop has a better performance. If you are interested, the source code is here.