Tuning performance is harder than debugging bugs

In the past week, I focused on resolving an application performance issue, i.e., try to pinpoint why the code didn’t run as fast as I expected. Once upon a time, I am convinced that tuning performance is indeed harder than debugging bugs.

I have more than 10 years experience in software programming, and in recent 4 years, I spend ~20% working time in performance tuning related work: mostly application, sometimes the whole system. Regarding to debugging software bug, if the bug can be always reproduced, it should not be hard to find the root cause. Please notice, I never say it should be easy to fix: e.g., some huge technical debt. If for some reasons, the bug is not 100% reproducible, e.g., the notorious multi-thread bug, you can resort to methods to increase reproduce ratio and help you to pinpoint the culprit: add more logs, change execution time sequence, and so on. However, when talking about performance issue, the thing is “you don’t know something you don’t know“.

In most cases, as a software engineer, you don’t need to keep a watchful eye on hardware, Operating System, compiler, etc. You just need to concentrate on your own code. But to make your program performant, it is not enough to only analyze your code, you need to find answers to questions like this: why does the program run slower in this more powerful platform? Why does profiling make program run even faster? Why can’t multi-thread give a big performance rise? The more you dive into, the more you find you don’t know: architectures of CPU, the mechanism behind Operating System, tons of compiler’s options, and so forth. Even small catch can make your program hiccup! Furthermore, the stackoverflow is not the good place to call for help for performance issue, so the only guy you can rely on is yourself at most of time.

Nonetheless, the fun of performance tuning is also here: after days even weeks of endeavor, I finally find the bottleneck. It is not only exciting experience but every time I learn something I totally don’t know before amid this process. Performance tuning forces you to get a whole picture of the computer system, not only the code you write. This can broaden your view and let you know the essence of computer science.

performance tuning is harder than debugging bugs, but it also pays off! Enjoy it!

A performance issue caused by NUMA

The essence of NUMA is accessing local memory fast while remote slow, and I was bit by it today.

The original code is like this:

/* Every thread create one partition of a big vector and process it*/
#pragma omp parallel for
for (...)
{
    ......
    vector<> local_partition = create_big_vector_partition();
    /* processing the vector partition*/
    ......
}

I tried to create a big vector out of OpenMP block, then every thread just grabs a partition and processes it:

vector<> big_vector = create_big_vector();

#pragma omp parallel for
for (...)
{
    ......
    vector<>& local_partition = get_partition(big_vector);
    /* processing the vector partition*/
    ......
}

I measure the execution time of OpenMP block:

#pragma omp parallel for
for (...)
{
    ......
}

Though in original code, every thread needs to create partition of vector itself, it is still faster than the modified code.

After some experiments and analysis, numastat helps me to pinpoint the problem:

$ numastat
                           node0           node1
numa_hit              6259740856      7850720376
numa_miss              120468683       128900132
numa_foreign           128900132       120468683
interleave_hit             32881           32290
local_node            6259609322      7850520401
other_node             120600217       129100106

In original solution, every thread creates vector partition in local memory of CPU. However, in second case, the threads often need to access memory in remote node, and this overhead is bigger than creating vector partition locally.

Parallelization may cause GPU utilization become worse

Recently, I did an experiment about CUDA. The following is a simple dead-loop code:

......
while (1)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

Run it in single thread, the GPU utilization is ~80%, while in two threads, the utilization is reduced to ~60%; in three threads, the utilization is reduced to ~40%. I can’t comprehend this phenomenon, so posted topics in both stackoverflow and CUDA developer forum. Unfortunately, there was no response.

After some investigation, I found this post and know there is a kernel launch queue firstly. I modified the code and profile again (use nvprof instead of GUI):

......
for (int i = 0; i < 10000; i++)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

The following is the profile output of one, two and three threads:

==22209== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  17.577ms     10000  1.7570us  1.7270us  2.8800us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   97.83%  70.567ms     10000  7.0560us  4.4700us  13.296ms  cudaLaunchKernel
                    2.17%  1.5644ms     10000     156ns     119ns  15.779us  cudaGetLastError

==23662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  35.288ms     20000  1.7640us  1.7270us  12.704us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.09%  473.79ms     20000  23.689us  5.0040us  13.294ms  cudaLaunchKernel
                    0.91%  4.3564ms     20000     217ns     117ns  6.4690us  cudaGetLastError

==27597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  52.587ms     30000  1.7520us  1.7270us  2.9440us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.23%  2.10159s     30000  70.053us  13.545us  13.778ms  cudaLaunchKernel
                    0.77%  16.328ms     30000     544ns     368ns  19.316us  cudaGetLastError

We can see the average execution time of cudaLaunchKernel scales up, so it manifests there is bottle neck in the kernel launch queue when running more threads.

What I learn from Practical Binary Analysis as a non-reverse-engineering engineer?

I spent the past two weeks in reading Practical Binary Analysis. Since I am not a professional reverse engineer, I glossed over the “Part III: Advanced Binary Analysis”, so I only read half the book. Even though, I still get a big gain:

(1) Know better of ELF file. On *nix Operating system, ELF file is everywhere: executable file, object file, shared library and coredump. “Chapter 2: The ELF format” gives me a clear explanation of the composition of ELF. E.g., I know why some functions have “@plt” suffix when using gdb to debug it.

(2) Master many tricks about GNU Binutils. GNU Binutils is a toolbox which provides versatile command line programs to analyze ELF files. Literally it relies heavily on BFD library. I also get some sense about how to use BFD library to tweak ELF files.

(3) “Appendix A: A crash course on X86 assembly” is a good tutorial for refreshing X86 architecture and assembly language.

(4) Others: E.g., I understand how to use LD_PRELOAD environmental variable and dynamic linking functions to manipulate shared library.

All in all, if you are working on *nix (although this book is based on Linux, I think most knowledge are also applicable to other *nix), you should try to read this book. I promise it is not a waste of time and you can always learn something, believe me!

What you need may be “pipeline +Unix commands” only

I came across Taco Bell Programming recently, and think this article is worthy to read for every software engineer. The post mentions a scenario which you may consider to use Hadoop to solve but actually xargs may be a simpler and better choice. This reminds me a similar experience: last year a client wanted me to process a data file which has 5 million records. After some investigations, no novel technologies, a concise awk script (less than 10 lines) worked like a charm! What surprised me more is that awk is just a single-thread program, no nifty concurrency involved.

The IT field never lacks “new” technologies: cloud computing, big data, high concurrency, etc. However, the thinkings behind these “fancy” words may date back to the era when Unix arose. Unix command line tools are invaluable treasure. In many cases, picking the right components and using pipeline to glue them can satisfy your requirement perfectly. So spending some time in reviewing Unixcommand line manual instead of chasing state-of-the-art techniques exhaustedly, you may gain more.

BTW, if your data set can be disposed by an awk script, it should not be called “big data”.