CUDA | Nan Xiao's Blog

Fix “identifier “__builtin_is_constant_evaluated” is undefined” error on Arch Linux

The Arch Linux has shipped gcc 9, but the newest CUDA V10.1.168 still only supports gcc 8. So I met following error after upgrading Arch Linux:

......
/usr/include/c++/9.1.0/bits/stl_function.h(437): error: identifier "__builtin_is_constant_evaluated" is undefined
......

Unfortunately, gcc-8 package is not ready now. So I fell back to gcc-7:

SET(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -ccbin=gcc-7"}

Although the CUDA release note claims it already supports clang 8, I indeed bumped into some compile errors when using clang 8. So using gcc-7 is a feasible work-around for me.

Parallelization may cause GPU utilization become worse

Recently, I did an experiment about CUDA. The following is a simple dead-loop code:

......
while (1)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

Run it in single thread, the GPU utilization is ~80%, while in two threads, the utilization is reduced to ~60%; in three threads, the utilization is reduced to ~40%. I can’t comprehend this phenomenon, so posted topics in both stackoverflow and CUDA developer forum. Unfortunately, there was no response.

After some investigation, I found this post and know there is a kernel launch queue firstly. I modified the code and profile again (use nvprof instead of GUI):

......
for (int i = 0; i < 10000; i++)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

The following is the profile output of one, two and three threads:

==22209== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  17.577ms     10000  1.7570us  1.7270us  2.8800us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   97.83%  70.567ms     10000  7.0560us  4.4700us  13.296ms  cudaLaunchKernel
                    2.17%  1.5644ms     10000     156ns     119ns  15.779us  cudaGetLastError

==23662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  35.288ms     20000  1.7640us  1.7270us  12.704us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.09%  473.79ms     20000  23.689us  5.0040us  13.294ms  cudaLaunchKernel
                    0.91%  4.3564ms     20000     217ns     117ns  6.4690us  cudaGetLastError

==27597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  52.587ms     30000  1.7520us  1.7270us  2.9440us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.23%  2.10159s     30000  70.053us  13.545us  13.778ms  cudaLaunchKernel
                    0.77%  16.328ms     30000     544ns     368ns  19.316us  cudaGetLastError

We can see the average execution time of cudaLaunchKernel scales up, so it manifests there is bottle neck in the kernel launch queue when running more threads.

Modifying memory pool helps me find a hidden bug

My project has a CUDA memory pool which uses C++‘s std::queue. Allocating from the head of queue:

ptr = q.front();
q.pop();

While freeing memory insert it into the tail of queue:

q.push(ptr);

I changed the implementation from std::queue to std::deque. Both allocating and freeing all occur in the front of queue:

ptr = q.front();
q.pop_front();
......
q.push_front(ptr);

This modification helps me find a hidden bug which is releasing memory early. In origin code, the memory is inserted at the end of queue. So there is a interval between it is reused by other threads and current thread, and the work can still be done correctly as long as it is not reused by others. But after using std::deque, the memory is immediately used by other threads, which disclose the bug.

Beware of synchronizing steam when using “default-stream per-thread” in CUDA

Yesterday, I refactored a project through adding”--default-stream per-thread” option to improve its performance. Unfortunately, program will crash in cudaMemcpy:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
[Current thread is 1 (Thread 0x7f5620fa1700 (LWP 31206))]
(gdb) bt
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
#1  0x00007f570d45ffef in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f570d3bff90 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f570d3198d5 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f570d319da7 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f570d21d665 in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f570d21de08 in ?? () from /usr/lib/libcuda.so.1
#7  0x00007f570d352455 in cuMemcpy_ptds () from /usr/lib/libcuda.so.1
#8  0x00007f570ee1b0f9 in cudart::driverHelper::memcpyDispatch(void*, void const*, unsigned long, cudaMemcpyKind, bool) ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#9  0x00007f570ede70f9 in cudart::cudaApiMemcpy_ptds(void*, void const*, unsigned long, cudaMemcpyKind) () from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#10 0x00007f570ee2772b in cudaMemcpy_ptds ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so  
......

After reading GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and How to Overlap Data Transfers in CUDA C/C++ carefully, I found the root cause. Because in my program, the CUDA memory is allocated through cudaMalloc (not unified memory), I also need synchronizing stream, like this:

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyDefault);  
cudaStreamSynchronize(0);

Small tips of optimizing CUDA programs

In this post, I will introduce some tips which can improve CUDA programs’ performance:

(1) Use “pool” to cache resources.

“Pool” is a common data structure which can be used in boosting performance (You can refer my another post which introduces “pool” specially). From my experience, using memory pool to avoid allocating/freeing CUDA memory frequently is a very effective trick. The other resource I want to cache is CUDAstream. Yes, since CUDA 7, you can use --default-stream per-thread compile option to enable a “regular” stream for every host thread, but if you want to use multiple streams in one thread, a “stream” pool may be a choice.

P.S., this is my implementation of memory pool.

(2) Batch processing in stream.

The effect of tip is to reduce synchronizing stream. I.e.:

kernel_1< , , , st>();

kernel_2< , , , st>();

cudaStreamSynchronize(st);

instead of:

kernel_1< , , , st>();
cudaStreamSynchronize(st);

kernel_2< , , , st>();
cudaStreamSynchronize(st);

(3) Use Peer-to-Peer communication.

In most cases, the Peer-to-Peer communication among devices should be faster than using host as an agent, but it is not “absolute truth” (You can refer this post).

These tips are just my own ideas, and you should test and pick appropriate ones for your own application environment.