Modifying memory pool helps me find a hidden bug

My project has a CUDA memory pool which uses C++‘s std::queue. Allocating from the head of queue:

ptr = q.front();
q.pop(); 

While freeing memory insert it into the tail of queue:

q.push(ptr);  

I changed the implementation from std::queue to std::deque. Both allocating and freeing all occur in the front of queue:

ptr = q.front();
q.pop_front();
......
q.push_front(ptr);

This modification helps me find a hidden bug which is releasing memory early. In origin code, the memory is inserted at the end of queue. So there is a interval between it is reused by other threads and current thread, and the work can still be done correctly as long as it is not reused by others. But after using std::deque, the memory is immediately used by other threads, which disclose the bug.

 

Beware of synchronizing steam when using “default-stream per-thread” in CUDA

Yesterday, I refactored a project through adding”--default-stream per-thread” option to improve its performance. Unfortunately, program will crash in cudaMemcpy:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
[Current thread is 1 (Thread 0x7f5620fa1700 (LWP 31206))]
(gdb) bt
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
#1  0x00007f570d45ffef in ?? () from /usr/lib/libcuda.so.1
#2  0x00007f570d3bff90 in ?? () from /usr/lib/libcuda.so.1
#3  0x00007f570d3198d5 in ?? () from /usr/lib/libcuda.so.1
#4  0x00007f570d319da7 in ?? () from /usr/lib/libcuda.so.1
#5  0x00007f570d21d665 in ?? () from /usr/lib/libcuda.so.1
#6  0x00007f570d21de08 in ?? () from /usr/lib/libcuda.so.1
#7  0x00007f570d352455 in cuMemcpy_ptds () from /usr/lib/libcuda.so.1
#8  0x00007f570ee1b0f9 in cudart::driverHelper::memcpyDispatch(void*, void const*, unsigned long, cudaMemcpyKind, bool) ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#9  0x00007f570ede70f9 in cudart::cudaApiMemcpy_ptds(void*, void const*, unsigned long, cudaMemcpyKind) () from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#10 0x00007f570ee2772b in cudaMemcpy_ptds ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so  
......

After reading GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and How to Overlap Data Transfers in CUDA C/C++ carefully, I found the root cause. Because in my program, the CUDA memory is allocated through cudaMalloc (not unified memory), I also need synchronizing stream, like this:

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyDefault);  
cudaStreamSynchronize(0);

 

Small tips of optimizing CUDA programs

In this post, I will introduce some tips which can improve CUDA programs’ performance:

(1) Use “pool” to cache resources.

“Pool” is a common data structure which can be used in boosting performance (You can refer my another post which introduces “pool” specially). From my experience, using memory pool to avoid allocating/freeing CUDA memory frequently is a very effective trick. The other resource I want to cache is CUDAstream. Yes, since CUDA 7, you can use --default-stream per-thread compile option to enable a “regular” stream for every host thread, but if you want to use multiple streams in one thread, a “stream” pool may be a choice.

P.S., this is my implementation of memory pool.

(2) Batch processing in stream.

The effect of tip is to reduce synchronizing stream. I.e.:

kernel_1< , , , st>();

kernel_2< , , , st>();

cudaStreamSynchronize(st);

instead of:

kernel_1< , , , st>();
cudaStreamSynchronize(st);

kernel_2< , , , st>();
cudaStreamSynchronize(st);  

(3) Use Peer-to-Peer communication.

In most cases, the Peer-to-Peer communication among devices should be faster than using host as an agent, but it is not “absolute truth” (You can refer this post).

These tips are just my own ideas, and you should test and pick appropriate ones for your own application environment.

Fix cudart::globalState::registerEntryFunction core dump issue

In the past 2 days, I was in trouble handling a program crash dump which uses CUDA:

(gdb) bt
#0  0x00007ffff73fc559 in cudart::globalState::registerEntryFunction(void**, char const*, char*, char const*, int, uint3*, uint3*, dim3*, dim3*, int*) () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#1  0x00007ffff73decbc in __cudaRegisterFunction () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#2  0x00007ffff73d9098 in __nv_cudaEntityRegisterCallback(void**) () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#3  0x00000000004283d6 in __cudaRegisterLinkedBinary(__fatBinC_Wrapper_t const*, void (*)(void**), void*) ()
#4  0x00000000004282e5 in __cudaRegisterLinkedBinary_66_tmpxft_00002dac_00000000_12_cuda_device_runtime_compute_70_cpp1_ii_8b1a5d37 ()
#5  0x00007ffff7de76ba in ?? () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7de77cb in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7dd7c6a in ?? () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000001 in ?? ()
#9  0x00007fffffffe7e3 in ?? ()
#10 0x0000000000000000 in ?? ()

Long story to short, my OS is Ubuntu 16.04.5 LTS, and CUDA version is:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

After tough trial and error, the solution is renaming one file from “.cpp” to “.cu“. But on another Arch Linux with the newest CUDA (V10.0.130), this problem doesn’t exist.

If you are interested in more information, please refer this topic.

Some tips of creating streams in using CUDA

Check following simple program:

cat test_stream.cu
int main()
{
        cudaStream_t st_00, st_01, st_11;

        cudaSetDevice(0);
        cudaStreamCreate(&st_00);
        cudaStreamCreate(&st_01);

        cudaSetDevice(1);
        cudaStreamCreate(&st_11);

        return 0;
}

In my system, device 0 is Nvidia Tesla-V100 GPU while device 1 is Tesla-P100. Use cuda-gdb to debug the program step by step:

(1)

Temporary breakpoint 1, main () at /home/xiaonan/temp/test_stream.cu:2
2       {
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:2
(cuda-gdb) n
5               cudaSetDevice(0);
(cuda-gdb)
[New Thread 0x7fffdffff700 (LWP 82532)]
6               cudaStreamCreate(&st_00);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6

When the program was launched, there is only main thread (Id is 1). Then after calling cudaSetDevice(0);, a new thread is spawned (Id is 2).

(2)

(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
(cuda-gdb) n
[New Thread 0x7fffdf7fe700 (LWP 82652)]
7               cudaStreamCreate(&st_01);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:7
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

On device 0, only first calling cudaStreamCreate will generate a new thread. Check used memory through nvidia-smi command:

$ nvidia-smi
Tue Nov 13 16:53:37 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

Create another stream, you will find the memory usage is the same as before.

(3)

(cuda-gdb) n
9               cudaSetDevice(1);
(cuda-gdb)
10              cudaStreamCreate(&st_11);
(cuda-gdb)
[New Thread 0x7fffdeffd700 (LWP 82993)]
12              return 0;
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:12
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6
  4    Thread 0x7fffdeffd700 (LWP 82993) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

Switch to another device and create stream; check memory usage now:

$ nvidia-smi
Tue Nov 13 16:54:24 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    30W / 250W |    291MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     82365      C   /home/xiaonan/temp/build/test_stream         281MiB |
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

You will find different devices consume different memory for creating streams.