Yesterday, I refactored a project through adding”--default-stream per-thread
” option to improve its performance. Unfortunately, program will crash in cudaMemcpy
:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
[Current thread is 1 (Thread 0x7f5620fa1700 (LWP 31206))]
(gdb) bt
#0 0x00007f570d3eb7f0 in ?? () from /usr/lib/libcuda.so.1
#1 0x00007f570d45ffef in ?? () from /usr/lib/libcuda.so.1
#2 0x00007f570d3bff90 in ?? () from /usr/lib/libcuda.so.1
#3 0x00007f570d3198d5 in ?? () from /usr/lib/libcuda.so.1
#4 0x00007f570d319da7 in ?? () from /usr/lib/libcuda.so.1
#5 0x00007f570d21d665 in ?? () from /usr/lib/libcuda.so.1
#6 0x00007f570d21de08 in ?? () from /usr/lib/libcuda.so.1
#7 0x00007f570d352455 in cuMemcpy_ptds () from /usr/lib/libcuda.so.1
#8 0x00007f570ee1b0f9 in cudart::driverHelper::memcpyDispatch(void*, void const*, unsigned long, cudaMemcpyKind, bool) ()
from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#9 0x00007f570ede70f9 in cudart::cudaApiMemcpy_ptds(void*, void const*, unsigned long, cudaMemcpyKind) () from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
#10 0x00007f570ee2772b in cudaMemcpy_ptds ()
from /home/xiaonan/DSI_cuRlib_v2.0/build/src/libtest.so
......
After reading GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and How to Overlap Data Transfers in CUDA C/C++ carefully, I found the root cause. Because in my program, the CUDA
memory is allocated through cudaMalloc
(not unified memory), I also need synchronizing stream, like this:
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyDefault);
cudaStreamSynchronize(0);