Beware of synchronizing steam when using “default-stream per-thread” in CUDA

Yesterday, I refactored a project through adding”--default-stream per-thread” option to improve its performance. Unfortunately, program will crash in cudaMemcpy:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/
[Current thread is 1 (Thread 0x7f5620fa1700 (LWP 31206))]
(gdb) bt
#0  0x00007f570d3eb7f0 in ?? () from /usr/lib/
#1  0x00007f570d45ffef in ?? () from /usr/lib/
#2  0x00007f570d3bff90 in ?? () from /usr/lib/
#3  0x00007f570d3198d5 in ?? () from /usr/lib/
#4  0x00007f570d319da7 in ?? () from /usr/lib/
#5  0x00007f570d21d665 in ?? () from /usr/lib/
#6  0x00007f570d21de08 in ?? () from /usr/lib/
#7  0x00007f570d352455 in cuMemcpy_ptds () from /usr/lib/
#8  0x00007f570ee1b0f9 in cudart::driverHelper::memcpyDispatch(void*, void const*, unsigned long, cudaMemcpyKind, bool) ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/
#9  0x00007f570ede70f9 in cudart::cudaApiMemcpy_ptds(void*, void const*, unsigned long, cudaMemcpyKind) () from /home/xiaonan/DSI_cuRlib_v2.0/build/src/
#10 0x00007f570ee2772b in cudaMemcpy_ptds ()
   from /home/xiaonan/DSI_cuRlib_v2.0/build/src/  

After reading GPU Pro Tip: CUDA 7 Streams Simplify Concurrency and How to Overlap Data Transfers in CUDA C/C++ carefully, I found the root cause. Because in my program, the CUDA memory is allocated through cudaMalloc (not unified memory), I also need synchronizing stream, like this:

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyDefault);  


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.