Recently, I did an experiment about CUDA
. The following is a simple dead-loop code:
......
while (1)
{
dgt_mul<<<gDim, bDim, 0, st>>>(......);
}
......
Run it in single thread, the GPU
utilization is ~80%
, while in two threads, the utilization is reduced to ~60%
; in three threads, the utilization is reduced to ~40%
. I can’t comprehend this phenomenon, so posted topics in both stackoverflow and CUDA developer forum. Unfortunately, there was no response.
After some investigation, I found this post and know there is a kernel launch queue firstly. I modified the code and profile again (use nvprof
instead of GUI
):
......
for (int i = 0; i < 10000; i++)
{
dgt_mul<<<gDim, bDim, 0, st>>>(......);
}
......
The following is the profile output of one, two and three threads:
==22209== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 17.577ms 10000 1.7570us 1.7270us 2.8800us dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
API calls: 97.83% 70.567ms 10000 7.0560us 4.4700us 13.296ms cudaLaunchKernel
2.17% 1.5644ms 10000 156ns 119ns 15.779us cudaGetLastError
==23662== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 35.288ms 20000 1.7640us 1.7270us 12.704us dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
API calls: 99.09% 473.79ms 20000 23.689us 5.0040us 13.294ms cudaLaunchKernel
0.91% 4.3564ms 20000 217ns 117ns 6.4690us cudaGetLastError
==27597== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 52.587ms 30000 1.7520us 1.7270us 2.9440us dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
API calls: 99.23% 2.10159s 30000 70.053us 13.545us 13.778ms cudaLaunchKernel
0.77% 16.328ms 30000 544ns 368ns 19.316us cudaGetLastError
We can see the average execution time of cudaLaunchKernel
scales up, so it manifests there is bottle neck in the kernel launch queue when running more threads.