In this post, I will introduce some tips which can improve CUDA
programs’ performance:
(1) Use “pool” to cache resources.
“Pool” is a common data structure which can be used in boosting performance (You can refer my another post which introduces “pool” specially). From my experience, using memory pool to avoid allocating/freeing CUDA
memory frequently is a very effective trick. The other resource I want to cache is CUDA
stream. Yes, since CUDA 7, you can use --default-stream per-thread
compile option to enable a “regular” stream for every host thread, but if you want to use multiple streams in one thread, a “stream” pool may be a choice.
P.S., this is my implementation of memory pool.
(2) Batch processing in stream.
The effect of tip is to reduce synchronizing stream. I.e.:
kernel_1< , , , st>();
kernel_2< , , , st>();
cudaStreamSynchronize(st);
instead of:
kernel_1< , , , st>();
cudaStreamSynchronize(st);
kernel_2< , , , st>();
cudaStreamSynchronize(st);
(3) Use Peer-to-Peer communication.
In most cases, the Peer-to-Peer communication among devices should be faster than using host as an agent, but it is not “absolute truth” (You can refer this post).
These tips are just my own ideas, and you should test and pick appropriate ones for your own application environment.