CUDA P2P is not guaranteed to be faster than staged through the host

Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:

(1) Disable P2P, you can see CPU utilization ratio is very high: 86.7%, and the bandwidth is nearly 10.67GB/s:

(2) Enable P2P, CPU utilization drops down to 1.3% only, and the bandwidth is about 1.6GB/s fall behind: 9.00GB/s:

P.S., the full code is here.

Leave a Reply Cancel reply