Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:
(1) Disable P2P
, you can see CPU
utilization ratio is very high: 86.7%
, and the bandwidth is nearly 10.67GB/s
:
(2) Enable P2P
, CPU
utilization drops down to 1.3%
only, and the bandwidth is about 1.6GB/s
fall behind: 9.00GB/s
:
P.S., the full code is here.