Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:
P2P, you can see
CPU utilization ratio is very high:
86.7%, and the bandwidth is nearly
CPU utilization drops down to
1.3% only, and the bandwidth is about
1.6GB/s fall behind:
The test file is here.