CUDA P2P is not guaranteed to be faster than staged through the host

Today, I write a simple test to verify whether CUDA Peer-to-Peer Memory Copy is always faster than using CPU to transfer. At least from my platform, it is not:

(1) Disable P2P, you can see CPU utilization ratio is very high: 86.7%, and the bandwidth is nearly 10.67GB/s:

(2) Enable P2P, CPU utilization drops down to 1.3% only, and the bandwidth is about 1.6GB/s fall behind: 9.00GB/s:

P.S., the full code is here.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.