CUDA编程笔记(10)——Streaming Multiprocessors

这篇笔记摘自Professional CUDA C Programming

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural buildin block.
Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA. 。

一个GPU包含多个Streaming Multiprocessor,而每个Streaming Multiprocessor又包含多个coreStreaming Multiprocessors支持并发执行多个thread

A thread block is scheduled on only one SM. Once a thread block is scheduled on an SM, it remains there until execution completes. An SM can hold more than one thread block at the same time. The following figure illustrates the corresponding components from the logical view and hardware view of CUDA programming:

一个block只能调度到一个Streaming Multiprocessor上运行。一个Streaming Multiprocessor可以同时运行多个block

capture

《CUDA编程笔记(10)——Streaming Multiprocessors》有1个想法

  1. 每个core的shared memory都给一个thread用吗?比如,每个core有96k的shared memory,那每个thread也可以使用96k吗?
    我查看deviceQuery。96k shared memory per core, Total amount of shared memory per block 49152 bytes, Maximum number of threads per block 1024。96* 1024 /2= 49152。我不懂为什么要除以2. 难道一个共享内存给两个thread用了?
    谢谢。

发表评论

邮箱地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.