这篇笔记摘自Professional CUDA C Programming:
The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural buildin block.
Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA. 。
一个GPU
包含多个Streaming Multiprocessor
,而每个Streaming Multiprocessor
又包含多个core
。Streaming Multiprocessors
支持并发执行多个thread
。
A thread block is scheduled on only one SM. Once a thread block is scheduled on an SM, it remains there until execution completes. An SM can hold more than one thread block at the same time. The following figure illustrates the corresponding components from the logical view and hardware view of CUDA programming:
一个block
只能调度到一个Streaming Multiprocessor
上运行。一个Streaming Multiprocessor
可以同时运行多个block
。
每个core的shared memory都给一个thread用吗?比如,每个core有96k的shared memory,那每个thread也可以使用96k吗?
我查看deviceQuery。96k shared memory per core, Total amount of shared memory per block 49152 bytes, Maximum number of threads per block 1024。96* 1024 /2= 49152。我不懂为什么要除以2. 难道一个共享内存给两个thread用了?
谢谢。