CUDA编程笔记(17)——Matrix transpose (shared memory)

An Efficient Matrix Transpose in CUDA C/C++Coalesced Transpose Via Shared Memory一节讲述如何使用shared memory高效地实现matrix transpose

__global__ void transposeCoalesced(float *odata, const float *idata)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
 tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
 odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

(1)idataodata分别是表示1024X1024float元素的matrix的连续内存:

IMG_20170216_145959[1]

(2)关于blockIdxthreadIdx的取值,参考下面的图:

IMG_20170216_151045[1]

shared memory请参考下面的图:

IMG_20170216_151058[1]

(3)在下列代码中:

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
    tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

每一个block32X8大小,需要循环4次,把一个block内容copytile这个shared memory中。idata是按行读取的,因此是coalesced

IMG_20170216_152636[1]

(4)最难理解的在最后一部分:

x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
    odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];

对比从idata读取数据和写数据到odata

......
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
......
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
......

可以看到是把tile做了transpose的数据(行变列,列变行)传给odata。而确定需要把tile放到哪里位置的代码:

x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;

假设blockIdx.x31blockIdx.y0threadIdx.x1threadIdx.y2。根据上述代码,计算xy

x = 0 * 32 + 1;
y = 31 * 32 + 2;

根据下面的图,可以看到是把东北角的内容copy的西南角:

IMG_20170216_155616[1]

CUDA编程笔记(16)——Shared Memory

这篇笔记摘自Professional CUDA C Programming

Global memory is large, on-board memory and is characterized by relatively high latencies. Shared memory is smaller, low-latency on-chip memory that offers much higher bandwidth than global memory. You can think of it as a program-managed cache. Shared memory is generally useful as:
➤ An intra-block thread communication channel
➤ A program-managed cache for global memory data
➤ Scratch pad memory for transforming data to improve global memory access patterns

Shared memory is partitioned among all resident thread blocks on an SM; therefore, shared memory is a critical resource that limits device parallelism. The more shared memory used by a kernel, the fewer possible concurrently active thread blocks.

 

CUDA编程笔记(15)——MEMORY ACCESS PATTERNS

这篇笔记摘自Professional CUDA C Programming

Global memory loads/stores are staged through caches, as shown in Figure 4-6. Global memory is a logical memory space that you can access from your kernel. All application data initially resides in DRAM, the physical device memory. Kernel memory requests are typically served between the device DRAM and SM on-chip memory using either 128-byte or 32-byte memory transactions.

All accesses to global memory go through the L2 cache. Many accesses also pass through the L1 cache, depending on the type of access and your GPU’s architecture. If both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. If only the L2 cache is used, a memory access is serviced by a 32-byte memory transaction. On architectures that allow the L1 cache to be used for global memory caching, the L1 cache can be explicitly enabled or disabled at compile time.

An L1 cache line is 128 bytes, and it maps to a 128-byte aligned segment in device memory. If each thread in a warp requests one 4-byte value, that results in 128 bytes of data per request, which maps perfectly to the cache line size and device memory segment size.

There are two characteristics of device memory accesses that you should strive for when optimizing your application:
➤ Aligned memory accesses
➤ Coalesced memory

capture

Aligned memory accesses occur when the frst address of a device memory transaction is an even multiple of the cache granularity being used to service the transaction (either 32 bytes for L2 cache or 128 bytes for L1 cache). Performing a misaligned load will cause wasted bandwidth.

Coalesced memory accesses occur when all 32 threads in a warp access a contiguous chunk of memory.

capture

Memory store operations are relatively simple. The L1 cache is not used for store operations on either Fermi or Kepler GPUs, store operations are only cached in the L2 cache before being sent to device memory. Stores are performed at a 32-byte segment granularity. Memory transactions can be one, two, or four segments at a time. For example, if two addresses fall within the same 128-byte region but not within an aligned 64-byte region, one four-segment transaction will be issued (that is, issuing a single four-segment transaction performs better than issuing two one-segment transactions).

capture

CUDA编程笔记(14)——zero-copy memory

这篇笔记摘自Professional CUDA C Programming

In general, the host cannot directly access device variables, and the device cannot directly access host variables. There is one exception to this rule: zero-copy memory. Both the host and device can access zero-copy memory.

GPU threads can directly access zero-copy memory. There are several advantages to using zero-copy memory in CUDA kernels, such as:
➤ Leveraging host memory when there is insufficient device memory
➤ Avoiding explicit data transfer between the host and device
➤ Improving PCIe transfer rates
When using zero-copy memory to share data between the host and device, you must synchronize memory accesses across the host and device. Modifying data in zero-copy memory from both the host and device at the same time will result in undefned behavior.

There are two common categories of heterogeneous computing system architectures: Integrated and discrete.

In integrated architectures, CPUs and GPUs are fused onto a single die and physically share main memory. In this architecture, zero-copy memory is more likely to benefit both performance and programmability because no copies over the PCIe bus are necessary.

For discrete systems with devices connected to the host via PCIe bus, zero-copy memory is advantageous only in special cases.

Because the mapped pinned memory is shared between the host and device, you must synchronize memory accesses to avoid any potential data hazards caused by multiple threads accessing the same memory location without synchronization.

Be careful to not overuse zero-copy memory. Device kernels that read from zero-copy memory can be very slow due to its high-latency.

 

CUDA编程笔记(13)——pinned memory

这篇笔记摘自Professional CUDA C Programming

Allocated host memory is by default pageable, that is, subject to page fault operations that move data in host virtual memory to different physical locations as directed by the operating system. Virtual memory offers the illusion of much more main memory than is physically available, just as the L1 cache offers the illusion of much more on-chip memory than is physically available.

The GPU cannot safely access data in pageable host memory because it has no control over when the host operating system may choose to physically move that data. When transferring data from pageable host memory to device memory, the CUDA driver first allocates temporary page-locked or pinned host memory, copies the source host data to pinned memory, and then transfers the data from pinned memory to device memory, as illustrated on the left side of Figure 4-4.

capture

The CUDA runtime allows you to directly allocate pinned host memory using:
cudaError_t cudaMallocHost(void **devPtr, size_t count);
This function allocates count bytes of host memory that is page-locked and accessible to the device. Since the pinned memory can be accessed directly by the device, it can be read and written with much higher bandwidth than pageable memory. However, allocating excessive amounts of pinned memory might degrade host system performance, since it reduces the amount of pageable memory available to the host system for storing virtual memory data.

 

nvcc简介

nvcc是“The main wrapper for the NVIDIA CUDA Compiler suite. Used to compile and link both host and gpu code.”,查看其版本可以使用--version选项:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

关于不同CUDA版本所支持的compute capability可以参考这里

CUDA VERSION   Min CC   Deprecated CC  Default CC
5.5 (and prior) 1.0       N/A             1.0
6.0             1.0       1.0             1.0
6.5             1.1       1.x             2.0
7.0             2.0       N/A             2.0
7.5 (same as 7.0)
8.0             2.0       2.x             2.0

Min CC = minimum compute capability that can be specified to nvcc

Deprecated CC = If you specify this CC, you will get a deprecation message, but compile should still proceed.

Default CC = The architecture that will be targetted if no `-arch` or `-gencode` switches are used

根据CUDA命名规范:GPUs are named sm_xy, where x denotes the GPU generation number, and y the version in that generation.

This situation is different for GPUs, because NVIDIA cannot guarantee binary compatibility without sacrificing regular opportunities for GPU improvements. Rather, as is already conventional in the graphics programming domain, nvcc relies on a two stage compilation model for ensuring application compatibility with future GPU generations.

nvcc可以保证编译出程序的application compatibility,但不能保证binary compatibility。在编译过程中,第一阶段产生virtual GPU architecture code,即PTX;第二阶段才编译出在真实GPU上运行的代码。因此真实的GPU必须实现了virtual GPU所要求的功能。因此,From this it follows that the virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run.

--gpu-architecture arch编译选项用来指定NVIDIA virtual GPU architecture。例如,compute_50。通常情况下,--gpu-architecture arch是用来生成PTX代码,不会用来生成运行在特定GPU上的代码。--gpu-code code,...选项则是用来指定the name of the NVIDIA GPU to assemble and optimize PTX for。例如,sm_50。关于这两个选项的取值的例子,可以参考这里

参考资料:
NVIDIA CUDA Compiler Driver NVCC
What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

CUDA编程笔记(11)——warp

这篇笔记摘自Professional CUDA C Programming

Warps are the basic unit of execution in an SM. When you launch a grid of thread blocks, the thread blocks in the grid are distributed among SMs. Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. A warp consists of 32 consecutive threads and all threads in a warp are executed in Single Instruction Multiple Thread (SIMT) fashion; that is, all threads execute the same instruction, and each thread carries out that operation on its own private data. The following figure illustrates the relationship between the logical view and hardware view of a thread block.

capture

From the logical perspective, a thread block is a collection of threads organized in a 1D, 2D, or 3D layout.
From the hardware perspective, a thread block is a 1D collection of warps. Threads in a thread block are organized in a 1D layout, and each set of 32 consecutive threads forms a warp.

在实际的执行中,每个block会被切割成一个一个的warp,而warp中的thread会同步运行。

Threads in the same warp executing different instructions is referred to as warp divergence.
If threads of a warp diverge, the warp serially executes each branch path, disabling threads that do not take that path. Warp divergence can cause signifcantly degraded performance.
Take note that branch divergence occurs only within a warp. Different conditional values in different warps do not cause warp divergence.

warp divergence只会发生在同一个warp中,参考下图:

capture

The local execution context of a warp mainly consists of the following resources:
➤ Program counters
➤ Registers
➤ Shared memory
The execution context of each warp processed by an SM is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost.

Each SM has a set of 32-bit registers stored in a register file that are partitioned among threads, and a fixed amount of shared memory that is partitioned among thread blocks. The number of thread blocks and warps that can simultaneously reside on an SM for a given kernel depends on the number of registers and amount of shared memory available on the SM and required by the kernel.

thread共享registerblock共享shared memory

A thread block is called an active block when compute resources, such as registers and shared memory, have been allocated to it. The warps it contains are called active warps. Active warps can be further classifed into the following three types:
➤ Selected warp
➤ Stalled warp
➤ Eligible warp
The warp schedulers on an SM select active warps on every cycle and dispatch them to execution units. A warp that is actively executing is called a selected warp. If an active warp is ready for execution but not currently executing, it is an eligible warp. If a warp is not ready for execution, it is a stalled warp. A warp is eligible for execution if both of the following two conditions is met:
➤ Thirty-two CUDA cores are available for execution.
➤ All arguments to the current instruction are ready.

GUIDELINES FOR GRID AND BLOCK SIZE
Using these guidelines will help your application scale on current and future devices:
➤ Keep the number of threads per block a multiple of warp size (32).
➤ Avoid small block sizes: Start with at least 128 or 256 threads per block.
➤ Adjust block size up or down according to kernel resource requirements.
➤ Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device.
➤ Conduct experiments to discover the best execution configuration and resource usage.

CUDA编程笔记(10)——Streaming Multiprocessors

这篇笔记摘自Professional CUDA C Programming

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural buildin block.
Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA. 。

一个GPU包含多个Streaming Multiprocessor,而每个Streaming Multiprocessor又包含多个coreStreaming Multiprocessors支持并发执行多个thread

A thread block is scheduled on only one SM. Once a thread block is scheduled on an SM, it remains there until execution completes. An SM can hold more than one thread block at the same time. The following figure illustrates the corresponding components from the logical view and hardware view of CUDA programming:

一个block只能调度到一个Streaming Multiprocessor上运行。一个Streaming Multiprocessor可以同时运行多个block

capture

CUDA编程笔记(8)——CUDA kernel

这篇笔记摘自Professional CUDA C Programming

A CUDA kernel call is a direct extension to the C function syntax that adds a kernel’s execution confguration inside triple-angle-brackets:
kernel_name <<<grid, block>>>(argument list);
As explained in the previous section, the CUDA programming model exposes the thread hierarchy. With the execution configuration, you can specify how the threads will be scheduled to run on the GPU. The first value in the execution configuration is the grid dimension, the number of blocks to launch. The second value is the block dimension, the number of threads within each block. By specifying the grid and block dimensions, you configure:
➤ The total number of threads for a kernel
➤ The layout of the threads you want to employ for a kernel

kernel_name <<<grid, block>>>(argument list);中,grid参数指定block数量,而block参数指定每个blockthread数量,二者之积就是grid一共拥有的thread数量。

Unlike a C function call, all CUDA kernel launches are asynchronous. Control returns to the CPU immediately after the CUDA kernel is invoked.

A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.

The following restrictions apply for all kernels:
➤ Access to device memory only
➤ Must have void return type
➤ No support for a variable number of arguments
➤ No support for static variables
➤ No support for function pointers
➤ Exhibit an asynchronous behavior