我的站点

一个系统软件工程师的随手涂鸦

Page 3 of 70

nvcc简介

nvcc是“The main wrapper for the NVIDIA CUDA Compiler suite. Used to compile and link both host and gpu code.”,查看其版本可以使用--version选项:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

关于不同CUDA版本所支持的compute capability可以参考这里

CUDA VERSION   Min CC   Deprecated CC  Default CC
5.5 (and prior) 1.0       N/A             1.0
6.0             1.0       1.0             1.0
6.5             1.1       1.x             2.0
7.0             2.0       N/A             2.0
7.5 (same as 7.0)
8.0             2.0       2.x             2.0

Min CC = minimum compute capability that can be specified to nvcc

Deprecated CC = If you specify this CC, you will get a deprecation message, but compile should still proceed.

Default CC = The architecture that will be targetted if no `-arch` or `-gencode` switches are used

根据CUDA命名规范:GPUs are named sm_xy, where x denotes the GPU generation number, and y the version in that generation.

This situation is different for GPUs, because NVIDIA cannot guarantee binary compatibility without sacrificing regular opportunities for GPU improvements. Rather, as is already conventional in the graphics programming domain, nvcc relies on a two stage compilation model for ensuring application compatibility with future GPU generations.

nvcc可以保证编译出程序的application compatibility,但不能保证binary compatibility。在编译过程中,第一阶段产生virtual GPU architecture code,即PTX;第二阶段才编译出在真实GPU上运行的代码。因此真实的GPU必须实现了virtual GPU所要求的功能。因此,From this it follows that the virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run.

--gpu-architecture arch编译选项用来指定NVIDIA virtual GPU architecture。例如,compute_50。通常情况下,--gpu-architecture arch是用来生成PTX代码,不会用来生成运行在特定GPU上的代码。--gpu-code code,...选项则是用来指定the name of the NVIDIA GPU to assemble and optimize PTX for。例如,sm_50。关于这两个选项的取值的例子,可以参考这里

参考资料:
NVIDIA CUDA Compiler Driver NVCC
What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

CUDA编程笔记(12)——CUDA Memory Model

这篇笔记摘自Professional CUDA C Programming

The CUDA memory model exposes many types of programmable memory to you:
➤ Registers
➤ Shared memory
➤ Local memory
➤ Constant memory
➤ Texture memory
➤ Global memory
The following figure illustrates the hierarchy of these memory spaces. Each has a different scope, lifetime, and caching behavior. A thread in a kernel has its own private local memory. A thread block has its own shared memory, visible to all threads in the same thread block, and whose contents persist for the lifetime of the thread block. All threads can access global memory. There are also two read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different uses. Texture memory offers different address modes and filtering for various data layouts. The contents of global, constant, and texture memory have the same lifetime as an application.

capture

capture

The principal traits of the various memory types are summarized in following table:

capture

Eclipse切换perspective

使用CUDANsight Eclipse Edition工具时,其perspective选项在右上角,可以根据需要切换不同的perspective

capture

CUDA编程笔记(11)——warp

这篇笔记摘自Professional CUDA C Programming

Warps are the basic unit of execution in an SM. When you launch a grid of thread blocks, the thread blocks in the grid are distributed among SMs. Once a thread block is scheduled to an SM, threads in the thread block are further partitioned into warps. A warp consists of 32 consecutive threads and all threads in a warp are executed in Single Instruction Multiple Thread (SIMT) fashion; that is, all threads execute the same instruction, and each thread carries out that operation on its own private data. The following figure illustrates the relationship between the logical view and hardware view of a thread block.

capture

From the logical perspective, a thread block is a collection of threads organized in a 1D, 2D, or 3D layout.
From the hardware perspective, a thread block is a 1D collection of warps. Threads in a thread block are organized in a 1D layout, and each set of 32 consecutive threads forms a warp.

在实际的执行中,每个block会被切割成一个一个的warp,而warp中的thread会同步运行。

Threads in the same warp executing different instructions is referred to as warp divergence.
If threads of a warp diverge, the warp serially executes each branch path, disabling threads that do not take that path. Warp divergence can cause signifcantly degraded performance.
Take note that branch divergence occurs only within a warp. Different conditional values in different warps do not cause warp divergence.

warp divergence只会发生在同一个warp中,参考下图:

capture

The local execution context of a warp mainly consists of the following resources:
➤ Program counters
➤ Registers
➤ Shared memory
The execution context of each warp processed by an SM is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost.

Each SM has a set of 32-bit registers stored in a register file that are partitioned among threads, and a fixed amount of shared memory that is partitioned among thread blocks. The number of thread blocks and warps that can simultaneously reside on an SM for a given kernel depends on the number of registers and amount of shared memory available on the SM and required by the kernel.

thread共享registerblock共享shared memory

A thread block is called an active block when compute resources, such as registers and shared memory, have been allocated to it. The warps it contains are called active warps. Active warps can be further classifed into the following three types:
➤ Selected warp
➤ Stalled warp
➤ Eligible warp
The warp schedulers on an SM select active warps on every cycle and dispatch them to execution units. A warp that is actively executing is called a selected warp. If an active warp is ready for execution but not currently executing, it is an eligible warp. If a warp is not ready for execution, it is a stalled warp. A warp is eligible for execution if both of the following two conditions is met:
➤ Thirty-two CUDA cores are available for execution.
➤ All arguments to the current instruction are ready.

GUIDELINES FOR GRID AND BLOCK SIZE
Using these guidelines will help your application scale on current and future devices:
➤ Keep the number of threads per block a multiple of warp size (32).
➤ Avoid small block sizes: Start with at least 128 or 256 threads per block.
➤ Adjust block size up or down according to kernel resource requirements.
➤ Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device.
➤ Conduct experiments to discover the best execution configuration and resource usage.

CUDA编程笔记(10)——Streaming Multiprocessors

这篇笔记摘自Professional CUDA C Programming

The GPU architecture is built around a scalable array of Streaming Multiprocessors (SM). GPU hardware parallelism is achieved through the replication of this architectural buildin block.
Each SM in a GPU is designed to support concurrent execution of hundreds of threads, and there are generally multiple SMs per GPU, so it is possible to have thousands of threads executing concurrently on a single GPU. When a kernel grid is launched, the thread blocks of that kernel grid are distributed among available SMs for execution. Once scheduled on an SM, the threads of a thread block execute concurrently only on that assigned SM. Multiple thread blocks may be assigned to the same SM at once and are scheduled based on the availability of SM resources. Instructions within a single thread are pipelined to leverage instruction-level parallelism, in addition to the thread-level parallelism you are already familiar with in CUDA. 。

一个GPU包含多个Streaming Multiprocessor,而每个Streaming Multiprocessor又包含多个coreStreaming Multiprocessors支持并发执行多个thread

A thread block is scheduled on only one SM. Once a thread block is scheduled on an SM, it remains there until execution completes. An SM can hold more than one thread block at the same time. The following figure illustrates the corresponding components from the logical view and hardware view of CUDA programming:

一个block只能调度到一个Streaming Multiprocessor上运行。一个Streaming Multiprocessor可以同时运行多个block

capture

CUDA编程笔记(9)——函数类型限定符

这篇笔记摘自Professional CUDA C Programming

capture

 

CUDA编程笔记(8)——CUDA kernel

这篇笔记摘自Professional CUDA C Programming

A CUDA kernel call is a direct extension to the C function syntax that adds a kernel’s execution confguration inside triple-angle-brackets:
kernel_name <<<grid, block>>>(argument list);
As explained in the previous section, the CUDA programming model exposes the thread hierarchy. With the execution configuration, you can specify how the threads will be scheduled to run on the GPU. The first value in the execution configuration is the grid dimension, the number of blocks to launch. The second value is the block dimension, the number of threads within each block. By specifying the grid and block dimensions, you configure:
➤ The total number of threads for a kernel
➤ The layout of the threads you want to employ for a kernel

kernel_name <<<grid, block>>>(argument list);中,grid参数指定block数量,而block参数指定每个blockthread数量,二者之积就是grid一共拥有的thread数量。

Unlike a C function call, all CUDA kernel launches are asynchronous. Control returns to the CPU immediately after the CUDA kernel is invoked.

A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.

The following restrictions apply for all kernels:
➤ Access to device memory only
➤ Must have void return type
➤ No support for a variable number of arguments
➤ No support for static variables
➤ No support for function pointers
➤ Exhibit an asynchronous behavior

CUDA编程笔记(7)——thread hierarchy

这篇笔记摘自Professional CUDA C Programming

When a kernel function is launched from the host side, execution is moved to a device where a large number of threads are generated and each thread executes the statements specified by the kernel function. CUDA exposes a thread hierarchy abstraction to enable you to organize your threads. This is a two-level thread hierarchy decomposed into blocks of threads and grids of blocks:

capture

All threads spawned by a single kernel launch are collectively called a grid. All threads in a grid share the same global memory space. A grid is made up of many thread blocks. A thread block is a group of threads that can cooperate with each other using:
➤ Block-local synchronization
➤ Block-local shared memory
Threads from different blocks cannot cooperate.

Threads rely on the following two unique coordinates to distinguish themselves from each other:
➤ blockIdx (block index within a grid)
➤ threadIdx (thread index within a block)
These variables appear as built-in, pre-initialized variables that can be accessed within kernel functions. When a kernel function is executed, the coordinate variables blockIdx and threadIdx are assigned to each thread by the CUDA runtime. Based on the coordinates, you can assign portions of data to different threads. The coordinate variable is of type uint3, a CUDA built-in vector type, derived from the basic integer type.

CUDA organizes grids and blocks in three dimensions. The dimensions of a grid and a block are specifed by the following two built-in variables:
➤ blockDim (block dimension, measured in threads)
➤ gridDim (grid dimension, measured in blocks)
These variables are of type dim3, an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

blockIdx&threadIdxuint3类型,含义是坐标,所以下标从0开始;blockDim&gridDimdim3类型,含义是维度,即用来计算block中有多少个thread,当前grid中包含多少个block,因此默认值是1

There are two distinct sets of grid and block variables in a CUDA program: manually-defined dim3 data type and pre-defined uint3 data type. On the host side, you define the dimensions of a grid and block using a dim3 data type as part of a kernel invocation. When the kernel is executing, the CUDA runtime generates the corresponding built-in, pre-initialized grid, block, and thread variables, which are accessible within the kernel function and have type uint3. The manually-defined grid and block variables for the dim3 data type are only visible on the host side, and the built-in, pre-initialized grid and block variables of the uint3 data type are only visible on the device side.

It is important to distinguish between the host and device access of grid and block variables. For example, using a variable declared as block from the host, you define the coordinates and access them as follows:
block.x, block.y, and block.z
On the device side, you have pre-initialized, built-in block size variable available as:
blockDim.x, blockDim.y, and blockDim.z
In summary, you define variables for grid and block on the host before launching a kernel, and access them there with the x, y and z fields of the vector structure from the host side. When the kernel is launched, you can use the pre-initialized, built-in variables within the kernel.

下面这3页摘自Learn CUDA In An Afternoon

1

2

3

CUDA编程笔记(6)——RUNTIME API VS DRIVER API

这篇笔记摘自Professional CUDA C Programming

CUDA provides two API levels for managing the GPU device and organizing threads:
➤ CUDA Driver API
➤ CUDA Runtime API
The driver API is a low-level API and is relatively hard to program, but it provides more control over how the GPU device is used. The runtime API is a higher-level API implemented on top of the driver API. Each function of the runtime API is broken down into more basic operations issued to the driver API.

There is no noticeable performance difference between the runtime and driver APIs. How your kernels use memory and how you organize your threads on the device have a much more pronounced effect.

These two APIs are mutually exclusive. You must use one or the other, but it is not possible to mix function calls from both.

capture

CUDA编程笔记(5)——CUDA程序结构

这篇笔记摘自Professional CUDA C Programming

A typical CUDA program structure consists of five main steps:
1. Allocate GPU memories.
2. Copy data from CPU memory to GPU memory.
3. Invoke the CUDA kernel to perform program-specific computation.
4. Copy data back from GPU memory to CPU memory.
5. Destroy GPU memories.

CUDA exposes you to the concepts of both memory hierarchy and thread hierarchy, extending your ability to control thread execution and scheduling to a greater degree, using:
➤ Memory hierarchy structure
➤ Thread hierarchy structure
For example, a special memory, called shared memory, is exposed by the CUDA programming model. Shared memory can be thought of as a software-managed cache, which provides great speedup by conserving bandwidth to main memory. With shared memory, you can control the locality of your code directly.

When writing a parallel program in ANSI C, you need to explicitly organize your threads with either pthreads or OpenMP, two well-known techniques to support parallel programming on most processor architectures and operating systems. When writing a program in CUDA C, you actually just write a piece of serial code to be called by only one thread. The GPU takes this kernel and makes it parallel by launching thousands of threads, all performing that same computation. The CUDA programming model provides you with a way to organize your threads hierarchically. Manipulating this organization directly affects the order in which threads are executed on the GPU. Because CUDA C is an extension of C, it is often straightforward to port C programs to CUDA C. Conceptually, peeling off the loops of your code yields the kernel code for a CUDA C implementation.

CUDA abstracts away the hardware details and does not require applications to be mapped to traditional graphics APIs. At its core are three key abstractions: a hierarchy of thread groups, a hierarchy of memory groups, and barrier synchronization, which are exposed to you as a minimal set of language extensions.

 

Page 3 of 70

Powered by WordPress & Theme by Anders Norén