Nan Xiao's Blog

A system software / performance engineer's home

Tag: GPU

Use “.cu” as file extension name when playing Thrust

Today, I tried the simple Thrust program:

$ cat a.c
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(void) {
        // H has storage for 4 integers
        thrust::host_vector<int> H(4);

        // initialize individual elements
        H[0] = 14;
        H[1] = 20;
        H[2] = 38;
        H[3] = 46;

        // H.size() returns the size of vector H
        std::cout << "H has size " << H.size() << std::endl;

        // print contents of H
        for(int i = 0; i < H.size(); i++)
                std::cout << "H[" << i << "] = " << H[i] << std::endl;

        // resize H
        H.resize(2);
        std::cout << "H now has size " << H.size() << std::endl;

        // Copy host_vector H to device_vector D
        thrust::device_vector<int> D = H;

        // elements of D can be modified
        D[0] = 99;
        D[1] = 88;

        // print contents of D
        for(int i = 0; i < D.size(); i++)
                std::cout << "D[" << i << "] = " << D[i] << std::endl;

        // H and D are automatically deleted when the function returns
        return 0;
}

Built it:

$ nvcc -arch=sm_37 a.c
In file included from a.c:1:0:
/opt/cuda/bin/..//include/thrust/host_vector.h:25:18: fatal error: memory: No such file or directory
compilation terminated.

It seemed very weird! After scanning Thrust’s FAQ, I came across the following tip:

Make sure that files that #include Thrust have a .cu extension. Other extensions (e.g., .cpp) will cause nvcc to treat the file incorrectly and produce an error message.

Renamed the source file name and rebuilt it:

$ mv a.c a.cu
$ nvcc -arch=sm_37 a.cu
$ ./a.out
H has size 4
H[0] = 14
H[1] = 20
H[2] = 38
H[3] = 46
H now has size 2
D[0] = 99
D[1] = 88

Worked like a charm!

Don’t use “-G” compile option for profiling CUDA programs

I use Nsight as an IDE to develop CUDA programs:

capture

Use nvprof to measure the load efficiency and store efficiency of accessing global memory:

$ nvprof --devices 2 --metrics gld_efficiency,gst_efficiency ./cuHE_opt

................... CRT polynomial Terminated ...................

==1443== Profiling application: ./cuHE_opt
==1443== Profiling result:
==1443== Metric result:
Invocations   Metric NameMetric Description Min Max Avg
Device "Tesla K80 (2)"
Kernel: gpu_cuHE_crt(unsigned int*, unsigned int*, int, int, int, int)
  1gld_efficiency Global Memory Load Efficiency  62.50%  62.50%  62.50%
  1gst_efficiencyGlobal Memory Store Efficiency 100.00% 100.00% 100.00%
Kernel: gpu_crt(unsigned int*, unsigned int*, int, int, int, int)
  1gld_efficiency Global Memory Load Efficiency  39.77%  39.77%  39.77%
  1gst_efficiencyGlobal Memory Store Efficiency 100.00% 100.00% 100.00%

But if I use nvcc to compile the program directly:

 nvcc -arch=sm_37 cuHE_opt.cu  -o cuHE_opt

The nvprof displays the different measuring results:

$ nvprof --devices 2 --metrics gld_efficiency,gst_efficiency ./cuHE_opt
......
................... CRT polynomial Terminated ...................

==1801== Profiling application: ./cuHE_opt
==1801== Profiling result:
==1801== Metric result:
Invocations   Metric NameMetric Description Min Max Avg
Device "Tesla K80 (2)"
Kernel: gpu_cuHE_crt(unsigned int*, unsigned int*, int, int, int, int)
  1gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
  1gst_efficiencyGlobal Memory Store Efficiency 100.00% 100.00% 100.00%
Kernel: gpu_crt(unsigned int*, unsigned int*, int, int, int, int)
  1gld_efficiency Global Memory Load Efficiency  50.00%  50.00%  50.00%
  1gst_efficiencyGlobal Memory Store Efficiency 100.00% 100.00% 100.00%

After some investigations, the reason is using -G compile option in the first case. As the document of nvcc has mentioned:

--device-debug (-G)
    Generate debug information for device code. Turns off all optimizations.
    Don't use for profiling; use -lineinfo instead.

So don’t use -G┬ácompile option for profiling CUDA programs.

Is the warp size always 32 in CUDA?

Last week, I began to read the awesome Professional CUDA C Programming, and bumped into the following words in GPU Architecture Overview section:

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps.

Since this book is published in 2014, I just wonder whether the warp size is still 32 in CUDA no matter the different Compute Capability is. To figure out it, I turn to the official CUDA C Programming Guide, and get the answer from Compute Capability table:

capture

Yep, for all Compute Capabilities, the warp size is always 32.

BTW, you can also use following program to determine the warp size value:

#include <stdio.h>

int main(void) {
        cudaDeviceProp deviceProp;
        if (cudaSuccess != cudaGetDeviceProperties(&deviceProp, 0)) {
                printf("Get device properties failed.\n");
                return 1;
        } else {
                printf("The warp size is %d.\n", deviceProp.warpSize);
                return 0;
        }
}

The running result in my CUDA box is here:

The warp size is 32.

Powered by WordPress & Theme by Anders Norén