Nan Xiao's Blog | A system software / performance engineer's home

Import existing CUDA project into Nsight

The steps to import an existing CUDA project (who uses CMake) into Nsight are as following:

(1) Select File -> New -> CUDA C/C++ Project:

Untick “Use default location“, and select the root directory of your project.

(2) Change Build location in Properties to points to the Makefile position.

(3) After building successfully, right click project: Run As -> Local C/C++ Application, then select which binary you want to execute.

References:
Setting Nsight to run with existing Makefile project;
How to create Eclipse project from CMake project;
How to change make location in Eclipse.

First taste of stxxl

stxxl is an interesting project:

STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i. e. STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the closeness to the STL supports ease of use and compatibility with existing applications, another design priority is high performance.

Recently, I have a task to do operation on data which can’t be stored in memory, so I want to give a shot of stxxl.

My OS is OpenBSD and the memory is less than 4G and I try to modify the vector1.cpp:

for (int i = 0; i < 1024 * 1024; i++)

for (int i = 0; i < 1024 * 1024 * 1024; i++)

Since every integer occupies 4 bytes, so it will require at least 4G storage to save the vector. Build and run this program:

# ./vector1
[STXXL-MSG] STXXL v1.4.99 (prerelease/Debug) (git 263df0c54dc168212d1c7620e3c10c93791c9c29)
[STXXL-ERRMSG] Warning: no config file found.
[STXXL-ERRMSG] Using default disk configuration.
[STXXL-MSG] Warning: open()ing /var/tmp/stxxl without DIRECT mode, as the system does not support it.
[STXXL-MSG] Disk '/var/tmp/stxxl' is allocated, space: 1000 MiB, I/O implementation: syscall delete_on_exit queue=0 devid=0
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
......
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
101
[STXXL-ERRMSG] Removing disk file: /var/tmp/stxxl

The program outputs 101 which is correct result. Check /var/tmp/stxxl before it is deleted:

# ls -alt /var/tmp/stxxl
-rw-r-----  1 root  wheel  4294967296 Mar  9 15:58 /var/tmp/stxxl

It is indeed the data file and size is 4G.

Based on this simple test, stxxl gives me a good impression, and is worthy for further exploring.

P.S., use cmake -DBUILD_TESTS=ON .. to enable building the examples.

Clang may be a better option than gcc when requiring much memory

I used an old machine (the OS is Arch Linux, and memory less than 3G) to build stxxl project:

# cmake -DBUILD_TESTS=ON ..
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
......
# make VERBOSE=1
......
cd /root/stxxl/build/examples/applications && /usr/bin/c++  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/stxxl/include -I/root/stxxl/build/include  -W -Wall -pedantic -Wno-long-long -Wextra -ftemplate-depth=1024 -std=c++11 -fopenmp -g   -o CMakeFiles/skew3-lcp.dir/skew3-lcp.cpp.o -c /root/stxxl/examples/applications/skew3-lcp.cpp

The default compiler is gcc 7.3.0, and the building process was stuck at compiling skew3-lcp.cpp. The output of htop showed that nearly all memory is occupied:

Switch to clang:

# cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DBUILD_TESTS=ON ..
-- The C compiler identification is Clang 5.0.1
-- The CXX compiler identification is Clang 5.0.1
......
# make
......
[100%] Linking CXX executable test1
[100%] Built target test1

The project can be built successfully, and the peak memory used for compiling skew3-lcp.cpp is 1.78G. Based on this test, if you have compiling task which needs much memory, clang may be a better choice than gcc.

Some tips of using “pool” in programming

In the past weeks, I am dived into utilizing CUDA APIs to operate on multiple GPUs. Among my work, one is a “memory pool” module which manages allocating/freeing memories from different devices. In this post, I want to share some thoughts about “pool”, an interesting tech in programming.

The “pool” works like a “cache”: it allocates some resource beforehand; if the application wants it, the “pool” picks up a available one for it. One classical example is the “database connection pool”: the “pool” preallocates some TCP connections and keep them alive, and this will save client from handshaking with server every time. The other instance is “memory pool”, which I implemented recently. The “pool” keeps the freed memory, and not really release it to device. Based on my benchmark test, the application’s performance can get a 100% improvement in extreme case. (Caveat: For multithreaded application, if the locking mechanism of “pool” becomes the performance bottleneck, try every thread has its own private “pool”.)

The other function of using “pool” is for debugging purpose. Still use my “memory pool” as an demonstration: for every memory pointer, there are a device ID (which GPU this memory is allocated from) and memory size accompanied with it. So you can know the whole life of this block memory clearly from analyzing trace log. This has saved me from notorious “an illegal memory access was encountered” error many times in multiple GPU programming.

Last but not least, although the number of this “memory pool”‘s code lines is merely ~100, I already use following data structures: queue, vector, pair and map. Not mention the mutex and lock, which are essential to avoid nasty data-race issues. So writing this model is a good practice to hone my programming craft.

Enjoy coding!

An empirical method of debugging “illegal memory access” bug in CUDA programming

Recently, I encountered “an illegal memory access was encountered” error during CUDA programming. such as:

cudaSafeCall(cudaMemcpy(h_res, d_res, gParams.n*sizeof(uint32), cudaMemcpyDeviceToHost));

Because the kernel in CUDA programming is executed asynchronously, the code which reports error is not the original culprit usually. After referring this piece of code, I encapsulate a new cudaMemoryTest function:

#define cudaSafeCall(call)  \
        do {\
            cudaError_t err = call;\
            if (cudaSuccess != err) \
            {\
                std::cerr << "CUDA error in " << __FILE__ << "(" << __LINE__ << "): " \
                    << cudaGetErrorString(err);\
                exit(EXIT_FAILURE);\
            }\
        } while(0)

void cudaMemoryTest()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaSafeCall(cudaMalloc((int**)&d_a, bytes));

    memset(h_a, 0, bytes);
    cudaSafeCall(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
    cudaSafeCall(cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost));

    return 0;
}

And insert this function call to the suspicious positions:

{
    cudaMemoryTest();
    kernel_A<<<...>>>;

    ......

    cudaMemoryTest();
    kernel_B<<<...>>>;

    ......
}

This method can notify me timely once the CUDA memory access is exceptional, then I can investigate further.

Mostly, the reasons causing this issue is NULL pointer or a pointer points to a already freed memory. But in multiple GPUs environment, you must make sure the memory for one operation is allocated in the same device.