Technology | Nan Xiao's Blog

Clang may be a better option than gcc when requiring much memory

I used an old machine (the OS is Arch Linux, and memory less than 3G) to build stxxl project:

# cmake -DBUILD_TESTS=ON ..
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
......
# make VERBOSE=1
......
cd /root/stxxl/build/examples/applications && /usr/bin/c++  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/stxxl/include -I/root/stxxl/build/include  -W -Wall -pedantic -Wno-long-long -Wextra -ftemplate-depth=1024 -std=c++11 -fopenmp -g   -o CMakeFiles/skew3-lcp.dir/skew3-lcp.cpp.o -c /root/stxxl/examples/applications/skew3-lcp.cpp

The default compiler is gcc 7.3.0, and the building process was stuck at compiling skew3-lcp.cpp. The output of htop showed that nearly all memory is occupied:

Switch to clang:

# cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DBUILD_TESTS=ON ..
-- The C compiler identification is Clang 5.0.1
-- The CXX compiler identification is Clang 5.0.1
......
# make
......
[100%] Linking CXX executable test1
[100%] Built target test1

The project can be built successfully, and the peak memory used for compiling skew3-lcp.cpp is 1.78G. Based on this test, if you have compiling task which needs much memory, clang may be a better choice than gcc.

Some tips of using “pool” in programming

In the past weeks, I am dived into utilizing CUDA APIs to operate on multiple GPUs. Among my work, one is a “memory pool” module which manages allocating/freeing memories from different devices. In this post, I want to share some thoughts about “pool”, an interesting tech in programming.

The “pool” works like a “cache”: it allocates some resource beforehand; if the application wants it, the “pool” picks up a available one for it. One classical example is the “database connection pool”: the “pool” preallocates some TCP connections and keep them alive, and this will save client from handshaking with server every time. The other instance is “memory pool”, which I implemented recently. The “pool” keeps the freed memory, and not really release it to device. Based on my benchmark test, the application’s performance can get a 100% improvement in extreme case. (Caveat: For multithreaded application, if the locking mechanism of “pool” becomes the performance bottleneck, try every thread has its own private “pool”.)

The other function of using “pool” is for debugging purpose. Still use my “memory pool” as an demonstration: for every memory pointer, there are a device ID (which GPU this memory is allocated from) and memory size accompanied with it. So you can know the whole life of this block memory clearly from analyzing trace log. This has saved me from notorious “an illegal memory access was encountered” error many times in multiple GPU programming.

Last but not least, although the number of this “memory pool”‘s code lines is merely ~100, I already use following data structures: queue, vector, pair and map. Not mention the mutex and lock, which are essential to avoid nasty data-race issues. So writing this model is a good practice to hone my programming craft.

Enjoy coding!

An empirical method of debugging “illegal memory access” bug in CUDA programming

Recently, I encountered “an illegal memory access was encountered” error during CUDA programming. such as:

cudaSafeCall(cudaMemcpy(h_res, d_res, gParams.n*sizeof(uint32), cudaMemcpyDeviceToHost));

Because the kernel in CUDA programming is executed asynchronously, the code which reports error is not the original culprit usually. After referring this piece of code, I encapsulate a new cudaMemoryTest function:

#define cudaSafeCall(call)  \
        do {\
            cudaError_t err = call;\
            if (cudaSuccess != err) \
            {\
                std::cerr << "CUDA error in " << __FILE__ << "(" << __LINE__ << "): " \
                    << cudaGetErrorString(err);\
                exit(EXIT_FAILURE);\
            }\
        } while(0)

void cudaMemoryTest()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaSafeCall(cudaMalloc((int**)&d_a, bytes));

    memset(h_a, 0, bytes);
    cudaSafeCall(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
    cudaSafeCall(cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost));

    return 0;
}

And insert this function call to the suspicious positions:

{
    cudaMemoryTest();
    kernel_A<<<...>>>;

    ......

    cudaMemoryTest();
    kernel_B<<<...>>>;

    ......
}

This method can notify me timely once the CUDA memory access is exceptional, then I can investigate further.

Mostly, the reasons causing this issue is NULL pointer or a pointer points to a already freed memory. But in multiple GPUs environment, you must make sure the memory for one operation is allocated in the same device.

Function application in Haskell

As a newbie of Haskell, I find the life becomes easier once I understand function application:

(1) function application is actually “function call”. For example, define a simple add function who returns the sum of 2 numbers:

# cat add.hs
add :: Num a => a -> a -> a
add a b = a + b

Load it in ghci, and call this function:

# ghci
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
Prelude> :l add
[1 of 1] Compiling Main             ( add.hs, interpreted )
Ok, one module loaded.
*Main> add 2 4
6
*Main> add 3 6
9

Beware that the tokens in function application are separated by space. So once you see following format:

a b ..

You know it is a function application, and also a “function call”.

(2) function application has the highest precedence. Check following example:

*Main> add 1 2 ^ add 1 2
27

It is equal to “(add 1 2) ^ (add 1 2)” literally.

(3) $ operator is “application operator”, and it is right associative, and has lowest precedence. Check following instance:

*Main> add 1 $ add 2 $ add 3 4
10

The $ operator divides the expression into 3 parts: “add 1“, “add 2” and add 3 4. Because $ is right associative, the result of add 3 4 is fed into add 2function first; then the result of add 2 $ add 3 4 is passed into add 1. It is equal to “add 1 ( add 2 ( add 3 4 ) )” in fact, so $ can be used to remove parentheses.

References:
Prelude;
Calling functions.

Porting software is fun and rewarding

Regarding to port software, I think there are several kinds:

a) For the simplest case, one tool is created for Linux, and you want to use it on FreeBSD. Because there is no out-of-box package for this Operating System, you grab the code and compile it yourself, no complaint from compiler. Run it and it seems work, bingo! This should be a perfect experience!

b) The life will become pleasant if everything is similar to the above case, but in reality it is definitely not. Sometimes, the progress can’t go so smoothly. Take socket programming as an example, the Solaris has some specific requirements if you are only familiar with Linux environment (Please check this post). So you may tweak the compiler options and even customoize your code to fit your requirement in this scenario.

c) The third case is you need to read the whole software source code and do modifications, and this is what I am currently doing. Back to this Monday, I received a task to verify a conception. I remembered there is an Open Source framework which has implemented similar function, so I downloaded and went through the code carefully. Fortunately, this project indeed satisfies our requirement, but since our computation environment is Nvidia GPU, I need to use CUDA APIs to replace the related code besides integrate this framework into our code repository. If no other accidents, I think I can finish the whole work in next week.

From my personal experience, porting software is really rewarding! Take this week’s work as an example, I learnt a new C++ library and refreshed my knowledge of graph data structure. Furthermore, porting software can also give you fun: after several hours even days’ hard work, a bespoken tool can meet your requirement, that will let you feel very filled!

At the end, I must declare I don’t encourage you should be lazy and don’t think problems yourself; instead you should leverage the resource rationally. Moreover, please conform to the software license, and don’t violate it.

Enjoy porting!