Gcc’s “-fstack-protector-strong” option

Gcc‘s “-fstack-protector-strong” helped me catch an array overflow bug recently. The “-fstack-protector-strong” option will add “canary” in the function stack, when function returns, it would check whether the guard is corrupted or not. If corrupted, __stack_chk_fail() will be invoked:

    0x00007ffff5138674 <+52>:   mov    -0x38(%rbp),%rax
    0x00007ffff5138678 <+56>:   xor    %fs:0x28,%rax
    0x00007ffff5138681 <+65>:   jne    0x7ffff5138ff3 <function+2483>
    ......
    0x00007ffff5138ff3 <+2483>: callq  0x7ffff50c2100 <__stack_chk_fail@plt>

And the program will crash:

*** stack smashing detected ***: program terminated
Segmentation fault

Use gdb to check:

(gdb) bt
#0  0x00007fffde26e0b8 in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007fffde26efb9 in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#2  0x00007fffde890aa6 in backtrace () from /usr/lib64/libc.so.6
#3  0x00007fffde7f4ef4 in __libc_message () from /usr/lib64/libc.so.6
#4  0x00007fffde894577 in __fortify_fail () from /usr/lib64/libc.so.6
#5  0x00007fffde894532 in __stack_chk_fail () from /usr/lib64/libc.so.6
#6  0x00007ffff5138ff8 in function () at src.c:685
#7  0x045b9fd4c77e2ff3 in ?? ()
#8  0x9a8ad8e7e2eb8ca8 in ?? ()
#9  0x0fa0e627193655f1 in ?? ()
#10 0xfc295178098bb96f in ?? ()
#11 0xa09a574a7780cd13 in ?? ()
......

The function frames and return addresses are overwritten, so the call stack can’t be recovered. Please be aware that the line which gdb prints:

#6  0x00007ffff5138ff8 in function () at src.c:685

may not be related to culprit!

Treasure core dump file

Recently I fixed a memory corruption issue, i.e., for a 8-byte memory address, one byte in the middle was set to 0, so the memory address became invalid, and accessing of this memory caused program crash. This reminded me another memory corruption issue which I fixed before. From my experience, this kind of memory corruption issues are very difficult to debug: the adjacent memory is all good, only one or several bytes are changed to other values. These bugs are not obvious out-of-bound memory access problems, and difficult to find methods to reproduce.

Generally speaking, logs can’t always give you a hand when the memory is random corrupted, not mention in some situations, the traces won’t be provided for reasons. The only thing you can get is the core dump file, and you must utilize the file and try to unearth as much information as possible. E.g., from program’s perspective, what was the state of program when it crashed? Except the ruined memory, were there other abnormalities? From system’s perspective, have you observed all the registers’ values? Are they all valid? If not, which part of code can cause it?

So every time, when you meet a not-easy reproduced bug, don’t freak out. Just calm down and begin to analyze core dump file carefully. You become a detective and core dump file is the crime scene. In reality, you can’t require the criminal to commit again. Similarly, not every bug can reoccur; you must try your best to find the root cause from the core dump file. From my experience, every tough debugging experience can make you understand program and system better. So it is a precious learning opportunity.

Treasure core dump file and enjoy debugging!

OpenBSD saves me again! — Debug a memory corruption issue

Yesterday, I came across a third-part library issue, which crashes at allocating memory:

......
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f594a5a9b6b in _int_malloc () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f594a5a9b6b in _int_malloc () from /usr/lib/libc.so.6
#1  0x00007f594a5ab503 in malloc () from /usr/lib/libc.so.6
#2  0x00007f594b13f159 in operator new (sz=5767168) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/new_op.cc:50
......

It is obvious that the memory tags are corrupted, but who is the murder? Since the library involves a lot of maths computation, it is not an easy task to grasp the code quickly. So I need to find another way:

(1) Open all warnings during compilation: -Wall. Nothing found.

(2) Use valgrind, but unfortunately, valgrind crashes itself:

......
valgrind: the 'impossible' happened:
   Killed by fatal signal

host stacktrace:
==43326==    at 0x58053139: get_bszB_as_is (m_mallocfree.c:303)
==43326==    by 0x58053139: get_bszB (m_mallocfree.c:315)
==43326==    by 0x58053139: vgPlain_arena_malloc (m_mallocfree.c:1799)
==43326==    by 0x5800BA84: vgMemCheck_new_block (mc_malloc_wrappers.c:372)
==43326==    by 0x5800BD39: vgMemCheck___builtin_vec_new (mc_malloc_wrappers.c:427)
==43326==    by 0x5809F785: do_client_request (scheduler.c:1866)
==43326==    by 0x5809F785: vgPlain_scheduler (scheduler.c:1433)
==43326==    by 0x580AED50: thread_wrapper (syswrap-linux.c:103)
==43326==    by 0x580AED50: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1
......

(3) Change compiler, use clang instead of gcc, and hope it can give me some clues. Still no effect.

(4) Switch Operating System from Linux to OpenBSD, the program crashes again. But this time, it tells me where the memory corruption occurs:

......
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000014b07f01e52d in addMod (r=<error reading variable>, a=4693443247995522, b=28622907746665631,
......

I figure out the issue quickly, and not bother to understand the whole code. OpenBSD saves me again, thanks!

 

Some tips of using “pool” in programming

In the past weeks, I am dived into utilizing CUDA APIs to operate on multiple GPUs. Among my work, one is a “memory pool” module which manages allocating/freeing memories from different devices. In this post, I want to share some thoughts about “pool”, an interesting tech in programming.

The “pool” works like a “cache”: it allocates some resource beforehand; if the application wants it, the “pool” picks up a available one for it. One classical example is the “database connection pool”: the “pool” preallocates some TCP connections and keep them alive, and this will save client from handshaking with server every time. The other instance is “memory pool”, which I implemented recently. The “pool” keeps the freed memory, and not really release it to device. Based on my benchmark test, the application’s performance can get a 100% improvement in extreme case. (Caveat: For multithreaded application, if the locking mechanism of “pool” becomes the performance bottleneck, try every thread has its own private “pool”.)

The other function of using “pool” is for debugging purpose. Still use my “memory pool” as an demonstration: for every memory pointer, there are a device ID (which GPU this memory is allocated from) and memory size accompanied with it. So you can know the whole life of this block memory clearly from analyzing trace log. This has saved me from notorious “an illegal memory access was encountered” error many times in multiple GPU programming.

Last but not least, although the number of this “memory pool”‘s code lines is merely ~100, I already use following data structures: queue, vector, pair and map. Not mention the mutex and lock, which are essential to avoid nasty data-race issues. So writing this model is a good practice to hone my programming craft.

Enjoy coding!

An empirical method of debugging “illegal memory access” bug in CUDA programming

Recently, I encountered “an illegal memory access was encountered” error during CUDA programming. such as:

cudaSafeCall(cudaMemcpy(h_res, d_res, gParams.n*sizeof(uint32), cudaMemcpyDeviceToHost));

Because the kernel in CUDA programming is executed asynchronously, the code which reports error is not the original culprit usually. After referring this piece of code, I encapsulate a new cudaMemoryTest function:

#define cudaSafeCall(call)  \
        do {\
            cudaError_t err = call;\
            if (cudaSuccess != err) \
            {\
                std::cerr << "CUDA error in " << __FILE__ << "(" << __LINE__ << "): " \
                    << cudaGetErrorString(err);\
                exit(EXIT_FAILURE);\
            }\
        } while(0)

void cudaMemoryTest()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaSafeCall(cudaMalloc((int**)&d_a, bytes));

    memset(h_a, 0, bytes);
    cudaSafeCall(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
    cudaSafeCall(cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost));

    return 0;
}

And insert this function call to the suspicious positions:

{
    cudaMemoryTest();
    kernel_A<<<...>>>;

    ......

    cudaMemoryTest();
    kernel_B<<<...>>>;

    ......
}

This method can notify me timely once the CUDA memory access is exceptional, then I can investigate further.

Mostly, the reasons causing this issue is NULL pointer or a pointer points to a already freed memory. But in multiple GPUs environment, you must make sure the memory for one operation is allocated in the same device.