Reading code is still the most effective method to debug multi-thread bug

In the past month, I fixed two multi-thread bug, and the symptoms of these two bugs are:

a) For the first bug: some threads are dead-locked. This bug only occurs on few production machines, and the frequency is not high. And this bug never happens in testbed.

b) For the second one: the program will crash after running for 3 ~ 5 hours, and the reason is the program enters a should-never-enter code path which will trigger assert. Though there is the core dump file, I can’t find any clues from the crime scene.

The straightforward way to debug first bug is checking all lock and unlock operations are paired in any path. Unfortunately, that is not the root cause, so I began to check all code which is related to the lock. After two days, I finally got a copy-pasta error which can open a can of worms.

For the second bug, I went through all code related to multi-thread access problematic variable one line by another, to see whether there is a corner case which can incur contention. Thank god! When I have a rest at the noon, I finally had the idea!

You can see, during the debug process of these two bugs, I can’t find other better method except reading code again and again (I indeed tried to add more traces but it didn’t work). BTW, the common thing of these two bugs is the fix is simple: just modifying one line of code.

Beware of out-of-boundary access of array

Today my colleague fixed one bug related to out-of-boundary access of array: a hash function returns the selected index of the array, but the hash function’s return value is int, so in corner case, when the hash value is overflow, it can become negative, and this will cause access an invalid element of the array. The lessons I learnt from this bug:
(1) Review the return value of hash function;
(2) Pay attention to the index when accessing array, is it possible to cause out-of-boundary access?

The experience of fixing a memory corruption issue

I came across a program crash last week:

Program terminated with signal 11, Segmentation fault.
#0  0x00007ffff365bd29 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6
#0  0x00007ffff365bd29 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6
#1  0x00007ffff606025c in memcpy (__len=<optimized out>, __src=0x0, __dest=0x0) at /usr/include/bits/string3.h:51
......
#5  0x0000000000000000 in ?? () 

The 5th stack frame address is 0x0000000000000000, and it seems not right. To debug it, get the registers values first:

According to X86_64 architecture, The value in memory address (%rbp) should be previous %rbp value, and the value in memory address (%rbp) + 8 should be return address. Checked these two values, and found they are all 0s, so it means the stack is corrupted.

The next thing to do is dump the memory between %rsp and %rbp, and refer the assembly code of the function at the same time. With this, I can know which memory part doesn’t seem correct, and review code accordingly. Finally I found the root cause and fixed it.

P.S., in optimisation build mode, some functions may be inlined, so please be aware of this caveat.

The pitfall of upgrading 3rd-party library

Today, I debugged a tricky issue, a bug related to a 3rd-party library. When I used gdb to check a structure’s values, found the last member was missed compared to the definitions in header file. I began to suspect this might be caused by 3rd-party library. I checked the upgrade log, then found the root cause: when I compiled the code, the 3rd-party library’s version is v1.1, but when I run the program, the library was upgraded to v1.2 by others, which caused this mysterious bug. The solution is simple: rebuild the code. But the debugging process is exhausting.

Bisection assert is a good debug methodology

Recently, I fixed an issue which is related to uninitialised bit-field in C programming language. Because the bit-filed can be either 0 or 1, so the bug will occur randomly. But the good news is the reproduced rate is very high, nearly 50%. Though I am not familiar with the code, I used bisection assert to help:

 {
  ......
  assert(bit-field == 0);
  ......
  assert(bit-field == 0);
  ......
 }

If the first assert is not triggered, but the second one is, I can know which code block has the bug, then bisect code and add assert again, until the root cause is found.