debug | Nan Xiao's Blog

Why do I need a debugger?

When I begin to learn a new programming language, I will try and master the debugger for it as early as possible. For example, in 2013, while I touched the Go, there seems only gdb for use. Although gdb itself is not a good choice (From Debugging Go Code with GDB):

As a consequence, although GDB can be useful in some situations, it is not a reliable debugger for Go programs, particularly heavily concurrent ones.

But at that time there was no other choice. So after delve came out, I switched to it without hesitation. Though I am not an expert of delvecode, I still try my best do some contributions to make delve become better: improve documents, report issues, etc. Why am I so keen on debugger? The answer is it is a really irreplaceable and necessary tool for software engineers. The reasons are as follows:

(1) If the print statements are the only debugging method of a programming language, it will make feel upset. I.e., if a bug is fully reproduceable, but from the logs you can’t figure out the reason. If there is a debugger which can help you step into every statement and inspect value of every variable, I think you can get the root cause quickly.

Certainly, the debugger isn’t omnipotent. E.g., the nasty multi-threads bug (If you are interested in this topic, you can read this post which describes an experience I have undergone.). If you can’t find reproduce condition of this issue, and adding logs still fails, you can try to add some assert statements which are triggered when the issue happens again. Then you can get the core dump file which records the scene of crime, and use debugger to analyze it. You can see the debugger plays an important part in this scenario yet.

(2) “Debugger is a perfect unit-test tool”, the words come form my director when I got my first full-time job after leaving school. The reason is when you finish a code segment, you can use debugger to check whether it is correct through step-in mode: check value of every variable, mock the conditions which can’t easily be constructed in black-box test, inspect the stack and registers, etc. By means of this, you can fix many corner bugs.

(3) Debugger is a good tool to help you understand code. When I dive into some big Go projects, I find so many channels, interfaces, goroutines, and they sometimes make me crazy. But by way of using debugger, I can set breakpoints, then once the program is stopped, I can understand the code logic better through watching the calling stack.

Based on the above, debugger is an invaluable tool for everyone who lives on writing code. Try know and master it better. Maybe one day, a colleague can’t find reasons for one bug, then you use a small debugger trick and spot the root cause immediately. Isn’t it a cool stuff? 🙂

A brief intro of delve

delve is a debugger developed in Golang and also dedicated to help trouble-shooting Golang programs (Home page is here). Though it is still in pre-1.0 release, I think it is stable enough for daily use. BTW, if you find some bugs, you can report it to developers and help to make delve more stronger! P.S., Albeit fmt.Print buddies are useful in most cases, I strongly recommend you try to usedelve to inspect the internal mechanism of your code, because it can help you know Golang deeper, not just superficial stuff.

Installing delve is very simple: taking Linux platform as an example, it is no different from setting up other Golang projects, just “go get” is enough:

go get github.com/derekparker/delve/cmd/dlv

Now, in $GOPATH/bin, there is an extra dlv executable binary file (Notice: the project is named delve, while the executable file is calleddlv. I even made a foolish mistake when began to use it).Run dlv command, and it will show you a detailed manual of delve:

# dlv
Delve is a source level debugger for Go programs.

Delve enables you to interact with your program by controlling the execution of the process,
evaluating variables, and providing information of thread / goroutine state, CPU register state and more.

The goal of this tool is to provide a simple yet powerful interface for debugging Go programs.

Usage:
  dlv [command]

Available Commands:
  version     Prints version.
  run         Deprecated command. Use 'debug' instead.
  debug       Compile and begin debugging program.
......

Let’s check this artificial Hello.go program:

package main

import "fmt"

func main() {
        var s []byte
        s = append(s, []byte("Hello, Debugging!")...)
        fmt.Println(string(s))
}

Use delve to debug it:

# dlv debug Hello.go
Type 'help' for list of commands.
(dlv) help
The following commands are available:
    help (alias: h) ------------- Prints the help message.
    break (alias: b) ------------ Sets a breakpoint.
    trace (alias: t) ------------ Set tracepoint.
    restart (alias: r) ---------- Restart process.
    continue (alias: c) --------- Run until breakpoint or program termination.
    step (alias: s) ------------- Single step through program.
    step-instruction (alias: si)  Single step a single cpu instruction.
    next (alias: n) ------------- Step over to next source line.
    threads --------------------- Print out info for every traced thread.
    thread (alias: tr) ---------- Switch to the specified thread.
......

If you are familiar with gdb, you will find the commands are very similar, and I promise you can master delve soon.

An interesting thing is that the delve doesn’t provide start command which gdb offers, so you should try to set breakpoints first, then run continue command:

(dlv) b Hello.go:8
Breakpoint 1 set at 0x4011ea for main.main() ./Hello.go:8
(dlv) c
> main.main() ./Hello.go:8 (hits goroutine(1):1 total:1) (PC: 0x4011ea)
     3: import "fmt"
     4:
     5: func main() {
     6:         var s []byte
     7:         s = append(s, []byte("Hello, Debugging!")...)
=>   8:         fmt.Println(string(s))
     9: }
(dlv) p s
[]uint8 len: 17, cap: 32, [72,101,108,108,111,44,32,68,101,98,117,103,103,105,110,103,33]
(dlv) goroutines
[4 goroutines]
* Goroutine 1 - User: ./Hello.go:8 main.main (0x4011ea)
  Goroutine 2 - User: /usr/local/go/src/runtime/proc.go:263 runtime.gopark (0x42a153)
  Goroutine 3 - User: /usr/local/go/src/runtime/proc.go:263 runtime.gopark (0x42a153)
  Goroutine 4 - User: /usr/local/go/src/runtime/mfinal.go:144 runtime.runfinq (0x413f80)

Cool! Isn’t it? Now You can observe almost everything you want to know about your program.

Happy Debugging! Happy delving!

The tips of debugging Mesos

In the past week, I was following this tutorial to build a “kubernetes on Mesos” testbed. All went well but the Mesos master always complains following words:

......
E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected
......

Firstly, I tried to get help from Mesos mailing list and stackoverflow, but after other friends can’t give correct answers directly, I knew I must depend on myself. Enduring a tough debugging process, I worked out the root cause. Since I am a newbie of Mesos and C++(Mesos is implemented in C++, and I last time touch C++ was 7 years ago), I think the experiences and tips may also be useful for other novices. So I summarize them as the following words:

(1) LOG VS VLOG

When you meet an issue, analyzing log should be the first step. Mesos utilizes the google-glog to generate the logs. And the log format explanation is here:

Log lines have this form:
    Lmmdd hh:mm:ss.uuuuuu threadid file:line] msg...
where the fields are defined as follows:
    L                A single character, representing the log level (eg 'I' for INFO)
    mm               The month (zero padded; ie May is '05')
    dd               The day (zero padded)
    hh:mm:ss.uuuuuu  Time in hours, minutes and fractional seconds
    threadid         The space-padded thread ID as returned by GetTID()
    file             The file name
    line             The line number
    msg              The user-supplied message

So compared with the above words, you can easily understand this log:

E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected

By default, the Mesos doesn’t output logs generated by VLOG function, and you need to set GLOG_v=m if you want to see the information from VLOG function (Refer this post):

$ sudo GLOG_v=3 ./bin/mesos-master.sh --ip=15.242.100.56 --work_dir=/var/lib/mesos
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1229 22:42:38.818521 11830 process.cpp:2426] Spawned process __gc__@15.242.100.56:5050
I1229 22:42:38.818613 11846 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818540032+00:00
I1229 22:42:38.818749 11847 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818712832+00:00
I1229 22:42:38.818802 11844 process.cpp:2436] Resuming help@15.242.100.56:5050 at 2015-12-30 03:42:38.818746112+00:00
......

You can also use LOG function to add logs yourself on suspected locations.

(2) Gdb

If logs can’t save you, it is time for debugger to be your hero. Gdb is no doubt a great tool of debugging C/C++ programs. To use gdb, you should enable --enable-debug configuration option before compiling Mesos:

nan@ubuntu:~/mesos-0.25.0/build$ ../configure --enable-debug

You can set breakpoint on class member function like this:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

You can also make use of “auto-complete” feature of gdb. Input the uncompleted function name:

(gdb) b process::SocketManager::cl

Then click tab:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

Notice: If the matched symbols are too many, it may hang gdb. So try to reduce the scope as small as possible.

Additionally, since source file names are relative to the directory where the code was compiled (please refer breakpoints in GDB)，you can reach the same effect through “b file:line“command :

(gdb) b ../../../3rdparty/libprocess/src/process.cpp:1279
Breakpoint 2 at 0x7fd07857973a: file ../../../3rdparty/libprocess/src/process.cpp, line 1279.
(gdb) c
Continuing.
......
[Switching to Thread 0x7fd06b9d4700 (LWP 16677)]

Breakpoint 2, process::SocketManager::link_connect (this=0xca1a30, future=..., socket=0x7fd0500026d0, to=...)
    at ../../../3rdparty/libprocess/src/process.cpp:1279
1279      if (future.isDiscarded() || future.isFailed()) {

You can see the breakpoint is set onprocess::SocketManager::link_connect(process::Future<Nothing> const&, process::network::Socket*, process::UPID const&)function.

P.S.:There are also handy out-of-box gdb scripts in build/bin directory:

# ls bin/gdb-mesos-*
bin/gdb-mesos-local.sh  bin/gdb-mesos-master.sh  bin/gdb-mesos-slave.sh  bin/gdb-mesos-tests.sh

(3) Tcpdump and wireshark

Network packet analyzing tools such as tcpdump and wireshark are essential to diagnose programs which interact with other hosts. E.g., you can use following command to see what come in and out of Mesos master:

sudo tcpdump -A -s 0 'tcp port 5050' -i em1 -w capture.pcap

BTW, my issue is finally fixed by analyzing the following packet:

(4) Pstack script

Personally, I think pstack script is useful when monitoring thread status, and please refer Use pstack to track threads on Linux for detail.

Enjoy debugging!

An experience of fixing a memory-corruption bug

During the last 4 months, I was disturbed by a memory-corruption bug, and this bug will cause program crash. Until last Monday, I found the root cause and fixed it. This debug process is a difficult but memorable experience, so I will share it in this article.

My program works as a SMS Hub. When it receives a SMS, it will allocate a structure in heap memory like this:

typedef struct
{
......
int *a[8];
......
} info;

After processing the SMS, the program will free the memory, and send the SMS to the next Hub or Operator.

Since last November, the program will crash sometimes, and the cause is the second element in array a (a[1])will be changed from a valid value to NULL.

(1) Checking the logs and reproduced the bug
Firstly, I checked the commercial logs, but there were no clues can be found. And the SMS which killed program also seemed no different than others. I also tried to use this SMS to reproduce the bug in testbed, but also failed.

(2) Using libumem
Because our program runs in Solaris, I linked the libumem and hoped it can help me. After a few days, the program crashed again. But the tags before and after the corrupted memory are all OK, so it is not a memory off-bound bug. I also checked the memory before and after the corrupted memory, but nothing valuable can be found.

(3) Adding more logs
Until then, the only thing I can think is adding more logs. After adding enough logs, I found the variable is modified between functions in the same thread: when leaving the last function, the variable is OK, but entering the next function, the variable is changed. So I can make sure the variable is destroyed by another thread. But how can I find the murderer?

(4) Asking for help from other geeks
I began to ask for help from other geeks. I put this question on the stackoverflow: How to debug the memory is changed randomly issue, and received a very good and comprehensive answer. I recommended every one should read this post. I also sent emails to other geeks, most of them replied and gave some suggestions, such as exchange some variables definition sequences, etc. I also wanted to get the root cause from the core dump file directly: Can I know which thread change the global variable’s value from core dump file?, but at the end, I am failed. Until one day, I found an article (Aha, written in Chinese!)by accident, this article describes the bug is very similar to mine except his programming language is C++ and mine is C. I began to follow the steps he provided to find the root cause.

(5) Porting the program on Linux and use valgrind
I wanted to use valgrind to help find the bug, but my program runs on Solaris, and valgrind can’t be used on Solaris, so another colleague helped to ported it on Linux. After running valgrind, I did find some memory-related bugs, but these bugs can’t explain the cause of this issue. I still can’t find the root cause.

(6) Using electric-fence
I tried to use electric-fence, but the home page of electric-fence was closed. I found some packages and source codes from the internet, and tried to integrated them into my program. But this time, I also failed.

(7) Using mprotect function
Because the second element of the array is always changed to NULL, I used mprotect instead of malloc to allocate the memory of the array, and set read-only attribute of the memory. So if the memory is changed, the program will crash. But the mprotect will allocate a page size of the memory, and this may cause the program change the behavior, I am not sure whether the bug can occur again.

（8） Finding the cause
About 2 weeks later, when a colleague stopped the application, the application crashed again. The cause was a global variable was changed. I checked all the old core dumps immediately, and found the global variable was always changed, and pointed to the address of the array! It seemed I was very clear to the truth.

After about 2 days analysis, the cause was found: When a thread calls pthread_create to create another thread, it will changes a global variable. But in few cases, the child thread will execute firstly, and it will also changes the global variable. The code assumes the parent thread always execute firstly, and this will cause the program crash.

When looking back the 4-month experience, I have studied a lot of things for debugging this bug: libumem, valgrid, libefence, etc. It is a really memorable and cool experience!