The tips of debugging Mesos

In the past week, I was following this tutorial to build a “kubernetes on Mesos” testbed. All went well but the Mesos master always complains following words:

......
E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected
......

Firstly, I tried to get help from Mesos mailing list and stackoverflow, but after other friends can’t give correct answers directly, I knew I must depend on myself. Enduring a tough debugging process, I worked out the root cause. Since I am a newbie of Mesos and C++(Mesos is implemented in C++, and I last time touch C++ was 7 years ago), I think the experiences and tips may also be useful for other novices. So I summarize them as the following words:

(1) LOG VS VLOG

When you meet an issue, analyzing log should be the first step. Mesos utilizes the google-glog to generate the logs. And the log format explanation is here:

Log lines have this form:
    Lmmdd hh:mm:ss.uuuuuu threadid file:line] msg...
where the fields are defined as follows:
    L                A single character, representing the log level (eg 'I' for INFO)
    mm               The month (zero padded; ie May is '05')
    dd               The day (zero padded)
    hh:mm:ss.uuuuuu  Time in hours, minutes and fractional seconds
    threadid         The space-padded thread ID as returned by GetTID()
    file             The file name
    line             The line number
    msg              The user-supplied message

So compared with the above words, you can easily understand this log:

E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected

By default, the Mesos doesn’t output logs generated by VLOG function, and you need to set GLOG_v=m if you want to see the information from VLOG function (Refer this post):

$ sudo GLOG_v=3 ./bin/mesos-master.sh --ip=15.242.100.56 --work_dir=/var/lib/mesos
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1229 22:42:38.818521 11830 process.cpp:2426] Spawned process __gc__@15.242.100.56:5050
I1229 22:42:38.818613 11846 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818540032+00:00
I1229 22:42:38.818749 11847 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818712832+00:00
I1229 22:42:38.818802 11844 process.cpp:2436] Resuming help@15.242.100.56:5050 at 2015-12-30 03:42:38.818746112+00:00
......

You can also use LOG function to add logs yourself on suspected locations.

(2) Gdb

If logs can’t save you, it is time for debugger to be your hero. Gdb is no doubt a great tool of debugging C/C++ programs. To use gdb, you should enable --enable-debug configuration option before compiling Mesos:

nan@ubuntu:~/mesos-0.25.0/build$ ../configure --enable-debug

You can set breakpoint on class member function like this:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

You can also make use of “auto-complete” feature of gdb. Input the uncompleted function name:

(gdb) b process::SocketManager::cl  

Then click tab:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

Notice: If the matched symbols are too many, it may hang gdb. So try to reduce the scope as small as possible.

Additionally, since source file names are relative to the directory where the code was compiled (please refer breakpoints in GDB),you can reach the same effect through “b file:line“command :

(gdb) b ../../../3rdparty/libprocess/src/process.cpp:1279
Breakpoint 2 at 0x7fd07857973a: file ../../../3rdparty/libprocess/src/process.cpp, line 1279.
(gdb) c
Continuing.
......
[Switching to Thread 0x7fd06b9d4700 (LWP 16677)]

Breakpoint 2, process::SocketManager::link_connect (this=0xca1a30, future=..., socket=0x7fd0500026d0, to=...)
    at ../../../3rdparty/libprocess/src/process.cpp:1279
1279      if (future.isDiscarded() || future.isFailed()) {

You can see the breakpoint is set onprocess::SocketManager::link_connect(process::Future<Nothing> const&, process::network::Socket*, process::UPID const&)function.

P.S.:There are also handy out-of-box gdb scripts in build/bin directory:

# ls bin/gdb-mesos-*
bin/gdb-mesos-local.sh  bin/gdb-mesos-master.sh  bin/gdb-mesos-slave.sh  bin/gdb-mesos-tests.sh

(3) Tcpdump and wireshark

Network packet analyzing tools such as tcpdump and wireshark are essential to diagnose programs which interact with other hosts. E.g., you can use following command to see what come in and out of Mesos master:

sudo tcpdump -A -s 0 'tcp port 5050' -i em1 -w capture.pcap

BTW, my issue is finally fixed by analyzing the following packet:

1

(4) Pstack script

Personally, I think pstack script is useful when monitoring thread status, and please refer Use pstack to track threads on Linux for detail.

Enjoy debugging!

 

Execute “mesos-slave.sh” using root privilege

I have spent the past week to try building “Kubernetes on Mesos” following Getting started with Kubernetes on Mesos document, but “kubectl get node” always reports the node state is not ready:

$ kubectl get node
NAME      LABELS                          STATUS     AGE
ubuntu    kubernetes.io/hostname=ubuntu   NotReady   4h

Today, when executing “./bin/mesos-slave.sh --master=127.0.0.1:5050” command, I notice this log:

W1217 05:52:42.213497 24074 backend.cpp:50] Failed to create 'bind' backend: BindBackend requires root privileges

So I use sudo command to run the ./bin/mesos-slave.sh script again, then the “Kubernetes on Mesos” runs successfully:

 

Capture