Leverage comprehensive debugging tricks in one shot

Last Friday, a colleague told me that when connecting an invalid address, the client using gRPC will block forever. To verify it, I use the example code shipped in gRPC:

GreeterClient greeter(grpc::CreateChannel(
  "localhost:50051", grpc::InsecureChannelCredentials()));

Change the "localhost:50051" to "badhost:50051", then compile and execute the program. Sure enough, the client hang without any response. At the outset, I thought it should be a common issue, and there must be a solution already. So I just submitted a post in the discussion group, although there was some responses, but since they were not the satisfactory explanations, I knew I need to trouble-shooting myself.

(1) The first thing I wanted to make sure was whether the network card had sent requests to badhost or not, so I used tcpdump to capture the packets:

$ sudo tcpdump -A -s 0 'port 50051' -i enp7s0f0

But there isn’t any data captured. To double-confirm, I also used tcpconnect program to check:

$ sudo tcpconnect -P 50051
PID    COMM         IP SADDR            DADDR            DPORT

Still nothing output.

(2) Although I couldn’t find the connect request to port 50051, no matter what application on *NIX, it will definitely call connect function at the end. So I changed the tactic, and tried to find who calls the connect:

a) Build gRPC with debugging info (The reason of using “PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig” is here):

$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" make

b) Modify the Makefile to build client program with debugging info:

CXXFLAGS += -g -std=c++11

c) Use gdb to debug the program, after starting it, set breakpoint at connect function:

$ gdb -q greeter_client
Reading symbols from greeter_client...done.
(gdb) start
Temporary breakpoint 1 at 0x146fe: file greeter_client.cc, line 74.
Starting program: /home/xiaonan/Project/grpc/examples/cpp/helloworld/greeter_client
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffea88) at greeter_client.cc:74
74      int main(int argc, char** argv) {
(gdb) b connect
Breakpoint 2 at 0x7ffff6619b80 (2 locations)

Then continue executing the program. When the breakpoint was hit, check the stack:

(gdb) c
Continuing.
[New Thread 0x7ffff4edc700 (LWP 28396)]
[New Thread 0x7ffff46db700 (LWP 28397)]
[Switching to Thread 0x7ffff4edc700 (LWP 28396)]

Thread 2 "greeter_client" hit Breakpoint 2, 0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

Then continue to run the program, the breakpoint was hit again:

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6
(gdb)

Oh, I see! The resolving of badhost must be failed, so there would definitely no subsequent connecting port 50051. But why the client was trying to resolve name again and again? If I find this cause, it can explain why client was blocking.

(3) Since there is ?? from /usr/lib/libgrpc.so.4, I can’t know which function was the culprit. I can go over the code, but I think I need the direct proof. Build gRPC with CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" seems not enough. After some tweaking, I come out the following solutions:

a) According to the Makefile:

# TODO(nnoble): the strip target is stripping in-place, instead
# of copying files in a temporary folder.
# This prevents proper debugging after running make install.  

make install” will strip the debugging information, so instead of executing “make install” command, I set LD_LIBRARY_PATH environment variable to let client link library in the specified directory:

$ export LD_LIBRARY_PATH=/home/xiaonan/Project/grpc/libs/opt

b) Hardcode -g in the Makefile:

CFLAGS += -g -std=c99 -Wsign-conversion -Wconversion $(W_SHADOW) $(W_EXTRA_SEMI)
CXXFLAGS += -g -std=c++11

Then the symbols can all be resolved:

(gdb) bt
#0  0x00007ffff6486b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff64bb61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff64bbae2 in __nscd_get_mapping () from /usr/lib/libc.so.6
#3  0x00007ffff64bbed5 in __nscd_get_map_ref () from /usr/lib/libc.so.6
#4  0x00007ffff64b9ba3 in __nscd_getai () from /usr/lib/libc.so.6
#5  0x00007ffff64708bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#6  0x00007ffff6471724 in getaddrinfo () from /usr/lib/libc.so.6
#7  0x00007ffff7473ec5 in blocking_resolve_address_impl (name=0x55555578edf0 "badhost:50051",
    default_port=0x555555790220 "https", addresses=0x55555578f1f0) at src/core/lib/iomgr/resolve_address_posix.c:83
#8  0x00007ffff74742e3 in do_request_thread (exec_ctx=0x7ffff5043c30, rp=0x55555578e630, error=<optimized out>)
    at src/core/lib/iomgr/resolve_address_posix.c:157
#9  0x00007ffff7472b86 in run_closures (exec_ctx=<optimized out>, list=...) at src/core/lib/iomgr/executor.c:64
#10 executor_thread (arg=0x555555789fc0) at src/core/lib/iomgr/executor.c:152
#11 0x00007ffff74e5286 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.c:42
#12 0x00007ffff6181049 in start_thread () from /usr/lib/../lib64/libpthread.so.0
#13 0x00007ffff6485f0f in clone () from /usr/lib/libc.so.6

Now I just need to step-into code, and the information of this issue can also be referred here.

During the whole process, I used sniffer tool (tcpdump), kernel tracing tool(tcpconnect, which belongs to bcc and utilizes eBPF), networking knowledge (set breakpoint on connect function), debugging tool (gdb), and the trick of linking library (set LD_LIBRARY_PATH to bypass installing gRPC), that’s why I call the whole procedure “leverage comprehensive debugging tricks”.

 

Use network analyzer to learn SSH session establishment

The establishment of SSH session consists of 2 parts: build up the encryption channel and authenticate user. To understand the whole flow better, I usetcpdump/Wireshark to capture and analyze the packets. Server is OpenBSD 6.1 and client is ArchLinux. The tcpdump command is like this:

sudo tcpdump -A -s 0 'net 192.168.38.176' -i enp7s0f0 -w capture.pcap

(1) Connect server first time:

1

The captured packets:

C1

We can see the client/server negotiated SSH version first (In fact, client and server sent SSH version simultaneously, so please don’t misunderstand client sent first, then server responded. Use “nc 192.168.38.176 22” command to check.)

, then exchanged public key to generate secret key. The server issued “New Keys” message, and waited for client to answer.

(2) Accept server’s public key but not input password:

2

The captured packets:

C2

The first packet should be client acknowledged server’s “New Keys” message, then there are some interactions. Now the encryption channel is set up.

(3) Enter password and authenticate user:

3

The captured packets:

C3

These packets are all encrypted data. If user’s password is correct, the whole SSH session will be ready, and you can administrator server now.

Reference:
Understanding the SSH Encryption and Connection Process.

A brief intro of TCP keep-alive in Go’s HTTP implementation

Let’s see a Go web program:

package main

import (
        "fmt"
        "io"
        "io/ioutil"
        "net/http"
        "os"
        "time"
)

func main() {
        for {
                resp, err := http.Get("https://www.google.com/")
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
                        os.Exit(1)
                }
                _, err = io.Copy(ioutil.Discard, resp.Body)
                resp.Body.Close()
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: reading: %v\n", err)
                        os.Exit(1)
                }
                time.Sleep(45 * time.Second)
        }
}

The logic of above application is not hard, just retrieve information from the specified Website URL. I am a layman of web development, and think the HTTP communication should be “short-lived”, which means when HTTPclient issues a request, it will start a TCP connection with HTTP server, once the client receives the response, this session should end, and the TCP connection should be demolished. But is the fact really like this? I use lsof command to check my guess:

# lsof -P -n -p 907
......
fetch   907 root    3u    IPv4 0xfffff80013677810              0t0     TCP 192.168.80.129:32618->xxx.xxx.xxx.xxx:8080 (ESTABLISHED)
......

Oh! My assumption is wrong, and there is a “long-lived” TCP connection. Why does this happen? When I come across the network related troubles, I will always seek help from tcpdump and wireshark and try to capture packets for analysis. This time should not be exception, and the communication flow is as the following picture:

1

(1) The start number of packet is 4, that’s because the first 3 packets are TCP handshake, and it is safe to ignore them;
(2) Packet 4 ~ 43 is the first HTTP GET flow, and this process lasts about 2 seconds, and ends at 19:20:37;
(3) Then after half a minute, at 19:21:07, there occurs a TCP keep-alive packet on the wire. Oh dear! The root cause has been found! Although the HTTP session is over, the TCP connection still exists and uses keep-alive mechanism to make the TCP passway alive, so this TCP route can be reused by following HTTP messages;
(4) As expected, 15 seconds later, a new HTTP session begins at packet 46, which is exactly 45 seconds pass after the first HTTPconversation.

That’s the effect of TCP keep-alive, which keeps the TCP connection alive and can be reused by following HTTP sessions. Another thing you should pay attention of reusing connection is the Response.Body must be read to completion and closed (Please refer here):

type Response struct {
    ......

    // Body represents the response body.
    //
    // The http Client and Transport guarantee that Body is always
    // non-nil, even on responses without a body or responses with
    // a zero-length body. It is the caller's responsibility to
    // close Body. The default HTTP client's Transport does not
    // attempt to reuse HTTP/1.0 or HTTP/1.1 TCP connections
    // ("keep-alive") unless the Body is read to completion and is
    // closed.
    //
    // The Body is automatically dechunked if the server replied
    // with a "chunked" Transfer-Encoding.
    Body io.ReadCloser

    ......
}

As an example, modify the above program as follows:

package main

import (
        "fmt"
        "io"
        "io/ioutil"
        "net/http"
        "os"
        "time"
)

func closeResp(resp *http.Response) {
        _, err := io.Copy(ioutil.Discard, resp.Body)
        resp.Body.Close()
        if err != nil {
                fmt.Fprintf(os.Stderr, "fetch: reading: %v\n", err)
                os.Exit(1)
        }
}

func main() {
        for {
                resp1, err := http.Get("https://www.google.com/")
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
                        os.Exit(1)
                }

                resp2, err := http.Get("https://www.facebook.com/")
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
                        os.Exit(1)
                }

                time.Sleep(45 * time.Second)
                for _, v := range []*http.Response{resp1, resp2} {
                        closeResp(v)
                }
        }
}

During running it, You will see 2 TCP connections, not 1:

# lsof -P -n -p 1982
......
fetch   1982 root    3u    IPv4 0xfffff80013677810              0t0     TCP 192.168.80.129:43793->xxx.xxx.xxx.xxx:8080 (ESTABLISHED)
......
fetch   1982 root    6u    IPv4 0xfffff80013677000              0t0     TCP 192.168.80.129:12105->xxx.xxx.xxx.xxx:8080 (ESTABLISHED)

If you call closeResp function before issuing new HTTP request, the TCP connection can be reused:

package main

import (
        "fmt"
        "io"
        "io/ioutil"
        "net/http"
        "os"
        "time"
)

func closeResp(resp *http.Response) {
        _, err := io.Copy(ioutil.Discard, resp.Body)
        resp.Body.Close()
        if err != nil {
                fmt.Fprintf(os.Stderr, "fetch: reading: %v\n", err)
                os.Exit(1)
        }
}

func main() {
        for {
                resp1, err := http.Get("https://www.google.com/")
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
                        os.Exit(1)
                }
                closeResp(resp1)


                time.Sleep(45 * time.Second)

                resp2, err := http.Get("https://www.facebook.com/")
                if err != nil {
                        fmt.Fprintf(os.Stderr, "fetch: %v\n", err)
                        os.Exit(1)
                }
                closeResp(resp2)
        }
}

P.S., the full code is here.

References:
Package http;
Reusing http connections in Golang;
Is HTTP Shortlived?.

The tips of debugging Mesos

In the past week, I was following this tutorial to build a “kubernetes on Mesos” testbed. All went well but the Mesos master always complains following words:

......
E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected
......

Firstly, I tried to get help from Mesos mailing list and stackoverflow, but after other friends can’t give correct answers directly, I knew I must depend on myself. Enduring a tough debugging process, I worked out the root cause. Since I am a newbie of Mesos and C++(Mesos is implemented in C++, and I last time touch C++ was 7 years ago), I think the experiences and tips may also be useful for other novices. So I summarize them as the following words:

(1) LOG VS VLOG

When you meet an issue, analyzing log should be the first step. Mesos utilizes the google-glog to generate the logs. And the log format explanation is here:

Log lines have this form:
    Lmmdd hh:mm:ss.uuuuuu threadid file:line] msg...
where the fields are defined as follows:
    L                A single character, representing the log level (eg 'I' for INFO)
    mm               The month (zero padded; ie May is '05')
    dd               The day (zero padded)
    hh:mm:ss.uuuuuu  Time in hours, minutes and fractional seconds
    threadid         The space-padded thread ID as returned by GetTID()
    file             The file name
    line             The line number
    msg              The user-supplied message

So compared with the above words, you can easily understand this log:

E1228 21:57:13.138357 27257 process.cpp:1911] Failed to shutdown socket with fd 17: Transport endpoint is not connected

By default, the Mesos doesn’t output logs generated by VLOG function, and you need to set GLOG_v=m if you want to see the information from VLOG function (Refer this post):

$ sudo GLOG_v=3 ./bin/mesos-master.sh --ip=15.242.100.56 --work_dir=/var/lib/mesos
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1229 22:42:38.818521 11830 process.cpp:2426] Spawned process __gc__@15.242.100.56:5050
I1229 22:42:38.818613 11846 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818540032+00:00
I1229 22:42:38.818749 11847 process.cpp:2436] Resuming __gc__@15.242.100.56:5050 at 2015-12-30 03:42:38.818712832+00:00
I1229 22:42:38.818802 11844 process.cpp:2436] Resuming help@15.242.100.56:5050 at 2015-12-30 03:42:38.818746112+00:00
......

You can also use LOG function to add logs yourself on suspected locations.

(2) Gdb

If logs can’t save you, it is time for debugger to be your hero. Gdb is no doubt a great tool of debugging C/C++ programs. To use gdb, you should enable --enable-debug configuration option before compiling Mesos:

nan@ubuntu:~/mesos-0.25.0/build$ ../configure --enable-debug

You can set breakpoint on class member function like this:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

You can also make use of “auto-complete” feature of gdb. Input the uncompleted function name:

(gdb) b process::SocketManager::cl  

Then click tab:

(gdb) b process::SocketManager::close(int)
Breakpoint 1 at 0x7fd07857c162: file ../../../3rdparty/libprocess/src/process.cpp, line 1849.

Notice: If the matched symbols are too many, it may hang gdb. So try to reduce the scope as small as possible.

Additionally, since source file names are relative to the directory where the code was compiled (please refer breakpoints in GDB),you can reach the same effect through “b file:line“command :

(gdb) b ../../../3rdparty/libprocess/src/process.cpp:1279
Breakpoint 2 at 0x7fd07857973a: file ../../../3rdparty/libprocess/src/process.cpp, line 1279.
(gdb) c
Continuing.
......
[Switching to Thread 0x7fd06b9d4700 (LWP 16677)]

Breakpoint 2, process::SocketManager::link_connect (this=0xca1a30, future=..., socket=0x7fd0500026d0, to=...)
    at ../../../3rdparty/libprocess/src/process.cpp:1279
1279      if (future.isDiscarded() || future.isFailed()) {

You can see the breakpoint is set onprocess::SocketManager::link_connect(process::Future<Nothing> const&, process::network::Socket*, process::UPID const&)function.

P.S.:There are also handy out-of-box gdb scripts in build/bin directory:

# ls bin/gdb-mesos-*
bin/gdb-mesos-local.sh  bin/gdb-mesos-master.sh  bin/gdb-mesos-slave.sh  bin/gdb-mesos-tests.sh

(3) Tcpdump and wireshark

Network packet analyzing tools such as tcpdump and wireshark are essential to diagnose programs which interact with other hosts. E.g., you can use following command to see what come in and out of Mesos master:

sudo tcpdump -A -s 0 'tcp port 5050' -i em1 -w capture.pcap

BTW, my issue is finally fixed by analyzing the following packet:

1

(4) Pstack script

Personally, I think pstack script is useful when monitoring thread status, and please refer Use pstack to track threads on Linux for detail.

Enjoy debugging!