Leverage comprehensive debugging tricks in one shot

Last Friday, a colleague told me that when connecting an invalid address, the client using gRPC will block forever. To verify it, I use the example code shipped in gRPC:

GreeterClient greeter(grpc::CreateChannel(
  "localhost:50051", grpc::InsecureChannelCredentials()));

Change the "localhost:50051" to "badhost:50051", then compile and execute the program. Sure enough, the client hang without any response. At the outset, I thought it should be a common issue, and there must be a solution already. So I just submitted a post in the discussion group, although there was some responses, but since they were not the satisfactory explanations, I knew I need to trouble-shooting myself.

(1) The first thing I wanted to make sure was whether the network card had sent requests to badhost or not, so I used tcpdump to capture the packets:

$ sudo tcpdump -A -s 0 'port 50051' -i enp7s0f0

But there isn’t any data captured. To double-confirm, I also used tcpconnect program to check:

$ sudo tcpconnect -P 50051
PID    COMM         IP SADDR            DADDR            DPORT

Still nothing output.

(2) Although I couldn’t find the connect request to port 50051, no matter what application on *NIX, it will definitely call connect function at the end. So I changed the tactic, and tried to find who calls the connect:

a) Build gRPC with debugging info (The reason of using “PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig” is here):

$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" make

b) Modify the Makefile to build client program with debugging info:

CXXFLAGS += -g -std=c++11

c) Use gdb to debug the program, after starting it, set breakpoint at connect function:

$ gdb -q greeter_client
Reading symbols from greeter_client...done.
(gdb) start
Temporary breakpoint 1 at 0x146fe: file greeter_client.cc, line 74.
Starting program: /home/xiaonan/Project/grpc/examples/cpp/helloworld/greeter_client
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffea88) at greeter_client.cc:74
74      int main(int argc, char** argv) {
(gdb) b connect
Breakpoint 2 at 0x7ffff6619b80 (2 locations)

Then continue executing the program. When the breakpoint was hit, check the stack:

(gdb) c
Continuing.
[New Thread 0x7ffff4edc700 (LWP 28396)]
[New Thread 0x7ffff46db700 (LWP 28397)]
[Switching to Thread 0x7ffff4edc700 (LWP 28396)]

Thread 2 "greeter_client" hit Breakpoint 2, 0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

Then continue to run the program, the breakpoint was hit again:

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6
(gdb)

Oh, I see! The resolving of badhost must be failed, so there would definitely no subsequent connecting port 50051. But why the client was trying to resolve name again and again? If I find this cause, it can explain why client was blocking.

(3) Since there is ?? from /usr/lib/libgrpc.so.4, I can’t know which function was the culprit. I can go over the code, but I think I need the direct proof. Build gRPC with CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" seems not enough. After some tweaking, I come out the following solutions:

a) According to the Makefile:

# TODO(nnoble): the strip target is stripping in-place, instead
# of copying files in a temporary folder.
# This prevents proper debugging after running make install.  

make install” will strip the debugging information, so instead of executing “make install” command, I set LD_LIBRARY_PATH environment variable to let client link library in the specified directory:

$ export LD_LIBRARY_PATH=/home/xiaonan/Project/grpc/libs/opt

b) Hardcode -g in the Makefile:

CFLAGS += -g -std=c99 -Wsign-conversion -Wconversion $(W_SHADOW) $(W_EXTRA_SEMI)
CXXFLAGS += -g -std=c++11

Then the symbols can all be resolved:

(gdb) bt
#0  0x00007ffff6486b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff64bb61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff64bbae2 in __nscd_get_mapping () from /usr/lib/libc.so.6
#3  0x00007ffff64bbed5 in __nscd_get_map_ref () from /usr/lib/libc.so.6
#4  0x00007ffff64b9ba3 in __nscd_getai () from /usr/lib/libc.so.6
#5  0x00007ffff64708bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#6  0x00007ffff6471724 in getaddrinfo () from /usr/lib/libc.so.6
#7  0x00007ffff7473ec5 in blocking_resolve_address_impl (name=0x55555578edf0 "badhost:50051",
    default_port=0x555555790220 "https", addresses=0x55555578f1f0) at src/core/lib/iomgr/resolve_address_posix.c:83
#8  0x00007ffff74742e3 in do_request_thread (exec_ctx=0x7ffff5043c30, rp=0x55555578e630, error=<optimized out>)
    at src/core/lib/iomgr/resolve_address_posix.c:157
#9  0x00007ffff7472b86 in run_closures (exec_ctx=<optimized out>, list=...) at src/core/lib/iomgr/executor.c:64
#10 executor_thread (arg=0x555555789fc0) at src/core/lib/iomgr/executor.c:152
#11 0x00007ffff74e5286 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.c:42
#12 0x00007ffff6181049 in start_thread () from /usr/lib/../lib64/libpthread.so.0
#13 0x00007ffff6485f0f in clone () from /usr/lib/libc.so.6

Now I just need to step-into code, and the information of this issue can also be referred here.

During the whole process, I used sniffer tool (tcpdump), kernel tracing tool(tcpconnect, which belongs to bcc and utilizes eBPF), networking knowledge (set breakpoint on connect function), debugging tool (gdb), and the trick of linking library (set LD_LIBRARY_PATH to bypass installing gRPC), that’s why I call the whole procedure “leverage comprehensive debugging tricks”.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.