Leverage comprehensive debugging tricks in one shot

Last Friday, a colleague told me that when connecting an invalid address, the client using gRPC will block forever. To verify it, I use the example code shipped in gRPC:

GreeterClient greeter(grpc::CreateChannel(
  "localhost:50051", grpc::InsecureChannelCredentials()));

Change the "localhost:50051" to "badhost:50051", then compile and execute the program. Sure enough, the client hang without any response. At the outset, I thought it should be a common issue, and there must be a solution already. So I just submitted a post in the discussion group, although there was some responses, but since they were not the satisfactory explanations, I knew I need to trouble-shooting myself.

(1) The first thing I wanted to make sure was whether the network card had sent requests to badhost or not, so I used tcpdump to capture the packets:

$ sudo tcpdump -A -s 0 'port 50051' -i enp7s0f0

But there isn’t any data captured. To double-confirm, I also used tcpconnect program to check:

$ sudo tcpconnect -P 50051
PID    COMM         IP SADDR            DADDR            DPORT

Still nothing output.

(2) Although I couldn’t find the connect request to port 50051, no matter what application on *NIX, it will definitely call connect function at the end. So I changed the tactic, and tried to find who calls the connect:

a) Build gRPC with debugging info (The reason of using “PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig” is here):

$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" make

b) Modify the Makefile to build client program with debugging info:

CXXFLAGS += -g -std=c++11

c) Use gdb to debug the program, after starting it, set breakpoint at connect function:

$ gdb -q greeter_client
Reading symbols from greeter_client...done.
(gdb) start
Temporary breakpoint 1 at 0x146fe: file greeter_client.cc, line 74.
Starting program: /home/xiaonan/Project/grpc/examples/cpp/helloworld/greeter_client
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffea88) at greeter_client.cc:74
74      int main(int argc, char** argv) {
(gdb) b connect
Breakpoint 2 at 0x7ffff6619b80 (2 locations)

Then continue executing the program. When the breakpoint was hit, check the stack:

(gdb) c
Continuing.
[New Thread 0x7ffff4edc700 (LWP 28396)]
[New Thread 0x7ffff46db700 (LWP 28397)]
[Switching to Thread 0x7ffff4edc700 (LWP 28396)]

Thread 2 "greeter_client" hit Breakpoint 2, 0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

Then continue to run the program, the breakpoint was hit again:

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6
(gdb)

Oh, I see! The resolving of badhost must be failed, so there would definitely no subsequent connecting port 50051. But why the client was trying to resolve name again and again? If I find this cause, it can explain why client was blocking.

(3) Since there is ?? from /usr/lib/libgrpc.so.4, I can’t know which function was the culprit. I can go over the code, but I think I need the direct proof. Build gRPC with CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" seems not enough. After some tweaking, I come out the following solutions:

a) According to the Makefile:

# TODO(nnoble): the strip target is stripping in-place, instead
# of copying files in a temporary folder.
# This prevents proper debugging after running make install.  

make install” will strip the debugging information, so instead of executing “make install” command, I set LD_LIBRARY_PATH environment variable to let client link library in the specified directory:

$ export LD_LIBRARY_PATH=/home/xiaonan/Project/grpc/libs/opt

b) Hardcode -g in the Makefile:

CFLAGS += -g -std=c99 -Wsign-conversion -Wconversion $(W_SHADOW) $(W_EXTRA_SEMI)
CXXFLAGS += -g -std=c++11

Then the symbols can all be resolved:

(gdb) bt
#0  0x00007ffff6486b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff64bb61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff64bbae2 in __nscd_get_mapping () from /usr/lib/libc.so.6
#3  0x00007ffff64bbed5 in __nscd_get_map_ref () from /usr/lib/libc.so.6
#4  0x00007ffff64b9ba3 in __nscd_getai () from /usr/lib/libc.so.6
#5  0x00007ffff64708bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#6  0x00007ffff6471724 in getaddrinfo () from /usr/lib/libc.so.6
#7  0x00007ffff7473ec5 in blocking_resolve_address_impl (name=0x55555578edf0 "badhost:50051",
    default_port=0x555555790220 "https", addresses=0x55555578f1f0) at src/core/lib/iomgr/resolve_address_posix.c:83
#8  0x00007ffff74742e3 in do_request_thread (exec_ctx=0x7ffff5043c30, rp=0x55555578e630, error=<optimized out>)
    at src/core/lib/iomgr/resolve_address_posix.c:157
#9  0x00007ffff7472b86 in run_closures (exec_ctx=<optimized out>, list=...) at src/core/lib/iomgr/executor.c:64
#10 executor_thread (arg=0x555555789fc0) at src/core/lib/iomgr/executor.c:152
#11 0x00007ffff74e5286 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.c:42
#12 0x00007ffff6181049 in start_thread () from /usr/lib/../lib64/libpthread.so.0
#13 0x00007ffff6485f0f in clone () from /usr/lib/libc.so.6

Now I just need to step-into code, and the information of this issue can also be referred here.

During the whole process, I used sniffer tool (tcpdump), kernel tracing tool(tcpconnect, which belongs to bcc and utilizes eBPF), networking knowledge (set breakpoint on connect function), debugging tool (gdb), and the trick of linking library (set LD_LIBRARY_PATH to bypass installing gRPC), that’s why I call the whole procedure “leverage comprehensive debugging tricks”.

 

Build static library of gRPC

If you want to build static library of gRPC, you can check following steps:

$ git clone https://github.com/grpc/grpc.git
$ cd grpc
$ git submodule update --init
$ CC=clang CXX=clang++ make REQUIRE_CUSTOM_LIBRARIES_opt=1 static

If you bump into following errors:

third_party/protobuf/src/google/protobuf/stubs/port.h:263:5: error: macro expansion producing 'defined' has undefined
  behavior [-Werror,-Wexpansion-to-defined]
#if GOOGLE_PROTOBUF_USE_UNALIGNED
    ^

Please check this patch. After compile, you can find all libraries are static:

$ ls -alt libs/opt/*
-rw-r--r-- 1 xiaonan xiaonan  8109188 Oct  6 17:54 libs/opt/libgrpc++_unsecure.a
-rw-r--r-- 1 xiaonan xiaonan  3705192 Oct  6 17:54 libs/opt/libgrpc++_reflection.a
-rw-r--r-- 1 xiaonan xiaonan  2089436 Oct  6 17:54 libs/opt/libgrpc++_error_details.a
-rw-r--r-- 1 xiaonan xiaonan 24675866 Oct  6 17:54 libs/opt/libgrpc++_cronet.a
-rw-r--r-- 1 xiaonan xiaonan  9276572 Oct  6 17:54 libs/opt/libgrpc++.a
-rw-r--r-- 1 xiaonan xiaonan   917148 Oct  6 17:54 libs/opt/libgrpc++_core_stats.a
-rw-r--r-- 1 xiaonan xiaonan  7180018 Oct  6 17:54 libs/opt/libgrpc_plugin_support.a
-rw-r--r-- 1 xiaonan xiaonan  8163444 Oct  6 17:41 libs/opt/libgrpc_unsecure.a
-rw-r--r-- 1 xiaonan xiaonan 18854796 Oct  6 17:41 libs/opt/libgrpc_cronet.a
-rw-r--r-- 1 xiaonan xiaonan 20064046 Oct  6 17:41 libs/opt/libgrpc.a
-rw-r--r-- 1 xiaonan xiaonan 10669618 Oct  6 17:40 libs/opt/libboringssl.a
-rw-r--r-- 1 xiaonan xiaonan   456458 Oct  6 17:39 libs/opt/libgpr.a
-rw-r--r-- 1 xiaonan xiaonan   379730 Oct  6 17:39 libs/opt/libz.a
-rw-r--r-- 1 xiaonan xiaonan   629666 Oct  6 17:39 libs/opt/libares.a

libs/opt/protobuf:
total 114748
drwxr-xr-x 4 xiaonan xiaonan     4096 Oct  6 17:54 ..
-rw-r--r-- 1 xiaonan xiaonan 50492478 Oct  6 17:49 libprotobuf.a
drwxr-xr-x 2 xiaonan xiaonan     4096 Oct  6 17:49 .
-rw-r--r-- 1 xiaonan xiaonan 66986782 Oct  6 17:49 libprotoc.a

libs/opt/pkgconfig:
total 24
drwxr-xr-x 4 xiaonan xiaonan 4096 Oct  6 17:54 ..
drwxr-xr-x 2 xiaonan xiaonan 4096 Oct  6 17:41 .
-rw-r--r-- 1 xiaonan xiaonan  283 Oct  6 17:41 grpc++_unsecure.pc
-rw-r--r-- 1 xiaonan xiaonan  253 Oct  6 17:41 grpc++.pc
-rw-r--r-- 1 xiaonan xiaonan  283 Oct  6 17:39 grpc_unsecure.pc
-rw-r--r-- 1 xiaonan xiaonan  264 Oct  6 17:39 grpc.pc

First taster of Standard C++ Thread Library on OpenBSD

Today I tried Standard C++ Thread Library on OpenBSD, since it requires the compiler to support C++11 standard, and the default c++ only support C++98 (please refer here), so I need to switch to clang++. The program is just a classic “Hello World”:

#include <thread>
#include <iostream>

void hello()
{
    std::cout << "Hello World!\n";
}

int main(void)
{
    std::thread t(hello);
    t.join();
    return 0;
}

Built and run it:

# clang++ -std=c++11 hello.cpp
root:/root/Project# ./a.out
terminate called after throwing an instance of 'std::system_error'
  what():  Enable multithreading to use std::thread: Operation not permitted
Abort trap (core dumped)

Whoops! The program crashed. After reading this post, adding -pthread during compilation fixed this issue:

# clang++ -pthread -std=c++11 hello.cpp
# ./a.out
Hello World!

The anatomy of ldd program on OpenBSD

In the past week, I read the ldd source code on OpenBSD to get a better understanding of how it works. And this post should also be a reference for other*NIX OSs.

The ELF file is divided into 4 categories: relocatable, executable, shared, and core. Only the executable and shared object files may have dynamic object dependencies, so the ldd only check these 2 kinds of ELF file:

(1) Executable.

ldd leverages the LD_TRACE_LOADED_OBJECTS environment variable in fact, and the code is as following:

if (setenv("LD_TRACE_LOADED_OBJECTS", "true", 1) < 0)
    err(1, "setenv(LD_TRACE_LOADED_OBJECTS)");

When LD_TRACE_LOADED_OBJECTS is set to 1 or true, running executable file will show shared objects needed instead of running it, so you even not needldd to check executable file. See the following outputs:

# /usr/bin/ldd
usage: ldd program ...
# LD_TRACE_LOADED_OBJECTS=1 /usr/bin/ldd
        Start            End              Type Open Ref GrpRef Name
        00000b6ac6e00000 00000b6ac7003000 exe  1    0   0      /usr/bin/ldd
        00000b6dbc96c000 00000b6dbcc38000 rlib 0    1   0      /usr/lib/libc.so.89.3
        00000b6d6ad00000 00000b6d6ad00000 rtld 0    1   0      /usr/libexec/ld.so  

(2) Shared object.

The code to print dependencies of shared object is as following:

if (ehdr.e_type == ET_DYN && !interp) {
    if (realpath(name, buf) == NULL) {
        printf("realpath(%s): %s", name,
            strerror(errno));
        fflush(stdout);
        _exit(1);
    }
    dlhandle = dlopen(buf, RTLD_TRACE);
    if (dlhandle == NULL) {
        printf("%s\n", dlerror());
        fflush(stdout);
        _exit(1);
    }
    _exit(0);
}

Why the condition of checking a ELF file is shared object or not is like this:

if (ehdr.e_type == ET_DYN && !interp) {
    ......
}

That’s because the file type of position-independent executable (PIE) is the same as shared object, but normally PIE contains a interpreter program header since it needs dynamic linker to load it while shared object lacks (refer this article). So the above condition will filter PIE file.

The dlopen(buf, RTLD_TRACE) is used to print dynamic object information. And the actual code is like this:

if (_dl_traceld) {
    _dl_show_objects();
    _dl_unload_shlib(object);
    _dl_exit(0);
}

In fact, you can also implement a simple application which outputs dynamic object information for shared object yourself:

#include <dlfcn.h>

int main(int argc, char **argv)
{
    dlopen(argv[1], RTLD_TRACE);
    return 0;
}

Compile and use it to analyze /usr/lib/libssl.so.43.2:

# cc lddshared.c
# ./a.out /usr/lib/libssl.so.43.2
    Start            End              Type Open Ref GrpRef Name
    000010e2df1c5000 000010e2df41a000 dlib 1    0   0      /usr/lib/libssl.so.43.2
    000010e311e3f000 000010e312209000 rlib 0    1   0      /usr/lib/libcrypto.so.41.1

The same as using ldd directly:

# ldd /usr/lib/libssl.so.43.2
/usr/lib/libssl.so.43.2:
    Start            End              Type Open Ref GrpRef Name
    00001d9ffef08000 00001d9fff15d000 dlib 1    0   0      /usr/lib/libssl.so.43.2
    00001d9ff1431000 00001d9ff17fb000 rlib 0    1   0      /usr/lib/libcrypto.so.41.1

Through the studying of ldd source code, I also get many by-products: such as knowledge of ELF file, linking and loading, etc. So diving into code is a really good method to learn *NIX deeper!

The analysis of gRPC communication mode

The route_guide program is a very good tutorial to learn gRPC, I will use it as an example to analyze the gRPC programming.

Prologue:
Compile the program with debug information, so it is easy to use gdb to manipulate them. Modify Makefile:

CXXFLAGS += -std=c++11 -g

(1) Start route_guide_server application, and use gdb to step-into route_guide_client:

221     int main(int argc, char** argv) {
(gdb) n
223       std::string db = routeguide::GetDbFileContent(argc, argv);
(gdb)
226                               grpc::InsecureChannelCredentials()),
(gdb)
[New Thread 0x7ffff502f700 (LWP 25947)]
[New Thread 0x7ffff482e700 (LWP 25948)]
225           grpc::CreateChannel("localhost:50051",
(gdb)
227           db);
(gdb)
DB parsed, loaded 100 features.

Check current network status:

$ netstat -an | grep 50051
tcp6       0      0 :::50051                :::*                    LISTEN

From this, we can know grpc::CreateChannel() function doesn’t create connection to server, and it just does some initialization work.

(2) Since simple RPC is one-request/one-response mode, and it’s really easy, I won’t discuss it here.

(3) Server-side streaming RPC is one-request/multiple-response mode. The client code skeleton is like this:

std::unique_ptr<ClientReader<Feature> > reader(
        stub_->ListFeatures(&context, rect));
while (reader->Read(&feature)) {
  ......
}
Status status = reader->Finish();
if (status.ok()) {
  ......
} else {
  ......
}

After client sends request (stub_->ListFeatures(&context, rect)), it will read all the responses from server until reader->Read() returns false. Then using reader->Finish() to check the status. The server code is simple, just send all responses:

for (......) {
    ......
    writer->Write(f);
 } 

But how about client not want to continue to process the responses in the middle? It likes this:

while (reader->Read(&feature)) {
  ......
  if (some error) {
    break;
  } 
}
Status status = reader->Finish();
if (status.ok()) {
  ......
} else {
  ......
}

After experimenting, the client will block at reader->Finish() function. You can refer Be careful of using grpc::ClientStreamingInterface::Finish() function. There are 2 methods I can figure out to handle this case:
a) Don’t call reader->Finish():

while (reader->Read(&feature)) {
  ......
  if (some error) {
    break;
  } 
}

b) Or although there is error, still consume other messages:

while (reader->Read(&feature)) {
  ......
  if (some error) {
    break;
  } 
}
while (reader->Read(&feature)) {
    // do nothing, just comsume other messages.
}
Status status = reader->Finish();
if (status.ok()) {
  ......
} else {
  ......
}

Honestly, I don’t dive into the gRPC code. But from testing, the above 2 methods seems all work correctly.

(4) Client-side streaming RPC is multiple-request/one-response mode, the client code framework is like this:

std::unique_ptr<ClientWriter<Point> > writer(
    stub_->RecordRoute(&context, &stats));
for (......) {
  ......
  if (!writer->Write(f.location())) {
    // Broken stream.
    break;
  }
  ......
}
writer->WritesDone();
Status status = writer->Finish();
if (status.ok()) {
  // Check stats from server
} else {
  ......
}

The client code is simple, just tries best to send all requests. Once finish transmitting, calls writer->WritesDone() and blocks writer->Finish() to wait server’s response. For server code, it’s also brief:

while (reader->Read(&point)) {
  ......
}

If the program breaks the while loop half way:

while (reader->Read(&point)) {
  ......
  if (some error) {
    break;
  } 
}

That is OK too.

(5) The last flow is bidirectional streaming RPC, which is multiple-request/multiple-response mode. For the server code framework, it is neat, just sends response according request:

while (stream->Read(&note)) {
  ......
  stream->Write(n);
  ......
}

But for client side, it is a little complicated:

stream->Write(note);
stream->Read(&server_note);
stream->Write(note);
stream->Read(&server_note);
......
stream->WritesDone();
Status status = stream->Finish();
if (!status.ok()) {
  ......
} else {
  ......
}

You must pay attention to the same problem as Server-side streaming RPC, that is, if there is pending server’s responses before stream->Finish(), it will cause client block forever:

while (reader->Read(&feature)) {
  ......
  if (some error) {
    break;
  } 
}
Status status = reader->Finish();

Epilogue:
In a summary, gRPC is a brilliant communication framework which hides so many dirty details and let you focus on message definitions. If you are considering doing some network programming, you can try it, and I think it can’t make you disappointed.