Enable generating core dump file on Debian Linux

The default core dump file size is 0 on Debian Linux:

$ ulimit -c
0

To enable generating core dump file, I need to run following command:

$ ulimit -c unlimited  

But if you re-login, the core dump file size is changed back to 0 from unlimited. So “ulimit -c unlimited” need to be executed during your login. E.g., if you use zsh, append it in .zshrc file.

OpenBSD gives a hint on forgetting unlock mutex

Check following simple C++ program:

#include <mutex>

int main(void)
{
    std::mutex m;
    m.lock();

    return 0;
}

The mutex m forgot unlock itself before exiting main function:

m.unlock();

Test it on GNU/Linux, and I chose ArchLinux as the testbed:

$ uname -a
Linux fujitsu-i 4.13.7-1-ARCH #1 SMP PREEMPT Sat Oct 14 20:13:26 CEST 2017 x86_64 GNU/Linux
$ clang++ -g -pthread -std=c++11 test_mutex.cpp
$ ./a.out
$

The process exited normally, and no more words was given. Build and run it on OpenBSD 6.2:

# clang++ -g -pthread -std=c++11 test_mutex.cpp
# ./a.out
pthread_mutex_destroy on mutex with waiters!

The OpenBSD prompts “pthread_mutex_destroy on mutex with waiters!“. Interesting!

Leverage comprehensive debugging tricks in one shot

Last Friday, a colleague told me that when connecting an invalid address, the client using gRPC will block forever. To verify it, I use the example code shipped in gRPC:

GreeterClient greeter(grpc::CreateChannel(
  "localhost:50051", grpc::InsecureChannelCredentials()));

Change the "localhost:50051" to "badhost:50051", then compile and execute the program. Sure enough, the client hang without any response. At the outset, I thought it should be a common issue, and there must be a solution already. So I just submitted a post in the discussion group, although there was some responses, but since they were not the satisfactory explanations, I knew I need to trouble-shooting myself.

(1) The first thing I wanted to make sure was whether the network card had sent requests to badhost or not, so I used tcpdump to capture the packets:

$ sudo tcpdump -A -s 0 'port 50051' -i enp7s0f0

But there isn’t any data captured. To double-confirm, I also used tcpconnect program to check:

$ sudo tcpconnect -P 50051
PID    COMM         IP SADDR            DADDR            DPORT

Still nothing output.

(2) Although I couldn’t find the connect request to port 50051, no matter what application on *NIX, it will definitely call connect function at the end. So I changed the tactic, and tried to find who calls the connect:

a) Build gRPC with debugging info (The reason of using “PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig” is here):

$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" make

b) Modify the Makefile to build client program with debugging info:

CXXFLAGS += -g -std=c++11

c) Use gdb to debug the program, after starting it, set breakpoint at connect function:

$ gdb -q greeter_client
Reading symbols from greeter_client...done.
(gdb) start
Temporary breakpoint 1 at 0x146fe: file greeter_client.cc, line 74.
Starting program: /home/xiaonan/Project/grpc/examples/cpp/helloworld/greeter_client
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffea88) at greeter_client.cc:74
74      int main(int argc, char** argv) {
(gdb) b connect
Breakpoint 2 at 0x7ffff6619b80 (2 locations)

Then continue executing the program. When the breakpoint was hit, check the stack:

(gdb) c
Continuing.
[New Thread 0x7ffff4edc700 (LWP 28396)]
[New Thread 0x7ffff46db700 (LWP 28397)]
[Switching to Thread 0x7ffff4edc700 (LWP 28396)]

Thread 2 "greeter_client" hit Breakpoint 2, 0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

Then continue to run the program, the breakpoint was hit again:

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6
(gdb)

Oh, I see! The resolving of badhost must be failed, so there would definitely no subsequent connecting port 50051. But why the client was trying to resolve name again and again? If I find this cause, it can explain why client was blocking.

(3) Since there is ?? from /usr/lib/libgrpc.so.4, I can’t know which function was the culprit. I can go over the code, but I think I need the direct proof. Build gRPC with CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" seems not enough. After some tweaking, I come out the following solutions:

a) According to the Makefile:

# TODO(nnoble): the strip target is stripping in-place, instead
# of copying files in a temporary folder.
# This prevents proper debugging after running make install.  

make install” will strip the debugging information, so instead of executing “make install” command, I set LD_LIBRARY_PATH environment variable to let client link library in the specified directory:

$ export LD_LIBRARY_PATH=/home/xiaonan/Project/grpc/libs/opt

b) Hardcode -g in the Makefile:

CFLAGS += -g -std=c99 -Wsign-conversion -Wconversion $(W_SHADOW) $(W_EXTRA_SEMI)
CXXFLAGS += -g -std=c++11

Then the symbols can all be resolved:

(gdb) bt
#0  0x00007ffff6486b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff64bb61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff64bbae2 in __nscd_get_mapping () from /usr/lib/libc.so.6
#3  0x00007ffff64bbed5 in __nscd_get_map_ref () from /usr/lib/libc.so.6
#4  0x00007ffff64b9ba3 in __nscd_getai () from /usr/lib/libc.so.6
#5  0x00007ffff64708bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#6  0x00007ffff6471724 in getaddrinfo () from /usr/lib/libc.so.6
#7  0x00007ffff7473ec5 in blocking_resolve_address_impl (name=0x55555578edf0 "badhost:50051",
    default_port=0x555555790220 "https", addresses=0x55555578f1f0) at src/core/lib/iomgr/resolve_address_posix.c:83
#8  0x00007ffff74742e3 in do_request_thread (exec_ctx=0x7ffff5043c30, rp=0x55555578e630, error=<optimized out>)
    at src/core/lib/iomgr/resolve_address_posix.c:157
#9  0x00007ffff7472b86 in run_closures (exec_ctx=<optimized out>, list=...) at src/core/lib/iomgr/executor.c:64
#10 executor_thread (arg=0x555555789fc0) at src/core/lib/iomgr/executor.c:152
#11 0x00007ffff74e5286 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.c:42
#12 0x00007ffff6181049 in start_thread () from /usr/lib/../lib64/libpthread.so.0
#13 0x00007ffff6485f0f in clone () from /usr/lib/libc.so.6

Now I just need to step-into code, and the information of this issue can also be referred here.

During the whole process, I used sniffer tool (tcpdump), kernel tracing tool(tcpconnect, which belongs to bcc and utilizes eBPF), networking knowledge (set breakpoint on connect function), debugging tool (gdb), and the trick of linking library (set LD_LIBRARY_PATH to bypass installing gRPC), that’s why I call the whole procedure “leverage comprehensive debugging tricks”.

 

The anatomy of ldd program on OpenBSD

In the past week, I read the ldd source code on OpenBSD to get a better understanding of how it works. And this post should also be a reference for other*NIX OSs.

The ELF file is divided into 4 categories: relocatable, executable, shared, and core. Only the executable and shared object files may have dynamic object dependencies, so the ldd only check these 2 kinds of ELF file:

(1) Executable.

ldd leverages the LD_TRACE_LOADED_OBJECTS environment variable in fact, and the code is as following:

if (setenv("LD_TRACE_LOADED_OBJECTS", "true", 1) < 0)
    err(1, "setenv(LD_TRACE_LOADED_OBJECTS)");

When LD_TRACE_LOADED_OBJECTS is set to 1 or true, running executable file will show shared objects needed instead of running it, so you even not needldd to check executable file. See the following outputs:

# /usr/bin/ldd
usage: ldd program ...
# LD_TRACE_LOADED_OBJECTS=1 /usr/bin/ldd
        Start            End              Type Open Ref GrpRef Name
        00000b6ac6e00000 00000b6ac7003000 exe  1    0   0      /usr/bin/ldd
        00000b6dbc96c000 00000b6dbcc38000 rlib 0    1   0      /usr/lib/libc.so.89.3
        00000b6d6ad00000 00000b6d6ad00000 rtld 0    1   0      /usr/libexec/ld.so  

(2) Shared object.

The code to print dependencies of shared object is as following:

if (ehdr.e_type == ET_DYN && !interp) {
    if (realpath(name, buf) == NULL) {
        printf("realpath(%s): %s", name,
            strerror(errno));
        fflush(stdout);
        _exit(1);
    }
    dlhandle = dlopen(buf, RTLD_TRACE);
    if (dlhandle == NULL) {
        printf("%s\n", dlerror());
        fflush(stdout);
        _exit(1);
    }
    _exit(0);
}

Why the condition of checking a ELF file is shared object or not is like this:

if (ehdr.e_type == ET_DYN && !interp) {
    ......
}

That’s because the file type of position-independent executable (PIE) is the same as shared object, but normally PIE contains a interpreter program header since it needs dynamic linker to load it while shared object lacks (refer this article). So the above condition will filter PIE file.

The dlopen(buf, RTLD_TRACE) is used to print dynamic object information. And the actual code is like this:

if (_dl_traceld) {
    _dl_show_objects();
    _dl_unload_shlib(object);
    _dl_exit(0);
}

In fact, you can also implement a simple application which outputs dynamic object information for shared object yourself:

#include <dlfcn.h>

int main(int argc, char **argv)
{
    dlopen(argv[1], RTLD_TRACE);
    return 0;
}

Compile and use it to analyze /usr/lib/libssl.so.43.2:

# cc lddshared.c
# ./a.out /usr/lib/libssl.so.43.2
    Start            End              Type Open Ref GrpRef Name
    000010e2df1c5000 000010e2df41a000 dlib 1    0   0      /usr/lib/libssl.so.43.2
    000010e311e3f000 000010e312209000 rlib 0    1   0      /usr/lib/libcrypto.so.41.1

The same as using ldd directly:

# ldd /usr/lib/libssl.so.43.2
/usr/lib/libssl.so.43.2:
    Start            End              Type Open Ref GrpRef Name
    00001d9ffef08000 00001d9fff15d000 dlib 1    0   0      /usr/lib/libssl.so.43.2
    00001d9ff1431000 00001d9ff17fb000 rlib 0    1   0      /usr/lib/libcrypto.so.41.1

Through the studying of ldd source code, I also get many by-products: such as knowledge of ELF file, linking and loading, etc. So diving into code is a really good method to learn *NIX deeper!

The basics of Client/Server socket programming

While Client/Server communication model is ubiquitous nowadays, most of them involve socket programming knowledge. In this post, I will introduce some rudimentary aspects of it:

(1) Short/Long-lived TCP connection.
Short-lived TCP connection refers to following pattern: Client creates a connection to server; send message, then close the connection. If Client wants to transmit information again, repeat the above steps. Because establishing and destroying TCP sessions have overhead, if Client needs to have transactions with Server frequently, long-lived connection may be a better choice: connect Server; deliver message, deliver message, …, disconnect Server. A caveat about long-lived connection is Client may need to send heartbeat message to Server to keep the TCP session active.

(2) Synchronous/Asynchronous communication.
After Client sends the request, it can block here to wait for Server’s response, this is called synchronous mode. Certainly the Client should set a timer in case the response never come. The Client can also choose not to block, and continue to do other things. On the contrary, this is called asynchronous mode. If the Client can send multiple requests before receiving responses, it needs to add ID for every message, so it can distinguish the corresponding response for every request.

(3) Error handling.
A big pain point of socket programming is you need to consider so many exceptional cases. For example, the Server suddenly crashes; the network cable is plugged out, or the response message is half-received, etc. So your code should process as many abnormalities as possible. It is no exaggeration to say that error-handling code quality is the cornerstone of program’s robustness.

(4) Portability.
Different *NIX flavors may have small divergences on socket programming, so the program works well on Linux may not guarantee it also run as you expect on FreeBSD. BTW, I summarized a post about tips of Solaris/illumos socket programming before, and you can read it if you happen to work on these platforms.

(5) Leverage sniffer tools.
Tcpdump/Wireshark/snoop are amazing tools for debugging network programming, and they can tell you what really happens under the hood. Try to be sophisticated at these tools, and they will save you at one day, trust me!