OpenBSD gives a hint on forgetting unlock mutex

Check following simple C++ program:

#include <mutex>

int main(void)
{
    std::mutex m;
    m.lock();

    return 0;
}

The mutex m forgot unlock itself before exiting main function:

m.unlock();

Test it on GNU/Linux, and I chose ArchLinux as the testbed:

$ uname -a
Linux fujitsu-i 4.13.7-1-ARCH #1 SMP PREEMPT Sat Oct 14 20:13:26 CEST 2017 x86_64 GNU/Linux
$ clang++ -g -pthread -std=c++11 test_mutex.cpp
$ ./a.out
$

The process exited normally, and no more words was given. Build and run it on OpenBSD 6.2:

# clang++ -g -pthread -std=c++11 test_mutex.cpp
# ./a.out
pthread_mutex_destroy on mutex with waiters!

The OpenBSD prompts “pthread_mutex_destroy on mutex with waiters!“. Interesting!

The first successful build of OpenBSD base system

Although the document of building OpenBSD base system is very simple, it still costs me nearly half a week to complete this work. I shared the experience here and hope the lessons I learned can help more newbies like me:

(1) The most important thing is about /usr/obj folder: it must be clean; belongs to wobj and mode is 770. I confronted the annoyed and weird “Permission denied” issue and have recorded the incident in this post before.

(2) I used to think once /usr/obj is calibrated, the compilation will go smoothly. Unfortunately, a more bizarre linking issue harassed me again:

cc -O2 -pipe  -DPIE_DEFAULT=1         -o gdb gdb.o libgdb.a ../bfd/libbfd.a -lreadline ../opcodes/libopcodes.a  -liberty -lncurses -lm     -liberty  -lkvm
libgdb.a(main.o): In function `captured_main':
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x7e6): warning:
warning: strcpy() is almost always misused, please use strlcpy()
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x7ff): warning:
warning: strcat() is almost always misused, please use strlcat()
libgdb.a(target.o): In function `normal_pid_to_str':
/usr/src/gnu/usr.bin/binutils/gdb/target.c:(.text+0x3047): warning:
warning: sprintf() is often misused, please use snprintf()
libgdb.a(main.o): In function `captured_main':
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0xa4): undefined reference to `bindtextdomain'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0xac): undefined reference to `textdomain'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x4b7): undefined reference to `dcgettext'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x606): undefined reference to `dcgettext'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x78d): undefined reference to `dcgettext'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0x9b9): undefined reference to `dcgettext'
/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0xa60): undefined reference to `dcgettext'
libgdb.a(main.o):/usr/src/gnu/usr.bin/binutils/gdb/main.c:(.text+0xb46): more undefined references to `dcgettext' follow
cc: error: linker command failed with exit code 1 (use -v to see invocation)
*** Error 1 in gnu/usr.bin/binutils/obj/gdb (Makefile:1176 'gdb')
*** Error 1 in gnu/usr.bin/binutils/obj (Makefile:21479 'all-gdb')
*** Error 1 in gnu/usr.bin/binutils (Makefile.bsd-wrapper:46 'all')
*** Error 1 in gnu/usr.bin (<bsd.subdir.mk>:48 'all')
*** Error 1 in gnu (<bsd.subdir.mk>:48 'all')
*** Error 1 in . (<bsd.subdir.mk>:48 'all')
*** Error 1 in . (Makefile:95 'do-build')
*** Error 1 in /usr/src (Makefile:74 'build')

I totally get lost in this blocking issue. After sending helps in both mailing list and Facebook, though many veterans gave good comments, after all they can’t approach my server, I still need to fix this issue myself at the end.

I began to doubt the upgrade of v6.2 because this is also the first time I upgrade OpenBSD. Although I couldn’t explain why the upgrade can be related to this link error, I don’t have other methods So I tried to follow the manual to do upgrade again. Now that I had done the OS upgrade, I also tried to refresh -current source code:

# cvs -q up -Pd
? mcount.d
? mcount.po
? gnu/usr.bin/binutils/gdb/.gdbinit
? gnu/usr.bin/binutils/gdb/Makefile
? gnu/usr.bin/binutils/gdb/config.cache
? gnu/usr.bin/binutils/gdb/config.h
? gnu/usr.bin/binutils/gdb/config.log
? gnu/usr.bin/binutils/gdb/config.status
? gnu/usr.bin/binutils/gdb/stamp-h
? gnu/usr.bin/binutils/gdb/doc/Makefile
? gnu/usr.bin/binutils/gdb/doc/config.log
? gnu/usr.bin/binutils/gdb/doc/config.status
? gnu/usr.bin/binutils/gdb/testsuite/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/config.log
? gnu/usr.bin/binutils/gdb/testsuite/config.status
? gnu/usr.bin/binutils/gdb/testsuite/gdb.ada/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.ada/gnat_ada.gpr
? gnu/usr.bin/binutils/gdb/testsuite/gdb.arch/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.asm/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.base/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.cp/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.disasm/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.dwarf2/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.fortran/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.java/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.mi/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.objc/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.stabs/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.stabs/config.log
? gnu/usr.bin/binutils/gdb/testsuite/gdb.stabs/config.status
? gnu/usr.bin/binutils/gdb/testsuite/gdb.threads/Makefile
? gnu/usr.bin/binutils/gdb/testsuite/gdb.trace/Makefile
...... 

Man! Why there are files began with ?? I see, I did some experiments in /usr/src directory before, and I didn’t notice that the source code had been polluted. That may be the root cause! As expected, after cleaning up the /usr/src, the build succeed finally!

It is time to wrap up, and the takeaways of building OpenBSD base system are as following:
(1) Clean up the /usr/obj and check its attributes carefully!
(2) Keep /usr/src synchronized with upstream and make sure it’s not messed up!

Clean out /usr/obj directory before compiling base system of OpenBSD

I tried to build base system of OpenBSD:

# cd /usr/src 
# make obj && make build

After a while, the following weird error was reported:

cc -O2 -pipe -g -Wimplicit -I/usr/src/lib/libc/include -I/usr/src/lib/libc/hidden -D__LIBC__  -Werror-implicit-function-declaration -include namespace.h -Werror=deprecated-declarations -DAPIWARN -DYP -I/usr/src/lib/libc/yp -I/usr/src/lib/libc -I/usr/src/lib/libc/gdtoa -I/usr/src/lib/libc/arch/amd64/gdtoa -DINFNAN_CHECK -DMULTIPLE_THREADS -DNO_FENV_H -DUSE_LOCALE -I/usr/src/lib/libc -I/usr/src/lib/libc/citrus -DRESOLVSORT -DFLOATING_POINT -DPRINTF_WIDE_CHAR -DSCANF_WIDE_CHAR -DFUTEX  -MD -MP  -c -fno-pie /usr/src/lib/libc/gmon/mcount.c -o mcount.po
error: error opening 'mcount.po.d': Permission denied
1 error generated.
*** Error 1 in lib/libc (gmon/Makefile.inc:12 'mcount.po': @cc -O2 -pipe -g -Wimplicit -I/usr/src/lib/libc/include -I/usr/src/lib/libc/hidde...)
*** Error 1 in lib (<bsd.subdir.mk>:48 'all')
*** Error 1 in . (Makefile:90 'do-build')
*** Error 1 in /usr/src (Makefile:74 'build')

I worked as a root, why don’t I have permission?

After checking manual again:

The build process will place the object files in a tree under /usr/obj. This directory must be owned by build:wobj with mode 770.

The first time, the /usr/obj directory must be cleaned out completely before proceeding to avoid permission issues. After a successful release build, this is no longer needed.

I checked my /usr/obj immediately:

# ls -alt
......
drwxrwx---  15 build  wobj    512 Sep  6 09:44 obj
......

The group and mode is correct, wait, today is October, 14th, but the date of directory is September, 6th. That’s should be the issue.

After cleaning out the /usr/obj and reissuing “make obj && make build“, the above error disappeared.

Leverage comprehensive debugging tricks in one shot

Last Friday, a colleague told me that when connecting an invalid address, the client using gRPC will block forever. To verify it, I use the example code shipped in gRPC:

GreeterClient greeter(grpc::CreateChannel(
  "localhost:50051", grpc::InsecureChannelCredentials()));

Change the "localhost:50051" to "badhost:50051", then compile and execute the program. Sure enough, the client hang without any response. At the outset, I thought it should be a common issue, and there must be a solution already. So I just submitted a post in the discussion group, although there was some responses, but since they were not the satisfactory explanations, I knew I need to trouble-shooting myself.

(1) The first thing I wanted to make sure was whether the network card had sent requests to badhost or not, so I used tcpdump to capture the packets:

$ sudo tcpdump -A -s 0 'port 50051' -i enp7s0f0

But there isn’t any data captured. To double-confirm, I also used tcpconnect program to check:

$ sudo tcpconnect -P 50051
PID    COMM         IP SADDR            DADDR            DPORT

Still nothing output.

(2) Although I couldn’t find the connect request to port 50051, no matter what application on *NIX, it will definitely call connect function at the end. So I changed the tactic, and tried to find who calls the connect:

a) Build gRPC with debugging info (The reason of using “PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig” is here):

$ PKG_CONFIG_PATH=/usr/lib/openssl-1.0/pkgconfig CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" make

b) Modify the Makefile to build client program with debugging info:

CXXFLAGS += -g -std=c++11

c) Use gdb to debug the program, after starting it, set breakpoint at connect function:

$ gdb -q greeter_client
Reading symbols from greeter_client...done.
(gdb) start
Temporary breakpoint 1 at 0x146fe: file greeter_client.cc, line 74.
Starting program: /home/xiaonan/Project/grpc/examples/cpp/helloworld/greeter_client
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=1, argv=0x7fffffffea88) at greeter_client.cc:74
74      int main(int argc, char** argv) {
(gdb) b connect
Breakpoint 2 at 0x7ffff6619b80 (2 locations)

Then continue executing the program. When the breakpoint was hit, check the stack:

(gdb) c
Continuing.
[New Thread 0x7ffff4edc700 (LWP 28396)]
[New Thread 0x7ffff46db700 (LWP 28397)]
[Switching to Thread 0x7ffff4edc700 (LWP 28396)]

Thread 2 "greeter_client" hit Breakpoint 2, 0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

Then continue to run the program, the breakpoint was hit again:

(gdb) bt
#0  0x00007ffff6619b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff664e61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff664f156 in __nscd_open_socket () from /usr/lib/libc.so.6
#3  0x00007ffff664ccc6 in __nscd_getai () from /usr/lib/libc.so.6
#4  0x00007ffff66038bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#5  0x00007ffff6604724 in getaddrinfo () from /usr/lib/libc.so.6
#6  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#7  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#8  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#9  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#10 0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#11 0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6
(gdb)

Oh, I see! The resolving of badhost must be failed, so there would definitely no subsequent connecting port 50051. But why the client was trying to resolve name again and again? If I find this cause, it can explain why client was blocking.

(3) Since there is ?? from /usr/lib/libgrpc.so.4, I can’t know which function was the culprit. I can go over the code, but I think I need the direct proof. Build gRPC with CC=clang CXX=clang++ CFLAGS="-g -O0" CXXFLAGS="-g -O0" seems not enough. After some tweaking, I come out the following solutions:

a) According to the Makefile:

# TODO(nnoble): the strip target is stripping in-place, instead
# of copying files in a temporary folder.
# This prevents proper debugging after running make install.  

make install” will strip the debugging information, so instead of executing “make install” command, I set LD_LIBRARY_PATH environment variable to let client link library in the specified directory:

$ export LD_LIBRARY_PATH=/home/xiaonan/Project/grpc/libs/opt

b) Hardcode -g in the Makefile:

CFLAGS += -g -std=c99 -Wsign-conversion -Wconversion $(W_SHADOW) $(W_EXTRA_SEMI)
CXXFLAGS += -g -std=c++11

Then the symbols can all be resolved:

(gdb) bt
#0  0x00007ffff6486b80 in connect () from /usr/lib/libc.so.6
#1  0x00007ffff64bb61e in open_socket () from /usr/lib/libc.so.6
#2  0x00007ffff64bbae2 in __nscd_get_mapping () from /usr/lib/libc.so.6
#3  0x00007ffff64bbed5 in __nscd_get_map_ref () from /usr/lib/libc.so.6
#4  0x00007ffff64b9ba3 in __nscd_getai () from /usr/lib/libc.so.6
#5  0x00007ffff64708bc in gaih_inet.constprop () from /usr/lib/libc.so.6
#6  0x00007ffff6471724 in getaddrinfo () from /usr/lib/libc.so.6
#7  0x00007ffff7473ec5 in blocking_resolve_address_impl (name=0x55555578edf0 "badhost:50051",
    default_port=0x555555790220 "https", addresses=0x55555578f1f0) at src/core/lib/iomgr/resolve_address_posix.c:83
#8  0x00007ffff74742e3 in do_request_thread (exec_ctx=0x7ffff5043c30, rp=0x55555578e630, error=<optimized out>)
    at src/core/lib/iomgr/resolve_address_posix.c:157
#9  0x00007ffff7472b86 in run_closures (exec_ctx=<optimized out>, list=...) at src/core/lib/iomgr/executor.c:64
#10 executor_thread (arg=0x555555789fc0) at src/core/lib/iomgr/executor.c:152
#11 0x00007ffff74e5286 in thread_body (v=<optimized out>) at src/core/lib/support/thd_posix.c:42
#12 0x00007ffff6181049 in start_thread () from /usr/lib/../lib64/libpthread.so.0
#13 0x00007ffff6485f0f in clone () from /usr/lib/libc.so.6

Now I just need to step-into code, and the information of this issue can also be referred here.

During the whole process, I used sniffer tool (tcpdump), kernel tracing tool(tcpconnect, which belongs to bcc and utilizes eBPF), networking knowledge (set breakpoint on connect function), debugging tool (gdb), and the trick of linking library (set LD_LIBRARY_PATH to bypass installing gRPC), that’s why I call the whole procedure “leverage comprehensive debugging tricks”.

 

Build static library of gRPC

If you want to build static library of gRPC, you can check following steps:

$ git clone https://github.com/grpc/grpc.git
$ cd grpc
$ git submodule update --init
$ CC=clang CXX=clang++ make REQUIRE_CUSTOM_LIBRARIES_opt=1 static

If you bump into following errors:

third_party/protobuf/src/google/protobuf/stubs/port.h:263:5: error: macro expansion producing 'defined' has undefined
  behavior [-Werror,-Wexpansion-to-defined]
#if GOOGLE_PROTOBUF_USE_UNALIGNED
    ^

Please check this patch. After compile, you can find all libraries are static:

$ ls -alt libs/opt/*
-rw-r--r-- 1 xiaonan xiaonan  8109188 Oct  6 17:54 libs/opt/libgrpc++_unsecure.a
-rw-r--r-- 1 xiaonan xiaonan  3705192 Oct  6 17:54 libs/opt/libgrpc++_reflection.a
-rw-r--r-- 1 xiaonan xiaonan  2089436 Oct  6 17:54 libs/opt/libgrpc++_error_details.a
-rw-r--r-- 1 xiaonan xiaonan 24675866 Oct  6 17:54 libs/opt/libgrpc++_cronet.a
-rw-r--r-- 1 xiaonan xiaonan  9276572 Oct  6 17:54 libs/opt/libgrpc++.a
-rw-r--r-- 1 xiaonan xiaonan   917148 Oct  6 17:54 libs/opt/libgrpc++_core_stats.a
-rw-r--r-- 1 xiaonan xiaonan  7180018 Oct  6 17:54 libs/opt/libgrpc_plugin_support.a
-rw-r--r-- 1 xiaonan xiaonan  8163444 Oct  6 17:41 libs/opt/libgrpc_unsecure.a
-rw-r--r-- 1 xiaonan xiaonan 18854796 Oct  6 17:41 libs/opt/libgrpc_cronet.a
-rw-r--r-- 1 xiaonan xiaonan 20064046 Oct  6 17:41 libs/opt/libgrpc.a
-rw-r--r-- 1 xiaonan xiaonan 10669618 Oct  6 17:40 libs/opt/libboringssl.a
-rw-r--r-- 1 xiaonan xiaonan   456458 Oct  6 17:39 libs/opt/libgpr.a
-rw-r--r-- 1 xiaonan xiaonan   379730 Oct  6 17:39 libs/opt/libz.a
-rw-r--r-- 1 xiaonan xiaonan   629666 Oct  6 17:39 libs/opt/libares.a

libs/opt/protobuf:
total 114748
drwxr-xr-x 4 xiaonan xiaonan     4096 Oct  6 17:54 ..
-rw-r--r-- 1 xiaonan xiaonan 50492478 Oct  6 17:49 libprotobuf.a
drwxr-xr-x 2 xiaonan xiaonan     4096 Oct  6 17:49 .
-rw-r--r-- 1 xiaonan xiaonan 66986782 Oct  6 17:49 libprotoc.a

libs/opt/pkgconfig:
total 24
drwxr-xr-x 4 xiaonan xiaonan 4096 Oct  6 17:54 ..
drwxr-xr-x 2 xiaonan xiaonan 4096 Oct  6 17:41 .
-rw-r--r-- 1 xiaonan xiaonan  283 Oct  6 17:41 grpc++_unsecure.pc
-rw-r--r-- 1 xiaonan xiaonan  253 Oct  6 17:41 grpc++.pc
-rw-r--r-- 1 xiaonan xiaonan  283 Oct  6 17:39 grpc_unsecure.pc
-rw-r--r-- 1 xiaonan xiaonan  264 Oct  6 17:39 grpc.pc