November, 2017 | Nan Xiao's Blog

The gRPC server program will crash when can’t bind successfully

a server program which uses gRPC crashed when started:

$ ./server 60001
......
E1123 15:37:54.133480971   14408 server_chttp2.c:38]         {"created":"@1511422674.133408109","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.c","file_line":245,"referenced_errors":[{"created":"@1511422674.133405147","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.c","file_line":338,"referenced_errors":[{"created":"@1511422674.133394827","description":"Unable to configure socket","fd":4,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c","file_line":200,"referenced_errors":[{"created":"@1511422674.133385167","description":"OS Error","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c","file_line":173,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1511422674.133404647","description":"Unable to configure socket","fd":4,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c","file_line":200,"referenced_errors":[{"created":"@1511422674.133401558","description":"OS Error","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c","file_line":173,"os_error":"Address already in use","syscall":"bind"}]}]}]}
Segmentation fault (core dumped)

It is weird because it runs well yesterday. Check the core dump file:

......
Core was generated by `./server 60001'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f30edbbf9b0 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f30edbbf9b0 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
#1  0x000055c312c87077 in grpc::Server::Wait() ()
......

No clue. So format the above json log:

{  
   "created":"@1511422674.133408109",
   "description":"No address added out of total 1 resolved",
   "file":"src/core/ext/transport/chttp2/server/chttp2_server.c",
   "file_line":245,
   "referenced_errors":[  
      {  
         "created":"@1511422674.133405147",
         "description":"Failed to add any wildcard listeners",
         "file":"src/core/lib/iomgr/tcp_server_posix.c",
         "file_line":338,
         "referenced_errors":[  
            {  
               "created":"@1511422674.133394827",
               "description":"Unable to configure socket",
               "fd":4,
               "file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c",
               "file_line":200,
               "referenced_errors":[  
                  {  
                     "created":"@1511422674.133385167",
                     "description":"OS Error",
                     "errno":98,
                     "file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c",
                     "file_line":173,
                     "os_error":"Address already in use",
                     "syscall":"bind"
                  }
               ]
            },
            {  
               "created":"@1511422674.133404647",
               "description":"Unable to configure socket",
               "fd":4,
               "file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c",
               "file_line":200,
               "referenced_errors":[  
                  {  
                     "created":"@1511422674.133401558",
                     "description":"OS Error",
                     "errno":98,
                     "file":"src/core/lib/iomgr/tcp_server_utils_posix_common.c",
                     "file_line":173,
                     "os_error":"Address already in use",
                     "syscall":"bind"
                  }
               ]
            }
         ]
      }

Oh, I see. “Address already in use” means the 60001 port is occupied already. I switch to another port, and it works.

Build SEAL on Linux

Building SEAL(Simple Encrypted Arithmetic Library) on Linux needs some tweaks:

(1) Add executable attribute for configure file:

$ chmod a+x configure

(2) Running configure generates following errors:

$ ./configure
-bash: ./configure: /bin/sh^M: bad interpreter: No such file or directory

dos2unix doesn’t take effect too:

$ dos2unix configure
dos2unix: Binary symbol 0x07 found at line 3911
dos2unix: Skipping binary file configure

Need tr to save me:

$ tr -d '\r' < configure > temp
$ mv temp configure

Then compiling is OK:

$ ./configure
$ make

The difference of loopback packets on Linux and OpenBSD

Capture the packets on loopback network card on Linux:

# tcpdump -i lo -w lo.pcap port 33333
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
......

Download it onto Windows and use wireshark to analyze it:

We can see every packet conforms to standard ethernet format.

Capture lookback packets on OpenBSD:

# tcpdump -i lo0 -w lo.pcap port 33333
tcpdump: listening on lo0, link-type LOOP
......

Also download it onto Windows and open it with wireshark:

The wireshark just recognizes the packet as “Raw IP” format, but can’t show details.

After referring discussion in Wireshark mailing list, I know it is related to network link-layer header type. 0x0C stands for “Raw IP”:

I modified the 0x0C to 0x6C, which means “OpenBSD loopback”:

Now the packets can be decoded successfully:

P.S., I also started a discussion about this issue in mailing list.

Update: I write a script to do this conversion.

The gRPC will terminate process when it can’t allocate memory

My program leverages gRPC, and after a stress testing, it crashed. Use gdb to debug the core dump file:

[Current thread is 1 (Thread 0x7f73ef5f1780 (LWP 147393))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f73edee542a in __GI_abort () at abort.c:89
#2  0x00005580f74da93a in gpr_malloc ()
#3  0x00005580f74e4b65 in ?? ()
#4  0x00005580f74e28d4 in grpc_exec_ctx_flush ()
#5  0x00005580f758aee1 in ?? ()
#6  0x00005580f75061b7 in grpc_pollset_work ()
#7  0x00005580f74f1c78 in ?? ()
#8  0x00005580f72ab85e in grpc::CompletionQueue::Pluck (this=0x5580f96b3698, tag=0x7fff9cafbe90)
......

I doubt the root cause should be memory is not enough, but not sure. Check the source code of gpr_malloc:

void* gpr_malloc(size_t size) {
  void* p;
  if (size == 0) return nullptr;
  GPR_TIMER_BEGIN("gpr_malloc", 0);
  p = g_alloc_functions.malloc_fn(size);
  if (!p) {
    abort();
  }
  GPR_TIMER_END("gpr_malloc", 0);
  return p;
}

We can see if p is NULL, the abort() system call will be invoked. This verifies what I guessed. Besides gpr_malloc, other memory allocate functions (such as gpr_zalloc, gpr_realloc, etc) have the same behaviors.

Learn socket programming tips from netcat

Since netcat is honored as “TCP/IP swiss army knife”, I read its source code in OpenBSD to summarize some socket programming tips:

(1) Client connects in non-blocking mode:

......
s = socket(res->ai_family, res->ai_socktype |
            SOCK_NONBLOCK, res->ai_protocol);


......  
if ((ret = connect(s, name, namelen)) != 0 && errno == EINPROGRESS) {
        pfd.fd = s;
        pfd.events = POLLOUT;
        ret = poll(&pfd, 1, timeout));
}
......

Creating socket and set SOCK_NONBLOCK mode for it. Then calling connect() function, if ret is 0, it means connection is established successfully; if errno is EINPROGRESS, we can use timeout to control how long to wait; otherwise the connection can’t be built.

(2) The usage of poll():

......
/* stdin */
pfd[POLL_STDIN].fd = stdin_fd;
pfd[POLL_STDIN].events = POLLIN;

/* network out */
pfd[POLL_NETOUT].fd = net_fd;
pfd[POLL_NETOUT].events = 0;

/* network in */
pfd[POLL_NETIN].fd = net_fd;
pfd[POLL_NETIN].events = POLLIN;

/* stdout */
pfd[POLL_STDOUT].fd = stdout_fd;
pfd[POLL_STDOUT].events = 0;

......
/* poll */
num_fds = poll(pfd, 4, timeout);

/* treat poll errors */
if (num_fds == -1)
    err(1, "polling error");

/* timeout happened */
if (num_fds == 0)
    return;

/* treat socket error conditions */
for (n = 0; n < 4; n++) {
    if (pfd[n].revents & (POLLERR|POLLNVAL)) {
        pfd[n].fd = -1;
    }
}
/* reading is possible after HUP */
if (pfd[POLL_STDIN].events & POLLIN &&
    pfd[POLL_STDIN].revents & POLLHUP &&
    !(pfd[POLL_STDIN].revents & POLLIN))
    pfd[POLL_STDIN].fd = -1;

Usually, we just need to care about file descriptors for reading:

pfd[POLL_STDIN].fd = stdin_fd;
pfd[POLL_STDIN].events = POLLIN;

no need to monitor file descriptors for writing:

/* network out */
pfd[POLL_NETOUT].fd = net_fd;
pfd[POLL_NETOUT].events = 0;

According to poll() manual from OpenBSD, if no need for “high-priority” (maybe out-of-band) data, POLLIN is enough, otherwise the monitor events should be POLLIN|POLLPRI. And this is similar for POLLOUT and POLLWRBAND.

There are 3 values(POLLERR, POLLNVAL and POLLHUP) which are only used in struct pollfd‘s revents. If POLLERR or POLLNVAL is detected, it’s not necessary to poll this file descriptor furthermore:

if (pfd[n].revents & (POLLERR|POLLNVAL)) {
    pfd[n].fd = -1;
}

We should pay more attention to POLLHUP:
(a)

POLLHUP

The device or socket has been disconnected. This event and POLLOUT are mutually-exclusive; a descriptor can never be writable if a hangup has occurred. However, this event and POLLIN, POLLRDNORM, POLLRDBAND, or POLLPRI are not mutually-exclusive. This flag is only valid in the revents bitmask; it is ignored in the events member.

(b)

The second difference is that on EOF there is no guarantee that POLLIN will be set in revents, the caller must also check for POLLHUP.

So it means if POLLHUP and POLLIN are both set in revents, there must be data to read (maybe EOF?), otherwise if only POLLHUP is checked, there is no data to read from.