debug | Nan Xiao's Blog

Why does SSL client report google’s certificate “self-signed”?

In previous post, I implemented a simple HTTPS client, but the program has a small flaw, i.e., when connecting to “www.google.com:443“, it will report following error in verifying certificate:

error code is 18:self signed certificate

error code is from SSL_get_verify_result:

long SSL_get_verify_result(const SSL *ssl)
{
    return ssl->verify_result;
}

and 18 is mapping to X509_V_ERR_DEPTH_ZERO_SELF_SIGNED_CERT, which means “self-signed certificate”. But for other websites, e.g., facebook.com, no error is outputted.

Use OpenSSL‘s client-arg program to test:

# LD_LIBRARY_PATH=/root/openssl/build gdb --args ./client-arg -connect "www.google.com:443"
......
Thread 2 hit Breakpoint 1, main (argc=3, argv=0xfffffc7fffdf4c38) at client-arg.c:99
99      BIO_puts(sbio, "GET / HTTP/1.0\n\n");
(gdb) p ssl->verify_result
$1 = 18
(gdb)

The same error code: 18. But openssl-s_client can guarantee the certificate is not “self-signed”:

# LD_LIBRARY_PATH=/root/openssl/build openssl s_client -connect google.com:443
CONNECTED(00000004)
depth=2 OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
verify return:1
depth=1 C = US, O = Google Trust Services, CN = GTS CA 1O1
verify return:1
depth=0 C = US, ST = California, L = Mountain View, O = Google LLC, CN = *.google.com
verify return:1
---
Certificate chain
 0 s:C = US, ST = California, L = Mountain View, O = Google LLC, CN = *.google.com
   i:C = US, O = Google Trust Services, CN = GTS CA 1O1
 1 s:C = US, O = Google Trust Services, CN = GTS CA 1O1
   i:OU = GlobalSign Root CA - R2, O = GlobalSign, CN = GlobalSign
---
......

Hmm, I need to find the root cause.

First of all, I searched the code to see when X509_V_ERR_DEPTH_ZERO_SELF_SIGNED_CERT is set, and found only one spot:

if (self_signed)
            return verify_cb_cert(ctx, NULL, num - 1,
                                  sk_X509_num(ctx->chain) == 1
                                  ? X509_V_ERR_DEPTH_ZERO_SELF_SIGNED_CERT
                                  : X509_V_ERR_SELF_SIGNED_CERT_IN_CHAIN);

The interesting thing is the amount of certificates in the chain is only 1, but from above openssl-s_client‘s output, there are 2 certificates in the chain. OK, let’s see the content of this “self-signed” certificate.

After some debugging, I finally found tls_process_server_certificate, which is used to process the server’s certificate. With the help of gdb, I can dump the content of certificate:

# gdb --args ./client www.google.com:443
.......
(gdb) b tls_process_server_certificate
......
Thread 2 hit Breakpoint 1, tls_process_server_certificate (s=0xf09e90, pkt=0xfffffc7fffdefe30)
    at ../ssl/statem/statem_clnt.c:1768
1768        X509 *x = NULL;
......
1806            if (certbytes != (certstart + cert_len)) {
(gdb)
1811            if (SSL_IS_TLS13(s)) {
(gdb) dump binary memory cert certstart certstart + cert_len
......

Try to check the cert file:

# cat cert
......
�0� *�H��       (No SNI provided; please fix your client.10Uinvalid2.invalid0�"0
��bO����
.....

The reason is obvious: “No SNI provided; please fix your client.”. Ah, I need to set SNI explicitly. After invoking SSL_set_tlsext_host_name, the certificate chain becomes correct (The new code can be downloaded here).

Summary: I am not an SSL/TLS expert, and OpenSSL project is complex and daunting. But with some basic SSL/TLS knowledge and the help of debugger, I can find the root cause of issues independently. Don’t give up, digest code bit by bit, finally you will win!

The registers’ values in core dump file on x86_64

On x86_64 platforms, some registers are “caller-saved” whilst others are “callee-saved” (refer AMD64 Calling Conventions for Linux / Mac OSX), or from Optimizing subroutines in assembly language, section 4.1, Register usage, “Registers that
can be used freely” (“caller-saved”) and “Registers that must be saved and restored” (“callee-saved”). When using gdb to display registers values, the values are relative to the selected stack frame (Refer Registers):

Normally, register values are relative to the selected stack frame (see Selecting a Frame). This means that you get the value that the register would contain if all stack frames farther in were exited and their saved registers restored. In order to see the true contents of hardware registers, you must select the innermost frame (with ‘frame 0’).

……

Also, the more “outer” the frame is you’re looking at, the more likely a call-clobbered register’s value is to be wrong, in the sense that it doesn’t actually represent the value the register had just before the call.

So it means when using gdb to analyse core dump file, you must pay attention to the registers values since it may not reflect correct values of current stack frame. Check following diagram:

You can see only RSP, RIP and “callee-saved” registers are different among frame 0, 7 and 8.

Gcc’s “-fstack-protector-strong” option

Gcc‘s “-fstack-protector-strong” helped me catch an array overflow bug recently. The “-fstack-protector-strong” option will add “canary” in the function stack, when function returns, it would check whether the guard is corrupted or not. If corrupted, __stack_chk_fail() will be invoked:

    0x00007ffff5138674 <+52>:   mov    -0x38(%rbp),%rax
    0x00007ffff5138678 <+56>:   xor    %fs:0x28,%rax
    0x00007ffff5138681 <+65>:   jne    0x7ffff5138ff3 <function+2483>
    ......
    0x00007ffff5138ff3 <+2483>: callq  0x7ffff50c2100 <__stack_chk_fail@plt>

And the program will crash:

*** stack smashing detected ***: program terminated
Segmentation fault

Use gdb to check:

(gdb) bt
#0  0x00007fffde26e0b8 in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007fffde26efb9 in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#2  0x00007fffde890aa6 in backtrace () from /usr/lib64/libc.so.6
#3  0x00007fffde7f4ef4 in __libc_message () from /usr/lib64/libc.so.6
#4  0x00007fffde894577 in __fortify_fail () from /usr/lib64/libc.so.6
#5  0x00007fffde894532 in __stack_chk_fail () from /usr/lib64/libc.so.6
#6  0x00007ffff5138ff8 in function () at src.c:685
#7  0x045b9fd4c77e2ff3 in ?? ()
#8  0x9a8ad8e7e2eb8ca8 in ?? ()
#9  0x0fa0e627193655f1 in ?? ()
#10 0xfc295178098bb96f in ?? ()
#11 0xa09a574a7780cd13 in ?? ()
......

The function frames and return addresses are overwritten, so the call stack can’t be recovered. Please be aware that the line which gdb prints:

#6  0x00007ffff5138ff8 in function () at src.c:685

may not be related to culprit!

Treasure core dump file

Recently I fixed a memory corruption issue, i.e., for a 8-byte memory address, one byte in the middle was set to 0, so the memory address became invalid, and accessing of this memory caused program crash. This reminded me another memory corruption issue which I fixed before. From my experience, this kind of memory corruption issues are very difficult to debug: the adjacent memory is all good, only one or several bytes are changed to other values. These bugs are not obvious out-of-bound memory access problems, and difficult to find methods to reproduce.

Generally speaking, logs can’t always give you a hand when the memory is random corrupted, not mention in some situations, the traces won’t be provided for reasons. The only thing you can get is the core dump file, and you must utilize the file and try to unearth as much information as possible. E.g., from program’s perspective, what was the state of program when it crashed? Except the ruined memory, were there other abnormalities? From system’s perspective, have you observed all the registers’ values? Are they all valid? If not, which part of code can cause it?

So every time, when you meet a not-easy reproduced bug, don’t freak out. Just calm down and begin to analyze core dump file carefully. You become a detective and core dump file is the crime scene. In reality, you can’t require the criminal to commit again. Similarly, not every bug can reoccur; you must try your best to find the root cause from the core dump file. From my experience, every tough debugging experience can make you understand program and system better. So it is a precious learning opportunity.

Treasure core dump file and enjoy debugging!

OpenBSD saves me again! — Debug a memory corruption issue

Yesterday, I came across a third-part library issue, which crashes at allocating memory:

......
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f594a5a9b6b in _int_malloc () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f594a5a9b6b in _int_malloc () from /usr/lib/libc.so.6
#1  0x00007f594a5ab503 in malloc () from /usr/lib/libc.so.6
#2  0x00007f594b13f159 in operator new (sz=5767168) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/new_op.cc:50
......

It is obvious that the memory tags are corrupted, but who is the murder? Since the library involves a lot of maths computation, it is not an easy task to grasp the code quickly. So I need to find another way:

(1) Open all warnings during compilation: -Wall. Nothing found.

(2) Use valgrind, but unfortunately, valgrind crashes itself:

......
valgrind: the 'impossible' happened:
   Killed by fatal signal

host stacktrace:
==43326==    at 0x58053139: get_bszB_as_is (m_mallocfree.c:303)
==43326==    by 0x58053139: get_bszB (m_mallocfree.c:315)
==43326==    by 0x58053139: vgPlain_arena_malloc (m_mallocfree.c:1799)
==43326==    by 0x5800BA84: vgMemCheck_new_block (mc_malloc_wrappers.c:372)
==43326==    by 0x5800BD39: vgMemCheck___builtin_vec_new (mc_malloc_wrappers.c:427)
==43326==    by 0x5809F785: do_client_request (scheduler.c:1866)
==43326==    by 0x5809F785: vgPlain_scheduler (scheduler.c:1433)
==43326==    by 0x580AED50: thread_wrapper (syswrap-linux.c:103)
==43326==    by 0x580AED50: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1
......

(3) Change compiler, use clang instead of gcc, and hope it can give me some clues. Still no effect.

(4) Switch Operating System from Linux to OpenBSD, the program crashes again. But this time, it tells me where the memory corruption occurs:

......
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000014b07f01e52d in addMod (r=<error reading variable>, a=4693443247995522, b=28622907746665631,
......

I figure out the issue quickly, and not bother to understand the whole code. OpenBSD saves me again, thanks!