Use OpenSSL to simulate TLS 1.3 “Session Resumption”

Thanks the great help from OpenSSL community, I finally can simulate an TLS 1.3 “Session Resumption”. The Operation System I used is OmniOS, and OpenSSL version is 1.1.1k, but I think the methods here can also be applied to other platforms:

(1) Open one terminal to launch tcpdump to capture TLS packets:

$ pfexec /opt/ooce/sbin/tcpdump -w tls.pcap port 443

(2) Open another terminal to initiate the first TLS 1.3 session:

$ openssl s_client -connect cloudflare.com:443 -tls1_3 -sess_out sess.pem -keylogfile keys1.txt
......

Once the connection is established, input “GET /” to trigger TLS 1.3 Server to send “New Session Ticket” message, and this will be saved in sess.pem file.

(3) Initiate another TLS 1.3 session to reuse the saved “Session Ticket“:

$ echo | openssl s_client -connect cloudflare.com:443 -tls1_3 -sess_in sess.pem -keylogfile keys2.txt

(4) Stop the tcpdump process.

(5) Combine two keys file into one:

$ cat keys1.txt keys2.txt > keys.txt

Then the keys.txt can be used to decrypt the two TLS 1.3 sessions (refer Use Wireshark to decrypt TLS flows).

Build OpenSSL on macOS

The default installed OpenSSL by brew is actually LibreSSL:

$ openssl version
LibreSSL 2.8.3

The method of building real OpenSSL is like this:

$ git clone https://github.com/openssl/openssl.git
$ cd openssl
$ mkdir build
$ cd build
$ ../Configure darwin64-x86_64 --debug --prefix=/Users/nanxiao/install
$ make 
$ make install

Check the freshly built OpenSSL:

$ /Users/nanxiao/install/bin/openssl version
OpenSSL 3.0.0-beta2-dev  (Library: OpenSSL 3.0.0-beta2-dev )

How to process large file?

In Process large data in external memory, I mentioned:

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

I want to elaborate how to process large file here:

(1) Split the large file into small ones which are independent from each other. E.g., based on users. Then you can spawn multiple threads to process each small file.

(2) For the output: if all threads output to same file, the write operations must be atomic and it will become bottleneck of the program. So every thread should have its own output file.

(3) After all threads exit, main thread can use cat or other methods to consolidate all output files into one.

Beware of using GNU libc basename() function

From the manual page, we know there are two versions of basename() implementation. One for POSIX-compliant:

#include <libgen.h>

char *basename(char *path);

Another for GNU version:

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <string.h>

But the manual doesn’t mention that the prototype type of GNU version is different from POSIX-compliant one (The parameter type is const char*, not char*):

char *basename (const char *__filename)

And the implementation is also simple, just invokes strrchr():

char *
__basename (const char *filename)
{
  char *p = strrchr (filename, '/');
  return p ? p + 1 : (char *) filename;
}

Rewrite a python program using C to boost performance

Recently I converted a python program to C. The python program will run for about 1 hour to finish the task:

$ /usr/bin/time -v taskset -c 35 python_program ....
......
        User time (seconds): 3553.48
        System time (seconds): 97.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12048772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11434463
        Voluntary context switches: 58873
        Involuntary context switches: 21529
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4704
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

while the C program only needs about 5 minutes:

$ /usr/bin/time -v taskset -c 35 c_program ....
......
        User time (seconds): 282.45
        System time (seconds): 8.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16430216
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 3962437
        Voluntary context switches: 14
        Involuntary context switches: 388
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4960
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

From the /usr/bin/time‘s output, we can see python program uses less memory than C program, but suffers more “page faults” and “context switches”.