Rewrite a python program using C to boost performance

Recently I converted a python program to C. The python program will run for about 1 hour to finish the task:

$ /usr/bin/time -v taskset -c 35 python_program ....
......
        User time (seconds): 3553.48
        System time (seconds): 97.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12048772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11434463
        Voluntary context switches: 58873
        Involuntary context switches: 21529
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4704
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

while the C program only needs about 5 minutes:

$ /usr/bin/time -v taskset -c 35 c_program ....
......
        User time (seconds): 282.45
        System time (seconds): 8.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16430216
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 3962437
        Voluntary context switches: 14
        Involuntary context switches: 388
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4960
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

From the /usr/bin/time‘s output, we can see python program uses less memory than C program, but suffers more “page faults” and “context switches”.

Valgrind can’t work with sanitizers together

Valgrind can’t work with sanitizers together. Check following program with explicit memory leak:

# cat memory-leak.c
#include <stdlib.h>
void *p;
int main() {
  p = malloc(7);
  p = 0; // The memory is leaked here.
  return 0;
}

Build it and run with valgrind:

# gcc memory-leak.c -o memory-leak
# valgrind ./memory-leak
==1155== Memcheck, a memory error detector
==1155== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1155== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1155== Command: ./memory-leak
==1155==
==1155==
==1155== HEAP SUMMARY:
==1155==     in use at exit: 7 bytes in 1 blocks
==1155==   total heap usage: 1 allocs, 0 frees, 7 bytes allocated
==1155==
==1155== LEAK SUMMARY:
==1155==    definitely lost: 7 bytes in 1 blocks
==1155==    indirectly lost: 0 bytes in 0 blocks
==1155==      possibly lost: 0 bytes in 0 blocks
==1155==    still reachable: 0 bytes in 0 blocks
==1155==         suppressed: 0 bytes in 0 blocks
==1155== Rerun with --leak-check=full to see details of leaked memory
==1155==
==1155== For lists of detected and suppressed errors, rerun with: -s
==1155== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The memory leak is detected successfully. Build the program with “-fsanitize=address” option and run valgrind again:

# gcc -fsanitize=address memory-leak.c -o memory-leak
# valgrind ./memory-leak
==1193== Memcheck, a memory error detector
==1193== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1193== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1193== Command: ./memory-leak
==1193==
==1193==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
==1193==
==1193== HEAP SUMMARY:
==1193==     in use at exit: 0 bytes in 0 blocks
==1193==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==1193==
==1193== All heap blocks were freed -- no leaks are possible
==1193==
==1193== For lists of detected and suppressed errors, rerun with: -s
==1193== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

We can see the valgrind can’t work normally.

Reference:
Sourceforge.

Building OmniOS requires some memory

I followed Building OmniOS to build OmniOS, but met following errors:

$ pfexec /opt/ooce/bin/omni build_world
......
./buildctl: fork: Not enough space
../lib/functions.sh: line 627: wait_for: No record of process 22984
../lib/functions.sh: fork: Not enough space
Waiting for illumos build...
......

The reason is the virtual machine which I used has only ~1.5G memory, and it is not enough. Increase the memory to 4G, then it is OK to build.

A core dump related to jemalloc

Recently I came across a core dump related to jemalloc:

#0  extent_arena_ind_get (extent=0x0) at include/jemalloc/internal/extent_inlines.h:40
#1  je_tcache_bin_flush_small (tsd=tsd@entry=0x7ffff7f717f8, tcache=tcache@entry=0x7ffff7f719e8,
    tbin=tbin@entry=0x7ffff7f71a70, binind=binind@entry=5, rem=<optimized out>) at src/tcache.c:159
#2  0x00007ffff356b97b in je_tcache_event_hard (tsd=tsd@entry=0x7ffff7f717f8,
    tcache=tcache@entry=0x7ffff7f719e8) at src/tcache.c:55
#3  0x00007ffff3512d49 in tcache_event (tcache=<optimized out>, tsd=<optimized out>)
    at include/jemalloc/internal/tcache_inlines.h:37
#4  tcache_dalloc_large (slow_path=<optimized out>, binind=<optimized out>, ptr=<optimized out>,
    tcache=<optimized out>, tsd=<optimized out>) at include/jemalloc/internal/tcache_inlines.h:212
#5  arena_dalloc_large (slow_path=<optimized out>, szind=<optimized out>, tcache=<optimized out>,
    ptr=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:276
#6  arena_dalloc (slow_path=<optimized out>, alloc_ctx=<optimized out>, tcache=<optimized out>,
    ptr=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:323
#7  idalloctm (slow_path=<optimized out>, is_internal=<optimized out>, alloc_ctx=<optimized out>,
    tcache=<optimized out>, ptr=<optimized out>, tsdn=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal_inlines_c.h:118
#8  ifree (slow_path=<optimized out>, tcache=<optimized out>, ptr=<optimized out>,
    tsd=<optimized out>) at src/jemalloc.c:2589
#9  je_free_default (ptr=0x7fff2ccf53c0) at src/jemalloc.c:2799

The Sanitizers helped me to find the root cause: a classical “double-free” memory issue. One thing should be noticed is the Sanitizers and jemalloc can’t be used simultaneously because they both intercept memory allocation/free functions. Check following code:

# cat memory-leak.c
#include <stdlib.h>
void *p;
int main() {
  p = malloc(7);
  p = 0; // The memory is leaked here.
  return 0;
}

Build with both Sanitizers and jemalloc:

# gcc -fsanitize=address -g memory-leak.c -L`jemalloc-config --libdir` -Wl,-rpath,`jemalloc-config --libdir` -ljemalloc `jemalloc-config --libs`
# ldd a.out
    linux-vdso.so.1 (0x00007ffdb85a6000)
    libasan.so.6 => /usr/lib/libasan.so.6 (0x00007ffb04605000)
    libjemalloc.so.2 => /usr/lib/libjemalloc.so.2 (0x00007ffb04362000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007ffb0421d000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007ffb03fb4000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007ffb03fae000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007ffb03f8d000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007ffb03dc5000)
    librt.so.1 => /usr/lib64/../lib64/librt.so.1 (0x00007ffb03dba000)
    libgcc_s.so.1 => /usr/lib64/../lib64/libgcc_s.so.1 (0x00007ffb03da0000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ffb04fdc000)

Use gdb to debug the program:

......
Breakpoint 2, 0x00007ffff76a90a4 in malloc () from /usr/lib/libasan.so.6
(gdb) bt
#0  0x00007ffff76a90a4 in malloc () from /usr/lib/libasan.so.6
#1  0x0000555555555183 in main () at memory-leak.c:4
......

The program will use functions from Sanitizers instead of jemalloc.

An AES encryption/decryption program

I write a simple AES encryption/decryption program, not recommend using it, but show some basic concepts:

(1) Refer my previous post: initialise EVP_CIPHER_CTX only once, which can improve code efficiency:

......
    EVP_CIPHER_CTX *enc_ctx = EVP_CIPHER_CTX_new();

    if (EVP_EncryptInit_ex(enc_ctx, EVP_aes_128_ecb(), NULL, key, NULL) == 0) {
        goto END;
    }
......

(2) Because the key length is 128 bits, the cipher text length should be multiples of 16 bytes. The plain text length is 98EVP_EncryptUpdate() will encrypt first 96 bytes, and EVP_EncryptFinal_ex() will encrypt the remaining 2 bytes. The total length of encrypted text is 112.

......
        if (EVP_EncryptUpdate(enc_ctx, ct, &ct_len, pt, sizeof(pt)) == 0) {
            goto END;
        }

        if (EVP_EncryptFinal_ex(enc_ctx, ct + ct_len, &len) == 0) {
            goto END;
        }

        ct_len += len;
......

Correspondingly, EVP_DecryptUpdate() will decrypt first 96 bytes, and EVP_DecryptFinal_ex() will decrypt the trailing 2 bytes:

......
        if (EVP_DecryptUpdate(dec_ctx, decrypted, &decrypted_len, ct, ct_len) == 0) {
            goto END;
        }

        if (EVP_DecryptFinal_ex(dec_ctx, decrypted + decrypted_len, &len) == 0) {
            goto END;
        }

        decrypted_len += len;
......