Rewrite a python program using C to boost performance

Recently I converted a python program to C. The python program will run for about 1 hour to finish the task:

$ /usr/bin/time -v taskset -c 35 python_program ....
......
        User time (seconds): 3553.48
        System time (seconds): 97.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12048772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11434463
        Voluntary context switches: 58873
        Involuntary context switches: 21529
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4704
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

while the C program only needs about 5 minutes:

$ /usr/bin/time -v taskset -c 35 c_program ....
......
        User time (seconds): 282.45
        System time (seconds): 8.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16430216
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 3962437
        Voluntary context switches: 14
        Involuntary context switches: 388
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4960
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

From the /usr/bin/time‘s output, we can see python program uses less memory than C program, but suffers more “page faults” and “context switches”.

“argument to variable-length array may be too large [-Wvla-larger-than=]” warning

I use gcc-9 from CentOS:

$ /opt/rh/devtoolset-9/root/usr/bin/gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

And I found if using -O3 compile option, for some Variable-length array in C programming language, gcc will report following warning:

warning: argument to variable-length array may be too large [-Wvla-larger-than=]
  596 |  uint8_t header[header_size];
      |          ^~~~~~~~~~~~~~~~~~

If not using -O3 option, the warning won’t be generated.

Exit main thread and keep other threads running

In C programming, if using return in main function, the whole process will terminate. To only let main thread gone, and keep other threads live, you can use thrd_exit in main function. Check following code:

#include <stdio.h>
#include <threads.h>
#include <unistd.h>

int
print_thread(void *s)
{
    thrd_detach(thrd_current());
    for (size_t i = 0; i < 5; i++)
    {
        sleep(1);
        printf("i=%zu\n", i);
    }
    thrd_exit(0);
}

int
main(void)
{
    thrd_t tid;
    if (thrd_success != thrd_create(&tid, print_thread, NULL)) {
        fprintf(stderr, "Create thread error\n");
        return 1;
    }
    thrd_exit(0);
}

Run it:

$ ./main
i=0
i=1
i=2
i=3
i=4

You can see even main thread exited, the other thread still worked.

P.S., the code can be downloaded here.

The c99 program on Void Linux

Today I found there is a c99 program in Void Linux:

$ c99
cc: fatal error: no input files
compilation terminated.
$ which c99
/usr/bin/c99

Check what it is:

$ file /usr/bin/c99
/usr/bin/c99: POSIX shell script, ASCII text executable
$ cat /usr/bin/c99
#!/bin/sh
exec /usr/bin/cc -std=c99 "$@"

Um, just a shell script which invokes /usr/bin/cc. So check cc program:

$ ll /usr/bin/cc
lrwxrwxrwx 1 root root 3 Jun  9 05:32 /usr/bin/cc -> gcc

Oh, a link to gcc.

Cacheline-Orientated programming

From CPU’s perspective, the memory hierarchy is registers, L1 cache, L2 cache, L3 cache, main memory, among others. The smallest unit of cache is one cacheline, and it is 64 bytes in most cases:

$ getconf LEVEL1_DCACHE_LINESIZE
64

To make your applications run efficiently, you need to take cacheline into account. Take notorious cacheline fales sharing as an example:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[14];
    };
    .....

The size of struct Foo is 64 bytes, and it can be stored in one cacheline. If CPU 0 accesses Foo.a while CPU 1 accesses Foo.b at the same time, there will be “cacheline ping-ponging” between CPUs, and the performance will be downgraded drastically.

The other trick is to allocate memory cacheline size aligned. Still use above struct Foo as the example. To guarantee the whole struct Foo in one cacheline, posix_memalign can be used:

    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo));

The 64 is the alignment requirement.

Last but not least, sometimes padding is needed. E.g.:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
        int padding[2];
    };
    ......
    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo) * 10);

Or using compiler’s aligned attribute:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
    } __attribute__((aligned(64)));;
    ......

The original struct Foo‘s size is 56 bytes, after padding (or through compiler’s aligned attribure), it becomes 64 bytes, and can be loaded in one cacheline. Now we can allocate an array of struct Foo, and every CPU will process one element of the array, no “cacheline ping-ponging” will occur.