__COUNTER__ macro in gcc/clang

Both gcc and clang define __COUNTER__ marco:

Defined to an integer value that starts at zero and is incremented each time the COUNTER macro is expanded.

Check following code:

# cat foo.c
#include <stdio.h>

void foo1(void)
{
    printf("%s:%d\n", __func__, __COUNTER__);
}

void foo2(void)
{
    printf("%s:%d\n", __func__, __COUNTER__);
}
# cat bar.c
#include <stdio.h>

void bar1(void)
{
    printf("%s:%d\n", __func__, __COUNTER__);
}

void bar2(void)
{
    printf("%s:%d\n", __func__, __COUNTER__);
}
# cat main.c
#include "foo.h"
#include "bar.h"

int main(void)
{
    foo1();
    foo2();
    bar1();
    bar2();
    return 0;
}

Run the program:

# ./main
foo_1:0
foo_2:1
bar_1:0
bar_2:1

You can see for every translate unit (.c) file, the __COUNTER__ begins at 0.

P.S., the code can be referenced here.

Compile code using x86 intrinsics

Check following simple program:

# cat foo.c
#include <inttypes.h>
#include <stdio.h>
#include <x86intrin.h>

int main(void) {
    printf("%" PRIu32 "\n", _mm_crc32_u32(42, 2534474250));
    return 0;
}

Use gcc to compile with no options:

# gcc foo.c
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/immintrin.h:37,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/x86intrin.h:32,
                 from foo.c:3:
foo.c: In function ‘main’:
/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/smmintrin.h:839:1: error: inlining failed in call to ‘always_inline’ ‘_mm_crc32_u32’: target specific option mismatch
  839 | _mm_crc32_u32 (unsigned int __C, unsigned int __V)
      | ^~~~~~~~~~~~~
foo.c:6:2: note: called from here
    6 |  printf("%" PRIu32 "\n", _mm_crc32_u32(42, 2534474250));
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Should use -march=native option here:

# gcc -march=native foo.c
#

Reference:
stackoverflow.

Gcc’s “-fstack-protector-strong” option

Gcc‘s “-fstack-protector-strong” helped me catch an array overflow bug recently. The “-fstack-protector-strong” option will add “canary” in the function stack, when function returns, it would check whether the guard is corrupted or not. If corrupted, __stack_chk_fail() will be invoked:

    0x00007ffff5138674 <+52>:   mov    -0x38(%rbp),%rax
    0x00007ffff5138678 <+56>:   xor    %fs:0x28,%rax
    0x00007ffff5138681 <+65>:   jne    0x7ffff5138ff3 <function+2483>
    ......
    0x00007ffff5138ff3 <+2483>: callq  0x7ffff50c2100 <__stack_chk_fail@plt>

And the program will crash:

*** stack smashing detected ***: program terminated
Segmentation fault

Use gdb to check:

(gdb) bt
#0  0x00007fffde26e0b8 in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007fffde26efb9 in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#2  0x00007fffde890aa6 in backtrace () from /usr/lib64/libc.so.6
#3  0x00007fffde7f4ef4 in __libc_message () from /usr/lib64/libc.so.6
#4  0x00007fffde894577 in __fortify_fail () from /usr/lib64/libc.so.6
#5  0x00007fffde894532 in __stack_chk_fail () from /usr/lib64/libc.so.6
#6  0x00007ffff5138ff8 in function () at src.c:685
#7  0x045b9fd4c77e2ff3 in ?? ()
#8  0x9a8ad8e7e2eb8ca8 in ?? ()
#9  0x0fa0e627193655f1 in ?? ()
#10 0xfc295178098bb96f in ?? ()
#11 0xa09a574a7780cd13 in ?? ()
......

The function frames and return addresses are overwritten, so the call stack can’t be recovered. Please be aware that the line which gdb prints:

#6  0x00007ffff5138ff8 in function () at src.c:685

may not be related to culprit!

“argument to variable-length array may be too large [-Wvla-larger-than=]” warning

I use gcc-9 from CentOS:

$ /opt/rh/devtoolset-9/root/usr/bin/gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

And I found if using -O3 compile option, for some Variable-length array in C programming language, gcc will report following warning:

warning: argument to variable-length array may be too large [-Wvla-larger-than=]
  596 |  uint8_t header[header_size];
      |          ^~~~~~~~~~~~~~~~~~

If not using -O3 option, the warning won’t be generated.

Cacheline-Orientated programming

From CPU’s perspective, the memory hierarchy is registers, L1 cache, L2 cache, L3 cache, main memory, among others. The smallest unit of cache is one cacheline, and it is 64 bytes in most cases:

$ getconf LEVEL1_DCACHE_LINESIZE
64

To make your applications run efficiently, you need to take cacheline into account. Take notorious cacheline fales sharing as an example:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[14];
    };
    .....

The size of struct Foo is 64 bytes, and it can be stored in one cacheline. If CPU 0 accesses Foo.a while CPU 1 accesses Foo.b at the same time, there will be “cacheline ping-ponging” between CPUs, and the performance will be downgraded drastically.

The other trick is to allocate memory cacheline size aligned. Still use above struct Foo as the example. To guarantee the whole struct Foo in one cacheline, posix_memalign can be used:

    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo));

The 64 is the alignment requirement.

Last but not least, sometimes padding is needed. E.g.:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
        int padding[2];
    };
    ......
    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo) * 10);

Or using compiler’s aligned attribute:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
    } __attribute__((aligned(64)));;
    ......

The original struct Foo‘s size is 56 bytes, after padding (or through compiler’s aligned attribure), it becomes 64 bytes, and can be loaded in one cacheline. Now we can allocate an array of struct Foo, and every CPU will process one element of the array, no “cacheline ping-ponging” will occur.