Cacheline-Orientated programming

From CPU’s perspective, the memory hierarchy is registers, L1 cache, L2 cache, L3 cache, main memory, among others. The smallest unit of cache is one cacheline, and it is 64 bytes in most cases:

$ getconf LEVEL1_DCACHE_LINESIZE
64

To make your applications run efficiently, you need to take cacheline into account. Take notorious cacheline fales sharing as an example:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[14];
    };
    .....

The size of struct Foo is 64 bytes, and it can be stored in one cacheline. If CPU 0 accesses Foo.a while CPU 1 accesses Foo.b at the same time, there will be “cacheline ping-ponging” between CPUs, and the performance will be downgraded drastically.

The other trick is to allocate memory cacheline size aligned. Still use above struct Foo as the example. To guarantee the whole struct Foo in one cacheline, posix_memalign can be used:

    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo));

The 64 is the alignment requirement.

Last but not least, sometimes padding is needed. E.g.:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
        int padding[2];
    };
    ......
    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo) * 10);

Or using compiler’s aligned attribute:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
    } __attribute__((aligned(64)));;
    ......

The original struct Foo‘s size is 56 bytes, after padding (or through compiler’s aligned attribure), it becomes 64 bytes, and can be loaded in one cacheline. Now we can allocate an array of struct Foo, and every CPU will process one element of the array, no “cacheline ping-ponging” will occur.

2 thoughts on “Cacheline-Orientated programming”

  1. Hello! From your article, it’s not clear why introducing the posix_memalign on the original Foo struct with 64-byte size, will solve the “cacheline ping-ponging”. Also, could you please, explain why it’s necessary to pad the struct with 56-byte size to 64-byte (by example), to “be loaded in one cacheline” ? What could happens when the struct will be not aligned with 56-byte size.

    1. > Hello! From your article, it’s not clear why introducing the posix_memalign on the original Foo struct with 64-byte size, will solve the “cacheline ping-ponging”.
      No, the example I mentioned uses posix_memalign will not solve “cacheline ping-ponging”, and it is just for showing “The other trick is to allocate memory cacheline size aligned.”.

      > Also, could you please, explain why it’s necessary to pad the struct with 56-byte size to 64-byte (by example), to “be loaded in one cacheline” ? What could happens when the struct will be not aligned with 56-byte size.
      It there is an array of struct Foo, every member will occupy one cacheline, and this should be more efficient compared to one member span two cachelines if it is not padded.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.