From CPU’s perspective, the memory hierarchy is registers, L1 cache, L2 cache, L3 cache, main memory, among others. The smallest unit of cache is one cacheline, and it is 64
bytes in most cases:
$ getconf LEVEL1_DCACHE_LINESIZE
64
To make your applications run efficiently, you need to take cacheline into account. Take notorious cacheline fales sharing as an example:
......
struct Foo
{
int a;
int b;
int c[14];
};
.....
The size of struct Foo
is 64
bytes, and it can be stored in one cacheline. If CPU 0
accesses Foo.a
while CPU 1
accesses Foo.b
at the same time, there will be “cacheline ping-ponging” between CPUs
, and the performance will be downgraded drastically.
The other trick is to allocate memory cacheline size aligned. Still use above struct Foo
as the example. To guarantee the whole struct Foo
in one cacheline, posix_memalign can be used:
struct Foo *foo;
posix_memalign(&foo, 64, sizeof(struct Foo));
The 64
is the alignment requirement.
Last but not least, sometimes padding is needed. E.g.:
......
struct Foo
{
int a;
int b;
int c[12];
int padding[2];
};
......
struct Foo *foo;
posix_memalign(&foo, 64, sizeof(struct Foo) * 10);
Or using compiler’s aligned
attribute:
......
struct Foo
{
int a;
int b;
int c[12];
} __attribute__((aligned(64)));;
......
The original struct Foo
‘s size is 56
bytes, after padding (or through compiler’s aligned
attribure), it becomes 64
bytes, and can be loaded in one cacheline. Now we can allocate an array of struct Foo
, and every CPU
will process one element of the array, no “cacheline ping-ponging” will occur.