A performance engineering team’s story

It is a story about a performance engineer team: most is genuine, and some is made-up.

Server performance related work requires a broad knowledge and experience: software, hardware, Operating System kernel, and so on. The team members’ background is also a rich variety: embedded system developer, PCB designer, QA, DevOps, compiler researcher, etc.

The team’s daily work is like this:

(1) Since company’s main products are servers, run SPEC benchmark programs and cooperate with other team to get the best result is the most important task. Since only your servers exceed other competitors, the customers are willing to pay money for them.

(2) It is no controversial that Linux has dominated the server market, so being proficient in Linux profiling and tracing tools (from perf,ftrace to BPF) is a compulsory class. Meanwhile, the company also has its own proprietary Unix which is still serving some critical business: bank, telecommunication, etc. The members also need to use some esoteric tools to help diagnosing issues on this proprietary Unix. At the same time, coding is another important task: develop company’s own profiling tools, and contribute to FOSS projects.

(3) Support customers and other teams. E.g., one customer wants to know the Oracle‘s performance on some server models; the other finds that Docker doesn’t run as expected. Another colleague comes to your desk: “When you are free, could you help to check how to optimize this program? Boss isn’t satisfied with it”.

(4) The team encourages sharing. During every weekly meeting, you can introduces a Unix command trick, a debugging skill, like this. Members can have 5% ~ 10% time to do hobby projects, but the prerequisite is work first. This benefit is not free lunch, you should report it in the meeting too.

So what is the point of this article? Nothing, just telling a story, that’s it.

A performance issue about copy constructor

These two day, I debugged a performance issue which is related to copy constructor: the class A has a member b which is NTL::ZZX type:

class A
{
    enum class type {zzx_t, ...} t;
    NTL::ZZX b;
    ......
}

When member t‘s value is zzx_t, b is valid. Otherwise b‘s content should be outdated.

There are 2 methods of implementing A‘s copy constructor:
(1)

A(const A& other) : t(other.t), b(other.b)
{
    ......
}

In this method, NTL::ZZX‘s copy constructor is called in spite of anything.

(2)

A(const A& other) : t(other.t)
{
    ......
    if (t == zzx_t)
    {
        b = other.b;
    }
    .....
}

In this case, NTL::ZZX‘s default constructor is called first. NTL::ZZX‘s copy assignment operator is invoked only if “t == zzx_t” condition is met.

NTL::ZZX‘s default constructor nearly does nothing, and copy constructor does approximate work as copy assignment operator. But in our scenario, t‘s value is not zzx_t at 80 percent of the time. So the second implementation of copy constructor gets a big performance boost compared to first one.

Performance comparison between string::at and string::operator[] in C++

Check following testPalindrome_index.cpp program which utilizes string::operator[]:

#include <string>

bool isPalindrome(
    std::string& s,
    std::string::size_type start,
    std::string::size_type end) {

        auto count = (end - start + 1) / 2;
        for (std::string::size_type i = 0; i < count; i++) {
            if (s[start] != s[end]) {
                return false;
            }
            start++;
            end--;
        }

        return true;
}

int main() {
        std::string s(1'000'000'000, 'a');

        isPalindrome(s, 0, s.size() - 1);
        return 0;
}

My compile is clang++, and measure the execution time without & with optimization:

# c++ -std=c++14 testPalindrome_index.cpp -o index
# time ./index
    0m13.84s real     0m12.77s user     0m01.06s system
# c++ -std=c++14 -O2 testPalindrome_index.cpp -o index
# time ./index
    0m01.44s real     0m00.42s user     0m01.01s system

We can see the time differences are so large (13.84 vs 1.44)!

Then change the code to use string::at:

#include <string>

bool isPalindrome(
    std::string& s,
    std::string::size_type start,
    std::string::size_type end) {

        auto count = (end - start + 1) / 2;
        for (std::string::size_type i = 0; i < count; i++) {
            if (s.at(start) != s.at(end)) {
                return false;
            }
            start++;
            end--;
        }

        return true;
}

int main() {
        std::string s(1'000'000'000, 'a');

        isPalindrome(s, 0, s.size() - 1);
        return 0;
}

Compile and test again:

# c++ -std=c++14 testPalindrome_at.cpp -o at
# time ./at
    0m07.31s real     0m06.36s user     0m00.96s system
# c++ -std=c++14 -O2 testPalindrome_at.cpp -o at
# time ./at
    0m06.42s real     0m05.45s user     0m00.97s system

We can see the time gap is nearly 1 second, and not outstanding as the first case. But the time with “-O2” optimization is 6.42, far bigger than 1.44which uses string::operator[].

The conclusion is if the string is long enough, the performance bias of using string::operator[] and string::at is remarkable. So this factor should be considered when decide which function should be used.

P.S., the full code is here.

Some tips of using “pool” in programming

In the past weeks, I am dived into utilizing CUDA APIs to operate on multiple GPUs. Among my work, one is a “memory pool” module which manages allocating/freeing memories from different devices. In this post, I want to share some thoughts about “pool”, an interesting tech in programming.

The “pool” works like a “cache”: it allocates some resource beforehand; if the application wants it, the “pool” picks up a available one for it. One classical example is the “database connection pool”: the “pool” preallocates some TCP connections and keep them alive, and this will save client from handshaking with server every time. The other instance is “memory pool”, which I implemented recently. The “pool” keeps the freed memory, and not really release it to device. Based on my benchmark test, the application’s performance can get a 100% improvement in extreme case. (Caveat: For multithreaded application, if the locking mechanism of “pool” becomes the performance bottleneck, try every thread has its own private “pool”.)

The other function of using “pool” is for debugging purpose. Still use my “memory pool” as an demonstration: for every memory pointer, there are a device ID (which GPU this memory is allocated from) and memory size accompanied with it. So you can know the whole life of this block memory clearly from analyzing trace log. This has saved me from notorious “an illegal memory access was encountered” error many times in multiple GPU programming.

Last but not least, although the number of this “memory pool”‘s code lines is merely ~100, I already use following data structures: queue, vector, pair and map. Not mention the mutex and lock, which are essential to avoid nasty data-race issues. So writing this model is a good practice to hone my programming craft.

Enjoy coding!

Use “-g -O2” option when employ gcc to compile your project

Honestly, I seldom used “-g -O2” option before when employ gcc to compile projects since optimization will cause source code and instructions can’t be consistent, and it makes you annoyed during debugging. But in the past half a year, since my work involves a lot of encryption & decryption jobs, and they are all compute intensive tasks, I find the -O2 option can really give a big improvement in performance.

For example, when the project is compiled with only -g option, the whole computation flow will last more than 30 minutes, but after using “-g -O2” option, the time is reduced to less than 3 minutes, and the whole performance is 10x times improved than before.

So when you care about your program’s performance, you should try to use “-g -O2” option: the -g option can enlarge your executable file size, but won’t make it run slow, and once the program crashes, it can also provide you enough debug information; the -O2 is the “best safe optimization” option. Besides these, you may run into some tricky bugs which occur only during optimization code.

Hope you can try and enjoy it!

References:
Using -g and -O2 options in gcc;
Is a program compiled with -g gcc flag slower than the same program compiled without -g?.