Fix performance issue related to hash table

Yesterday I met a performance issue: the consumer threads’ CPU utilisation are nearly 100%. From the perf report, these consumer threads spent remarkable time in searching from the hash tables. Since every consumer thread has its own hash table, we don’t need to worry about lock-contention issue, but the buckets number of the hash table is not large enough, so every bucket holds too many elements, and searching from these elements costs a lot of time. The fix is straightforward: increase the number of buckets.

Reading code is still the most effective method to debug multi-thread bug

In the past month, I fixed two multi-thread bug, and the symptoms of these two bugs are:

a) For the first bug: some threads are dead-locked. This bug only occurs on few production machines, and the frequency is not high. And this bug never happens in testbed.

b) For the second one: the program will crash after running for 3 ~ 5 hours, and the reason is the program enters a should-never-enter code path which will trigger assert. Though there is the core dump file, I can’t find any clues from the crime scene.

The straightforward way to debug first bug is checking all lock and unlock operations are paired in any path. Unfortunately, that is not the root cause, so I began to check all code which is related to the lock. After two days, I finally got a copy-pasta error which can open a can of worms.

For the second bug, I went through all code related to multi-thread access problematic variable one line by another, to see whether there is a corner case which can incur contention. Thank god! When I have a rest at the noon, I finally had the idea!

You can see, during the debug process of these two bugs, I can’t find other better method except reading code again and again (I indeed tried to add more traces but it didn’t work). BTW, the common thing of these two bugs is the fix is simple: just modifying one line of code.

Reorder packets in chronological order for pcap file

Today I debugged a tricky issue, and the root cause is the tested pcap file have some packets which are not in chronological order, e.g., packet 101 should come before packet 100. The solution is to use reordercap to process the pcap file first:

$ reordercap abnormal.pcap normal.pcap
155748 frames, 86 out of order

Then the test is OK.

A simple guide of using LinuxKI

I think LinuxKI is an underrated Linux performance tuning tool. When I worked in HPE, one of my colleagues heavily relied on LinuxKI to do performance tuning. I still remember one of his workload is like this: 8 Oracle databases run in 8 docker containers simultaneously, and he did following things every day:
(1) Execute runki to collect data;
(2) Use kiall to analyse data, then tune some parameters;
(3) Go back to step (1).

Below is a simple guide of how to use LinuxKI, and I assume the LinuxKI is already installed:
(1) Collect data in /dev/shm directory to reduced the risk of missing LinuxKI events during the tracing and does not add to the disk workload, but be sure /dev/shm has enough memory:

$ cd /dev/mem

(2) Run runki command (-R options means capturing advanced CPU stats):

$ sudo /opt/linuxki/runki -R

After finishing, there is a compressed *.tgz file:

$ ll -h
total 359M
-rw-r--r--. 1 root root 359M Apr 29 13:39 ki_all.pocket-p2.0429_1337.tgz

(3) Copy the *.tgz file into home directory:

$ cp ki_all.pocket-p2.0429_1337.tgz ~/

and now the original file can be safely removed.

(4) The final step is generating the reports:

$ cd ~
$ /opt/linuxki/kiall -r
Processing files in: /home/nanxiao/pocket-p2/0429_1337
Merging KI binary files.  Please wait...
ki.bin files merged by kiinfo -likimerge
/opt/linuxki/kiinfo -kitrace -ts 0429_1337
/opt/linuxki/kiinfo -kiall csv -html -ts 0429_1337
kiall complete

Then we can check and analyse the reports now.