A simple guide of using LinuxKI

I think LinuxKI is an underrated Linux performance tuning tool. When I worked in HPE, one of my colleagues heavily relied on LinuxKI to do performance tuning. I still remember one of his workload is like this: 8 Oracle databases run in 8 docker containers simultaneously, and he did following things every day:
(1) Execute runki to collect data;
(2) Use kiall to analyse data, then tune some parameters;
(3) Go back to step (1).

Below is a simple guide of how to use LinuxKI, and I assume the LinuxKI is already installed:
(1) Collect data in /dev/shm directory to reduced the risk of missing LinuxKI events during the tracing and does not add to the disk workload, but be sure /dev/shm has enough memory:

$ cd /dev/mem

(2) Run runki command (-R options means capturing advanced CPU stats):

$ sudo /opt/linuxki/runki -R

After finishing, there is a compressed *.tgz file:

$ ll -h
total 359M
-rw-r--r--. 1 root root 359M Apr 29 13:39 ki_all.pocket-p2.0429_1337.tgz

(3) Copy the *.tgz file into home directory:

$ cp ki_all.pocket-p2.0429_1337.tgz ~/

and now the original file can be safely removed.

(4) The final step is generating the reports:

$ cd ~
$ /opt/linuxki/kiall -r
Processing files in: /home/nanxiao/pocket-p2/0429_1337
Merging KI binary files.  Please wait...
ki.bin files merged by kiinfo -likimerge
/opt/linuxki/kiinfo -kitrace -ts 0429_1337
/opt/linuxki/kiinfo -kiall csv -html -ts 0429_1337
kiall complete

Then we can check and analyse the reports now.

Test locking a spinlock twice behaviour

I was curious whether pthread_spin_lock() can really detect deadlock scenario, so I wrote a simple program to test:

#include <stdio.h>
#include <pthread.h>

int
main(void)
{
    pthread_spinlock_t lock;
    if (pthread_spin_init(&lock, PTHREAD_PROCESS_PRIVATE) != 0) {
        perror("pthread_spin_init error");
        return 1;
    }

    if (pthread_spin_lock(&lock) != 0) {
        perror("pthread_spin_lock 1 error");
        return 1;
    }

    if (pthread_spin_lock(&lock) != 0) {
        perror("pthread_spin_lock 2 error");
        return 1;
    }

    return 0;
}

Tested it on both Linux and FreeBSD, the program blocked on the second pthread_spin_lock, never return:

$ ./double_lock

P.S., the code can be found here.

Beware of out-of-boundary access of array

Today my colleague fixed one bug related to out-of-boundary access of array: a hash function returns the selected index of the array, but the hash function’s return value is int, so in corner case, when the hash value is overflow, it can become negative, and this will cause access an invalid element of the array. The lessons I learnt from this bug:
(1) Review the return value of hash function;
(2) Pay attention to the index when accessing array, is it possible to cause out-of-boundary access?

How to obtain a big-endian CPU machine

Last week, I wanted to test whether a trivial function works OK on big-endian CPU. I have ARM and X86_64 machines at hand, but both them are little-endian. After searching online, I come across Running a emulated SparcStation 20 with qemu-sparc, though I heard about qemu before, but never used it, so wanted to give it a spin.

The installation of qemu is straightforward, then I created a NetBSD-10.1-sparc machine in just 3 steps (omit some configurations unneeded for me):

$ qemu-img create -f qcow2 ss20.image 4G
$ qemu-system-sparc -M SS-20 -m 256 -drive file=NetBSD-10.1-sparc.iso,bus=0,unit=2,media=cdrom,readonly=on -drive file=ss20.image,bus=0,unit=0,media=disk -full-screen -boot d
$ qemu-system-sparc -M SS-20 -m 256 -drive file=ss20.image,bus=0,unit=0,media=disk -full-screen -boot c

Then the machine booted successfully and met my requirement perfectly!

Test multi-thread program on one CPU

Today, I tested a multi-thread program on one CPU. The testbed is a FreeBSD virtual machine, and from lscpu command, it has indeed one CPU:

$ lscpu
Architecture:            aarch64
Byte Order:              Little Endian
Total CPU(s):            1
Model name:              Apple Unknown CPU r0p0 (midr: 610f0000)

The multi-thread program is simple too, just 4 threads add one global variable, and the correct result should be 400000 in every run. If the result is not 400000, exit the program:

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>

#define THREAD_NUM 4
#define SUM_LOOP_SIZE 100000

uint64_t sum;

void *
thread(void *arg)
{
    for (int i = 0; i < SUM_LOOP_SIZE; i++) {
        sum++;
    }
    return NULL;
}

int
main()
{
    pthread_t tid[THREAD_NUM];
    uint64_t counter = 0;
    while (1) {
        counter++;
        for (int i = 0; i < THREAD_NUM; i++) {
            int ret = pthread_create(&tid[i], NULL, thread, NULL);
            if (ret != 0) {
                fprintf(stderr, "Create thread error: %s", strerror(ret));
                return 1;
            }
        }

        for (int i = 0; i < THREAD_NUM; i++) {
            int ret = pthread_join(tid[i], NULL);
            if (ret != 0) {
                fprintf(stderr, "Join thread error: %s", strerror(ret));
                return 1;
            }
        }

        if (sum != THREAD_NUM * SUM_LOOP_SIZE) {
            fprintf(stderr, "Exit after running %" PRIu64 " times, sum=%" PRIu64 "\n", counter, sum);
            return 1;
        }

        sum = 0;
    }

    return 0;
}

Built and run the program:

$ ./multi_thread_one_cpu
Exit after running 17273076 times, sum=200000
$ ./multi_thread_one_cpu
Exit after running 1539708 times, sum=100000

Change “uint64_t sum;” to “volatile uint64_t sum;“, compile and run again:

$ ./multi_thread_one_cpu
Exit after running 20 times, sum=200000
$ ./multi_thread_one_cpu
Exit after running 50 times, sum=200000

Exit much faster.

In summary, when there are multiple threads access same variable, always use lock. P.S., the code can be found here.