multi-thread | Nan Xiao's Blog

Reading code is still the most effective method to debug multi-thread bug

In the past month, I fixed two multi-thread bug, and the symptoms of these two bugs are:

a) For the first bug: some threads are dead-locked. This bug only occurs on few production machines, and the frequency is not high. And this bug never happens in testbed.

b) For the second one: the program will crash after running for 3 ~ 5 hours, and the reason is the program enters a should-never-enter code path which will trigger assert. Though there is the core dump file, I can’t find any clues from the crime scene.

The straightforward way to debug first bug is checking all lock and unlock operations are paired in any path. Unfortunately, that is not the root cause, so I began to check all code which is related to the lock. After two days, I finally got a copy-pasta error which can open a can of worms.

For the second bug, I went through all code related to multi-thread access problematic variable one line by another, to see whether there is a corner case which can incur contention. Thank god! When I have a rest at the noon, I finally had the idea!

You can see, during the debug process of these two bugs, I can’t find other better method except reading code again and again (I indeed tried to add more traces but it didn’t work). BTW, the common thing of these two bugs is the fix is simple: just modifying one line of code.

Test locking a spinlock twice behaviour

I was curious whether pthread_spin_lock() can really detect deadlock scenario, so I wrote a simple program to test:

#include <stdio.h>
#include <pthread.h>

int
main(void)
{
    pthread_spinlock_t lock;
    if (pthread_spin_init(&lock, PTHREAD_PROCESS_PRIVATE) != 0) {
        perror("pthread_spin_init error");
        return 1;
    }

    if (pthread_spin_lock(&lock) != 0) {
        perror("pthread_spin_lock 1 error");
        return 1;
    }

    if (pthread_spin_lock(&lock) != 0) {
        perror("pthread_spin_lock 2 error");
        return 1;
    }

    return 0;
}

Tested it on both Linux and FreeBSD, the program blocked on the second pthread_spin_lock, never return:

$ ./double_lock

P.S., the code can be found here.

Test multi-thread program on one CPU

Today, I tested a multi-thread program on one CPU. The testbed is a FreeBSD virtual machine, and from lscpu command, it has indeed one CPU:

$ lscpu
Architecture:            aarch64
Byte Order:              Little Endian
Total CPU(s):            1
Model name:              Apple Unknown CPU r0p0 (midr: 610f0000)

The multi-thread program is simple too, just 4 threads add one global variable, and the correct result should be 400000 in every run. If the result is not 400000, exit the program:

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>

#define THREAD_NUM 4
#define SUM_LOOP_SIZE 100000

uint64_t sum;

void *
thread(void *arg)
{
    for (int i = 0; i < SUM_LOOP_SIZE; i++) {
        sum++;
    }
    return NULL;
}

int
main()
{
    pthread_t tid[THREAD_NUM];
    uint64_t counter = 0;
    while (1) {
        counter++;
        for (int i = 0; i < THREAD_NUM; i++) {
            int ret = pthread_create(&tid[i], NULL, thread, NULL);
            if (ret != 0) {
                fprintf(stderr, "Create thread error: %s", strerror(ret));
                return 1;
            }
        }

        for (int i = 0; i < THREAD_NUM; i++) {
            int ret = pthread_join(tid[i], NULL);
            if (ret != 0) {
                fprintf(stderr, "Join thread error: %s", strerror(ret));
                return 1;
            }
        }

        if (sum != THREAD_NUM * SUM_LOOP_SIZE) {
            fprintf(stderr, "Exit after running %" PRIu64 " times, sum=%" PRIu64 "\n", counter, sum);
            return 1;
        }

        sum = 0;
    }

    return 0;
}

Built and run the program:

$ ./multi_thread_one_cpu
Exit after running 17273076 times, sum=200000
$ ./multi_thread_one_cpu
Exit after running 1539708 times, sum=100000

Change “uint64_t sum;” to “volatile uint64_t sum;“, compile and run again:

$ ./multi_thread_one_cpu
Exit after running 20 times, sum=200000
$ ./multi_thread_one_cpu
Exit after running 50 times, sum=200000

Exit much faster.

In summary, when there are multiple threads access same variable, always use lock. P.S., the code can be found here.

The caveat of thread name length in glibc

Recently, our team met an interesting bug: the process is configured to spawn 16 threads, but only spawns 10 threads in reality. The thread code is like this:

static void *
stat_consumer_thread_run(void *data)
{
    stat_consumer_thread_t *thread = data;
    char thread_name[64];
    snprintf(thread_name, sizeof(thread_name), "stat.consumer.%d",
        thread->id);
    int rc = pthread_setname_np(pthread_self(), thread_name);
    if (rc != 0) {
        return NULL;
    }

    ......
    return NULL;
}

After checking pthread_setname_np manual, we found:

The thread name is a meaningful C language string, whose length is restricted to 16 characters, including the terminating null byte (’\0’).

So thread name is restricted to 16 characters, “stat.consumer.0” ~ “stat.consumer.9” are set successfully, but “stat.consumer.10” ~ “stat.consumer.15” are not, and the corresponding threads are failed to run.

Exit main thread and keep other threads running

In C programming, if using return in main function, the whole process will terminate. To only let main thread gone, and keep other threads live, you can use thrd_exit in main function. Check following code:

#include <stdio.h>
#include <threads.h>
#include <unistd.h>

int
print_thread(void *s)
{
    thrd_detach(thrd_current());
    for (size_t i = 0; i < 5; i++)
    {
        sleep(1);
        printf("i=%zu\n", i);
    }
    thrd_exit(0);
}

int
main(void)
{
    thrd_t tid;
    if (thrd_success != thrd_create(&tid, print_thread, NULL)) {
        fprintf(stderr, "Create thread error\n");
        return 1;
    }
    thrd_exit(0);
}

Run it:

$ ./main
i=0
i=1
i=2
i=3
i=4

You can see even main thread exited, the other thread still worked.

P.S., the code can be downloaded here.