The caveat of thread name length in glibc

Recently, our team met an interesting bug: the process is configured to spawn 16 threads, but only spawns 10 threads in reality. The thread code is like this:

static void *
stat_consumer_thread_run(void *data)
{
    stat_consumer_thread_t *thread = data;
    char thread_name[64];
    snprintf(thread_name, sizeof(thread_name), "stat.consumer.%d",
        thread->id);
    int rc = pthread_setname_np(pthread_self(), thread_name);
    if (rc != 0) {
        return NULL;
    }

    ......
    return NULL;
}

After checking pthread_setname_np manual, we found:

The thread name is a meaningful C language string, whose length is restricted to 16 characters, including the terminating null byte (’\0’).

So thread name is restricted to 16 characters, “stat.consumer.0” ~ “stat.consumer.9” are set successfully, but “stat.consumer.10” ~ “stat.consumer.15” are not, and the corresponding threads are failed to run.

Use “LC_ALL=C” to improve performance

Using “LC_ALL=C” can improve some program’s performance. The following is the test without LC_ALL=C of join program:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time join 1.sorted 2.sorted > 1-2.sorted.aggregated

real    0m49.903s
user    0m48.427s
sys 0m0.786s

And this one is using “LC_ALL=C“:

$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time LC_ALL=C join 1.sorted 2.sorted > 1-2.sorted.aggregated

real    0m12.752s
user    0m5.628s
sys 0m0.971s

some good references about this topic are Speed up grep searches with LC_ALL=C and Everyone knows grep is faster in the C locale.

Clear file system cache before doing I/O-intensive benchmark on Linux

If you do any I/O-intensive benchmark, please run following command before it (Otherwise you may get wrong impression of the program performance):

$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

sync means writing data from cache to file system (otherwise the dirty cache can’t be freed); “echo 3 > /proc/sys/vm/drop_caches” will drop clean caches, as well as reclaimable slab objects. Check following command:

$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time ./benchmark

real    0m12.434s
user    0m5.633s
sys 0m0.761s
$ time ./benchmark

real    0m6.291s
user    0m5.645s
sys 0m0.631s

the first run time of benchmark program is ~12 seconds. Now that the files are cached, the second run time of benchmark program is halved: ~6 seconds.

References:
Why drop caches in Linux?;
/proc/sys/vm;
CLEAR_CACHES.

Beware of using GNU libc basename() function

From the manual page, we know there are two versions of basename() implementation. One for POSIX-compliant:

#include <libgen.h>

char *basename(char *path);

Another for GNU version:

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <string.h>

But the manual doesn’t mention that the prototype type of GNU version is different from POSIX-compliant one (The parameter type is const char*, not char*):

char *basename (const char *__filename)

And the implementation is also simple, just invokes strrchr():

char *
__basename (const char *filename)
{
  char *p = strrchr (filename, '/');
  return p ? p + 1 : (char *) filename;
}

The doubt about “perf record -p pid sleep 10”

In Brendan Gregg’s awesome perf Examples, there is one command which can be used to sample for assigned time, e.g., 10 seconds:

# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID sleep 10

I have a doubt about it: wouldn’t perf sample two processes, one with specified PID and the other is sleep task? Yes, I know sleep should do nothing to affect sampling, but as someone who likes to find the root cause, I want to know whether the sleep will be sampled or not.

First of all, check the manual page, unfortunately, I couldn’t get the answer.

Secondly, experiment. I wrote a simple program which just prints something forever:

$ cat a.c
#include <stdio.h>
int main()
{
    while (1)
        printf("Hello\n");
}

Compile and use perf record to profile it for a while:

$ gcc a.c
$ sudo perf record ./a.out
Hello
Hello
.....;
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.106 MB perf.data (2409 samples) ]

Check the perf.data:

Samples: 2K of event 'cycles:ppp', Event count (approx.): 853929764
Overhead  Command  Shared Object      Symbol
   9.66%  a.out    [kernel.kallsyms]  [k] retint_userspace_restore_args
   7.69%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   4.29%  a.out    [kernel.kallsyms]  [k] n_tty_write
   3.75%  a.out    [kernel.kallsyms]  [k] system_call_after_swapgs
   3.63%  a.out    [kernel.kallsyms]  [k] irq_return
   3.01%  a.out    [kernel.kallsyms]  [k] native_write_msr_safe
......

Nothing special, just to make sure that a.out process can be profiled. Launch htop program whose PID is 39741, then execute following command for several seconds:

$ sudo perf record -p 39741 ./a.out
Hello
Hello
......
^CHello
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (98 samples) ]

a.out ran smoothly, but from perf report:

Samples: 98  of event 'cycles:ppp', Event count (approx.): 77850922
Overhead  Command  Shared Object       Symbol
   1.02%  htop     [kernel.kallsyms]   [k] sys_read                                                  ▒
   1.02%  htop     [kernel.kallsyms]   [k] getname                                                   ▒
   1.02%  htop     [kernel.kallsyms]   [k] __d_lookup_rcu                                            ▒
   1.02%  htop     [kernel.kallsyms]   [k] generic_permission                                        ▒
   1.02%  htop     [kernel.kallsyms]   [k] do_task_stat
......

I couldn’t see any samples from a.out, and all are from htop. So it seems a.out won’t be profiled in this scenario.

Third, check perf source code to verify my speculation. Roughly speaking, “perf record [options] command [options]” will fork a child process to run command, and it is in evlist__prepare_workload:

int evlist__prepare_workload(struct evlist *evlist, struct target *target, const char *argv[],
                 bool pipe_output, void (*exec_error)(int signo, siginfo_t *info, void *ucontext))
{
    ......
    if (target__none(target)) {
        if (evlist->core.threads == NULL) {
            fprintf(stderr, "FATAL: evlist->threads need to be set at this point (%s:%d).\n",
                __func__, __LINE__);
            goto out_close_pipes;
        }
        perf_thread_map__set_pid(evlist->core.threads, 0, evlist->workload.pid);
    }
    ......
}

For target__none:

static inline bool target__none(struct target *target)
{
    return !target__has_task(target) && !target__has_cpu(target);
}

Because I have already set interested process by pid before, target__has_task(target) will return true, which will cause target__none() returns false, therefore perf_thread_map__set_pid won’t be invoked and the forked process isn’t added into evlist->core.threads. That’s the reason why the command process won’t be profiled. Similarly, if target__has_cpu(target) returns true, the command process is not tracked either.

In summary, to find answer of doubt: read code, write code to verify, repeat, that’s all.