Two practical software engineering rules

There are so many huge books which introduce software engineering, and in this article, I want to share two practical rules which are based on my own experience.

(1) No fear of refactoring

As time goes on, refactoring code is inevitable: the original design can’t handle current situation seamlessly; we can use the new characteristics of programming language to polish existed code, etc. Since refactoring code is time-consuming, risking, and costly, many companies are reluctant to do it for some reasons. Whereas the refactoring is beneficial to both company and engineers literally.

For companies: After refactorig, the code should become more reasonable and easier-maintainable, and the consequence is that it will save you much time and cost to add new features. For engineers: refactoring code can let you be more familiar with the the code logic, try using new characteristics of programming language and practice module design skills, and it is a precious opportunity to enrich yourself. So in the long run, refactoring code is a win-win situation actually. ( If the software quality becomes worse, oh boy! Don’t refactor it!)

(2) “Real” peer-to-peer code review

I haven’t experienced pair-programming, but took part in many “fake” peer-to-peer code review: before reviewing, the reviewer didn’t read code before. During the reviewing, the code author needed to spell out what was the intention of this code, then the reviewer would analyze the code on the spot. It seemed the reviewer and code author were very busy in the review meeting, but in fact it was a totally time-wasting and inefficient!

From my viewpoint, there should be two maintainers for any software module, and the two maintainers have the same familiarity of code. No matter adding a new big feature or just fixing a small bug, the two maintainers should co-work the whole design flow in advance, then if the task is small, one maintainer can take over the whole work, otherwise they can share it. Since everybody has took part in the discussion before, he/she can review partner’s code alone. This method can avoid misleading by code author, saving time, and finding bug efficiently. The potential benefit for company is if one guy resigns, there is no loss because there is always another engineer who is an genuine backup.

These two rules seem feasible? Why not give them a shot?

Fix cudart::globalState::registerEntryFunction core dump issue

In the past 2 days, I was in trouble handling a program crash dump which uses CUDA:

(gdb) bt
#0  0x00007ffff73fc559 in cudart::globalState::registerEntryFunction(void**, char const*, char*, char const*, int, uint3*, uint3*, dim3*, dim3*, int*) () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#1  0x00007ffff73decbc in __cudaRegisterFunction () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#2  0x00007ffff73d9098 in __nv_cudaEntityRegisterCallback(void**) () from /home/xiaonan/dl2-he/3rdparty/libDSI_FV.so
#3  0x00000000004283d6 in __cudaRegisterLinkedBinary(__fatBinC_Wrapper_t const*, void (*)(void**), void*) ()
#4  0x00000000004282e5 in __cudaRegisterLinkedBinary_66_tmpxft_00002dac_00000000_12_cuda_device_runtime_compute_70_cpp1_ii_8b1a5d37 ()
#5  0x00007ffff7de76ba in ?? () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7de77cb in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7dd7c6a in ?? () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000001 in ?? ()
#9  0x00007fffffffe7e3 in ?? ()
#10 0x0000000000000000 in ?? ()

Long story to short, my OS is Ubuntu 16.04.5 LTS, and CUDA version is:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

After tough trial and error, the solution is renaming one file from “.cpp” to “.cu“. But on another Arch Linux with the newest CUDA (V10.0.130), this problem doesn’t exist.

If you are interested in more information, please refer this topic.

Some tips of creating streams in using CUDA

Check following simple program:

cat test_stream.cu
int main()
{
        cudaStream_t st_00, st_01, st_11;

        cudaSetDevice(0);
        cudaStreamCreate(&st_00);
        cudaStreamCreate(&st_01);

        cudaSetDevice(1);
        cudaStreamCreate(&st_11);

        return 0;
}

In my system, device 0 is Nvidia Tesla-V100 GPU while device 1 is Tesla-P100. Use cuda-gdb to debug the program step by step:

(1)

Temporary breakpoint 1, main () at /home/xiaonan/temp/test_stream.cu:2
2       {
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:2
(cuda-gdb) n
5               cudaSetDevice(0);
(cuda-gdb)
[New Thread 0x7fffdffff700 (LWP 82532)]
6               cudaStreamCreate(&st_00);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6

When the program was launched, there is only main thread (Id is 1). Then after calling cudaSetDevice(0);, a new thread is spawned (Id is 2).

(2)

(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
(cuda-gdb) n
[New Thread 0x7fffdf7fe700 (LWP 82652)]
7               cudaStreamCreate(&st_01);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:7
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

On device 0, only first calling cudaStreamCreate will generate a new thread. Check used memory through nvidia-smi command:

$ nvidia-smi
Tue Nov 13 16:53:37 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

Create another stream, you will find the memory usage is the same as before.

(3)

(cuda-gdb) n
9               cudaSetDevice(1);
(cuda-gdb)
10              cudaStreamCreate(&st_11);
(cuda-gdb)
[New Thread 0x7fffdeffd700 (LWP 82993)]
12              return 0;
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:12
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6
  4    Thread 0x7fffdeffd700 (LWP 82993) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

Switch to another device and create stream; check memory usage now:

$ nvidia-smi
Tue Nov 13 16:54:24 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    30W / 250W |    291MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     82365      C   /home/xiaonan/temp/build/test_stream         281MiB |
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

You will find different devices consume different memory for creating streams.

Conditional variable takeaways

Conditional variable is a common concept in both user-space and kernel-space programming. IMHO, it is a little complicated synchronization mechanism. Recently, I came across Measuring context switching and memory overheads for Linux threads, and this article provides an example which I think is a good tutorial about how to understand and use conditional variable.

The parent thread code is like following:

  /* parent thread */
  pthread_mutex_lock(&si.mutex);
  pthread_t childt;
  pthread_create(&childt, NULL, threadfunc, (void*)&si);

  // Each iteration of this loop will switch context from the parent to the
  // child and back - two context switches. The child signals first.
  ......
  for (int i = 0; i < NUM_ITERATIONS; ++i) {
    pthread_cond_wait(&si.cv, &si.mutex);
    pthread_cond_signal(&si.cv);
  }
  pthread_mutex_unlock(&si.mutex);

And this is the child thread code:

// The child thread signals first
  pthread_mutex_lock(&si->mutex);
  pthread_cond_signal(&si->cv);
  for (int i = 0; i < NUM_ITERATIONS; ++i) {
    pthread_cond_wait(&si->cv, &si->mutex);
    pthread_cond_signal(&si->cv);
  }
  pthread_mutex_unlock(&si->mutex);

(1) The first takeaway is pthread_cond_signal() must be called after pthread_cond_wait() in timing sequence; otherwise the signal won’t be received.

Check preceding code, before launching child thread:

    ......
    pthread_t childt;
    pthread_create(&childt, NULL, threadfunc, (void*)&si);
    ......

The parent thread must hold mutex first:

    ......
    pthread_mutex_lock(&si.mutex);
    ......

Then in the first iteration of loop, release the mutex and wait for notification:

    ......
    pthread_cond_wait(&si.cv, &si.mutex);
    ......

This can guarantee when child thread sends signal, the parent thread is already in the wait queue:

  ......
  pthread_mutex_lock(&si->mutex);
  pthread_cond_signal(&si->cv);
  ......

(2) The other thing we should remember is before and after calling pthread_cond_wait(), the current thread must hold the mutex. I.e., before callingpthread_cond_wait(), the current thread get the mutex, then in pthread_cond_wait(), put the current thread in the wait queue and release the mutexatomically. Once another thread signals current thread, it will reacquire mutex and return from pthread_cond_wait().

Forgetting “-pthread” option may give you a big surprise!

Today, I wrote a small pthread program to do some testing:

#include <pthread.h>

int main(void)
{
        pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
        pthread_cond_t cv = PTHREAD_COND_INITIALIZER;

        pthread_mutex_lock(&mutex);
        pthread_cond_wait(&cv, &mutex);
        return 0;
} 

Build and test it on OpenBSD-current (version is 6.4):

# cc cv_test.c -o cv_test
# ./cv_test

The program will block there and it is my expected result. Switch to Arch Linux (kernel version is 4.18.9):

# cc cv_test.c -o cv_test
# ./cv_test
#

The program will exit immediately. I doubt it is “spurious awake” firstly, but can’t get a convincing explanation. Using ldd to check program. On OpenBSD:

# ldd cv_test
cv_test:
        Start            End              Type  Open Ref GrpRef Name
        000000d4c3a00000 000000d4c3c02000 exe   1    0   0      cv_test
        000000d6e6007000 000000d6e62f6000 rlib  0    1   0      /usr/lib/libc.so.92.5
        000000d6db100000 000000d6db100000 ld.so 0    1   0      /usr/libexec/ld.so

On Arch Linux:

# ldd cv_test
        linux-vdso.so.1 (0x00007ffde91c6000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f3e3169b000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f3e3187a000)

Nothing special. After seeking help on stackoverflow, the answer is I need adding -pthread option:

# cc -pthread cv_test.c -o cv_test
# ./cv_test

This time it worked perfectly. Checking linked library:

# ldd cv_test
        linux-vdso.so.1 (0x00007fff48be8000)
        libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007fa46f84c000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fa46f688000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fa46f888000)

Why doesn’t Linux give me a link error which prompts I need link libpthread? It seems not make sense.