Some tips of creating streams in using CUDA

Check following simple program:

cat test_stream.cu
int main()
{
        cudaStream_t st_00, st_01, st_11;

        cudaSetDevice(0);
        cudaStreamCreate(&st_00);
        cudaStreamCreate(&st_01);

        cudaSetDevice(1);
        cudaStreamCreate(&st_11);

        return 0;
}

In my system, device 0 is Nvidia Tesla-V100 GPU while device 1 is Tesla-P100. Use cuda-gdb to debug the program step by step:

(1)

Temporary breakpoint 1, main () at /home/xiaonan/temp/test_stream.cu:2
2       {
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:2
(cuda-gdb) n
5               cudaSetDevice(0);
(cuda-gdb)
[New Thread 0x7fffdffff700 (LWP 82532)]
6               cudaStreamCreate(&st_00);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6

When the program was launched, there is only main thread (Id is 1). Then after calling cudaSetDevice(0);, a new thread is spawned (Id is 2).

(2)

(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:6
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
(cuda-gdb) n
[New Thread 0x7fffdf7fe700 (LWP 82652)]
7               cudaStreamCreate(&st_01);
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:7
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

On device 0, only first calling cudaStreamCreate will generate a new thread. Check used memory through nvidia-smi command:

$ nvidia-smi
Tue Nov 13 16:53:37 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

Create another stream, you will find the memory usage is the same as before.

(3)

(cuda-gdb) n
9               cudaSetDevice(1);
(cuda-gdb)
10              cudaStreamCreate(&st_11);
(cuda-gdb)
[New Thread 0x7fffdeffd700 (LWP 82993)]
12              return 0;
(cuda-gdb) i threads
  Id   Target Id         Frame
* 1    Thread 0x7ffff7a74740 (LWP 82365) "test_stream" main () at /home/xiaonan/temp/test_stream.cu:12
  2    Thread 0x7fffdffff700 (LWP 82532) "test_stream" 0x00007ffff7b743e7 in accept4 () from /usr/lib/libc.so.6
  3    Thread 0x7fffdf7fe700 (LWP 82652) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6
  4    Thread 0x7fffdeffd700 (LWP 82993) "test_stream" 0x00007ffff7b67bb1 in poll () from /usr/lib/libc.so.6

Switch to another device and create stream; check memory usage now:

$ nvidia-smi
Tue Nov 13 16:54:24 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.57                 Driver Version: 410.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    30W / 250W |    291MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   26C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   35C    P0    47W / 250W |    769MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     82365      C   /home/xiaonan/temp/build/test_stream         281MiB |
|    3     82365      C   /home/xiaonan/temp/build/test_stream         407MiB |
+-----------------------------------------------------------------------------+

You will find different devices consume different memory for creating streams.

The “sophisticated” modern software engineer interview process

About 10 years ago, the interview process of software engineer is pretty simple and straightforward: if you meet the condition, you will be invited on-site. The interviewer will ask you some questions about previous or current projects, computer science basic knowledge, and so on. If you are a fresh graduate, maybe there is a paper test about algorithm. Then if the interviewer think you are an appropriate candidate, you will enter the next round, mostly final round interview. This round always involves with R&D Director/HR, little or no technical discussion, only for salary and to see whether you can fit the culture of the team. Generally speaking, the whole process only lasts forĀ 1 week and is comprised of 1 ~ 2 rounds. no-nonsense!

If you want to know contemporary interview process, please check following capture of a job description, and I think it is a good represent:

Currently, before you reach on-site, you should have already passed 2 ~ 3 rounds of interview. Usually, the phone screening will be the first round, and HR will get a rough knowledge of your background. Sometimes HR also will ask you some technical questions though he/she is not technical-orientated. Then HR will notify there is a coding/homework test which you should finish by a deadline. Sometimes, the phone screening can be omitted, and a mail which notifies you need finish coding/homework test comes to your mail-box directly. It gives me (or maybe you) an impression that the persons in company are very busy, and there is no time to talk crap with interviewees. You apply for job, not job applies for you, Correct? So you should finish some work to prove you are qualified to talk to the company. Alas! I once got a homework test whose document is 7 pages! Yes, 7 pages! Besides coding, I also need to write a detailed test plan. The whole task costed me 30 hours, and I even doubt the company just outsourced its work to a free labour. Anyway, if you pass this round, you can hear the voice or see some person at least; if not, you will receive a rejection letter or no any response. The game is over, and you can’t even talk one word with the company.

In some cases, there is an extra coding interview before on-site. You passed online coding interview just now, but this time you will share a screen with interviewer to test your “pair-programming” ability. During this process, you should interact with interviewer actively: pose something, confirm something, etc. I don’t know whether this is the essence of “pair-programming”.

If you arrive this stage, congratulations! You can be invited on-site. There may be another 2 ~ 4 rounds of interview: every interview will cost 1 ~ 2 hours., and the content is nothing more than following 4 categories:
a) Still coding test (e.g., a dynamic programming problem);
b) System design (e.g., how to design a car-parking lot?);
c) Computer science basic knowledge (e.g., the difference between TCP and UDP?) ;
d) You previous/current projects (e.g., what is your current working field?).
There seems no standard of the interview, and you don’t know you should answer how many questions correctly to guarantee you can enter the next round, especially many problems are open-minded.

Based on previous description, you can see to get a job offer today, you will endure 4 ~ 6 rounds of interview, and nearly ~10 hours (this doesn’t count the time you spend on homework or prepare for online coding) in total. The whole process can last for 1 ~ 2 months. I can’t say this method is correct or not, but it indeed boosts the time cost of candidate at any rate. Regarding the companies: is it really meaningful for so many rounds? What result do you expect for every round? The candidate who passes all rounds is in truth a right person? If you are a owner of a company and can get specific answers for above questions, I think you can know whether this interview method is suitable or not for your corporation.

BTW, I really like following interview process:

The byproducts of reading OpenBSD netcat code

When I took part in a training last year, I heard about netcat for the first time. During that class, the tutor showed some hacks and tricks of using netcat which appealed to me and motivated me to learn the guts of it. Fortunately, in the past 2 months, I was not so busy that I can spend my spare time to dive into OpenBSD‘s netcat source code, and got abundant byproducts during this process.

(1) Brush up socket programming. I wrote my first network application more than 10 years ago, and always think the socket APIs are marvelous. Just ~10 functions (socket, bind, listen, accept…) with some IO multiplexing buddies (select, poll, epoll…) connect the whole world, wonderful! From that time, I developed a habit that is when touching a new programming language, network programming is an essential exercise. Even though I don’t write socket related code now, reading netcat socket code indeed refresh my knowledge and teach me new stuff.

(2) Write a tutorial about netcat. I am mediocre programmer and will forget things when I don’t use it for a long time. So I just take notes of what I think is useful. IMHO, this “tutorial” doesn’t really mean teach others something, but just a journal which I can refer when I need in the future.

(3) Submit patches to netcat. During reading code, I also found bugs and some enhancements. Though trivial contributions toOpenBSD, I am still happy and enjoy it.

(4) Implement a C++ encapsulation of libtls. OpenBSD‘s netcat supports tls/ssl connection, but it needs you take full care of resource management (memory, socket, etc), otherwise a small mistake can lead to resource leak which is fatal for long-live applications (In fact, the two bugs I reported to OpenBSD are all related resource leak). Therefore I develop a simple C++ library which wraps the libtls and hope it can free developer from this troublesome problem and put more energy in application logic part.

Long story to short, reading classical source code is a rewarding process, and you can consider to try it yourself.

Conditional variable takeaways

Conditional variable is a common concept in both user-space and kernel-space programming. IMHO, it is a little complicated synchronization mechanism. Recently, I came across Measuring context switching and memory overheads for Linux threads, and this article provides an example which I think is a good tutorial about how to understand and use conditional variable.

The parent thread code is like following:

  /* parent thread */
  pthread_mutex_lock(&si.mutex);
  pthread_t childt;
  pthread_create(&childt, NULL, threadfunc, (void*)&si);

  // Each iteration of this loop will switch context from the parent to the
  // child and back - two context switches. The child signals first.
  ......
  for (int i = 0; i < NUM_ITERATIONS; ++i) {
    pthread_cond_wait(&si.cv, &si.mutex);
    pthread_cond_signal(&si.cv);
  }
  pthread_mutex_unlock(&si.mutex);

And this is the child thread code:

// The child thread signals first
  pthread_mutex_lock(&si->mutex);
  pthread_cond_signal(&si->cv);
  for (int i = 0; i < NUM_ITERATIONS; ++i) {
    pthread_cond_wait(&si->cv, &si->mutex);
    pthread_cond_signal(&si->cv);
  }
  pthread_mutex_unlock(&si->mutex);

(1) The first takeaway is pthread_cond_signal() must be called after pthread_cond_wait() in timing sequence; otherwise the signal won’t be received.

Check preceding code, before launching child thread:

    ......
    pthread_t childt;
    pthread_create(&childt, NULL, threadfunc, (void*)&si);
    ......

The parent thread must hold mutex first:

    ......
    pthread_mutex_lock(&si.mutex);
    ......

Then in the first iteration of loop, release the mutex and wait for notification:

    ......
    pthread_cond_wait(&si.cv, &si.mutex);
    ......

This can guarantee when child thread sends signal, the parent thread is already in the wait queue:

  ......
  pthread_mutex_lock(&si->mutex);
  pthread_cond_signal(&si->cv);
  ......

(2) The other thing we should remember is before and after calling pthread_cond_wait(), the current thread must hold the mutex. I.e., before callingpthread_cond_wait(), the current thread get the mutex, then in pthread_cond_wait(), put the current thread in the wait queue and release the mutexatomically. Once another thread signals current thread, it will reacquire mutex and return from pthread_cond_wait().

The Dilemma brought by “outdated technology”

This week, I came across an interesting post: Do You Know Cobol? If So, There Might Be a Job for You. The general idea is many big finance companies still use the systems that are developed in Cobol, which is an ancient programming language. Currently, there are few people who master Cobol, and even worse, most of them will or already retire. IMHO, this article gives an good example about companies and personal engineers’ dilemma brought by “outdated technology”.

Let me tell two stories first:
(1) I once worked in HPE for nearly two years, and HPE has its own Unix Operating System: HP-UX. Actually, my team did HP-UX related work before I joined in, but at that time, all team’s work was already switched to Linux. Why? Because Linux had dominated the servers market then. To earn money, more and more resource should be invested in Linux area, and for HP-UX, basic maintenance and development is enough. As fat as I know, HP-UX is still the backbone of many critical services, such as banks, telecommunication Operators, etc. But even for HPE itself, HP-UX‘s priority becomes very lower now.

(2) There is a service which was launched in mid-1990s. It is a 32-bit program, written in C programming language, and stable enough to serve the people all over the world every day. About ~20 years later, one engineer noticed Year_2038_problem because the program definitely use time_t which is 32-bit long. He began to discuss with leader to transform the program to 64-bit, but both of them knew it was not as simple as adding only -m64 compile option. After ~20 years, the code had become “mature”, what I mean is the code repository was very large; about ~40 engineers had ever committed code, and some modules had changed into a total “black box”, what I mean is no one knew the logic behind it, but it really worked as a charm! To transform it into 64-bit program, maybe the compilation can pass, but no one know whether it indeed work! It needs careful code review and sufficient testing, but seems not worth cause it is a problem which will occur ~20 years later. Therefore, this task lies in “Todo” list year by year. Every one pray the service should be shut down by 2038.

These examples all narrate one fact, most “outdated technology” are not too “bad”, such as Cobol, HP-UX or 32-bit program, but for some reasons, they are not main-steam now. For most companies, the overhaul of services which are constructed by these “outdated technology” is not accepted: besides the notable time & person cost, one glitch can bring catastrophic result, even can let company close. But at the other side, the amount of engineers who master these “outdated technology” also become smaller and smaller, so the companies can only explore the potentials from their internal staffs mostly.

For engineers, there is also a dilemma. You can pick “outdated technology” as a hobby, but there is a huge risk to adopt it as a full-time job. Maybe one day, you need to find job again, and many biased companies will reject you just because they deem you don’t know some “hyped technology”, and can only tame dinosaurs. Ridiculous! Isn’t it? But it is the reality we must adapt to.