Parallelization may cause GPU utilization become worse

Recently, I did an experiment about CUDA. The following is a simple dead-loop code:

......
while (1)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

Run it in single thread, the GPU utilization is ~80%, while in two threads, the utilization is reduced to ~60%; in three threads, the utilization is reduced to ~40%. I can’t comprehend this phenomenon, so posted topics in both stackoverflow and CUDA developer forum. Unfortunately, there was no response.

After some investigation, I found this post and know there is a kernel launch queue firstly. I modified the code and profile again (use nvprof instead of GUI):

......
for (int i = 0; i < 10000; i++)
{
    dgt_mul<<<gDim, bDim, 0, st>>>(......);
}  
......

The following is the profile output of one, two and three threads:

==22209== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  17.577ms     10000  1.7570us  1.7270us  2.8800us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   97.83%  70.567ms     10000  7.0560us  4.4700us  13.296ms  cudaLaunchKernel
                    2.17%  1.5644ms     10000     156ns     119ns  15.779us  cudaGetLastError

==23662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  35.288ms     20000  1.7640us  1.7270us  12.704us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.09%  473.79ms     20000  23.689us  5.0040us  13.294ms  cudaLaunchKernel
                    0.91%  4.3564ms     20000     217ns     117ns  6.4690us  cudaGetLastError

==27597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  52.587ms     30000  1.7520us  1.7270us  2.9440us  dgt_mul(unsigned int*, unsigned int*, unsigned int*, int, int)
      API calls:   99.23%  2.10159s     30000  70.053us  13.545us  13.778ms  cudaLaunchKernel
                    0.77%  16.328ms     30000     544ns     368ns  19.316us  cudaGetLastError

We can see the average execution time of cudaLaunchKernel scales up, so it manifests there is bottle neck in the kernel launch queue when running more threads.

The display format of Bash’s built-in time command

Check the default output format of bash‘s built-in time command:

# time

real    0m0.000s
user    0m0.000s
sys     0m0.000s

You can use -p option to output in POSIX format:

# time -p
real 0.00
user 0.00
sys 0.00

From bash source code, we know the the definitions of these two formats:

#define POSIX_TIMEFORMAT "real %2R\nuser %2U\nsys %2S"
#define BASH_TIMEFORMAT  "\nreal\t%3lR\nuser\t%3lU\nsys\t%3lS"

To decipher the meanings of them, we need to refer bash manual:

TIMEFORMAT

The value of this parameter is used as a format string specifying how the timing information for pipelines prefixed with the time reserved word should be displayed. The ‘%’ character introduces an escape sequence that is expanded to a time value or other information. The escape sequences and their meanings are as follows; the braces denote optional portions.

%%
A literal ‘%’.

%[p][l]R
The elapsed time in seconds.

%[p][l]U
The number of CPU seconds spent in user mode.

%[p][l]S
The number of CPU seconds spent in system mode.

%P
The CPU percentage, computed as (%U + %S) / %R.

The optional p is a digit specifying the precision, the number of fractional digits after a decimal point. A value of 0 causes no decimal point or fraction to be output. At most three places after the decimal point may be specified; values of p greater than 3 are changed to 3. If p is not specified, the value 3 is used.

The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p determines whether or not the fraction is included.
……

Take POSIX_TIMEFORMAT as an example: %2R denotes using second as time unit, and the precision is two digits after a decimal point; %2U and %2S are similar.

Now you can comprehend the output of time, correct? Try using BASH_TIMEFORMAT as a practice.

 

Downgrade boost on Ubuntu 16.04

In the past 2 days, I was tortured by boost. The default boostversion on Ubuntu 16.04 is 1.58, but I met following compile errors:

......
/usr/include/boost/multi_index/detail/bucket_array.hpp: In static member function ‘static std::size_t boost::multi_index::detail::bucket_array_base<_>::size_index(std::size_t)’:
/usr/include/boost/multi_index/detail/bucket_array.hpp:84:62: error: invalid use of non-lvalue array
     const std::size_t *bound=std::lower_bound(sizes,sizes+sizes_length,n);
                                                              ^~~~~~~~~~~~
/usr/include/boost/multi_index/detail/bucket_array.hpp:85:25: error: invalid use of non-lvalue array
     if(bound==sizes+sizes_length)--bound;
                         ^~~~~~~~~~~~
/usr/include/boost/multi_index/detail/bucket_array.hpp:86:22: error: invalid use of non-lvalue array
     return bound-sizes;
......

I downgraded boost to 1.55; and downloaded and built it:

$ ./bootstrap.sh --prefix=/usr/local  
$ sudo ./b2 -a -q install

This time I found the default gcc-5 could not compile successfully. So I followed this post to install gcc-6, and modified/home/nan/boost_1_55_0/tools/build/v2/user-config.jam file to use gcc-6 to compile boost:

......
# Configure specific gcc version, giving alternative name to use.
using gcc : 6 : g++-6 ;
......

Then my project can be compiled successfully. Check gcc search header file path and library path:

$ echo | gcc-6 -E -Wp,-v -
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/6/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/6/include
 /usr/local/include
 /usr/lib/gcc/x86_64-linux-gnu/6/include-fixed
 /usr/include/x86_64-linux-gnu
 /usr/include
......

$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/cuda-9.0/targets/x86_64-linux/lib:
/usr/lib/x86_64-linux-gnu/libfakeroot:
/usr/local/lib:
/lib/x86_64-linux-gnu:
/usr/lib/x86_64-linux-gnu:
/usr/lib/nvidia-384:
/usr/lib32/nvidia-384:
/lib32:
/usr/lib32:
/lib:
/usr/lib:
/usr/lib/nvidia-384/tls: (hwcap: 0x8000000000000000)
/usr/lib32/nvidia-384/tls: (hwcap: 0x8000000000000000)

You will find the fresh installed boost 1.55 (in /usr/local directory) always be found before default boost 1.58 (header files are in /usr/include/boostand libraries in /usr/lib/x86_64-linux-gnu).

Update keyring first if your Arch Linux is old enough

My Arch Linux is not updated for nearly 3 months. When running pacman -Syu, it prompts following errors:

$ sudo pacman -Syu
......
error: python-dnspython: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python-dnspython-1.16.0-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: python-distlib: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python-distlib-0.2.8-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: python-pytoml: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python-pytoml-0.1.20-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: python2-distlib: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python2-distlib-0.2.8-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: python2-dnspython: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python2-dnspython-1.16.0-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: python2-pytoml: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/python2-pytoml-0.1.20-1-any.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: qbittorrent: signature from "Eli Schwartz <[email protected]>" is unknown trust
:: File /var/cache/pacman/pkg/qbittorrent-4.1.5-1-x86_64.pkg.tar.xz is corrupted (invalid or corrupted package (PGP signature)).
Do you want to delete it? [Y/n]
error: failed to commit transaction (invalid or corrupted package)
Errors occurred, no packages were upgraded.
......

The solution is updating archlinux-keyring first:

$ sudo pacman -S archlinux-keyring

Then all goes well!