The first gcc bug I ever meet

I have used gcc for more than 10 years, but never met a bug before. In my mind, the gcc is one of the stable software in the world, but at yesterday, the myth ended.

I tried to use OpenMP to optimize my program, and all was OK until the taskloop construct was added:

#pragma omp taskloop

The build flow terminated with the following errors:

xxxxxxx.cpp:142:1: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <> for instructions.

Whoops! It seemed I got the lucky draw! Since my project uses a lot of compile options:

... -g -O2 -fopenmp -fprofile-arcs -ftest-coverage ...

I must narrow down to find the root cause. Firstly, I only use -fopenmp, then everything was OK; Secondly, adding -g -O2, no problem; …. After combination trial, -fopenmp -fprofile-arcs can cause the problem.

To confirm it, I wrote a simple program:

int main(void) {
    #pragma omp taskloop
    for (int i = 0; i < 2; i++) {
    return 0;

Compile it:

# gcc -fopenmp -fprofile-arcs parallel.c
parallel.c:6:1: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <> for instructions.

Yeah, the bug was reproduced! It verified my assumption.

To bypass this issue, I decided to try the newest gcc. My OS is ArchLinux,and the gcc version is 6.3.1:

# gcc -v
Using built-in specs.
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl= --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 6.3.1 20170306 (GCC)

Now ArchLinux doesn’t provide gcc 7.1 installation package, so I need to download and build it myself:

# wget
# tar xvf gcc-7.1.0.tar.gz
# cd gcc-7.1.0/
# mkdir build
# cd build

Select the configuration options is a headache task for me, and I decide to copy the current options for 6.3.1 (Please refer the above output from gcc -v):

# ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info .....

Since I don’t need to compile ada and lto, I remove these from --enable-languages:


Besides this, I also need to build and install isl library myself or through ArchLinux isl package. Once configuration is successful, I can build and install it:

# make
# make install

Check the newest gcc:

# gcc -v
Using built-in specs.
Target: x86_64-pc-linux-gnu
Configured with: ../configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl= --enable-languages=c,c++,fortran,go,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 7.1.0 (GCC)

Compile the program again:

# gcc -fopenmp -fprofile-arcs parallel.c

This time compilation is successful!

P.S. The gcc may have other bugs when supporting some new OpenMP directives, so please pay attention to it.

The pitfalls of using OpenMP parallel for-loops

According to this discussion:

#pragma omp parallel for
for (...)

is a shortcut of

#pragma omp parallel
#pragma omp for
    for (...)

and it seems more convenient of using “#pragma omp parallel for“. But there are some pitfalls which you should pay attention to:

(1) You can’t assume the number of threads will be equal to for-loops iteration counts even it is very small. For example (The machine has only cores.):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    return 0;

Build and run this program:

# gcc -fopenmp parallel.c
# ./a.out
thread is 0
thread is 0
thread is 0
thread is 1
thread is 1

We can see only 2 threads are generated. Run it in another 32-core machine:

# ./a.out
thread is 1
thread is 0
thread is 2
thread is 4
thread is 3

We can see 5 threads are launched.

(2) Use num_threads clause to modify the program as following:

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        printf("thread is %d\n", omp_get_thread_num());
    return 0;

Rebuild and run it on original 2-core machine:

# gcc -fopenmp parallel.c
# ./a.out
thread is 2
thread is 3
thread is 4
thread is 1
thread is 0

We can see this time 5 threads are created. But you should notice the actual thread count depends the system resource. E.g., change the code like this:

#pragma omp parallel for num_threads(30000)
    for (int i = 0; i < 30000; i++) {
        printf("thread is %d\n", omp_get_thread_num());

Execute it:

# ./a.out

libgomp: Thread creation failed: Resource temporarily unavailable

So we should notice the the created thread number.

P.S., if the iteration number is smaller than core number, the number of threads will be equal to core number (still in 32-core machine):

#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp parallel for
    for (int i = 0; i < 4; i++) {
        if (0 == omp_get_thread_num()) {
            printf("thread number is %d\n", omp_get_num_threads());
    return 0;

The output is:

thread number is 32

(3) If you use C++ thread_local variable, you should take care:

#include <omp.h>
#include <stdio.h>

int main(void) {
    thread_local int array[5] = {0};
#pragma omp parallel for num_threads(5)
    for (int i = 0; i < 5; i++) {
        array[i] = i + 1;

    for (int i = 0; i < 5; i++) {
        printf("array[%d] is %d\n", i, array[i]);
    return 0;

Compile and run:

# g++ -fopenmp parallel.cpp
# ./a.out
array[0] is 1
array[1] is 0
array[2] is 0
array[3] is 0
array[4] is 0

We can see only the first element is changed, so it must be thread 0‘s work. Remove the thread_local qualifier, and rebuild. This time you get the wanted result:

# ./a.out
array[0] is 1
array[1] is 2
array[2] is 3
array[3] is 4
array[4] is 5

Switch to https when git protocol doesn’t work

For many reasons (such as Firewall), you can’t clone from remote server’s git port(by default: 9418) correctly. For example:

# git clone git:// -b perf/core
Cloning into 'linux'...

From the captured packet:


You can see the TCP connection is established, then no any response! You can switch to https or http protocol, it may save your life:

# git clone -b perf/core
Cloning into 'linux'...
POST git-upload-pack (gzip 25015 to 12570 bytes)
remote: Counting objects: 5287534, done.

Please refer the discussion here.

Use perf and FlameGraph to profile program on Linux

In most Linux environments, the perf tools should be set up by default. Otherwise, you can install it manually. E.g., in ArchLinux:

# pacman -S perf

Use following program as an example (It is a rifacimento from here, and you should only focus on the framework of the code):

# cat test.cpp
#include <NTL/ZZX.h>

using namespace std;
using namespace NTL;

void inner(int i, ZZX& t, Vec<ZZX>& phi)
        for (long j = 1; j <= i-1; j++)
         if (i % j == 0)
            t *= phi(j);

void outer(int i, Vec<ZZX>& phi)
        ZZX t;
        t = 1;
        inner(i, t, phi);
        phi(i) = (ZZX(INIT_MONO, i) - 1)/t;
        cout << phi(i) << "\n";

int main()
   Vec<ZZX> phi(INIT_SIZE, 100);

   for (long i = 1; i <= phi.length(); i++) {
      outer(i, phi);

Compile it:

# g++ -g -O2 -pthread test.cpp -lntl -lgmp

It is suggested that using -g -O2 options since -g can provide debug information which perf needs and -O2 can generate lots of optimizations.

Use perf record to sample the program:

# perf record --call-graph dwarf ./a.out
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.318 MB (38 samples) ]

To profile an already running program, use -p pid flag. A file will be generated in current directory, and you can use perf report command to parse it:

# perf report

The detailed information of every function will be showed:


Another awesome tool is FlameGraph which is used to analyze stack call traces:

# git clone --depth 1
# cd FlameGraph

Copy into current directory:

# cp ../ ./

Execute following command:

# perf script | ./ |./ > perf.svg

The perf.svg is like this:


You can see the whole stack frameworks and functions’ consume time ratio.

P.S., the full code is here.

Use “.cu” as file extension name when playing Thrust

Today, I tried the simple Thrust program:

$ cat a.c
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(void) {
        // H has storage for 4 integers
        thrust::host_vector<int> H(4);

        // initialize individual elements
        H[0] = 14;
        H[1] = 20;
        H[2] = 38;
        H[3] = 46;

        // H.size() returns the size of vector H
        std::cout << "H has size " << H.size() << std::endl;

        // print contents of H
        for(int i = 0; i < H.size(); i++)
                std::cout << "H[" << i << "] = " << H[i] << std::endl;

        // resize H
        std::cout << "H now has size " << H.size() << std::endl;

        // Copy host_vector H to device_vector D
        thrust::device_vector<int> D = H;

        // elements of D can be modified
        D[0] = 99;
        D[1] = 88;

        // print contents of D
        for(int i = 0; i < D.size(); i++)
                std::cout << "D[" << i << "] = " << D[i] << std::endl;

        // H and D are automatically deleted when the function returns
        return 0;

Built it:

$ nvcc -arch=sm_37 a.c
In file included from a.c:1:0:
/opt/cuda/bin/..//include/thrust/host_vector.h:25:18: fatal error: memory: No such file or directory
compilation terminated.

It seemed very weird! After scanning Thrust’s FAQ, I came across the following tip:

Make sure that files that #include Thrust have a .cu extension. Other extensions (e.g., .cpp) will cause nvcc to treat the file incorrectly and produce an error message.

Renamed the source file name and rebuilt it:

$ mv a.c
$ nvcc -arch=sm_37
$ ./a.out
H has size 4
H[0] = 14
H[1] = 20
H[2] = 38
H[3] = 46
H now has size 2
D[0] = 99
D[1] = 88

Worked like a charm!