nanxiao | 我的站点

CUDA编程笔记（1）——Parallelism

There are two fundamental types of parallelism in applications:
➤ Task parallelism
➤ Data parallelism
Task parallelism arises when there are many tasks or functions that can be operated independently and largely in parallel. Task parallelism focuses on distributing functions across multiple cores.

Data parallelism arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores.

CUDA programming is especially well-suited to address problems that can be expressed as data parallel computations. Many applications that process large data sets can use a data-parallel model to speed up the computations. Data-parallel processing maps data elements to parallel threads.

There are two basic approaches to partitioning data:
➤ Block: Each thread takes one portion of the data, usually an equal portion of the data.
➤ Cyclic: Each thread takes more than one portion of the data.

简而言之，block就是按线程数等分数据，10个线程就把数据分成10份，一个线程处理一份；而cyclic则是数据的份数大于线程数，举个例子，10个线程把数据分成20份，第一个线程处理第1，11份，第二个线程处理第2，12份。。。。。。，循环处理多次。

HP/HPE公司（即通常说的惠普公司，因其在2015年已经拆分成HP和HPE两家独立运营公司，且拆分后是由HPE延续操作系统的相关工作，所以在这里使用HP/HPE。）拥有自己的Unix操作系统：HP-UX。以前中国是有团队参与HP-UX的相关工作：功能开发，Unix认证等等，现在相应的工作应该都转到印度了。目前HP-UX应该在一些银行，电信系统还在使用，不过的确是很难见到了。可以通过Wikipedia来了解HP-UX的一些信息。

再来说一下Linux，其实以前HP/HPE公司有一个很大的Linux团队，其甚至有能力做出自己的Linux发行版：

此外，这个团队也曾经是Linux kernel的一个很重要的贡献者。不过，随着这些年公司的战略调整，这个团队的绝大部分工程师都已经离开了，其中的很多人加盟了其它公司，继续为Linux贡献着力量。目前HP/HPE在Linux上的工作重心侧重在同Linux厂商的合作，譬如今年与SuSE的合作（详情请参考Sweet SUSE! HPE snags itself a Linux distro）。

Linux操作系统的pstack工具

Solaris操作系统提供了pstack工具，用来打印运行程序的线程堆栈信息。RedHat公司发行的Linux操作系统（RHEL，CentOS等等）也提供了pstack工具，只要安装gdb：

# yum install gdb

就会把pstack也一并安装成功。

首先看一下pstack：

# which pstack
/usr/bin/pstack
# ls -lt /usr/bin/pstack
lrwxrwxrwx. 1 root root 6 Nov 19 06:32 /usr/bin/pstack -> gstack

可以看出pstack实际上只是一个指向了gstack的符号链接。再看一下gstack：

# cat /usr/bin/gstack
#!/bin/sh

if test $# -ne 1; then
    echo "Usage: `basename $0 .sh` <process-id>" 1>&2
    exit 1
fi

if test ! -r /proc/$1; then
    echo "Process $1 not found." 1>&2
    exit 1
fi

# GDB doesn't allow "thread apply all bt" when the process isn't
# threaded; need to peek at the process to determine if that or the
# simpler "bt" should be used.

backtrace="bt"
if test -d /proc/$1/task ; then
    # Newer kernel; has a task/ directory.
    if test `/bin/ls /proc/$1/task | /usr/bin/wc -l` -gt 1 2>/dev/null ; then
    backtrace="thread apply all bt"
    fi
elif test -f /proc/$1/maps ; then
    # Older kernel; go by it loading libpthread.
    if /bin/grep -e libpthread /proc/$1/maps > /dev/null 2>&1 ; then
    backtrace="thread apply all bt"
    fi
fi

GDB=${GDB:-/usr/bin/gdb}

# Run GDB, strip out unwanted noise.
# --readnever is no longer used since .gdb_index is now in use.
$GDB --quiet -nx $GDBARGS /proc/$1/exe $1 <<EOF 2>&1 |
set width 0
set height 0
set pagination no
$backtrace
EOF
/bin/sed -n \
    -e 's/^\((gdb) \)*//' \
    -e '/^#/p' \
    -e '/^Thread/p'

可以看到gstack仅仅是一个shell脚本。简单浏览一下这个脚本：

（1）

if test $# -ne 1; then
    echo "Usage: `basename $0 .sh` <process-id>" 1>&2
    exit 1
fi

脚本要求一个参数：进程ID。

（2）

if test ! -r /proc/$1; then
    echo "Process $1 not found." 1>&2
    exit 1
fi

通过检测/proc目录下进程子目录是否可读，来查看相应进程是否存在。

（3）

# GDB doesn't allow "thread apply all bt" when the process isn't
# threaded; need to peek at the process to determine if that or the
# simpler "bt" should be used.

backtrace="bt"
if test -d /proc/$1/task ; then
    # Newer kernel; has a task/ directory.
    if test `/bin/ls /proc/$1/task | /usr/bin/wc -l` -gt 1 2>/dev/null ; then
    backtrace="thread apply all bt"
    fi
elif test -f /proc/$1/maps ; then
    # Older kernel; go by it loading libpthread.
    if /bin/grep -e libpthread /proc/$1/maps > /dev/null 2>&1 ; then
    backtrace="thread apply all bt"
    fi
fi

如果进程只有一个线程，那么使用gdb的“bt”命令打印线程堆栈信息，否则使用“thread apply all bt”命令。

（4）

GDB=${GDB:-/usr/bin/gdb}

# Run GDB, strip out unwanted noise.
# --readnever is no longer used since .gdb_index is now in use.
$GDB --quiet -nx $GDBARGS /proc/$1/exe $1 <<EOF 2>&1 |
set width 0
set height 0
set pagination no
$backtrace
EOF
/bin/sed -n \
    -e 's/^\((gdb) \)*//' \
    -e '/^#/p' \
    -e '/^Thread/p'

最后调用gdb，使用“bt”或“thread apply all bt”命令，并把输出重定向到sed工具，由sed工具打印出线程堆栈信息。

最后看一个使用pstack的例子：

# pstack 707
Thread 3 (Thread 0x7f69600d8700 (LWP 713)):
#0  0x00007f6968af269d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6969027a84 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2  0x00007f6969027bac in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3  0x00007f6969027be9 in glib_worker_main () from /lib64/libglib-2.0.so.0
#4  0x00007f696904d4f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5  0x00007f696af9fdc5 in start_thread (arg=0x7f69600d8700) at pthread_create.c:308
#6  0x00007f6968afcced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 2 (Thread 0x7f695eec3700 (LWP 716)):
#0  0x00007f6968af269d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6969027a84 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2  0x00007f6969027dca in g_main_loop_run () from /lib64/libglib-2.0.so.0
#3  0x00007f6969641336 in gdbus_shared_thread_func () from /lib64/libgio-2.0.so.0
#4  0x00007f696904d4f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5  0x00007f696af9fdc5 in start_thread (arg=0x7f695eec3700) at pthread_create.c:308
#6  0x00007f6968afcced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 1 (Thread 0x7f696c5738c0 (LWP 707)):
#0  0x00007f6968af269d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6969027a84 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2  0x00007f6969027dca in g_main_loop_run () from /lib64/libglib-2.0.so.0
#3  0x0000560a080a80a3 in main ()

如果使用的Linux发行版没有pstack这个工具，可以考虑直接把gstack脚本拷贝过去。

求解最长增加子数列

求最长增加子数列是一道经典算法题，其动态规划解法最简单，但是时间复杂度是O(n²)。stackoverflow上有一个O(nlogn)解法，很是巧妙：

Now the improvement happens at the second loop, basically, you can improve the speed by using binary search. Besides the array dp[], let’s have another array c[], c is pretty special, c[i] means: the minimum value of the last element of the longest increasing sequence whose length is i.

sz = 1;
c[1] = array[0]; /*at this point, the minimum value of the last element of the size 1 increasing sequence must be array[0]*/
dp[0] = 1;
for( int i = 1; i < len; i++ ) {
   if( array[i] < c[1] ) {
  c[1] = array[i]; /*you have to update the minimum value right now*/
  dp[i] = 1;
   }
   else if( array[i] > c[sz] ) {
  c[sz+1] = array[i];
  dp[i] = sz+1;
  sz++;
   }
   else {
  int k = binary_search( c, sz, array[i] ); /*you want to find k so that c[k-1]<array[i]<c[k]*/
  c[k] = array[i];
  dp[i] = k;
   }
}

关键是理解这个c数组，其下标用来表示增加子数列的长度，而值则是这个长度的增加子数列中结尾元素最小的值。比如有两个相同长度的增加子数列：1,2,3和1,2,5，则c[3]的值为3，因为一旦将来有4出现，就可以把1,2,3扩展为1,2,3,4，而无法再把1,2,5扩展。完整的Go程序如下：

package main
import (
    "fmt"
    "os"
)

func bSearch(array []int, start int, end int, value int) int {
    for start <= end {
        mid := start + (end-start)/2
        if array[mid] == value {
            return mid
        } else if array[mid] < value {
            if mid+1 <= end && array[mid+1] > value {
                return mid+1
            } else {
                start = mid+1
            }            
        } else {
            if mid-1 >= start && array[mid-1] < value {
                return mid
            } else {
                end = mid-1
            }
        }
    }
    return start
}
func main() {
 //Enter your code here. Read input from STDIN. Print output to STDOUT
    var num int
    _, err := fmt.Scan(&num)
    if err != nil || num == 0{
        os.Exit(1)
    }
    s := make([]int, num)
    m := make([]int, num)
    e := make([]int, num+1)

    for i := 0; i < num; i++ {
        _, err := fmt.Scan(&s[i])
        if err != nil {
            os.Exit(1)
        }
    }

    m[0] = 1
    max := m[0]
    e[1] = s[0]
    for i := 1; i < num; i++ {
        if s[i] < e[1] {
            e[1] = s[i]
            m[i] = 1
        } else if s[i] > e[max] {
            max++
            e[max] = s[i]
            m[i] = max
        } else {
            k := bSearch(e, 1, max, s[i])
            e[k] = s[i]
            m[i] = k
        }
    }

    fmt.Println(max)
}

递归VS非递归

递归是一种很常见的解决问题办法，比如求解Fibonacci数列：

int Fibonacci(int n) {
    if (n <= 1) {
        return n;
    } else {
        return Fibonacci(n - 1) + Fibonacci(n - 2);
    }
}

但是递归会导致函数调用栈很深，此外有可能会有很多重复工作。比如求解Fibonacci(4)时，Fibonacci(2)，Fibonacci(1)都会被重复求值。看一下非递归解法：

int Fibonacci(int n) {
    if (n <= 1) {
        return n;
    } else {
        int fi = 0;
        int fj = 1;
        for (int i = 2; i <= n; i++) {
            int temp = fi + fj;
            fi = fj;
            fj = temp;
        }
        return fj;
    }
}

同递归方式相比，用循环迭代的方式取代了函数调用。

最后再看一下wikipedia中关于Binary search tree中查找某一元素的递归和非递归代码。
递归：

def search_recursively(key, node):
    if node is None or node.key == key:
        return node
    elif key < node.key:
        return search_recursively(key, node.left)
    else:  # key > node.key
    return search_recursively(key, node.right)

非递归：

def search_iteratively(key, node): 
    current_node = node
    while current_node is not None:
        if key == current_node.key:
            return current_node
        elif key < current_node.key:
            current_node = current_node.left
        else:  # key > current_node.key:
            current_node = current_node.right
    return None

分析求子数组最大值问题

求子数组的最大值是一道经典dynamic programming题，解法如下（参考这里）：

public int maxSubArray(int[] A) {
   int newsum=A[0];
   int max=A[0];
   for(int i=1;i<A.length;i++){
       newsum=Math.max(newsum+A[i],A[i]);
       max= Math.max(max, newsum);
   }
   return max;
}

理解这个算法的关键在于：求出数组中以每个元素为子数组的最后一个元素的最大值（上述代码中newsum），这些最大值中的最大者即为解（上述代码中max）。分析如下：从第一个元素A[0]起，newsum和max均为A[0]。而对下一个元素A[1]，以A[1]为子数组的最后一个元素的最大值或者是A[0]+A[1]（A[0]大于0），或是A[1]，取两者最大值。接下来再看A[2]，以A[2]为子数组的最后一个元素的最大值是A[0]+A[1]+A[2]，A[1]+A[2]和A[2]三者之间的最大值。以此类推。。。

C语言中使用scanf的一些注意事项

（1）scanf读取double类型数据应该使用%lf；而使用printf打印double时，要注意%lf用在printf中是C99才支持的，因此如果编译器不支持C99则要使用%f：

double d;
scanf("%lf", &d);
printf("%f", d);

参考资料：
Reading in double values with scanf in c；
Why does scanf() need “%lf” for doubles, when printf() is okay with just “%f”?。

（2）

char str[10];
scanf ("%[^\n]%*c", str);

%[^\n]含义是从stdin读取输入保存到str，直到遇到第一个\n；而%*c则丢弃掉这个\n。
参考资料：
scanf: “%[^\n]” skips the 2nd input but “ %[^\n]” does not. why?；
What does scanf(“%*[^\n]%*c”) mean?。

（3）限制输入字符串的长度，预留结尾的NUL：

char str[10]
scanf("%9s", str);

由一道题理解DFS算法

GENERATE PARENTHESES这篇文章讲解了如何使用Depth First Search，即DFS算法生成所有有效的括号组合：

void generateParentheses(int n) {
    dfs("", n, n);
}
void dfs(String s, int left, int right) // build the set of solutions using DFS approach
{
    if(left == 0 && right == 0) // BASE CASE: there is no more parentheses to add? we have a solution!
        System.out.println(s);
    if(left > 0) // while we have left parentheses to add, just add them
        dfs(s + "(", left - 1, right); // we call our function recursively with a left parentheses added
    if(right > left) // We are gonna add a right parentheses if we have more right parentheses than left ones
        dfs(s + ")", left, right - 1);
}

由此我也总结了一下DFS算法：
（1）DFS算法也是使用递归来解决问题；
（2）

if(left == 0 && right == 0) // BASE CASE: there is no more parentheses to add? we have a solution!
    System.out.println(s);

上述代码是递归的终止条件。
（3）

if(left > 0) // while we have left parentheses to add, just add them
    dfs(s + "(", left - 1, right); // we call our function recursively with a left parentheses added
if(right > left) // We are gonna add a right parentheses if we have more right parentheses than left ones
    dfs(s + ")", left, right - 1);

每调用一次dfs()函数，会为字符串添加(或)。

if(left > 0) // while we have left parentheses to add, just add them
    dfs(s + "(", left - 1, right); // we call our function recursively with a left parentheses added

会递归地生成当前位置是(的所有情况。if(right > left)分支则会生成当前位置是)的所有情况。

strace命令介绍

strace是Linux上的一个很好用的工具，它可以用来输出程序在运行过程中发生的系统调用以及收到的信号的相关信息，因此在调试和诊断问题时有很大的帮助，特别是在程序没有源码，或是在前期做一些粗略的分析时。strace命令格式如下：

strace [options] command [args]

举个例子：

＃ strace sleep 300
execve("/usr/bin/sleep", ["sleep", "300"], [/* 24 vars */]) = 0
brk(0)                                  = 0x22fa000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f70d1ef8000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
......
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=20243, si_uid=0} ---
......

从上面例子可以看出，对于系统调用，比如open，access，strace都会输出详细的参数和返回值，如果发生了错误，也会输出细致的错误信息。而对于接收到的信号，除了输出信息外，还要注意信号信息的前后都加了“---”，以示与系统调用的区别。

以下是一些常用的选项：
（1）-o：把strace执行结果输出到指定文件里：

# strace -o out ls

（2）-t：打印时间：

# strace -t ls
10:30:07 execve("/usr/bin/ls", ["ls"], [/* 24 vars */]) = 0
10:30:07 brk(0)
......

(3）-e：只关注某一系统调用：

# strace -e open ls
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
......

（4）-y：显示和文件描述符关联的文件路径：

# strace -y ls
......
fstat(3</etc/ld.so.cache>, {st_mode=S_IFREG|0644, st_size=32951, ...}) = 0
mmap(NULL, 32951, PROT_READ, MAP_PRIVATE, 3</etc/ld.so.cache>, 0) = 0x7fba3db13000
close(3</etc/ld.so.cache>)              = 0
......

（5）-f：追踪运行进程所生成的子进程。

参考资料：
strace(1) – Linux man page；
A swiss army knife of debugging tools。

什么是semaphore？

以下摘自The Little Book of Semaphores：

A semaphore is like an integer, with three differences:
1. When you create the semaphore, you can initialize its value to any integer, but after that the only operations you are allowed to perform are increment (increase by one) and decrement (decrease by one). You cannot read the current value of the semaphore.
2. When a thread decrements the semaphore, if the result is negative, the thread blocks itself and cannot continue until another thread increments the semaphore.
3. When a thread increments the semaphore, if there are other threads waiting, one of the waiting threads gets unblocked.
To say that a thread blocks itself (or simply “blocks”) is to say that it notifies the scheduler that it cannot proceed. The scheduler will prevent the thread from running until an event occurs that causes the thread to become unblocked. In the tradition of mixed metaphors in computer science, unblocking is often called “waking”.
That’s all there is to the definition, but there are some consequences of the definition you might want to think about.
• In general, there is no way to know before a thread decrements a semaphore whether it will block or not (in specific cases you might be able to prove that it will or will not).
• After a thread increments a semaphore and another thread gets woken up, both threads continue running concurrently. There is no way to know which thread, if either, will continue immediately.
• When you signal a semaphore, you don’t necessarily know whether another thread is waiting, so the number of unblocked threads may be zero or one.
Finally, you might want to think about what the value of the semaphore means. If the value is positive, then it represents the number of threads that can decrement without blocking. If it is negative, then it represents the number of threads that have blocked and are waiting. If the value is zero, it means there are no threads waiting, but if a thread tries to decrement, it will block.

semaphore的初始值为1，通常就用来构建mutex。

作者： nanxiao