我的站点

一个系统软件工程师的随手涂鸦

Page 2 of 70

Linux线程模型浅析

Linux的线程是“轻量级进程”(Light-Weight Process,即LWP)。在Linux系统上运行一个程序时,操作系统会为这个程序创建一个进程,其实也就是“主线程”,后续则可以产生出更多的线程。每个进程都有一个PIDProcess ID),每个线程也会有一个TIDThread ID),属于同一进程的线程各自有拥有不同的TID,但它们的PID是相同的,都等于“主线程”的TID。因此从本质上来讲,Linux系统下的进程和线程没有区别,只不过同一进程中的线程可以共享某些资源。下面看一个例子:

#include <unistd.h>
#include <omp.h>

int main(void){

        #pragma omp parallel num_threads(4)
        for(;;)
        {
            sleep(1);
        }

        return 0;
}

编译并在后台运行这个程序:

$ gcc -fopenmp threads.c
$ ./a.out &
[1] 9802

进程的PID9802,用ps -T pid命令查看进程的线程信息:

$ ps -T 9802
  PID  SPID TTY      STAT   TIME COMMAND
 9802  9802 pts/1    Sl     0:00 ./a.out
 9802  9803 pts/1    Sl     0:00 ./a.out
 9802  9804 pts/1    Sl     0:00 ./a.out
 9802  9805 pts/1    Sl     0:00 ./a.out

其中SPID即为TID。可以看到当前进程的PID9802,共包含4个线程,其TID依次为:9802980398049805,其中PIDSPID相同的线程即为主线程。

CUDA编程笔记(16)——Shared Memory

这篇笔记摘自Professional CUDA C Programming

Global memory is large, on-board memory and is characterized by relatively high latencies. Shared memory is smaller, low-latency on-chip memory that offers much higher bandwidth than global memory. You can think of it as a program-managed cache. Shared memory is generally useful as:
➤ An intra-block thread communication channel
➤ A program-managed cache for global memory data
➤ Scratch pad memory for transforming data to improve global memory access patterns

Shared memory is partitioned among all resident thread blocks on an SM; therefore, shared memory is a critical resource that limits device parallelism. The more shared memory used by a kernel, the fewer possible concurrently active thread blocks.

 

2016年终总结

2016年就这样过去了,它真的是令我难忘的一年。在这一年中我经历了太多的“第一次”,也有了很多新收获。在2017年的第一天,我再好好地回忆一下这“跌宕起伏”的一年吧。

首先,我在自己的个人博客上开辟了“每月简讯”这个专栏,用来总结每个月的经历。这样可以更清楚地了解自己过去一个月的状态,知道该从哪些方面改进和提高自己。

1月份的时候,我开通了一个微信新公众号:Unix。开始的时候,是每天发一个小tip,后来感觉这种形式其实对人们的帮助并不大。所以在停办一段时间后,又走到发原创文章这条路上。目前发表了不到10篇作品,不敢说写的多好,只能说都是自己用心之作。

DTrace这个公众号这一年并没有写什么的有价值的文章,主要是现在工作中基本不会用到DTrace,所以也没有什么好的材料用来分享。

7月份参加了人生第一次的IELTS考试。

今年写了两个初级教程:一个是关于Go语言:Go 101 Hacks;一个是关于FreeBSDFreeBSD 101 Hacks

1月份至9月份,我在H公司工作。工作主要侧重在两部分:Docker性能测试和Swarm/Swarmkit的功能开发。很遗憾,9月份的时候,公司结构调整,我们整个部门被裁掉了。但是H公司还是很厚道的,给了足够的补偿。在这个时候,才能体现一家公司是否真的是“人性化”。

9月中旬到12月份,我一直处于失业状态,也是我第一次失业。整天过得浑浑噩噩。以前每天上班很辛苦,但是并不觉得有多累。现在闲下来了,反而每天无精打采,还生了病。看来人真的是要折腾的。

12月份来到了当前的公司,开始了新的工作。目前感觉还好,没有什么不适应。

中英文博客这一年中还是坚持更新。中文博客更多的是记一些笔记,英文博客倒是写了一些个人还比较满意的文章。

年初的时候第一次出国,去了新加坡,感受了一下“异国他乡”是什么样子。

好了,就这样吧。新年新开始!

2016年12月简讯

这个月终于结束了将近三个月的失业状态,开始了新的工作。

工作方面:
开始接触和学习CUDAGPU编程,此外,又把Python捡了起来,写了一个稍微有点规模的小程序。

业余时间(技术):
写了一篇Unix之殇的用心之作。

业余时间(读书):
读了一本介绍拜仁慕尼黑俱乐部的画报。

生活方面:
参加了两次朋友家举办的party,结识了一些新朋友,也从他们那里了解到很多逸闻趣事。

CUDA编程笔记(15)——MEMORY ACCESS PATTERNS

这篇笔记摘自Professional CUDA C Programming

Global memory loads/stores are staged through caches, as shown in Figure 4-6. Global memory is a logical memory space that you can access from your kernel. All application data initially resides in DRAM, the physical device memory. Kernel memory requests are typically served between the device DRAM and SM on-chip memory using either 128-byte or 32-byte memory transactions.

All accesses to global memory go through the L2 cache. Many accesses also pass through the L1 cache, depending on the type of access and your GPU’s architecture. If both L1 and L2 caches are used, a memory access is serviced by a 128-byte memory transaction. If only the L2 cache is used, a memory access is serviced by a 32-byte memory transaction. On architectures that allow the L1 cache to be used for global memory caching, the L1 cache can be explicitly enabled or disabled at compile time.

An L1 cache line is 128 bytes, and it maps to a 128-byte aligned segment in device memory. If each thread in a warp requests one 4-byte value, that results in 128 bytes of data per request, which maps perfectly to the cache line size and device memory segment size.

There are two characteristics of device memory accesses that you should strive for when optimizing your application:
➤ Aligned memory accesses
➤ Coalesced memory

capture

Aligned memory accesses occur when the frst address of a device memory transaction is an even multiple of the cache granularity being used to service the transaction (either 32 bytes for L2 cache or 128 bytes for L1 cache). Performing a misaligned load will cause wasted bandwidth.

Coalesced memory accesses occur when all 32 threads in a warp access a contiguous chunk of memory.

capture

Memory store operations are relatively simple. The L1 cache is not used for store operations on either Fermi or Kepler GPUs, store operations are only cached in the L2 cache before being sent to device memory. Stores are performed at a 32-byte segment granularity. Memory transactions can be one, two, or four segments at a time. For example, if two addresses fall within the same 128-byte region but not within an aligned 64-byte region, one four-segment transaction will be issued (that is, issuing a single four-segment transaction performs better than issuing two one-segment transactions).

capture

一个跨平台的科学计算器——SpeedCrunch

SpeedCrunch是一款可以在WindowsMac OS XLinux三个平台上使用的科学计算器,其主页在这里:http://speedcrunch.org/。界面看着很简洁:

capture

功能也很强大。推荐感兴趣的朋友可以安装玩一下!

 

Python知识库(1)——class variable和instance variable

一个好的Python实践是把创建instance variable的工作放到__init__函数中。当然可以在其它地方创建instance variable,但一定要与具体的instance相关联。举例如下:

class Base:
    def __init__(self):
        self.a = 10

    def fun1(self):
        self.b = 20

p = Base()
p.fun1()
p.c = 30

print(p.a, p.b, p.c)

运行结果:

10 20 30

参考资料:
Must all Python instance variables be declared in def init?
python class instance variables and class variables

CUDA编程笔记(14)——zero-copy memory

这篇笔记摘自Professional CUDA C Programming

In general, the host cannot directly access device variables, and the device cannot directly access host variables. There is one exception to this rule: zero-copy memory. Both the host and device can access zero-copy memory.

GPU threads can directly access zero-copy memory. There are several advantages to using zero-copy memory in CUDA kernels, such as:
➤ Leveraging host memory when there is insufficient device memory
➤ Avoiding explicit data transfer between the host and device
➤ Improving PCIe transfer rates
When using zero-copy memory to share data between the host and device, you must synchronize memory accesses across the host and device. Modifying data in zero-copy memory from both the host and device at the same time will result in undefned behavior.

There are two common categories of heterogeneous computing system architectures: Integrated and discrete.

In integrated architectures, CPUs and GPUs are fused onto a single die and physically share main memory. In this architecture, zero-copy memory is more likely to benefit both performance and programmability because no copies over the PCIe bus are necessary.

For discrete systems with devices connected to the host via PCIe bus, zero-copy memory is advantageous only in special cases.

Because the mapped pinned memory is shared between the host and device, you must synchronize memory accesses to avoid any potential data hazards caused by multiple threads accessing the same memory location without synchronization.

Be careful to not overuse zero-copy memory. Device kernels that read from zero-copy memory can be very slow due to its high-latency.

 

CUDA编程笔记(13)——pinned memory

这篇笔记摘自Professional CUDA C Programming

Allocated host memory is by default pageable, that is, subject to page fault operations that move data in host virtual memory to different physical locations as directed by the operating system. Virtual memory offers the illusion of much more main memory than is physically available, just as the L1 cache offers the illusion of much more on-chip memory than is physically available.

The GPU cannot safely access data in pageable host memory because it has no control over when the host operating system may choose to physically move that data. When transferring data from pageable host memory to device memory, the CUDA driver first allocates temporary page-locked or pinned host memory, copies the source host data to pinned memory, and then transfers the data from pinned memory to device memory, as illustrated on the left side of Figure 4-4.

capture

The CUDA runtime allows you to directly allocate pinned host memory using:
cudaError_t cudaMallocHost(void **devPtr, size_t count);
This function allocates count bytes of host memory that is page-locked and accessible to the device. Since the pinned memory can be accessed directly by the device, it can be read and written with much higher bandwidth than pageable memory. However, allocating excessive amounts of pinned memory might degrade host system performance, since it reduces the amount of pageable memory available to the host system for storing virtual memory data.

 

Unix之殇

前几天网上出现了Solaris项目将会被Oracle停掉的谣言。尽管消息一直未被证实,但是以Solaris为代表的传统Unix操作系统的没落却是不争的事实。在上个月,top500发布的目前世界上运行最快的500台超级计算机中,有498台运行的是Linux

capture

由此可见,LinuxUnix目前的境遇可谓是天壤之别。

我不知道究竟是什么原因造成了目前Linux一统天下的局面,但是可以确定的是一定不是技术领域方面的原因。我没有为IBM公司工作过,也从未接触过AIX操作系统,所以对AIX没有发言权。而对于BSD系列操作系统(FreeBSDOpenBSDNetBSD等等),仅仅限于安装和使用过,并没有什么太深的体会。我为HP/HPE公司效力过,虽然并没有使用过HP-UX,但是周围有很多同事以前是做HP-UX相关工作的:开发新功能,做Unix认证等等。听他们讲,HP-UX非常稳定,很多电信,银行等对稳定性要求特别高的环境仍然在使用着HP-UX,也许这些企业慢慢地会转向Linux?我不知道。。。至于Solaris,我曾经在上面做过4年多的全职开发。Solaris上面有很多很cool的工具供用户使用,比如mdb,比如DTrace,这些工具为我工作提供了巨大的帮助,极大地满足了一个底层软件工程师的好奇心。此外Solaris也是以运行稳定而著称,比如这台已经连续运行了10年的装有Solaris的机器(图片出处:https://pbs.twimg.com/media/CjtxiOmWYAA5lHB.jpg):

cjtxiomwyaa5lhb

再来看看Linux,其实一直以来,Linux系统上并没有可以匹敌DTrace的系统tracing工具,直至最近BPF功能的成熟,可以说在tracing领域落后了Solaris整整12年(可以参考这篇文章:Linux in 2016 catches up to Solaris from 2004);再比如目前Ubuntu发行版中引入的ZFS文件系统,也是出自Solaris。所以,其实如果单单从技术领域来看,Linux不仅不见得做的比Unix好,某些方面甚至还是处于下风的。

在上面提到的几种Unix中,除了BSD系列,其余3种可以说都是某个传统硬件服务器厂商的私有操作系统。虽然曾经有OpenSolaris这个开源产品,但是也仅仅是昙花一现(个人觉得OpenSolaris最大的意义在于由其衍生出了illumos内核,以及基于illumos内核的类Solaris系统。比如smartos。)。所以说,是不是由于最近这些年互联网的日渐强势,硬件厂商的效益江河日下,而“城门失火,殃及池鱼”,随之而来的就是这些Unix也会受到不小的冲击呢?个人觉得应该有一定关系吧。但如果仅仅把Linux成功的原因归结于“开源”,似乎也有失偏颇,BSD系列操作系统也是开源的,且其在license上更为宽松(参考这里:Comparing BSD and Linux)。所以说对于Linux目前具有如此统治力的原因,真的是很难说清。

相信目前很多的中小公司都完全转向Linux了。最直白的原因:人好招。你见过多少招聘信息要求熟悉FreeBSD?肯定没有要求Linux的多。至于要求熟悉NetBSD的?也许有,但是我是没见过。所以对目前Unix人才的需要还是主要在大公司,也只有大公司有意愿和实力做这些“日渐小众”的Unix的相关工作。例如,Brendan Gregg在其社交账号中为对Solaris工程师提到Netflix目前使用FreeBSD

capture2

我很怀念十几年前各大操作系统“百花齐放”的时代,这样想并不是因为我对Linux有任何成见,只是当你的服务器都运行着清一色的Linux操作系统时,实在是觉得有些单调和乏味,就像现在人类使用的手机也基本可以分为两大阵营:iOSAndroid(又是Linux)。世界本应该就是多样化的,丰富多彩的,所以希望其它的Unix有朝一日可以“复兴”吧。。。

Page 2 of 70

Powered by WordPress & Theme by Anders Norén