使用vmstat命令监控CPU使用

vmstat命令可以用来监控CPU的使用状况。举例如下:

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 5201924   1328 5578060    0    0     0     0 1582 6952  2  1 98  0  0
 1  0      0 5200984   1328 5577996    0    0     0     0 2020 20567  9  1 90  0  0
 0  0      0 5198668   1328 5577952    0    0     0     0 1568 7617  5  1 94  0  0
 0  0      0 5194844   1328 5578000    0    0     0   187 1249 7057  1  1 98  0  0
 0  0      0 5199956   1328 5578232    0    0     0     0 1496 7306  4  1 95  0  0

上述命令每隔1秒输出系统状态,最后5列是描述的是CPU状况。man手册上关于这5列的含义描述的很清楚:

CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code.  (user time, including nice time)
       sy: Time spent running kernel code.  (system time)
       id: Time spent idle.  Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO.  Prior to Linux 2.5.41, included in idle.
       st: Time stolen from a virtual machine.  Prior to Linux 2.6.11, unknown.

vmstat实质上是从/proc/stat文件获得系统状态:

# cat /proc/stat
cpu  381584 711 299364 1398303520 429839 0 251 0 0 0
cpu0 90740 58 44641 174627550 131209 0 120 0 0 0
cpu1 43141 26 22925 174746812 108219 0 10 0 0 0
cpu2 41308 35 25097 174831161 25877 0 40 0 0 0
cpu3 39301 70 27514 174836084 27792 0 4 0 0 0
cpu4 39187 78 46191 174750027 109013 0 0 0 0 0
......

需要注意的是这里数字的单位是Jiffies

另外,vmstat计算CPU时间百分比使用的是“四舍五入”算法(vmstat.c):

static void new_format(void){
    ......
    duse = *cpu_use + *cpu_nic;
    dsys = *cpu_sys + *cpu_xxx + *cpu_yyy;
    didl = *cpu_idl;
    diow = *cpu_iow;
    dstl = *cpu_zzz;
    Div = duse + dsys + didl + diow + dstl;
    if (!Div) Div = 1, didl = 1;
    divo2 = Div / 2UL;
    printf(w_option ? wide_format : format,
           running, blocked,
           unitConvert(kb_swap_used), unitConvert(kb_main_free),
           unitConvert(a_option?kb_inactive:kb_main_buffers),
           unitConvert(a_option?kb_active:kb_main_cached),
           (unsigned)( (unitConvert(*pswpin  * kb_per_page) * hz + divo2) / Div ),
           (unsigned)( (unitConvert(*pswpout * kb_per_page) * hz + divo2) / Div ),
           (unsigned)( (*pgpgin        * hz + divo2) / Div ),
           (unsigned)( (*pgpgout           * hz + divo2) / Div ),
           (unsigned)( (*intr          * hz + divo2) / Div ),
           (unsigned)( (*ctxt          * hz + divo2) / Div ),
           (unsigned)( (100*duse            + divo2) / Div ),
           (unsigned)( (100*dsys            + divo2) / Div ),
           (unsigned)( (100*didl            + divo2) / Div ),
           (unsigned)( (100*diow            + divo2) / Div ),
           (unsigned)( (100*dstl            + divo2) / Div )
    );
    ......
}

所以会出现CPU利用百分比相加大于100的情况:2 + 1 + 98 = 101

另外,在Linux系统上,r字段表示的是当前正在运行和等待运行的task的总和。

 

参考资料:
/proc/stat explained
procps

 

Profiling CPU使用

本文内容取自于《Systems Performance: Enterprise and the Cloud》

Profiling CPU的方法是通过对CPU状态进行周期性地采样,然后进行分析。包含5个步骤:

1. Select the type of profile data to capture, and the rate.
2. Begin sampling at a timed interval.
3. Wait while the activity of interest occurs.
4. End sampling and collect sample data.
5. Process the data.

CPU采样数据基于下面两个因素:

a. User level, kernel level, or both
b. Function and offset (program-counter-based), function only, partial stack trace, or full stack trace

抓取user levelkernel level所有的函数调用栈固然可以完整地得到CPUprofile,但这样会产生太多的数据。因此通常只采样user levelkernel level部分函数调用栈就可以了,有时可能仅需要保留函数的名字。

下面是一个使用DTraceCPU采样的例子:

 # dtrace -qn 'profile-997 /arg1/ {@[execname, ufunc(arg1)] = count();} tick-10s{exit(0)}'

 top                                                 libc.so.7`0x801154fec                                             1
 top                                                 libc.so.7`0x8011e5f28                                             1
 top                                                 libc.so.7`0x8011f18a9                                             1

 

CPU,GPU和GPGPU的区别

下面摘自Why are we still using CPUs instead of GPUs?

GPUs have far more processor cores than CPUs, but because each GPU core runs significantly slower than a CPU core and do not have the features needed for modern operating systems, they are not appropriate for performing most of the processing in everyday computing. They are most suited to compute-intensive operations such as video processing and physics simulations.

GPU(Graphics Processing Unit)core数量比CPU的多,它是显卡(video card)的CPU。由于它的指令集不如CPU强大,但是core数量多,所以适合做一些相对简单的,计算密集性的运算:比如图像处理等等。GPGPU(General Purpose Graphics Processing Unit)则不仅仅只做图像处理的相关运算,也会做一些一般性的运算。

更新:计算机屏幕上的图像是如何显示出来的?这个帖子给了很好的解释:

The GPU has a series of registers that the BIOS maps. These permit the CPU to access the GPU’s memory and instruct the GPU to perform operations. The CPU plugs values into those registers to map some of the GPU’s memory so that the CPU can access it. Then it loads instructions into that memory. It then writes a value to a register that tells the GPU to execute the instructions the CPU loaded into its memory.

The information consists of the software that the GPU needs to run. This software is bundled with the driver and then the driver handles the responsibility split between the CPU and GPU (by running portions of its code on both devices).

The driver then manages a series of “windows” into GPU memory that the CPU can read from and write to. Generally, the access pattern involves the CPU writing instructions or information into mapped GPU memory and then instructing the GPU, through a register, to execute those instruction or process that information. The information includes shader logic, textures, and so on.

简单地讲,CPU会把要显示的图像和指令存到显卡(video card)的register中,然后通知GPU(显卡上的CPU)去执行画图命令。 此外,wiki百科上的这张图形象地描述了整个过程:

CUDA_processing_flow_(En)

参考资料:

 

为什么单CPU没有“memory reorder”问题?

这篇文章提到单核系统上不会有“memory reorder”问题:

Two threads being timesliced on a single CPU core won’t run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won’t really know about each other’s reordering.

仍以Memory Reordering Caught in the Act的图为例:

marked-example2

reordered

其实可以这样理解:单核CPU系统上,多个线程实际是交替顺序执行的,无法真正做到“并行”。无论两个线程或多个线程的代码如何乱序执行,CPU知道它们原本应该的执行顺序,一旦这种乱序会改变程序的运行结果,CPU会做出相应的“补救”措施,比如丢弃结果,重新执行等等,来保证代码会按照应该执行的顺序执行。所以“memory reorder”问题不会在单核系统上出现。

参考资料:
Why doesn’t the instruction reorder issue occur on a single CPU core?
preempt_disable的问题

 

“Memory order”分析笔记

以下图片摘自Memory Reordering Caught in the Act,它描述了memory reorder问题:
代码:

marked-example2

实际执行:

reordered

为什么会发生memory reorder?一言以蔽之,因为性能。

在支持memory reorder的系统上,有以下3order需要考虑:

Program order: the order in which the memory operations are specified in the code running on a given CPU.

Execution order: the order in which the individual memory-reference instructions are executed on a given CPU. The execution order can differ from program order due to both compiler and CPU-implementation optimizations.

Perceived order: the order in which a given CPU perceives its and other CPUs’ memory operations. The perceived order can differ from the execution order due to caching, interconnect and memory-system optimizations. Different CPUs might well perceive the same memory operations as occurring in different orders.

Program order是代码里访问内存的顺序。Execution order是代码在CPU上实际执行的顺序,由于编译器优化和CPU的实现,实际指令执行的顺序有可能和代码顺序不一样。Perceived orderCPU用来“感知”自己或者其它CPU对内存操作,由于cachinginterconnect等原因,这个顺序有可能与代码实际的execution order不同。

 

关于memory order的总结:

A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs’ memory operations.

An operation is reordered with a store only if the operation accesses a different location than does the store.

Aligned simple loads and stores are atomic.

Linux-kernel synchronization primitives contain any needed memory barriers, which is a good reason to use these primitives.

参考资料:
Memory Reordering Caught in the Act
Memory Ordering in Modern Microprocessors, Part I

DMA Remapping简介

DMA(Direct Nemory Access) Remapping是一种用来限制硬件设备只能使用DMA访问预先分配的内存区域(domain or physical memory regions)的技术。DMA Remapping会把DMA请求里的地址转化成正确的物理内存地址,同时还会检查设备是否允许访问指定的内存。请看下图:

dma_remapping

虚拟机的操作系统(Guest OS)所提供的物理地址称为Guest Physical Address (GPA) ,它不一定与实际的物理地址一致,也就是Host Physical Address (HPA),而DMA技术则要求访问真实的物理地址。DMA Remapping技术可以把Guest OS提供的GPA转化成HPA,然后数据就可以直接发送到Guest OS的缓冲区(buffer)了。

主机平台(host platform)可以支持一个或多个DMA remapping硬件单元(hardware unit),每个硬件单元remapping从它控制的作用域内发出的DMA remapping请求。主机固件(BIOS)需要把每个DMA remapping硬件单元报给系统软件(比如操作系统)。

DMA remapping硬件单元使用source-id来标示发出DMA请求的设备。对一个PCI Express设备,source-id就是resource identifier

 ________________________________________________________
|____Bus(8 bits)_________|__Device(5 bits)|_func(3 bits)_|

Root-entry作为最顶层的数据结构,会把某特定PCI总线上的设备映射到对应的domain。一个context-entry会把一个地址总线上的某个具体设备映射到对应的domain。参考下图:

root-entry

每个root-entry会有一个指向一个context-entry的表的指针,而每个context-entry则会包含如何用来进行地址转化的结构。

Linux kernel 笔记 (1) ——CPU在做什么?

In fact, in Linux, we can generalize that each processor is doing exactly one of three things at any given moment:
a) In user-space, executing user code in a process
b) In kernel-space, in process context, executing on behalf of a specific process
c) In kernel-space, in interrupt context, not associated with a process, handling an
interrupt

Linux中,任何时候,CPU都在做下面三件事中的一件:

a)运行进程的用户空间代码;
b)运行进程的内核空间代码;
c)处理中断(也是工作在内核空间,但不与任何进程关联)。

闲侃CPU(四)

CPU利用率(utilization)是指CPU在一段时间内用于做“有用功”的时间和整个这段时间的百分比值。所谓的“有用功”即CPU没有运行内核(kernelIDLE线程,而是运行用户级(user-level)应用程序线程,或是其它的内核(kernel)线程,或是处理中断。

CPU用来执行用户级(user-level)应用程序的时间称之为user-time,而运行内核级(kernel-level)程序的时间称之为kernel-time

计算密集型(computation-intensive)程序也许会把几乎所有的时间用来执行用户级(user-level)程序代码。而I/O密集型(I/O-intensive)程序有相当多的时间用来执行系统调用(system call),这些系统调用将会执行内核代码产生I/O

当一个CPU利用率达到100%时,称之为饱和(saturated)。在这种情况下,线程在等待获得CPU时,将会面临调度延迟(scheduler latency)的问题。

闲侃CPU(三)

CPU执行一条指令包含下面5个步骤,其中每个步骤都会由CPU的一个专门的功能单元(function unit)来完成:
(1)取指令;
(2)解码;
(3)执行指令;
(4)内存访问;
(5)写回寄存器。
最后两个步骤是可选的,因为很多指令只会访问寄存器,不会访问内存。上面的每个步骤至少要花费一个时钟周期(clock cycle)去完成。内存访问通常是最慢的,要占用多个时钟周期。 

指令流水线(Instruction Pipeline):是一种可以并行执行多条指令的CPU结构(architecture),也即同时执行不同指令的不同部分。假设上面提到的执行指令5个步骤每个步骤都占1个时钟周期,那么完成一个指令需要5个时钟周期(假设步骤45都要经历)。在执行这条指令的过程,每个步骤只有CPU的一个功能单元是工作的,其它的都在空闲中。采用指令流水线以后,多个功能单元可以同时活跃,举个例子:在解码一条指令时,可以同时取下一条指令。这样可以大大提高效率。理想情况下,执行每条指令仅需要1个时钟周期。

更进一步,如果CPU内执行特定功能的功能单元有多个的话,那么每个时钟周期可以完成更多的指令。这种CPU结构称之为“超标量(superscalar)”。指令宽度(Instruction Width)描述了可以并行处理的指令的数量。现代CPU一般是3-wide4-wide,即每个时钟周期可处理3~4条指令。

Cycles per instruction(CPI)是描述CPU在哪里耗费时钟周期和理解CPU利用率的一个重要度量参数。这个参数也可以表示为instructions per cycle(IPC)CPI表达了指令处理的效率,并不是指令本身的效率。

闲侃CPU(二)

时钟(clock)是驱动所有CPU处理器逻辑的数字信号。

CPU的速率可以用时钟周期(clock cycle)来衡量。举个例子,5 GHz CPU每秒可以产生50亿的时钟周期。每条CPU指令的执行都会占用一个或多个时钟周期。

CPU的速率是衡量CPU性能的一个重要参数。但是更快的CPU速率并不一定能带来性能的改善,而是要看这些CPU时钟周期都用在做什么。举个例子,如果都用在等待访问内存的结果,那么提高CPU的速率就不会带来真正性能的提升。