如何度量代码运行了多少时钟周期(X86平台)

Daniel Lemirebenchmark代码展示了在X86平台上,如何度量一段代码运行了多少时钟周期:

......
#define RDTSC_START(cycles)                                                    \
  do {                                                                         \
    unsigned cyc_high, cyc_low;                                                \
    __asm volatile("cpuid\n\t"                                                 \
                   "rdtsc\n\t"                                                 \
                   "mov %%edx, %0\n\t"                                         \
                   "mov %%eax, %1\n\t"                                         \
                   : "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx",    \
                     "%rdx");                                                  \
    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
  } while (0)

#define RDTSC_FINAL(cycles)                                                    \
  do {                                                                         \
    unsigned cyc_high, cyc_low;                                                \
    __asm volatile("rdtscp\n\t"                                                \
                   "mov %%edx, %0\n\t"                                         \
                   "mov %%eax, %1\n\t"                                         \
                   "cpuid\n\t"                                                 \
                   : "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx",    \
                     "%rdx");                                                  \
    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
  } while (0)
......
RDTSC_START(cycles_start);                                               
......                                                                   
RDTSC_FINAL(cycles_final);                                               
cycles_diff = (cycles_final - cycles_start);
......

这段代码其实参考自How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures,原理如下:

(1)测量开始和结束时的cpuid指令用来防止代码的乱序执行(out-of-order),即保证cpuid之前的指令不会调度到cpuid之后执行,因此两个cpuid指令之间只包含要度量的代码,没有掺杂其它的。

(2)rdtscrdtscp都是读取系统自启动以来的时钟周期数(cycles,高32位保存在edx寄存器,低32位保存在eax寄存器),并且rdtscp保证其之前的代码都已经完成。两次采样值相减就是我们需要的时钟周期数。

综上所述,通过cpuidrdtscrdtscp3条汇编指令,我们就可以计算出一段代码到底消耗了多少时钟周期。

P.S.,stackoverflow也有相关的讨论。

如何度量系统的时钟频率

Daniel LemireMeasuring the system clock frequency using loops (Intel and ARM)讲述了如何利用汇编指令来度量系统的时钟频率。以Intel X86处理器为例(ARM平台原理类似):

; initialize 'counter' with the desired number
label:
dec counter ; decrement counter
jnz label ; goes to label if counter is not zero

在实际执行时,现代的Intel X86处理器会把decjnz这两条指令“融合”成一条指令,并在一个时钟周期执行完毕。因此只要知道完成一定数量的循环花费了多长时间,就可以计算得出当前系统的时钟频率近似值。

代码中,Daniel Lemire使用了一种叫做“measure-twice-and-subtract”技巧:假设循环次数是65536,每次实验跑两次。第一次执行65536 * 2次,花费时间是nanoseconds1;第二次执行65536次,花费时间是nanoseconds2。那么我们就得到3个执行65536次数的时间:nanoseconds1 / 2nanoseconds1 - nanoseconds2nanoseconds2。这三个时间之间的误差必须小于一个值才认为此次实验结果是有效的:

......
double nanoseconds = (nanoseconds1 - nanoseconds2);
if ((fabs(nanoseconds - nanoseconds1 / 2) > 0.05 * nanoseconds) or
    (fabs(nanoseconds - nanoseconds2) > 0.05 * nanoseconds)) {
  return 0;
}
......

最后把有效的测量值排序取中位数(median):

......
std::cout << "Got " << freqs.size() << " measures." << std::endl;
std::sort(freqs.begin(),freqs.end());
std::cout << "Median frequency detected: " << freqs[freqs.size() / 2] << " GHz" << std::endl;
......

在我的系统上,lscpu显示的CPU时钟频率:

$ lscpu
......
CPU MHz:             1000.007
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
......

实际测量结果:

$ ./loop.sh
g++ -O2 -o reportfreq reportfreq.cpp  -std=c++11 -Wall -lm
measure using a tight loop:
Got 9544 measures.
Median frequency detected: 3.39196 GHz

measure using an unrolled loop:
Got 9591 measures.
Median frequency detected: 3.39231 GHz

measure using a tight loop:
Got 9553 measures.
Median frequency detected: 3.39196 GHz

measure using an unrolled loop:
Got 9511 measures.
Median frequency detected: 3.39231 GHz

measure using a tight loop:
Got 9589 measures.
Median frequency detected: 3.39213 GHz

measure using an unrolled loop:
Got 9540 measures.
Median frequency detected: 3.39196 GHz
.......

 

FreeBSD操作系统上获取CPU信息

FreeBSD既没有GNU/Linux操作系统上的/proc/cpuinfo文件,也不提供lscpu命令(其实lscpu也是访问的/proc/cpuinfo文件)。因此在FreeBSD上想了解当前机器的CPU信息,需要费点小周折:

(1)使用sysctl命令:

# sysctl hw.model hw.machine hw.ncpu
hw.model: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
hw.machine: amd64
hw.ncpu: 2

(2)读取/var/run/dmesg.boot文件:

# grep -i cpu /var/run/dmesg.boot
CPU: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (2400.05-MHz K8-class CPU)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
SMP: AP CPU #1 Launched!

(3)通过dmidecode命令获得CPUcache信息:

# dmidecode -t processor -t cache
# dmidecode 3.0
Scanning /dev/mem for entry point.
SMBIOS 2.4 present.

Handle 0x0004, DMI type 4, 35 bytes
Processor Information
        Socket Designation: LGA 775
        Type: Central Processor
        Family: Pentium 4
        Manufacturer: Intel
        ID: F6 06 00 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 15, Stepping 6
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
......
Handle 0x0005, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L1-Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 32 kB
        Maximum Size: 32 kB
......

参考资料:
FreeBSD CPU Information Command
What is the equivalent of /proc/cpuinfo on FreeBSD v8.1?

Linux系统上如何查看进程(线程)所运行的CPU

本文介绍如何在Linux系统上查看某个进程(线程)所运行的CPU,但在此之前我们需要弄清楚两个基本概念:

(1)Linux操作系统上的进程和线程没有本质区别,在内核看来都是一个task。属于同一个进程的各个线程共享某些资源,每一个线程都有一个ID,而“主线程”的线程ID同进程ID,也就是我们常说的PID是一样的。

(2)使用lscpu命令,可以得到当前系统CPU的数量:

$ lscpu
......
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
......

系统有2个物理CPUSocket(s): 2),每个CPU6coreCore(s) per socket: 6),而每个core又有2hardware threadThread(s) per core: 2)。所以整个系统上一共有2X6X2=24CPU(s):24)个逻辑CPU,也就是实际运行程序的CPU

使用htop命令可以得到进程(线程)所运行的CPU信息,但是htop默认情况下不会显示这一信息:

1
开启方法如下:
(1)启动htop后,按F2Setup):

2
(2) Setup中选择Columns,然后在Available Columns中选择PROCESSOR - ID of the CPU the process last executed, 接下来按F5Add)和F10Done)即可:

3

现在htop就会显示CPU的相关信息了。需要注意的是,其实htop显示的只是“进程(线程)之前所运行的CPU”,而不是“进程(线程)当前所运行的CPU”,因为有可能在htop显示的同时,操作系统已经把进程(线程)调度到其它CPU上运行了。

下面是一个运行时会包含4个线程的程序:

#include <omp.h>

int main(void){

        #pragma omp parallel num_threads(4)
        for(;;)
        {
        }

        return 0;
}

编译并运行代码:

$ gcc -fopenmp thread.c
$ ./a.out &
[1] 17235

使用htop命令可以得到各个线程ID,以及在哪个CPU上运行:

4

参考资料:
How to find out which CPU core a process is running on
闲侃CPU(一)

CUDA编程笔记(4)——CPU THREAD VS GPU THREAD

这篇笔记摘自Professional CUDA C Programming

Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches are slow and expensive.
Threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work. If the GPU must wait on one group of threads, it simply begins executing work on another.
CPU cores are designed to minimize latency for one or two threads at a time, whereas GPU cores are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.
Today, a CPU with four quad core processors can run only 16 threads concurrently, or 32 if the CPUs support hyper-threading.
Modern NVIDIA GPUs can support up to 1,536 active threads concurrently per multiprocessor. On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.

CUDA编程笔记(3)——Heterogeneous architecture

这篇笔记摘自Professional CUDA C Programming

A typical heterogeneous compute node nowadays consists of two multicore CPU sockets and two or more many-core GPUs. A GPU is currently not a standalone platform but a co-processor to a CPU. Therefore, GPUs must operate in conjunction with a CPU-based host through a PCI-Express bus. That is why, in GPU computing terms, the CPU is called the host and the GPU is called the device.

capture

A heterogeneous application consists of two parts:
➤ Host code
➤ Device code
Host code runs on CPUs and device code runs on GPUs. An application executing on a heterogeneous platform is typically initialized by the CPU. The CPU code is responsible for managing the environment, code, and data for the device before loading compute-intensive tasks on the device.

There are two important features that describe GPU capability:
➤ Number of CUDA cores
➤ Memory size
Accordingly, there are two different metrics for describing GPU performance:
➤ Peak computational performance
➤ Memory bandwidth
Peak computational performance is a measure of computational capability, usually defined as how many single-precision or double-precision floating point calculations can be processed per second. Peak performance is usually expressed in gflops (billion floating-point operations per second) or tflops (trillion floating-point calculations per second). Memory bandwidth is a measure of the ratio at which data can be read from or stored to memory. Memory bandwidth is usually expressed in gigabytes per second, GB/s.

CUDA编程笔记(2)——GPU core VS CPU core

这篇笔记摘自Professional CUDA C Programming

Even though many-core and multicore are used to label GPU and CPU architectures, a GPU core is quite different than a CPU core.
A CPU core, relatively heavy-weight, is designed for very complex control logic, seeking to optimize the execution of sequential programs.
A GPU core, relatively light-weight, is optimized for data-parallel tasks with simpler control logic, focusing on the throughput of parallel programs.

 

Linux下使用vmstat命令获得系统CPU的使用状态

本文是使用vmstat命令监控CPU使用的续文。

Linux下使用vmstat命令可以得到系统CPU的使用状态:

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 1860352    948 131040    0    0  2433   137  252  897  2  7 90  1  0

其中描述CPU状态的是最后5列:

------cpu-----
us sy id wa st
2  7 90  1  0

要注意,上面数字的含义是百分比。即CPU运行user space程序的时间占2%,。。。

各列含义如下:

ususer time):CPU运行user space代码的时间;
sysystem time):CPU运行kernel代码的时间,比如执行系统调用;
ididle time):CPU处于idle状态的时间;
waIO-wait time):CPU处于idle状态,因为所有正在运行的进程都在等待I/O操作完成,因此当前无可以调度的进程;
ststolen time):CPU花费在执行系统上运行的虚拟机的时间。

参考资料:
The precise meaning of I/O wait time in Linux
Linux Performance Analysis in 60,000 Milliseconds

Swarmkit笔记(12)——swarmctl创建service时指定资源限制

swarmctl创建service时可以指定CPUmemory资源限制:

# swarmctl service create --help
Create a service

Usage:
  swarmctl service create [flags]

Flags:
......
  --cpu-limit string            CPU cores limit (e.g. 0.5)
  --cpu-reservation string      number of CPU cores reserved (e.g. 0.5)
......
  --memory-limit string         memory limit (e.g. 512m)
  --memory-reservation string   amount of reserved memory (e.g. 512m)
    ......

*-reservation的作用是为container分配并“占住”相应的资源,所以这些资源对container一定是可用的;*-limit是限制container进程所使用的资源。解析资源的代码位于cmd/swarmctl/service/flagparser/resource.go文件。

参考资料:
Docker service Limits and Reservations

docker笔记(16)——为container指定CPU资源

Docker run命令的--cpuset-cpus选项,指定container运行在特定的CPU core上。举例如下:

# docker run -ti --rm --cpuset-cpus=1,6 redis

另外还有一个--cpu-shares选项,它是一个相对权重(relative weight),其默认值是1024。即如果两个运行的container--cpu-shares值都是1024的话,则占用CPU资源的比例就相等。