如何度量代码运行了多少时钟周期(X86平台)

Daniel Lemirebenchmark代码展示了在X86平台上,如何度量一段代码运行了多少时钟周期:

......
#define RDTSC_START(cycles)                                                    \
  do {                                                                         \
    unsigned cyc_high, cyc_low;                                                \
    __asm volatile("cpuid\n\t"                                                 \
                   "rdtsc\n\t"                                                 \
                   "mov %%edx, %0\n\t"                                         \
                   "mov %%eax, %1\n\t"                                         \
                   : "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx",    \
                     "%rdx");                                                  \
    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
  } while (0)

#define RDTSC_FINAL(cycles)                                                    \
  do {                                                                         \
    unsigned cyc_high, cyc_low;                                                \
    __asm volatile("rdtscp\n\t"                                                \
                   "mov %%edx, %0\n\t"                                         \
                   "mov %%eax, %1\n\t"                                         \
                   "cpuid\n\t"                                                 \
                   : "=r"(cyc_high), "=r"(cyc_low)::"%rax", "%rbx", "%rcx",    \
                     "%rdx");                                                  \
    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
  } while (0)
......
RDTSC_START(cycles_start);                                               
......                                                                   
RDTSC_FINAL(cycles_final);                                               
cycles_diff = (cycles_final - cycles_start);
......

这段代码其实参考自How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures,原理如下:

(1)测量开始和结束时的cpuid指令用来防止代码的乱序执行(out-of-order),即保证cpuid之前的指令不会调度到cpuid之后执行,因此两个cpuid指令之间只包含要度量的代码,没有掺杂其它的。

(2)rdtscrdtscp都是读取系统自启动以来的时钟周期数(cycles,高32位保存在edx寄存器,低32位保存在eax寄存器),并且rdtscp保证其之前的代码都已经完成。两次采样值相减就是我们需要的时钟周期数。

综上所述,通过cpuidrdtscrdtscp3条汇编指令,我们就可以计算出一段代码到底消耗了多少时钟周期。

P.S.,stackoverflow也有相关的讨论。

如何度量系统的时钟频率

Daniel LemireMeasuring the system clock frequency using loops (Intel and ARM)讲述了如何利用汇编指令来度量系统的时钟频率。以Intel X86处理器为例(ARM平台原理类似):

; initialize 'counter' with the desired number
label:
dec counter ; decrement counter
jnz label ; goes to label if counter is not zero

在实际执行时,现代的Intel X86处理器会把decjnz这两条指令“融合”成一条指令,并在一个时钟周期执行完毕。因此只要知道完成一定数量的循环花费了多长时间,就可以计算得出当前系统的时钟频率近似值。

代码中,Daniel Lemire使用了一种叫做“measure-twice-and-subtract”技巧:假设循环次数是65536,每次实验跑两次。第一次执行65536 * 2次,花费时间是nanoseconds1;第二次执行65536次,花费时间是nanoseconds2。那么我们就得到3个执行65536次数的时间:nanoseconds1 / 2nanoseconds1 - nanoseconds2nanoseconds2。这三个时间之间的误差必须小于一个值才认为此次实验结果是有效的:

......
double nanoseconds = (nanoseconds1 - nanoseconds2);
if ((fabs(nanoseconds - nanoseconds1 / 2) > 0.05 * nanoseconds) or
    (fabs(nanoseconds - nanoseconds2) > 0.05 * nanoseconds)) {
  return 0;
}
......

最后把有效的测量值排序取中位数(median):

......
std::cout << "Got " << freqs.size() << " measures." << std::endl;
std::sort(freqs.begin(),freqs.end());
std::cout << "Median frequency detected: " << freqs[freqs.size() / 2] << " GHz" << std::endl;
......

在我的系统上,lscpu显示的CPU时钟频率:

$ lscpu
......
CPU MHz:             1000.007
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
......

实际测量结果:

$ ./loop.sh
g++ -O2 -o reportfreq reportfreq.cpp  -std=c++11 -Wall -lm
measure using a tight loop:
Got 9544 measures.
Median frequency detected: 3.39196 GHz

measure using an unrolled loop:
Got 9591 measures.
Median frequency detected: 3.39231 GHz

measure using a tight loop:
Got 9553 measures.
Median frequency detected: 3.39196 GHz

measure using an unrolled loop:
Got 9511 measures.
Median frequency detected: 3.39231 GHz

measure using a tight loop:
Got 9589 measures.
Median frequency detected: 3.39213 GHz

measure using an unrolled loop:
Got 9540 measures.
Median frequency detected: 3.39196 GHz
.......