DTrace | 我的站点

本文内容取自于《Systems Performance: Enterprise and the Cloud》。

Profiling CPU的方法是通过对CPU状态进行周期性地采样，然后进行分析。包含5个步骤：

1. Select the type of profile data to capture, and the rate.
2. Begin sampling at a timed interval.
3. Wait while the activity of interest occurs.
4. End sampling and collect sample data.
5. Process the data.

CPU采样数据基于下面两个因素：

a. User level, kernel level, or both
b. Function and offset (program-counter-based), function only, partial stack trace, or full stack trace

抓取user level和kernel level所有的函数调用栈固然可以完整地得到CPU的profile，但这样会产生太多的数据。因此通常只采样user level或kernel level部分函数调用栈就可以了，有时可能仅需要保留函数的名字。

下面是一个使用DTrace对CPU采样的例子：

 # dtrace -qn 'profile-997 /arg1/ {@[execname, ufunc(arg1)] = count();} tick-10s{exit(0)}'

 top                                                 libc.so.7`0x801154fec                                             1
 top                                                 libc.so.7`0x8011e5f28                                             1
 top                                                 libc.so.7`0x8011f18a9                                             1

这周在新的产品版本上线后，发现监控日志总是会报recv()返回error，并且errno是131（ECONNRESET）。查了一下man手册，发现并没有说recv()会返回ECONNRESET，于是自己便打算一探究竟。想到最近正在学习Dtrace，于是便写了下面这个简单脚本（check_recv.d）：

#!/usr/sbin/dtrace -qs

syscall::recv:return
/(int)arg0 <=0 && pid == $1/
{
    printf("recv return: tid=%d, arg0=%d, errno=%d\n", tid, arg0, errno);
}

第四行是触发探针条件：当recv返回0或者-1并且进程号等于输入监控的进程号。
第六行是输出：线程ID，recv()返回值，errno。
使用方法：check_recv.d 19771（监控进程号）
通过运行脚本，我发现其实recv()返回的是0，而errno也是0，那么为什么监控日志会输出errno是131呢？我又查了一下这个版本新加的代码，发现了下面的逻辑：

if (recv() <= 0)
{
    log(errno)
}

原来在recv()返回0时，也会输出errno。而在recv()返回0时，是不会更新errno的值，只有在recv()返回-1时，才会更新errno的值。所以现在监控日志里的errno其实是以前某个系统调用错误时设置的errno。所以代码应该改为：

if (recv() < 0)
{
    log(errno)
}

问题解决！

一	二	三	四	五	六	日
« 12月
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

标签：DTrace

Profiling CPU使用

使用Dtrace检查recv()的返回值