本文所演示的内核版本为3.14.69,平台架构为x86架构,主要以通过perf测试进程的内存带宽为例,讲述了在内核的调用过程。

perf用户态使用

在用户态,用户可以调用perf_event_open系统调用来使用perf。
建议读者好好阅读下下面推荐的资料,这样可以对perf_event_open的理解更加深刻些。

  1. man2 perf_event_open
  2. perf events self profiling example
  3. 使用performance counter读取硬件或软件Event
  4. perf_event_open 设置性能监视

perf event子系统架构

The Linux Perf_Event Subsystem consists of the files core.c and perf_event.c. These files are the interface between the linux kernel and various user space performance monitoring tool.

perf event子系统中的数据结构

数据结构的定义在perf_event.h文件中。

The following are some of the important data structures which are used by the perf_event subsystem.

1
2
3
4
struct perf_event;
struct perf_event_attr;
struct perf_event_context;
struct pmu;

Important Fields in the Data Structures

perf_event

1
2
3
4
5
6
7
8
9
10
struct perf_event {
struct perf_event *group_leader;
struct pmu *pmu;
u64 total_time_enabled;
u64 total_time_running;
struct perf_event_attr attr;
atomic64_t child_count;
struct perf_event_context *ctx;
perf_overflow_handler_t overflow_handler;
struct task_struct *owner;}

Description:

  • group_leader
    This field specifies the leader of the group of events attached to the process.
  • pmu
    This field points to the generic performance monitoring unit structure.
  • total_time_enabled
    This field specify the total time in nanoseconds that the event has been enabled.
  • total_time_running
    This field specify total time in nanoseconds that the event is running(scheduled onto the
    CPU)
  • owner
    This field points to the task structure of the process which has monitoring this event.

perf_event_attr

1
2
3
4
5
6
7
8
9
10
11
12
13
struct perf_event_attr {
__u32 type;
__u64 config;
__u64 sample_period;
__u64 sample_freq;
__u64 sample_type;
exclusive : 1,
exclude_user : 1,
exclude_kernel : 1,
exclude_hv : 1,
exclude_idle : 1,
exclude_host : 1,
exclude_guest : 1 }

Description:

  • type
    This field specifies the overall event type.
  • config
    This field specifies which event needs to be monitored. It is used along with type to
    decide the exact event.
  • sample_period, sample_freq
    Sampling period defines the N value where N is the number of events after which the
    interrupt is generated. It can be counted in terms of frequency as well.
  • sample_type
    The various bits in this field specify which values to include in the sample.
  • exclude_user
    This bit when enabled the count excludes the user-space events.
  • exclude_kernel
    This bit when enabled the count exclude the kernel-space events.

perf_event_context

1
2
3
4
5
6
struct perf_event_context {
struct list_head event_list;
int nr_events;
struct perf_event_context *parent_ctx;
u64 time;
u64 timestamp; }

Description:

  • event_lists
    This field specifies the list of events.
  • nr_events
    This field specifies the number of events that are currently monitored.
  • parent_ctx
    This fields points to the context of the processes parent.
  • time,timestamp
    These are context clocks, they run when the context is enabled.

pmu

1
2
3
4
5
6
struct pmu {
void (*pmu_enable) (struct pmu *pmu);
void (*pmu_disable) (struct pmu *pmu);
void (*start) (struct perf_event *event, int flags);
void (*stop) (struct perf_event *event, int flags);
void (*read) (struct perf_event *event); }

Description:
This structure majorly contains the function pointers to various PMU related functions.

  • pmu_enable,pmu_disable
    These functions are used to fully disable/enable a PMU.
  • start,stop
    These functions are used to start or stop a counter on a PMU.
  • read
    This function is used to update the event value for a particular counter.

Counting Support in Linux Perf

perf_event assigns one file descriptor per event and either per-thread or per-CPU. The system call perf_event_open() configures the hardware MSRs and creates a file descriptor which can be used for reading the performance measurement data. Once the file descriptor is obtained we can issue subsequent read calls to get the values of the performance counters. These values are then aggregated at the end of the program execution.

The following is the execution flow for getting the file descriptor.

For enabling and disabling performance monitoring events we use the ioclt and prctl system calls.

Execution flow of the read system call:

perf_event_open系统调用的具体过程

具体实例

好了,背景知识终于介绍完成了,下面介绍下具体的实例,方便读者理解。
perf stat -e cache-misses -I 1000 -p 2234
每隔1000ms,会输出2234进程在过去1000ms的cache_misses硬件事件,这是如何做到的呢?

1
2
3
struct perf_event_attr attr;
type:0 //PERF_TYPE_HARDWARE
config:3 //PERF_COUNT_HW_CACHE_MISSES

每隔1000ms输出结果是通过用户态程序控制的,阅读完使用performance counter读取硬件或软件Event中的程序即可明白。

当需要监听2234进程的cache_misses时,实际上是对2234进程中的所有线程进行监听,假设线程的数目为5个,此刻,会调用perf_event_open系统调用5次。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static int perf_event_read_one(struct perf_event *event,
u64 read_format, char __user *buf)
{
u64 enabled, running;
u64 values[4];
int n = 0;
values[n++] = perf_event_read_value(event, &enabled, &running);
printk("<0>""liujunming perf_event_read_one value%llu\n", values[0]);
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
values[n++] = enabled;
if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
values[n++] = running;
if (read_format & PERF_FORMAT_ID)
values[n++] = primary_event_id(event);
if (copy_to_user(buf, values, n * sizeof(u64)))
return -EFAULT;
return n * sizeof(u64);
}

perf_event_read_value会读取相应线程寄存器中记录的cache_miss值。

当统计2234进程时,实际上是对它5个线程的cache_misses值进行累加。假设前2秒内,2234进程的cache_misses为10000,前3秒内,2234进程的cache_misses为30000。那么在2~3秒这一秒内,进程的cache_misses即为20000。


参考资料:

  1. A Study of Performance Monitoring Unit, perf and perf_events subsystem
  2. 龙芯多核平台上性能分析工具的设计与实现