本文将深入理解ioeventfd机制,偏向于KVM side。

1. Prerequisite

eventfd system call内核实现

2. Introduction and Motivation

KVM: add ioeventfd support中的commit message很好地阐述了ioeventfd的motivation。

ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd signal when written to by a guest. Host userspace can register any arbitrary IO address with a corresponding eventfd and then pass the eventfd to a specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause side-effects in the emulated model or may return data to the caller. Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM “heavy-weight” exit back to userspace, and is ultimately serviced by qemu’s device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for other IO (such as to kick off an out-of-band DMA request, etc). For these patterns, the synchronous call is particularly expensive since we really only want to simply get our notification transmitted asychronously and return as quickly as possible. All the sychronous infrastructure to ensure proper data-dependencies are met in the normal IO case are just unecessary overhead for signalling. This adds additional computational load on the system, as well as latency to the signalling path.

Therefore, we provide a mechanism for registration of an in-kernel trigger point that allows the VCPU to only require a very brief, lightweight exit just long enough to signal an eventfd. This also means that any clients compatible with the eventfd interface (which includes userspace and kernelspace equally well) can now register to be notified. The end result should be a more flexible and higher performance notification API for the backend KVM hypervisor and perhipheral components.

读者如果对kick off an out-of-band DMA request这句话不够理解,可以阅读:深入理解DMA part1深入理解DMA part2

3. Overview

Sequences:

  1. QEMU将一段PIO/MMIO region与eventfd绑定(具体来说,就是填好struct kvm_ioeventfd),并设置好notification的handler;
  2. 通过ioctl将struct kvm_ioeventfd结构体传给KVM;
  3. KVM根据信息,注册PIO/MMIO region的handler为ioeventfd_ops;
  4. Guest写PIO/MMIO region时,会发生VM Exit,KVM最终会调用ioeventfd_write to trigger an event to QEMU;
  5. QEMU监测到ioeventfd上出现了event,调用相应的handler处理IO.

4. Details

用户态传入的参数如下:

1
2
3
4
5
6
7
8
struct kvm_ioeventfd {
__u64 datamatch; /* 1 */
__u64 addr; /* legal pio/mmio address */
__u32 len; /* 0, 1, 2, 4, or 8 bytes */
__s32 fd; /* 2 */
__u32 flags;
__u8 pad[36];
};

如果flags设置了KVM_IOEVENTFD_FLAG_DATAMATCH,只有当guest向addr地址写入的值与datamatch值相等时,才会触发event。

用户态信息kvm_ioeventfd需要转化成内核态存放。ioeventfd内核态结构体基于eventfd,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/*
* --------------------------------------------------------------------
* ioeventfd: translate a PIO/MMIO memory write to an eventfd signal.
*
* userspace can register a PIO/MMIO address with an eventfd for receiving
* notification when the memory has been touched.
* --------------------------------------------------------------------
*/

struct _ioeventfd {
struct list_head list;
u64 addr;
int length;
struct eventfd_ctx *eventfd;
u64 datamatch;
struct kvm_io_device dev;
u8 bus_idx;
bool wildcard;
};
  • list用于将当前ioeventfd链接到kvm的ioeventfd链表中去.

  • addr是ioeventfd对应的IO地址.

  • length是eventfd关联的长度.

  • eventfd是该ioeventfd对应的eventfd.

  • datamatch上文已经介绍过了.

  • dev用于将该ioeventfd与guest关联起来(通过注册该dev到guest).

  • bus_idx是该ioeventfd要注册到kvm的哪个总线上.

    • kvm中将ioeventfd注册的地址分为4类,可以认为每类地址有独立的地址空间,它们被抽象成4个bus上的地址。分别是kvm_bus所列出的MMIO,PIO,VIRTIO_CCW_NOTIFY,FAST_MMIO。MMIO和FAST_MMIO的区别是,MMIO需要检查写入地址的值长度是否和ioeventfd指定的长度相等,FAST_MMIO则不需要检查长度。
  • wildcard与datamatch互斥,如果kvm_ioeventfd中datamatch为false,则_ioeventfd->wildcard设为true.

所以_ioeventfd描述了一个ioeventfd要注册到kvm中的所有信息,其中包含了ioeventfd信息和需要注册到guest的总线和设备信息。


KVM中的函数调用链如下:

1
2
3
4
5
kvm_ioeventfd
kvm_assign_ioeventfd
kvm_assign_ioeventfd_idx
kvm_iodevice_init(&p->dev, &ioeventfd_ops)
kvm_io_bus_register_dev
1
2
3
4
kvm_io_bus_write
__kvm_io_bus_write
kvm_iodevice_write
dev->ops->write(ioeventfd_write)
1
2
3
4
static const struct kvm_io_device_ops ioeventfd_ops = {
.write = ioeventfd_write,
.destructor = ioeventfd_destructor,
};

需要注意的是,ioeventfd对应的文件操作只有write操作,而没有read操作。

write操作对应guest中写入ioeventfd对应的IO地址时触发的操作,也就是guest执行OUT类汇编指令时触发的操作,相反read操作就是guest执行IN类汇编指令时触发的操作,OUT类指令只是简单向外部输出数据,无需等待QEMU处理完成即可继续运行guest,但IN指令需要从外部获取数据,必须要等待QEMU处理完成IO请求再继续运行guest。

ioeventfd设计的初衷就是节省guest运行OUT类指令时的时间,IN类指令执行时间无法节省,因此这里的ioeventfd 文件操作中只有write而没有read。

剩下的事情就留给读者了,结合着源码与参考资料,去发现更多的细节吧!


参考资料:

  1. KVM: add ioeventfd support
  2. qemu-kvm的ioeventfd机制
  3. qemu中的eventfd——ioeventfd