这一年多以来陆陆续续听到了VFIO这个技术,知道它可以用来:(1)虚拟化下pass-thru device;(2)实现用户态驱动,如DPDK,SPDK等。但是,从未深入研究过VFIO。接下来,准备写一系列的博客来介绍这项技术。本文将介绍the motivation of VFIO。本文内容主要源于大佬的slides。

1. Background

The VFIO driver is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows safe , non-privileged, userspace drivers.

Why do we want that? Virtual machines often make use of direct device access (“device assignment”) when configured for the highest possible I/O performance. From a device and host perspective, this simply turns the VM into a userspace driver, with the benefits of significantly reduced latency, higher bandwidth, and direct use of bare-metal device drivers.

Some applications, particularly in the high performance computing field, also benefit from low-overhead, direct device access from userspace. Examples include network adapters (often non-TCP/IP based) and compute accelerators. Prior to VFIO, these drivers had to either go through the full development cycle to become proper upstream driver, be maintained out of tree, or make use of the UIO framework, which has no notion of IOMMU protection, limited interrupt support, and requires root privileges to access things like PCI configuration space.

The VFIO driver framework intends to unify these, replacing both the KVM PCI specific device assignment code as well as provide a more secure, more featureful userspace driver environment than UIO.

The use cases of VFIO are straightforward. Both of them need to directly access device:

  • Virtualization guest OS

  • High performance user-level IO stacks such as DPDK

VFIO claims that it replaced two exsiting things:

  • UIO
  • KVM Legacy PCI Assignment

2. VFIO vs UIO

UIO, i.e. Userspace I/O.

2.1 Basic target of UIO: Enable userspace device drivers:

  • Remap device MMIO to userspace
  • Remap kernel physical memory to userspace
  • Remap kernel virtual memory to userspace

2.2 What is missing:

  • IRQ (partial support, polling mostly)
  • DMA

2.3 Why DMA capability not provided?

  • IOMMU not considered from the beginning

2.4 Why VFIO is different?

  • Full DMA capability from userspace
  • Full interrupt support to userspace

3. VFIO vs Legacy PCI Assignment

3.1 Legacy Assignment: the Pros

  • straight forward implementation

3.2 Legacy Assignment: the Cons

  • the minimum granularity is PCI BDF
    • How about a PCIe-to-PCI Bridge?
    • A device capable of Peer-to-Peer, but without ACS capability?

“How about a PCIe-to-PCI Bridge?”的解释:

设备利用自己的Source Identifier(包含Bus、Device、Func)来找到页表项以完成地址映射,不过如下特殊情况需要考虑:
对于由PCIe switch扩展出的PCI桥及桥下设备,在发送DMA请求时,Source Identifier是PCIe switch的,这样的话该PCI桥及桥下所有设备都会使用PCIe switch的Source Identifier去定位Context Entry,找到的页表也是同一个,如果将这个PCI桥下的不同设备分给不同虚机,由于会使用同一份页表,这样会产生问题。针对这种情况,当前PCI桥及桥下的所有设备必须分配给同一个虚机,这就是VFIO中group的概念。

“A device capable of Peer-to-Peer, but without ACS capability?”的解释:

ACS的基本介绍

This isolation is not always at the granularity of a single device though. Even when an IOMMU is capable of this, properties of devices, interconnects, and IOMMU topologies can each reduce this isolation. For instance, an individual device may be part of a larger multi-function enclosure. While the IOMMU may be able to distinguish between devices within the enclosure, the enclosure may not require transactions between devices to reach the IOMMU. Examples of this could be anything from a multi-function PCI device with backdoors between functions to a non-PCI-ACS (Access Control Services) capable bridge allowing redirection without reaching the IOMMU.

For VFIO, this isolation granularity is vfio_group instead of a single device.不会存在同一个group内的devices分配给多个虚拟机的情况。

3.3 Why VFIO is different?

  • the minimum granularity is vfio_group, which is derived from iommu_group
  • A PCIe-to-PCI hierarchy belongs to one group
  • Devices capable of Peer-to-Peer but without ACS capability, these devices belongs to one group(比如Device1和Device2之间可以Peer-to-Peer,但是没有ACS capability,如果将Device1 pass-thru给虚拟机1,将Device2 pass-thru给虚拟机2,就可能会破坏虚拟机的隔离性。因此需要将Device1和Device2放入同一个vfio_group中)

参考资料:

  1. Notes about Virtualization Yizhou Shan
  2. Kernel documentation: VFIO
  3. VFIO概述