The VFIO driver is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows safe , non-privileged, userspace drivers.
Why do we want that? Virtual machines often make use of direct device access (“device assignment”) when configured for the highest possible I/O performance. From a device and host perspective, this simply turns the VM into a userspace driver, with the benefits of significantly reduced latency, higher bandwidth, and direct use of bare-metal device drivers.
Some applications, particularly in the high performance computing field, also benefit from low-overhead, direct device access from userspace. Examples include network adapters (often non-TCP/IP based) and compute accelerators. Prior to VFIO, these drivers had to either go through the full development cycle to become proper upstream driver, be maintained out of tree, or make use of the UIO framework, which has no notion of IOMMU protection, limited interrupt support, and requires root privileges to access things like PCI configuration space.
The VFIO driver framework intends to unify these, replacing both the KVM PCI specific device assignment code as well as provide a more secure, more featureful userspace driver environment than UIO.
The use cases of VFIO are straightforward. Both of them need to directly access device:
Virtualization guest OS
High performance user-level IO stacks such as DPDK
VFIO claims that it replaced two exsiting things:
- KVM Legacy PCI Assignment
UIO, i.e. Userspace I/O.
- Remap device MMIO to userspace
- Remap kernel physical memory to userspace
- Remap kernel virtual memory to userspace
- IRQ (partial support, polling mostly)
- IOMMU not considered from the beginning
- Full DMA capability from userspace
- Full interrupt support to userspace
- straight forward implementation
- the minimum granularity is PCI BDF
- How about a PCIe-to-PCI Bridge?
- A device capable of Peer-to-Peer, but without ACS capability?
“How about a PCIe-to-PCI Bridge?”的解释:
对于由PCIe switch扩展出的PCI桥及桥下设备，在发送DMA请求时，Source Identifier是PCIe switch的，这样的话该PCI桥及桥下所有设备都会使用PCIe switch的Source Identifier去定位Context Entry，找到的页表也是同一个，如果将这个PCI桥下的不同设备分给不同虚机，由于会使用同一份页表，这样会产生问题。针对这种情况，当前PCI桥及桥下的所有设备必须分配给同一个虚机，这就是VFIO中group的概念。
“A device capable of Peer-to-Peer, but without ACS capability?”的解释:
This isolation is not always at the granularity of a single device though. Even when an IOMMU is capable of this, properties of devices, interconnects, and IOMMU topologies can each reduce this isolation. For instance, an individual device may be part of a larger multi-function enclosure. While the IOMMU may be able to distinguish between devices within the enclosure, the enclosure may not require transactions between devices to reach the IOMMU. Examples of this could be anything from a multi-function PCI device with backdoors between functions to a non-PCI-ACS (Access Control Services) capable bridge allowing redirection without reaching the IOMMU.
For VFIO, this isolation granularity is vfio_group instead of a single device.不会存在同一个group内的devices分配给多个虚拟机的情况。
- the minimum granularity is vfio_group, which is derived from iommu_group
- A PCIe-to-PCI hierarchy belongs to one group
- Devices capable of Peer-to-Peer but without ACS capability, these devices belongs to one group(比如Device1和Device2之间可以Peer-to-Peer，但是没有ACS capability，如果将Device1 pass-thru给虚拟机1，将Device2 pass-thru给虚拟机2，就可能会破坏虚拟机的隔离性。因此需要将Device1和Device2放入同一个vfio_group中)