本文内容主要转载自:

  1. Scalable IOV技术详解
  2. 聊聊intel平台io虚拟化技术之 SIOV
  3. RECENT ENHANCEMENTS IN INTEL® VIRTUALIZATION TECHNOLOGY FOR DIRECTED I/O (INTEL® VT-D)

1. Introduction

Scalable IO Virtualization(SIOV)是IO虚拟化技术的一个演进,是SR-IOV的进一步发展。为了提升虚拟机的IO性能,intel 的VT-d解决了设备直通问题,使虚拟机可以直接访问硬件设备从而提高性能,而SR-IOV则提供了设备共享的能力,通过将设备硬件虚拟化出多个VF给不同的虚拟机使用。

首先我们回顾一下SR-IOV技术,如下图所示,SR-IOV引入了两种设备PF和VF。其中PF具有完整的PCIe功能,包括VF的管理(创建/删除/配置),VF则是一种轻量的PCIe设备,只有部分数据传输功能,不包含资源管理和配置管理。但VF也是标准的PCIe设备,既有唯一的BDF(Bus,Device,Function)来标识这个设备,拥有属于自己的PCIe配置空间。

2. Limitation of SR-IOV

While SR-IOV enabled the ability to partition a device and provide direct access to VMs, it also imposed scalability limitations to modern cloud and containerized environments. For instance:

  • Device BARs and headers must be duplicated for every VF.
  • PCIe limits resources such as MSI-X to maximum of 2048 vectors.
  • BIOS Firmware must reserve a number of resources such as MMIO ranges, bus ranges to accommodate devices of any capability to be hotplugged.

SR-IOV implementations typically provide only a small number of VFs due to the above resource requirements. Typical SR-IOV devices only support 64 or less VFs per physical device. Light-weight containerized usages in modern cloud environments expect to have thousands of containers and therefore will put pressure on potentially scarce resources. In these environments, SR-IOV will not scale.

Limitations of SR-IOV based implementations include:

  • Scalability - Unable to scale to hyperscale usages (1000+ VMs/Containers) due to cost implications for having increased memory on board and limitations on BUS numbers in certain platforms.
  • Flexible Resource Management - SR-IOV requires resources such as BUS numbers and MMIO ranges to use the newly created VFs. Typically, the resources are spread evenly between each of the VFs. Although it’s possible to conceive of such variable resource assignments to different VFs, it imposes hardware complexity which would increase hardware cost. For instance, being able to create a device with 2 hardware queues for one VF, and 4 queues on the same physical device for another VF is generally not implemented.
  • Composability - the motivation of SR-IOV is to enable direct VF pass-through. The guest driver has full control on the assigned VF device which the host/hypervisor has no insight into. This makes it difficult to perform live migration or snapshot VF device state.

Even with these limitations, SR-IOV has worked well in traditional VM usage. However, this approach no longer meets the scaling requirements for containerized environments.

3. Scalable IOV

Intel introduced the recent update to Intel® VT-d that allows for fine-grained capacity allocation. More specifically, it allows software to compose virtual devices with different capacity or capability. For instance, it’s not required to replicate hardware like SR-IOV devices. Intel® Scalable IOV allows software to compose a virtual device on demand. The virtual device provisioned via software allows most device access to be separated into slow path (configuration) and fast path (I/O). Any activity that involves configuration and control is done by software mediation. Fast path I/O is performed directly to hardware with no software intervention. This allows resources such as queues to be bundled on demand and such usage can fit either full machine virtualization or native container type usages.

Intel® Scalable IOV requires changes in the following areas:

  • Device Support - A device should support Process Address Space ID (PASID). The PASID is a 20 bit value that is used in conjunction with the Requester ID. PASID granular resource allocation and proper isolation requirements are identified in the Intel® Scalable I/O Virtualization Technical Specification.
    • The Interrupt Message Store (IMS) provides devices the flexibility to dictate how interrupts are specified without limitations on how many and where the message address and data are stored.
  • Platform Support - DMA remapping hardware should support PASID granular DMA isolation capability.
  • System Software - Support in the Operating System to provide abstractions that allow such devices to be provisioned for a Guest OS, or native process consumption.

Intel® Scalable IOV addresses the aforementioned limitations observed on PCIe* SR-IOV:

  • Scalability - supports finer-grained device sharing. For example, on a NIC with 1024 TX/RX queue pairs, each queue pair can now be independently assignable.
  • Flexible Resource management - software fully manages and maps backend resources to virtual devices. This provides great flexibility for heterogeneous configurations (different resources, different capabilities, and others.)
  • Composability - mediation of the slow-path allows the host/hypervisor to capture the virtual device state to enable live migration or snapshot usages. Also state save/restore is required only on a small piece of device resource (queue, context, etc.), which can be implemented more easily on a device as compared to requiring the entire VF state to be migratable.

针对SR-IOV的一些局限性,intel推出了Scalable IOV技术。它主要包含一些几个技术特性:

  1. 硬件辅助的直通架构,具体来说就是
    1. 慢速路径有软件模拟完成,所谓慢速路径一般指设备的配置,接口的管理,而快速路径则是指IO的数据传输路径。在SR-IOV中,慢速路径和快速路径都是通过硬件直通的方式完成的;
    2. 快速路径资源可以动态分配,映射;
    3. 硬件保证快速路径的资源在DMA时是完全隔离的,保证不同虚拟设备的安全隔离;
  2. 更加细粒度的动态资源配置。具体来说就是可以按照PCIe设备上的tx/rx queue pair来切分虚拟设备,而不是VF,从而实现更细粒度的资源分配;
  3. 利用PASID(Process Address Space ID)的PCIe能力,PASID技术也是PCI协议的一个补充,它颠覆了传统通过BDF(Bus,Device,Function)来唯一标识一个PCIe设备的方式,以BDF+PASID在一个PCIe设备内细分更多的虚拟设备;
  4. 支持各种IO设备,包括网卡,存储设备,GPU,各种加速器等;
  5. 支持虚拟机,裸金属,容器等多种应用场景;

以上就是Scalable IOV的主要技术特征,可以看出和SR-IOV类似,它不仅仅是PCIe设备侧的一次革新,更是硬件设备,BIOS,操作系统,hypervisor,CPU,IOMMU等整个硬件的一次革新。

  • Over-provisioning: 两个VDEV之间的Queue资源是可以share的
  • Generational Compatability: vmm可以使用VDCM(Virtual Device Composition Module)在不同代的硬件设备上呈现相同的VDEV功能,这样即使在部署了不同代的SIOV设备的host之间虚拟机也能正常迁移

4. 整体架构

Scalable IOV的整体架构和构成如下图所示。

5. 硬件架构

SIOV 主要是以queue为粒度来给上层应用提供服务,因此设备层提出了一种叫ADI(Assignable Device Interfaces)的接口概念,这个有些类似于SR-IOV中的VF,ADI指作为一种独立的单元进行分配、配置和组织的一组后端资源。它和VF有两点不同之处:

  1. 没有PCI配置空间,所有ADI设备共享PF的配置空间;
  2. 通过PASID标识,而不是BDF

同时ADI作为一个可用随时分配的设备,又具备以下特点:

  1. ADI设备之间是完全隔离的,不共享任何资源;
  2. 不同的ADI设备的MMIO寄存器是以物理页为单位隔离,保证进行MMIO页映射时在不同的页,避免MMIO被不同的进程共享;
  3. 所有ADI的DMA操作通过PASID进行,因此IOMMU可以根据每个设备DMA的PASID查找不同的页表,保证物理上ADI是安全隔离的;
  4. 采用了Interrupt Message Storage(IMS)技术。其实IMS和ADI不是绑定的,ADI采用IMS是由于往往ADI设备较多,每个ADI设备的每个queue都会产生中断,为了支持大量的中断消息存储使用了IMS技术。至于IMS具体的存储格式和位置是和具体设备实现相关的。此外ADI中断不支持共享,而且只支持MSI/MSI-X,不支持lagacy中断;
  5. 每个ADI设备可以独立的进行reset操作;

5.1 PCI配置空间

对PCIe设备进行初始化和枚举时,需要配置空间能够发现设备是否支持Scalable IOV技术,intel定义了一个Designated Vendor Specific Extended Capability (DVSEC) 域用于发现和配置支持Scalable IOV技术的设备。具体如下图所示:

5.2 MMIO

ADI的MMIO,它是位于PF bar地址空间的一段连续的按页大小对齐的地址范围。每个ADI设备的MMIO是相互独立的,ADI设备的MMIO register又分为两类,一类是访问频率比较高的比如硬件层的doorbell,一类是不经常访问的或者慢路径访问的比如用来进行一些设备配置和管理等。

5.3 PASID(区分来自不同ADI设备的DMA请求)

IOMMU提供DMA remapping的操作,进行地址转换,将不同的IO设备提供的IOVA地址转换成物理地址,用于设备的DMA。在intel IOMMU中,每个IO设备通过BDF找到属于自己的页表。为了支持Scalable IOV,DMA remapping增加了PASID的支持,其多级页表也进行了重新设计,具体如下图所示:

5.4 IMS

一个PCIe设备即使在MSI-X的情况下,它支持的最大中断数目也只能到2048,那如果一个PF上支持的ADI数量所使用的总的中断数量超过了这个limit将如何处理呢?
为了解决这个中断limit的问题,SIOV引入了新的中断存储机制叫IMS(Iinterrupt Message Storage),理论上IMS在支持的中断数量是没有上限的,从实现原理上来讲其仍然是message 格式的中断触发机制,每个message有一个DWORD 大小的payload和64-bit的address。这些message存储在 IMS的table里面,这个table可以有全部缓存在硬件上,也可以全部放在host memory里面。

6. 软件架构


6.1 VDCM

VDCM (Virtual Device Composition Module)主要负责在ADI和虚拟设备(VDEV)之间建立映射关系,处理和仿真慢速路径的操作(负责一些trap到后端的MMIO的解释执行),另外就是ADI设备的一些操作比如Reset和配置等。

6.2 VDEV

其实上面也已经讲了它是由一个或者多个ADI设备组成,在guest里面看到的就是一个标准的PCIe 设备。每个VDEV都有虚拟的requester id, config space, memory BAR,MSI-X table等,它们都是由VDCM来进行模拟的。

6.3 VDEV MMIO and interrupts

从上面的分析来看,VDCM在整个软件架构上扮演着非常重要的角色,下面我们结合一张图来看一下相关实现:

结合上图我们来分析一些细节的东西,比如vdev 的MMIO,中断等 。

6.3.1 VDEV MMIO

从图中可以看到VDEV MMIO实现分为三类:

  1. 直接map到 ADI的 MMIO,类似SR-IOV场景下将硬件的MMIO通过EPT的方式直接让guest访问,避免大量的VM Exit;
  2. 通过VDCM 模拟的MMIO,guest里面在写这个MMIO的时候会trap到 VDCM,然后需要VDCM进行解释和模拟相关的action,通常这类MMIO是要是一些控制面的数据交互;
  3. map到host侧的memory上,这类MMIO通常存储的是一些参数或者数据,这样就避免了在读取或者写入的时候VDCM侧的解释和指令模拟。

6.3.2 VDEV interrupts

VDEV 会通过VDCM 虚拟出MSI或者MSI-X的能力呈现给guest,当guest driver 去programs MSI 或者MSI-X的时候会被VDCM截获到然后做相关的中断虚拟化操作。这里需要说明的是慢路径上的中断是可以通过VMM提供的中断注入接口来触发,而快路径或者说是数据面上的中断是通IOMMU的post interrupt来注入的。

7. MISC

7.1 Hardware-Assisted Mediated Pass-Through



另外一个视角来看SIOV:同时结合了SR-IOV和Mediated pass-through的优点。


参考资料:

  1. 聊聊intel平台io虚拟化技术之 SIOV
  2. Scalable IOV技术详解
  3. RECENT ENHANCEMENTS IN INTEL® VIRTUALIZATION TECHNOLOGY FOR DIRECTED I/O (INTEL® VT-D)
  4. Introducing Intel® Scalable I/O Virtualization
  5. White Paper
  6. ASSIGNABLE INTERFACES IN INTEL® SCALABLE I/O VIRTUALIZATION IN LINUX
  7. Version 1.2 SPEC
  8. 英特尔携手微软打造全新I/O虚拟化架构,提升加速器和I/O设备的可扩展性
  9. Hardware-Assisted Mediated Pass-Through with VFIO
  10. Intel® Scalable I/O Virtualization