This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.
这篇文章介绍了 vhost 如何为 KVM 提供内核 virtio 设备。我最近一直在研究 vhost-scsi,并回答了有关 ioeventfd、irqfd 和 vhost 的问题,因此我认为这将是一篇有用的 QEMU Internals 帖子。
The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.
Linux 中的 vhost 驱动程序提供内核 virtio 设备模拟。通常,QEMU 用户空间进程会模拟guest的 I/O 访问。Vhost 将 virtio 仿真代码放入内核,将 QEMU 用户空间排除在外。这允许设备仿真代码直接调用内核子系统,而不是从用户空间执行系统调用。
The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.
vhost-net 驱动程序在主机内核中模拟 virtio-net 网卡。vhost-net 是最古老的 vhost 设备,也是主线 Linux 中唯一可用的设备。此外,还开发了试验性的 vhost-blk 和 vhost-scsi 设备。
In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.
在 Linux 3.0 中,vhost 代码位于 drivers/vhost/。所有设备都要使用的通用代码位于 drivers/vhost/vhost.c 中。其中包括 virtio vring 访问函数,所有 virtio 设备都需要这些函数才能与guest通信。vhost-net 代码位于 drivers/vhost/net.c 中。
The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.
vhost-net 驱动程序会在主机上创建一个 /dev/vhost-net 字符设备。该字符设备是配置 vhost-net 实例的接口。
When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.
当使用 -netdev tap,vhost=on 启动 QEMU 时,它会打开 /dev/vhost-net 并通过几个 ioctl(2) 调用初始化 vhost-net 实例。这些调用对于将 QEMU 进程与 vhost-net 实例关联、准备 virtio 功能协商以及将guest物理内存映射传递给 vhost-net 驱动程序都是必要的。
During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the “vhost worker thread”. The job of the worker thread is to handle I/O events and perform the device emulation.
在初始化过程中,vhost 驱动程序会创建一个名为 vhost-$pid 的内核线程,其中 $pid 是 QEMU 进程的 pid。该线程被称为 “vhost 工作线程”。工作线程的任务是处理 I/O 事件和执行设备仿真。
Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.
Vhost 不会模拟完整的 virtio PCI 适配器。相反,它仅限于进行 virtqueue 操作。例如,QEMU 仍用于执行 virtio 功能协商和热迁移。这意味着 vhost 驱动程序不是独立的 virtio 设备实现,它依赖用户空间来处理控制面,而数据面则在内核中完成。
The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.
vhost 工作线程会等待 virtqueue kicks,然后处理放在 virtqueue 上的缓冲区。在 vhost-net 中,这意味着从 tx virtqueue 获取数据包并通过 tap 文件描述符传输。
File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.
文件描述符轮询也由 vhost 工作线程完成。在 vhost-net 中,当数据包通过 tap 文件描述符进入时,工作线程就会被唤醒,并将数据包放入 rx virtqueue,这样guest就能接收到这些数据包。
One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.
vhost 架构令人惊讶的一点是,它与 KVM 没有任何关联。Vhost 是一个用户空间接口,不依赖于 KVM 内核模块。这意味着其他用户空间代码(如 libpcap)如果发现 vhost 设备是方便的高性能 I/O 接口,理论上也可以使用 vhost 设备。
When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.
当guest因为在 virtqueue 上放置了buffers而kick主机时,需要有一种方法来向 vhost 工作线程发出有工作要做的信号。由于 vhost 并不依赖于 KVM 内核模块,因此它们无法直接通信。相反,vhost 实例会设置一个 eventfd 文件描述符,由 vhost 工作线程监视其活动。KVM 内核模块有一个名为 ioeventfd 的功能,用于获取 eventfd 并将其连接到特定的guest I/O VM exit。QEMU 用户空间会为 VIRTIO_PCI_QUEUE_NOTIFY 硬件寄存器访问注册一个 ioeventfd,从而kick virtqueue。这样,当 guest kick virtqueue 时,vhost 工作线程就会收到 KVM 内核模块的通知。
On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a “call” file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.
在从 vhost 工作线程返回到中断guest的过程中,也使用了类似的方法。Vhost 会获取一个 “call “文件描述符,并写入该文件描述符以通知guest。KVM 内核模块有一个名为 irqfd 的功能,允许 eventfd 触发guest中断。QEMU 用户空间为 virtio PCI 设备中断注册了一个 irqfd,并将其交给 vhost 实例。这就是 vhost 工作线程中断guest的方式。
In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.
最终,vhost 实例只知道 guest 内存映射、ioeventfd 和 irqfd。
Here are the main points to begin exploring the code:
以下是开始探索代码的要点:
The QEMU userspace code shows how to initialize the vhost instance:
QEMU 用户空间代码显示了如何初始化 vhost 实例:
如上图所示,通过network packet的数据流向对比,可以显示vhost-pci的作用:high performance inter-VM communication schemes,vhost-pci机制可以让network packet直接从一个vm传输到另外一个vm中,无需经过vSwitch的中转。
详细内容可以参考vhost-pci and virtio-vhost-user。
参考资料:
]]>The virtio-vhost-user device lets guests act as vhost device backends so that virtual network switches and storage appliance VMs can provide virtio devices to other guests.
virtio-vhost-user 设备可让客户机充当 vhost 设备后端,这样虚拟网络交换机和存储设备虚拟机就能为其他客户机提供 virtio 设备。
virtio-vhost-user was inspired by vhost-pci by Wei Wang and Zhiyong Yang.
virtio-vhost-user 的灵感来源于Wei Wang和Zhiyong Yang的 vhost-pci。
2.1 用于云环境的设备
In cloud environments everything is a guest. It is not possible for users to run vhost-user processes on the host. This precludes high-performance vhost-user appliances from running in cloud environments.
在云环境中,一切都是客户机。用户不可能在主机上运行 vhost-user进程。这使得高性能 vhost-user 设备无法在云环境中运行。
virtio-vhost-user allows vhost-user appliances to be shipped as virtual machine images. They can provide I/O services directly to other guests instead of going through an extra layer of device emulation like a host network switch:
virtio-vhost-user 允许 vhost-user 设备作为虚拟机镜像发布。它们可以直接向其他客户机提供 I/O 服务,而无需通过额外的设备仿真层(如主机网络交换机):1
2
3
4
5
6
7
8
9
10
Traditional Appliance VMs virtio-vhost-user Appliance VMs
+-------------+ +-------------+ +-------------+ +-------------+
| VM1 | | VM2 | | VM1 | | VM2 |
| Appliance | | Consumer | | Appliance | | Consumer |
| ^ | | ^ | | <------+---+------> |
+------|------+---+------|------+ +-------------+---+-------------+
| +-----------------+ | | |
| Host | | Host |
+-------------------------------+ +-------------------------------+
Once the vhost-user session has been established all vring activity can be performed by poll mode drivers in shared memory. This eliminates vmexits in the data path so that the highest possible VM-to-VM communication performance can be achieved.
一旦 vhost-user 会话建立,所有 vring 活动都可由共享内存中的轮询模式驱动程序执行。这样就消除了数据路径中的 vmexits,从而实现尽可能高的VM-to-VM 通信性能。
Even when interrupts are necessary, virtio-vhost-user can use lightweight vmexits thanks to ioeventfd instead of exiting to host userspace. This ensures that VM-to-VM communication bypasses device emulation in QEMU.
即使需要中断(笔者注: virtio前端驱动kick需要发生VM Exit),virtio-vhost-user 也可以通过 ioeventfd 使用轻量级 vmexits,而不是退出到主机用户空间。这可确保VM-to-VM 通信绕过 QEMU 中的设备仿真。
Virtio devices were originally emulated inside the QEMU host userspace process. Later on, vhost allowed a subset of a virtio device, called the vhost device backend, to be implement inside the host kernel. vhost-user then allowed vhost device backends to reside in host userspace processes instead.
Virtio 设备最初是在 QEMU 主机用户空间进程内仿真的。后来,vhost 允许一部分virtio 设备(称为 vhost 设备后端)在主机内核中实现。vhost-user 允许 vhost 设备后端驻留在主机用户空间进程中。
virtio-vhost-user takes this one step further by moving the vhost device backend into a guest. It works by tunneling the vhost-user protocol over a new virtio device type called virtio-vhost-user.
virtio-vhost-user 在此基础上更进一步,将 vhost 设备后端移至客户机中。它的工作原理是在名为 virtio-vhost-user 的新 virtio 设备类型上传输 vhost-user 协议。
The following diagram shows how two guests communicate:
下图显示了两个客户机的通信方式:1
2
3
4
5
6
7
8
9
10
11
12
13
14
+-------------+ +-------------+
| VM1 | | VM2 |
| | | |
| vhost | shared memory | |
| device | +-----------------> | |
| backend | | |
| | | virtio-net |
+-------------+ +-------------+
| | | |
| virtio- | vhost-user socket | |
| vhost-user | <-----------------> | vhost-user |
| QEMU | | QEMU |
+-------------+ +-------------+
VM2 sees a regular virtio-net device. VM2’s QEMU uses the existing vhost-user feature as if it were talking to a host userspace vhost-user backend.
VM2 看到的是普通的 virtio-net 设备。VM2 的 QEMU 使用现有的 vhost-user 功能,就像与主机用户空间 vhost-user 后端对话一样。
VM1’s QEMU tunnels the vhost-user protocol messages from VM1’s QEMU to the new virtio-vhost-user device so that guest software in VM1 can act as the vhost-user backend.
VM1 的 QEMU 将 vhost-user 协议信息从 VM1 的 QEMU 隧道(笔者注:可以类比于网络中的隧道技术,a method for transporting data across a network using protocols that are not supported by that network)传输到新的 virtio-vhost-user 设备,这样 VM1 中的客户机软件就可以充当 vhost-user 后端。
It is possible to reuse existing vhost-user backend software with virtio-vhost-user since they use the same vhost-user protocol messages. A driver is required for the virtio-vhost-user PCI device that carries the message instead of the usual vhost-user UNIX domain socket. The driver can be implemented in a guest userspace process using Linux vfio-pci but guest kernel driver implementation would also be also possible.
由于 virtio-vhost-user 使用相同的 vhost-user 协议信息,因此可以重新使用现有的 vhost-user 后端软件。virtio-vhost-user PCI 设备需要一个驱动程序来传输信息,而不是通常的 vhost-user UNIX 域套接字。该驱动程序可在客户机用户空间进程中使用 Linux vfio-pci 实现,也可在客户机内核驱动程序中实现。
The vhost device backend vrings are accessed through shared memory and do not require vhost-user message exchanges in the data path. No vmexits are taken when poll mode drivers are used. Even when interrupts are used, QEMU is not involved in the data path because ioeventfd lightweight vmexits are taken.
vhost 设备后端 vrings 通过共享内存访问,不需要在数据路径中进行 vhost 用户信息交换。使用轮询模式驱动程序时,不会出现 vmexits。即使使用中断(笔者注: virtio前端驱动kick需要发生VM Exit),QEMU 也不会参与数据路径,因为会使用 ioeventfd 轻量级 vmexits。
All vhost device types work with virtio-vhost-user, including net, scsi, and blk.
所有 vhost 设备类型都能与 virtio-vhost-user 一起使用,包括 net、scsi 和 blk。
下面截取了DPDK中VirtioVhostUser的使用案例:
slides中的Memory region I/O in device,笔者的理解就是VVU(VirtioVhostUser) device的MMIO寄存器。
参考资料:
Shared Virtual Addressing (SVA) is the ability to share process address spaces with devices. It is called “SVM” (Shared Virtual Memory) by OpenCL and some IOMMU architectures, but since that abbreviation is already used for AMD virtualisation in Linux (Secure Virtual Machine), we prefer the less ambiguous “SVA”.
Shared Virtual Addressing for the IOMMU
cc:Cache Coherent
参考资料:
]]>强烈建议观看视频Intel VMDq Explanation,非常清晰地介绍了VMDq,下面主要是视频中的概要总结,将以Without VMDq与With VMDq来做对比。
A single core(you have one core that’s actually in charge of handling every packet before it determines which other core interrupt for the action, copy the data to the target vm) cannot keep up with 10 Gbps of data. Most packets coming in require two interrupts, one for the core assigned to handle Ethernet interrupts, followed by an interrupt of the core processing the VM where the packet is targetd for.
Receive Path
Reduces overhead and increases throughput by sorting packets with the Intel Ethernet Controller and spreading the workload amongst multiple CPU cores.
这节主要是mark下Intel® VMDq Technology white paper中的关键notes。
VMDq
VMM在服务器的物理网卡中为每个虚机分配一个独立的队列,这样虚机出来的流量可以直接经过软件交换机发送到指定队列上,软件交换机无需进行排序和路由操作。
但是,VMM和虚拟交换机仍然需要将网络流量在VMDq和虚机之间进行复制。
SR-IOV
对于SR-IOV来说,则更加彻底,它通过创建不同虚拟功能(VF)的方式,呈现给虚拟机的就是独立的网卡,因此,虚拟机直接跟网卡通信,不需要经过软件交换机,VF和VM之间通过DMA进行高速数据传输,SR-IOV的性能是最好的。
Unlike SR-IOV, which exposes a complete device interface to the virtual machine guest, VMDq only provides network queues to the virtual machine guest.
VMDq只是一个过渡性的技术,当前已经被SR-IOV所替代。
参考资料:
iproute2 is a collection of userspace utilities for controlling and monitoring various aspects of networking in the Linux kernel, including routing, network interfaces, tunnels, traffic control, and network-related device drivers.
参考资料:
]]>通过Linux bounding机制实现,将直通网络设备和PV设备通过bounding机制绑定为一张网卡,做热迁移时切换到PV设备。
Bonded Interface用于将多个网络接口聚合成一个逻辑上的”bonded”接口。可用于故障备份或负载均衡等场景。
参考资料:
]]>TC框架实现中加入了Filter Actions机制。filter实际作用就是classifier,当数据包匹配到特定的filter之后,可以执行该filter所挂载的actions对数据包进行处理。
如图:
The tc filter framework provides the infrastructure to another extensible set of tools as well, namely tc actions. As the name suggests, they allow to do things with packets (or associated data). (The list of) Actions are part of a given filter. If it matches, each action it contains is executed in order before returning the classification result.
在lwn中看到了offload tc action to net device的工作,有机会再细看。
参考资料:
]]>1 | /* drivers/net/ifb.c: |
从内核的注释中可知,ifb的motivation是为了解决如下两个问题:
在多个网卡之间共享一个根Qdisc是ifb实现的一个初衷。如果你有10块网卡,想在这10块网卡上实现相同的流控策略,你需要配置10遍吗?将相同的东西抽出来,实现一个ifb虚拟网卡,然后将这10块网卡的流量全部重定向到这个ifb虚拟网卡上,此时只需要在这个虚拟网卡上配置一个Qdisc就可以了。
Linux中的QoS分为入口(Ingress)部分和出口(Egress)部分,入口部分主要用于进行入口流量限速(policing),出口部分主要用于队列调度(queuing scheduling)。
大多数排队规则(qdisc)都是用于输出方向的,输入方向只有一个排队规则,即ingress qdisc。ingress qdisc本身的功能很有限,但可用于重定向incoming packets。通过Ingress qdisc把输入方向的数据包重定向到虚拟设备ifb,而ifb的输出方向可以配置多种qdisc,就可以达到对输入方向的流量做队列调度的目的。
IFB is an alternative to tc filters for handling ingress traffic, by redirecting it to a virtual interface and treat is as egress traffic there.You need one ifb interface per physical interface, to redirect ingress traffic from eth0 to ifb0, eth1 to ifb1 and so on.
When inserting the ifb module, tell it the number of virtual interfaces you need. The default is 2:1
modprobe ifb numifbs=1
Now, enable all ifb interfaces:1
ip link set dev ifb0 up # repeat for ifb1, ifb2, ...
And redirect ingress traffic from the physical interfaces to corresponding ifb interface. For eth0 -> ifb0:1
2tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0
Again, repeat for eth1 -> ifb1, eth2 -> ifb2 and so on, until all the interfaces you want to shape are covered.
Now, you can apply all the rules you want. Egress rules for eth0 go as usual in eth0. Let’s limit bandwidth, for example:1
2
3tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:1 htb rate 1mbit
tc class add dev eth0 parent 1:1 classid 1:10 htb rate 1mbit
Needless to say, repeat for eth1, eth2, …
Ingress rules for eth0, now go as egress rules on ifb0 (whatever goes into ifb0 must come out, and only eth0 ingress traffic goes into ifb0). Again, a bandwidth limit example:1
2
3tc qdisc add dev ifb0 root handle 1: htb default 10
tc class add dev ifb0 parent 1: classid 1:1 htb rate 1mbit
tc class add dev ifb0 parent 1:1 classid 1:10 htb rate 1mbit
The advantage of this approach is that egress rules are much more flexible than ingress filters. Ingress filters only allow you to drop packets, not introduce wait times, for example. By handling ingress traffic as egress you can setup queue disciplines, with traffic classes and, if need be, filters. You get access to the whole tc tree, not only simple filters.
参考资料:
]]>流量控制Traffic Control简称TC,表示网络设备接收和发送数据包的排队机制。比如,数据包的接收速率、发送速率、多个数据包的发送顺序等。
Linux实现了流量控制子系统,它包括两部分:
它们有些类似于内核态的netfilter框架和用户态的iptables程序。
Traffic Control的作用包括以下几种:
Simply put, a qdisc is a scheduler. Every output interface needs a scheduler of some kind, and the default scheduler is a FIFO. Other qdiscs available under Linux will rearrange the packets entering the scheduler’s queue in accordance with that scheduler’s rules.
The qdisc is the major building block on which all of Linux traffic control is built, and is also called a queuing discipline.
The classful qdiscs can contain classes, and provide a handle to which to attach filters.
The classless qdiscs can contain no classes, nor is it possible to attach filter to a classless qdisc.
要实现对数据包接收和发送的这些控制行为,需要使用队列结构来临时保存数据包。在Linux实现中,把这种包括数据结构和算法实现的控制机制抽象为结构队列规程:Queuing discipline,简称为qdisc。qdisc对外暴露两个回调接口enqueue和dequeue分别用于数据包入队和数据包出队,而具体的排队算法实现则在qdisc内部隐藏。
A qdisc has two operations:
Classes only exist inside a classful qdisc (e.g., HTB and CBQ). Classes are immensely flexible and can always contain either multiple children classes or a single child qdisc.
Any class can also have an arbitrary number of filters attached to it, which allows the selection of a child class or the use of a filter to reclassify or drop traffic entering a particular class.
A leaf class is a terminal class in a qdisc. It contains a qdisc (default FIFO) and will never contain a child class. Any class which contains a child class is an inner class (or root class) and not a leaf class.
A filter is used by a classful qdisc to determine in which class a packet will be enqueued.
基于qdisc, class和filter三种元素可以构建出非常复杂的树形qdisc结构,极大扩展流量控制的能力。
对于树形结构的qdisc, 当数据包流至最顶层qdisc时,会层层向下递归进行调用。如,父对象(qdisc/class)的enqueue回调接口被调用时,其上所挂载的所有filter依次被调用,直到一个filter匹配成功。然后将数据包入队到filter所指向的class,具体实现则是调用class所配置的Qdisc的enqueue函数。没有成功匹配filter的数据包分类到默认的class中。
如图:
Every class and classful qdisc requires a unique identifier within the traffic control structure. This unique identifier is known as a handle and has two constituent members, a major number and a minor number. These numbers can be assigned arbitrarily by the user in accordance with the following rules.
The numbering of handles for classes and qdiscs
major
This parameter is completely free of meaning to the kernel. The user may use an arbitrary numbering scheme, however all objects in the traffic control structure with the same parent must share a major handle number. Conventional numbering schemes start at 1 for objects attached directly to the root qdisc.
minor
This parameter unambiguously identifies the object as a qdisc if minor is 0. Any other value identifies the object as a class. All classes sharing a parent must have unique minor numbers.
The special handle ffff:0 is reserved for the ingress qdisc.
The handle is used as the target in classid and flowid phrases of tc filter statements. These handles are external identifiers for the objects, usable by userland applications. The kernel maintains internal identifiers for each object.
可以查询man tc、man tc-u32、man tc-htb等man手册。
As a simple example, in order to limit bandwidth of individual IP addresses stored in CLIENT_IP
shell variable, with limitations like the following:
CLIENT_IP
= 100kbpsCLIENT_IP
(if there is more bandwidth available) = 200kbpsCommands below would suffice:1
2
3
4
5tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:1 htb rate 1000kbps ceil 1500kbps
tc class add dev eth0 parent 1:1 classid 1:10 htb rate 1kbps ceil 2kbps
tc class add dev eth0 parent 1:1 classid 1:11 htb rate 100kbps ceil 200kbps
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip src ${CLIENT_IP} flowid 1:11
参考资料:
]]>Network ingress and egress are terms used in networking to describe the direction of network traffic. In general, ingress refers to network traffic that enters a network or a device, while egress refers to network traffic that exits a network or a device.
For example, when you browse a website, the data packets that are sent from the website’s server to your browser are considered ingress traffic for your device and egress traffic for the server. Conversely, the data packets that are sent from your browser to the website’s server are considered egress traffic for your device and ingress traffic for the server.
https://www.quora.com/What-are-network-ingress-and-egress
The answer from chatgpt:
Network ingress and egress refer to the movement of data into and out of a network. Ingress: This refers to the incoming data traffic that enters a network from an external source, such as the internet or another network. Egress: This refers to the outgoing data traffic that leaves a network to an external destination, such as the internet or another network. In networking, understanding and managing network ingress and egress is important for maintaining the security, performance, and efficiency of a network.
]]>slides中有如下描述:
Host may use a new mechanism to throttle commands processing by migrating controller to slow down changes
其对应的是:
Support limit the BW and IOPS of a controller to allow slowing down of command processing on a migrating controller
这是QoS的相关实现,考虑写磁盘多的workload,不限速的话,最后一轮的脏LBAs可能会很多,downtime就会有些大了。
原文考虑了本地盘与非本地盘的NVMe Live Migration。
对于本地盘的情况,需要记录脏的LBAs,在热迁移每轮迭代中,会传输脏的LBAs(类似于热迁移的脏页传输)。
对于非本地盘的情况,其实就无效考虑脏的LBAs了。
对于IPU/DPU的NVMe Live Migration,详情可以参考NVMe VFIO Live Migration for IPU/DPU Devices。
值得注意的是,如果host上的IOMMU支持DMA脏页记录的话,就无需NVMe Device自己去记录DMA脏页了。
参考资料:
]]>无需理解LUNs。
无需理解vSAN。
参考资料:
]]>Both speed and duplex can change, thus the driver is expected to re-read these values after receiving a configuration change notification.
1 | struct virtio_net_config { |
Linux kernel source code:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32static void virtnet_config_changed_work(struct work_struct *work)
{
struct virtnet_info *vi =
container_of(work, struct virtnet_info, config_work);
u16 v;
if (virtio_cread_feature(vi->vdev, VIRTIO_NET_F_STATUS,
struct virtio_net_config, status, &v) < 0)
return;
if (v & VIRTIO_NET_S_ANNOUNCE) {
netdev_notify_peers(vi->dev);
virtnet_ack_link_announce(vi);
}
/* Ignore unknown (future) status bits */
v &= VIRTIO_NET_S_LINK_UP;
if (vi->status == v)
return;
vi->status = v;
if (vi->status & VIRTIO_NET_S_LINK_UP) {
virtnet_update_settings(vi);
netif_carrier_on(vi->dev);
netif_tx_wake_all_queues(vi->dev);
} else {
netif_carrier_off(vi->dev);
netif_tx_stop_all_queues(vi->dev);
}
}
1 | static void virtnet_update_settings(struct virtnet_info *vi) |
XDP其实是位于网卡驱动程序里的一个快速处理数据包的HOOK点,为什么快?因为数据包处理位置非常底层,避开了很多内核skb处理开销。
XDP暴露了一个可以加载BPF程序的网络钩子。在这个钩子中,程序能够对传入的数据包进行任意修改和快速决策,避免了内核内部处理带来的额外开销。这使得XDP在性能速度方面成为最佳钩子,例如缓解DDoS攻击等。
建议阅读上述资料,会对XDP有不错的认识。
以DDoS为例:
The XDP program is executed at the earliest possible moment after a packet is received from the hardware, before the kernel allocates its per-packet sk_buff
data structure.
代码层的理解:
详情参考xdp paper3.1 The XDP Driver Hook。
建议阅读[译] [论文] XDP (eXpress Data Path):在操作系统内核中实现快速、可编程包处理(ACM,2018)。
某种意义上来说,XDP 可以认为是一种 offload 方式:
参考资料:
]]>Modern high performance server is nearly all based on PCIE architecture and technologies derived from it such as Direct Media Interface (DMI) or Quick Path Interconnect (QPI).
For example below is a sample block diagram for a dual processor system:
A PCI Express system consists of many components, most important of which to us are:
Root Complex acts as the agent which helps with:
The End point is usually of most interest to us because that’s where we put our high performance device.
It is GPU in the sample block diagram while in real time it can be a high speed Ethernet card or data collecting/processing card, or an infiniband card talking to some storage device in a large data center.
Below is a refined block diagram that amplify the interconnection of those components:
Based on this topology let’s talk about a typical scenario where Remote Direct Memory Access (RDMA) is used to allow a end point PCIE device to write directly to a pre-allocated system memory whenever data arrives, which offload to the maximum any involvements of CPU.
So the device will initiate a write request with data and send it along hoping root complex will help it get the data into system memory.
PCIE, different from traditional PCI or PCI-X, bases its communication traffic on the concepts of packets flying over point-to-point serial link, which is sometimes why people mention PCIE as a sort of tiny network topology.
So the RDMA device, acting as requester, sends its request package bearing the data along the link towards root complex.
The packet will arrive at intermediary PCIE switch and forward to root complex and root complex will diligently move data in the payload to system memory through its private memory controller.
Of course we would expect some overhead besides pure data payload and here goes the packet structure of PICE gen3:
So obviously given those additional “tax” you have to pay you would hope that you can put as large a payload as you can which would hopefully increase the effective utilization ratio.
However it does not always work and here comes to our discussion about “max payload size”.
Each device has a “max payload size supported” in its dev cap config register part indicating its capability and a “max payload size” in its dev control register part which will be programmed with actual “max playload set” it can use.
Below shows the related registers extracted from pcie base spec:
So how do we decide on what value to set within the range not above max payload supported?
The idea is it has to be equal to the minimum max payload supported along the route.
So for our data write request it would have to consider end point’s max payload supported as well as pcie switch (which is abstracted as pcie device while we do enumeration) and root complex’s root port (which is also abstracted as a device).
PCIE base spec actually described it this way without giving detailed implementation:
Now let’s take a look at how linux does it.
1 | static void pcie_write_mps(struct pci_dev *dev, int mps) |
So linux follows the same idea and take the minimum of upstream device capability and downstream pci device.
The only exception is for root port which is supposed to be the top of PCI hierarchy so we can simply set by its max supported.
pcie_set_mps
does real setting of the config register and it can be seen that it is taking the min.
Now we have finished talking about max payload size, let’s turn our attention to max read request size.
It does not apply to memory write request but it applies to memory read request by that you cannot request more than that size in a single memory request.
We can imagine a slightly different use case where some application prepares a block of data to be processed by the end point device and then we notifying the device of the memory address of size and ask the device to take over.
The device will have to initiate a series of memory read request to fetch the data and process in place on the card and put the result int some preset location.
So even though packet payload can go at max to 4096 bytes the device will have to work in trickle like way if we program its max read request to be a very small value.
Here is the explanation from PCIE base spec on max read request:
So again let’s say how linux programs max read request size:
1 | static void pcie_write_mrrs(struct pci_dev *dev) |
pcie_set_readrq
does the real setting and surprisingly it uses max payload size as the ceiling even though it has not relationship with that.
We can well send a large read request but when data is returned from root complex it will be split into many small packets each with payload size less or equal to max payload size.
So above code is mainly executed in PCI bus enumeration phase.
And if we grep with this function name pcie_set_readrq
we can see other device drivers provide overrides probably to increase the read request efficiency.
So how big an impact the two settings has on your specific device?
It’s hard to tell though you can easily find on the internet discussions talking about it.
Here is a good one Understanding Performance of PCI Express Systems.
And here is another good one PCI Express Max Payload size and its impact on Bandwidth.
]]>Virtio-net failover is a virtualization technology that allows a virtual machine (VM) to switch from a Virtual Function I/O (VFIO) device to a virtio-net device when the VM needs to be migrated from a host to another.
On one hand, the Single Root I/O Virtualization (SR-IOV) technology allows a device like a networking card to be split into several devices (the Virtual Functions) and with the help of the VFIO technology, the kernel of the VM can directly drive these devices. This is interesting in terms of performance, because it can reach the same level as a bare metal system. In this case, the cost of the performance is that a VFIO device cannot be migrated.
On the other hand, virtio-net is a paravirtualized networking device that has good performance and can be migrated. The trade off is that performance is not as good as with VFIO devices.
Virtio-net failover tries to bring the best of both worlds: performance of a VFIO device with the migration capability of a virtio-net device.
Virtio-net failover relies on a several blocks of technology to migrate a VM using a VFIO device:
Failover is a term that comes from the high availability (HA) domain in an attempt to provide reliability, availability and serviceability (RAS) to a system.
The principle of failover is to bind two devices together, so called the primary and the standby, in a redundant way. The system only uses the primary device, but if the primary device becomes unavailable, unusable or disconnected, the failover manager can detect the problem and disable the primary device to switch to the standby device.
The standby is used to maintain service availability. While the standby is in use, an operator can remove the dysfunctional device and replace it with a healthy one. Once the problem is corrected, the new device can be used as the new standby device while the old standby device becomes the new primary device. Alternately, the newly replaced device could be restored as the primary device and the other switched back to standby.
Virtio-net failover plays with the failover principle to bind two devices together, but in this case, the VFIO device is chosen with caution to be the primary device and used during the regular state of operation of the system. When a migration occurs, the hypervisor triggers a primary device fault (by unplugging it), that will force the failover manager (in our case the guest kernel failover_net driver) to disable the primary device and use the standby device, the virtio-net device, that is able to survive to a VM live migration.
The hypervisor also takes the role of the operator by restoring the disabled device on the migration destination side by hotplugging a new VFIO device. In this case, failover_net driver is configured to restore the VFIO device as the primary and to keep the virtio-net as the standby as the devices are not identical.
To implement virtio-net failover, we need support at guest kernel level and at hypervisor level:
Virtio-net failover allows a VM hypervisor to migrate a VM with a VFIO device without interrupting the network connection. To reach this goal we need collaboration between the hypervisor and the guest kernel — the hypervisor unplugs the card and the guest kernel switches the network connection to the virtio-net device, and then they restore the original state on the destination host.
]]>Intel E810网卡对VF透传设备的热迁移支持,使高性能数据面和灵活的运维得以兼顾,并且热迁移依托第四代志强可扩展处理器可以得到显著加速。
从slides里可以看出,对于Device State,改动的是e810 driver中增加的live migration支持(e810网卡协议本身支持live migration),然后VFIO framework去调用e810 driver提供的接口,即可获取到Device State。
]]>FRED has the capability of helping system performance and response time.
Intel engineers summed up FRED as:
The Intel flexible return and event delivery (FRED) architecture defines simple new transitions that change privilege level (ring transitions). The FRED architecture was designed with the following goals:
1) Improve overall performance and response time by replacing event delivery through the interrupt descriptor table (IDT event delivery) and event return by
the IRET instruction with lower latency transitions.
2) Improve software robustness by ensuring that event delivery establishes the full supervisor context and that event return establishes the full user context.
The new transitions defined by the FRED architecture are FRED event delivery and, for returning from events, two FRED return instructions. FRED event delivery can effect a transition from ring 3 to ring 0, but it is used also to deliver events incident to ring 0(ring0 -> ring0之间也可以用FRED). One FRED instruction (ERETU) effects a return from ring 0 to ring 3, while the other (ERETS) returns while remaining in ring 0.
In addition to these transitions, the FRED architecture defines a new instruction (LKGS) for managing the state of the GS segment register. The LKGS instruction can be used by 64-bit operating systems that do not use the new FRED transitions.
Simply put, FRED is basically about lower-latency transitions between CPU privilege levels.
点到为止,按需再细看吧。
参考资料:
]]>