本文主要是mark下Virtio-net failover: An introduction相关notes。

Virtio-net failover is a virtualization technology that allows a virtual machine (VM) to switch from a Virtual Function I/O (VFIO) device to a virtio-net device when the VM needs to be migrated from a host to another.

On one hand, the Single Root I/O Virtualization (SR-IOV) technology allows a device like a networking card to be split into several devices (the Virtual Functions) and with the help of the VFIO technology, the kernel of the VM can directly drive these devices. This is interesting in terms of performance, because it can reach the same level as a bare metal system. In this case, the cost of the performance is that a VFIO device cannot be migrated.

On the other hand, virtio-net is a paravirtualized networking device that has good performance and can be migrated. The trade off is that performance is not as good as with VFIO devices.

Virtio-net failover tries to bring the best of both worlds: performance of a VFIO device with the migration capability of a virtio-net device.

Principles

Virtio-net failover relies on a several blocks of technology to migrate a VM using a VFIO device:

  • Live migration, to move the VM from one host to another
  • Virtio-net, to keep network connection while the migration is in progress
  • VFIO, to use a host hardware device
  • Failover, to switch from a networking device to another in a transparent way

Failover

Failover is a term that comes from the high availability (HA) domain in an attempt to provide reliability, availability and serviceability (RAS) to a system.

The principle of failover is to bind two devices together, so called the primary and the standby, in a redundant way. The system only uses the primary device, but if the primary device becomes unavailable, unusable or disconnected, the failover manager can detect the problem and disable the primary device to switch to the standby device.

The standby is used to maintain service availability. While the standby is in use, an operator can remove the dysfunctional device and replace it with a healthy one. Once the problem is corrected, the new device can be used as the new standby device while the old standby device becomes the new primary device. Alternately, the newly replaced device could be restored as the primary device and the other switched back to standby.

Virtio-net failover operation

Virtio-net failover plays with the failover principle to bind two devices together, but in this case, the VFIO device is chosen with caution to be the primary device and used during the regular state of operation of the system. When a migration occurs, the hypervisor triggers a primary device fault (by unplugging it), that will force the failover manager (in our case the guest kernel failover_net driver) to disable the primary device and use the standby device, the virtio-net device, that is able to survive to a VM live migration.

The hypervisor also takes the role of the operator by restoring the disabled device on the migration destination side by hotplugging a new VFIO device. In this case, failover_net driver is configured to restore the VFIO device as the primary and to keep the virtio-net as the standby as the devices are not identical.

To implement virtio-net failover, we need support at guest kernel level and at hypervisor level:

  • The hypervisor detects when a migration is started and unplugs the VFIO device as it cannot be migrated, at the same time it will block the migration while the VFIO card is being unplugged.
  • Guest kernel needs the ability to switch transparently between the VFIO device and the virtio-net device.
  • On normal operation, VFIO is used for its performance, but when the hypervisor asks to unplug the card, the kernel unplugs it and switches the networking traffic from the VFIO device to the virtio-net device.
  • While the migration is processed, the VM is not stopped and all the networking traffic that usually crosses the VFIO device is redirected to the virtio-net device. There is no service interruption,
  • At the end of the migration, on the destination side, the hypervisor plugs in a VFIO device, and the traffic switches back from the virtio-net device to the VFIO device.

Summary

Virtio-net failover allows a VM hypervisor to migrate a VM with a VFIO device without interrupting the network connection. To reach this goal we need collaboration between the hypervisor and the guest kernel — the hypervisor unplugs the card and the guest kernel switches the network connection to the virtio-net device, and then they restore the original state on the destination host.