深入理解virtio packed virtqueue机制
文章目录
本文将基于第一性原理,深入解析virtio packed virtqueue机制。
Prerequisite
读者需要对split virtqueue有深刻的理解。
Terms
- 本文将virtio spec中标准的”Driver Ring Wrap Counter”统称为avail_wrap_counter
- 本文将virtio spec中标准的”Device Ring Wrap Counter”统称为used_wrap_counter
Motivation
For software backends
- Bad cache utilization, several cache misses per request
- metadata is scattered into several places
- descriptor chain is not contiguous in memory
- cache contention in many places
For hardware implementation
- several PCIe transactions per descriptor
Overview
packed virtqueue将virtio1.0中的desc ring,avail ring,used ring三个ring打包成一个desc ring了。
相对split desc去掉了next字段
在split desc中next字段是记录一个desc chain中的下一个desc idx使用的,通常配合flags这样使用:1
2if ((descs[idx].flags & VRING_DESC_F_NEXT) == 1)
nextdesc = descs[ descs[idx].next];
但是在packed desc ring中一个desc chain一定是相邻的(可以理解为链表变为了数组),所以next字段就用不上了,上面获取nextdesc的方式可以转化为如下方式:1
2if ((descs[idx].flags & VRING_DESC_F_NEXT) == 1)
nextdesc = descs[++idx];
flags字段的变化
为了仅仅使用desc ring就能标记avail ring与used ring信息,packed virtqueue引入了avail_wrap_counter与used_wrap_counter这两个flag。相对split desc,flags字段仍然保留,但是其取值增加了,因为要把三个ring合一,每个desc就需要更多的信息表明身份(是used还是avail)。在原有flags的基础上增加了两个flag:1
2
avail desc:当desc flags关于VRING_DESC_F_AVAIL的设置和avail_wrap_counter同步,且VRING_DESC_F_USED的设置和avail_wrap_counter相反时,表示desc为avail desc。例如avail_wrap_counter为1时,flags应该设置VRING_DESC_F_AVAIL|~VRING_DESC_F_USED,当avail_wrap_counter为0时,flags应该设置~VRING_DESC_F_AVAIL|VRING_DESC_F_USED。
used desc:当desc flags关于VRING_DESC_F_USED的设置和used_wrap_counter同步,且VRING_DESC_F_AVAIL的设置也和used_wrap_counter同步时,表示desc为used desc。例如used_wrap_counter为1时,flags应该设置VRING_DESC_F_AVAIL|VRING_DESC_F_USED,当used_wrap_counter为0时,flags应该设置~VRING_DESC_F_AVAIL|~VRING_DESC_F_USED。
综上可以看出,avail desc的两个flag总是相反的(只能设置一个),而used desc的两个flag总是相同的,要么都设置,要么都不设置。
相对split desc增加了id字段
这个id比较特殊,他是buffer id,注意不是desc的下标idx。
核心要点
last_used_idx <= used_idx <= last_avail_idx <= avail_idx
- avail_idx/avail_wrap_counter会由driver的局部变量维护
- last_avail_idx/对应时刻的avail_wrap_counter会由device的局部变量维护
- used_idx/used_wrap_counter会由device的局部变量维护
- last_used_idx/对应时刻的used_wrap_counter会由driver的局部变量维护
Placing Available Buffers Into The Descriptor Ring
For each buffer element, b:
- Get the next descriptor table entry, d
- Get the next free buffer id value
- Set
d.addr
to the physical address of the start of b - Set
d.len
to the length of b. - Set
d.id
to the buffer id - Calculate the flags as follows:
- If b is device-writable, set the VIRTQ_DESC_F_WRITE bit to 1, otherwise 0
- Set the VIRTQ_DESC_F_AVAIL bit to the current value of the Driver Ring Wrap Counter
- Set the VIRTQ_DESC_F_USED bit to inverse value
- Perform a memory barrier to ensure that the descriptor has been initialized
- Set
d.flags
to the calculated flags value - If d is the last descriptor in the ring, toggle the Driver Ring Wrap Counter
- Otherwise, increment d to point at the next descriptor
This makes a single descriptor buffer available. However, in general the driver MAY make use of a batch of descriptors as part of a single request. In that case, it defers updating the descriptor flags for the first descriptor (and the previous memory barrier) until after the rest of the descriptors have been initialized.
1 | /* Note: vq->avail_wrap_count is initialized to 1 */ |
Don’t mark the 1st descriptor available until all of them are ready
原因: The driver always makes the first descriptor in the list available after the rest of the list has been written out into the ring. This guarantees that the device will never observe a partial scatter/gather list in the ring.
Device writes used descriptor
The device only writes out a single used descriptor for the whole list. It then skips forward according to the number of descriptors in the list. The driver needs to keep track of the size of the list corresponding to each buffer ID, to be able to skip to where the next used descriptor is written by the device.
When the device has finished processing the buffer, it writes a used device descriptor including the Buffer ID into the Descriptor Ring (overwriting a driver descriptor previously made available), and sends a used event notification.
对于一个Descriptor chain,device仅仅往第一个desc中写入used_wrap_counter,而不是往Descriptor chain中的所有desc中写入used_wrap_counter。这样在hardware实现的packed virtqueue中,可以减少设备侧的TLP交互次数。
Example
接下来的案例以一个desc table大小为4的vring为例,展示packed virtqueue的流程与相关rule的原因。Flag中的A表示VIRTQ_DESC_F_AVAIL,U表示VIRTQ_DESC_F_USED,N表示VIRTQ_DESC_F_NEXT。
案例1
step1: 初始化
step2: driver生产Available Buffers
step3: device拉取Available Buffers
step4: device生产used Buffers
对于buffer id 0,device只更新了desc 0的flag(A|U)。
step5: driver消费used Buffers
driver知道buffer id 0的size为3,因此更新last_used_idx为3。VIRTQ_DESC_F_NEXT is reserved in used descriptors, and should be ignored by drivers.
step6: driver生产Available Buffers
step7: device拉取Available Buffers
基于avail_wrap_counter的值,device可以知道desc 1(flag为A|~U|N)为上一轮的avail desc,而非本轮的avail desc。
案例2
step1: 初始化
step2: driver生产Available Buffers
step3: device拉取Available Buffers
step4: device生产used Buffers
device只更新了desc 0的flag(A|U),更新了desc 0的id为1。In a used descriptor, Element Address is unused. Element Length specifies the length of the buffer that has been initialized (written to) by the device. Element Length is reserved for used descriptors without the VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
step5: driver消费used Buffers
driver知道buffer id 1的size为2,因此更新last_used_idx为2。
源码解析
1 | struct vring_packed_desc { |
https://elixir.bootlin.com/linux/v6.14/source/drivers/virtio/virtio_ring.c
https://elixir.bootlin.com/qemu/v8.0.0/source/hw/virtio/virtio.c
driver往Descriptor Ring生产Available Buffers
1 | struct vring_virtqueue_packed { |
其中,vq->packed.avail_wrap_counter
是driver侧维护的avail_wrap_counter,vq->packed.next_avail_idx
是driver侧维护的avail_idx,vq->packed.desc_state[id].num
是driver侧维护的size of the list corresponding to buffer id
。
device从Descriptor Ring消费Available Buffers
1 | static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz) |
其中,vq->last_avail_idx
是device侧维护的last_avail_idx,vq->last_avail_wrap_counter
是device侧维护的last_avail_idx时刻对应的avail_wrap_counter。
device往Descriptor Ring生产used Buffers
1 | static void virtqueue_packed_flush(VirtQueue *vq, unsigned int count) |
其中,vq->used_idx
是device侧维护的used_idx,vq->used_wrap_counter
是device侧维护的used_wrap_counter。
driver从Descriptor Ring消费used Buffers
1 | static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq, |
Packed virtqueues support up to 2^15 entries each.
其中,vq->last_used_idx
的第14~第0 bit为driver侧维护的last_used_idx,vq->last_used_idx
的第15 bit为driver侧维护的last_used_idx时刻对应的used_wrap_counter。
注释
- Packed virtqueues支持2^15 entries;
- 每个packed virtqueue 有三部分构成:
- Descriptor Ring
- Driver Event Suppression:后端(device)只读,用来控制后端向前端(driver)的通知(used notifications)
- Device Event Suppression:前端(driver)只读,用来控制前端向后端(device)的通知(avail notifications)
- Write Flag,VIRTQ_DESC_F_WRITE
- 对于avail desc这个flag用来标记其关联的buffer是只读的还是只写的;
- 对于used desc这个flag用来表示去关联的buffer是否有被后端(device)写入数据;
- desc中的len
- 对于avail desc,len表示desc关联的buffer中被写入的数据长度;
- 对于uesd desc,当VIRTQ_DESC_F_WRITE被设置时,len表示后端(device)写入数据的长度,当VIRTQ_DESC_F_WRITE没有被设置时,len没有意义;
- Descriptor Chain
VIRTQ_DESC_F_NEXT在used desc中是没有意义的
参考资料:
- 从dpdk1811看virtio1.1 的实现—packed ring
- Virtual I/O Device (VIRTIO) Version 1.1
- What’s new in Virtio 1.1? by Jens Freimann
- What’s New in Virtio 1.1 - Jason Wang
- Packed virtqueue: How to reduce overhead with virtio
- virtio系列-packed virtqueue
- add packed ring layout support
- [PATCH v10 00/13] packed ring layout spec
- Virtio I/O 虚拟化(二):Packed Virtqueue
- 异步模式下的 Vhost Packed Ring 设计介绍
- [virtio-dev] [PATCH net-next v3 00/13] virtio: support packed ring