本文主要转载自Recapitulating AF_XDP

Introduction

this time I will talk about a pretty awesome feature in the Linux kernel: AF_XDP. Please keep in mind, that this is a summary and explanation in my own words, and it’s not intended to fully cover all technical depths. The focus lies on understanding AF_XDP’s core concepts, learn how to use it, and what to consider while using it.
这次,我将谈谈 Linux 内核中一个非常棒的功能: AF_XDP。请记住,这只是用我自己的话进行的总结和解释,并不打算完全涵盖所有的技术深度。重点在于理解AF_XDP的核心概念,学习如何使用它,以及使用时需要注意的事项。

The official kernel documentation describes AF_XDP as “an address family that is optimized for high performance packet processing”. Why do we need an additional address family (or in other words an additional type of network socket)? Reading this sentence from the doc implies, that the existing address families are not suitable for high performance networking. And that’s exactly the case. While the Linux networking stack does a really good job abstracting layers from applications, it suffers performance due to exactly these abstractions. That’s why other libraries like DPDK completely bypass the kernel networking stack with their so called Poll Mode Drivers (PMD). This is very, very fast, reaching line rate for 100Gbit/s NICs. But this performance comes with some drawbacks: DPDK code is difficult to maintain, there is no chance to benefit from any kernel functionality (e.g. existing networking drivers), the number of supported NICs is limited and smaller than by the kernel, and PMD drivers completely block each used core to 100%.

内核官方文档将 AF_XDP 描述为 “为高性能数据包处理而优化的address family”。为什么我们需要一个额外的address family(或者换句话说,一个额外的网络套接字类型)?从文档中的这句话可以看出,现有的address families并不适合高性能网络。事实正是如此。虽然Linux网络协议栈在应用程序抽象层方面做得非常好,但正是由于这些抽象,它的性能受到了影响。这就是为什么其他库(如 DPDK)会通过所谓的轮询模式驱动程序(PMD)完全绕过内核网络协议栈。这样做的速度非常非常快,可以达到 100Gbit/s 网卡的速度。但这种性能也有一些缺点: DPDK 代码难以维护,无法从任何内核功能(如现有的网络驱动程序)中获益,支持的网卡数量有限,比内核还少,而且PMD 驱动程序将每个使用的core完全阻塞到100%(笔者注:PMD比较占用CPU资源)。

Consequently, getting some functionality in the Linux kernel that allows high-performance packet processing sounds pretty awesome. At first, there is one important thing to name which sometimes confuses people: AF_XDP is not a kernel bypass, like DPDK, it’s a fastpath inside the kernel. This means, e.g. normal kernel networking drivers are used. After clarifying this important difference, let’s dig into AF_XDP to see how it works and what we need to consider.
因此,在 Linux内核中加入一些允许高性能数据包处理的功能听起来非常棒。首先,有一点很重要,有时会让人感到困惑: AF_XDP 并不是像 DPDK 那样的绕过内核,而是内核中的fastpath。这意味着,例如,会使用正常的内核网络驱动程序。在澄清了这一重要区别之后,让我们深入了解 AF_XDP,看看它是如何工作的,以及我们需要考虑的事项。

Note: For deep explanations of all used concepts, please visit the kernel documentation, it’s really great! A complete working program and tutorial can be found here.
注:要深入了解所有使用的概念,请访问内核文档,它真的很棒!在这里可以找到完整运行的程序和教程。

Data Flow: eBPF and XDP

In the mentioned kernel documentation, the authors assume that the reader is familiar with bpf and xdp, otherwise pointing to cilium docs as reference. However, I think it’s important to mention how these two things work together with AF_XDP, to understand how AF_XDP differs from e.g. DPDK. XDP itself is a way to bypass the normal networking stack (not the whole kernel) to achieve high performance packet processing speeds. eBPF is used to run verified code in the kernel on a set of different events, called hooks. One of these hooks is the XDP hook. An eBPF program using the XDP hook gets called for every incoming packet arriving at the driver (if the driver supports running eBPF), getting a reference to the raw packet representation. The eBPF program can now perform different tasks with the packet, like modifying it, dropping it, passing it to the network stack, sending it back to the NIC or redirecting it. In our AF_XDP case, the redirecting (XDP_REDIRECT) is the most important action, because it allows to send packets directly to userspace. The following figure shows the flow of packets using a normal socket and AF_XDP.

在提到的内核文档中,作者假定读者熟悉bpf和xdp,否则会将cilium文档作为参考。不过,我认为有必要提及这两样东西如何与 AF_XDP 协同工作,以了解 AF_XDP 与 DPDK 等的不同之处。XDP本身是一种绕过普通网络堆栈(而非整个内核)以实现高性能数据包处理速度的方法。eBPF用于在内核中不同的事件(称为钩子)运行验证代码。其中一个钩子就是 XDP 钩子。使用 XDP 钩子的 eBPF 程序会调用到达驱动程序的每个传入数据包(如果驱动程序支持运行 eBPF),并获得原始数据包的引用。现在,eBPF程序可以对数据包执行不同的任务,如修改、丢弃、传递给网络协议栈、发送回 NIC 或重定向。在我们的 AF_XDP 案例中,重定向(XDP_REDIRECT)是最重要的操作,因为它允许将数据包直接发送到用户空间。下图显示了使用普通套接字和 AF_XDP 的数据包流程。

After being received by the NIC, the first layer the packets pass is the networking driver. In the driver, applications may load eBPF programs using the XDP hook to perform the actions explained above. In AF_XDP, the eBPF program redirects the packet to a particular XDP socket that was created in userspace. Bypassing the Linux networking (Traffic control, IP, TCP/UDP, etc), the userspace application can now handle the packets without further actions performed in the kernel. If the driver supports ZEROCOPY, the packets are written directly into address space of the application, otherwise, one copy operation needs to be performed. In contrast to AF_XDP, packets targeted to normal sockets (UDP/TCP) traverse the networking stack. They can either be passed to the stack using XDP_PASS or there is no eBPF program using the XDP hook, and packets are forwarded directly to the networking stack.
数据包被网卡接收后,首先经过的一层是网络驱动程序。在驱动程序中,应用程序可使用 XDP 钩子加载 eBPF 程序,以执行上述操作。在 AF_XDP 中,eBPF 程序会将数据包重定向到在用户空间创建的特定 XDP socket。绕过 Linux 网络(Traffic control、IP、TCP/UDP等),用户空间应用程序现在可以处理数据包,而无需在内核中执行进一步操作。如果驱动程序支持 ZEROCOPY,数据包就会直接写入应用程序的地址空间,否则就需要执行一次复制操作。与 AF_XDP 不同,针对普通套接字(UDP/TCP)的数据包会走网络协议栈。这些数据包可以使用XDP_PASS传递到协议栈,或者不使用 XDP 钩子的 eBPF 程序,直接转发到网络协议栈。

Now let’s consider the backwards direction. In AF_XDP, packets can be passed directly to the NIC driver by passing a block of memory containing them to the driver, which then processes them and send them to the NIC. On the other hand, normal sockets send packets using syscalls like sendto, where the packets traverse the whole networking stack backwards. On the outgoing side, there is no XDP hook that can be attached using eBPF, so no further packet processing here.

现在,让我们考虑一下相反方向(笔者注:发包)。在 AF_XDP 中,数据包可以直接传递给网卡驱动程序,方法是将包含数据包的内存块传递给驱动程序,然后由驱动程序处理数据包并将其发送给网卡。另一方面,普通套接字使用sendto等系统调用发送数据包,数据包会经过整个网络协议栈。发包时,没有可以使用 eBPF attached的 XDP 钩子,因此这里没有进一步的数据包处理。

Note: Please consider that there are some SmartNICs that also support running XDP programs directly on the NIC. However, this is not the common case, therefor the driver mode is focused here.
注:请注意,有些 SmartNIC 也支持直接在 NIC 上运行 XDP程序。不过,这种情况并不常见,因此这里主要介绍驱动程序模式。

Structure and Concepts

In the previous section, we saw how packets flow until they arrive at our application. So now let’s look at how AF_XDP sockets read and write packets from/to the NIC driver. AF_XDP works in a completely different way from what we already know about socket programming. The setup of the socket is quite similar, but reading and writing from/to the NIC differs a lot. In AF_XDP, you create a UMEM region, and you have four rings assigned to the UMEM: RX, TX, completion and fill ring. Wow, sounds really complication. But trust me, it’s not. UMEM is basically just an area of continuos virtual memory, divided into equal-sized frames. The mentioned 4 rings contain pointers to particular offsets in the UMEM. To understand the rings, let’s consider an example, shown in the next figure.
在上一节中,我们了解了数据包在到达应用程序之前是如何流动的。现在我们来看看 AF_XDP 套接字是如何从网卡驱动程序读写数据包的。AF_XDP的工作方式与我们已经了解的套接字编程完全不同。套接字的设置非常相似,但从 NIC 读取和向 NIC 写入数据却有很大不同。在 AF_XDP 中,您需要创建一个 UMEM 区域,并为UMEM分配四个环:RX、TX、completion和fill环。听起来真复杂。但相信我,其实并不复杂。UMEM 基本上只是一个连续虚拟内存区域,被划分为大小相等的帧。上述 4 个环包含指向 UMEM中特定偏移量的指针。为了理解这些环,让我们看一个例子,如下图所示。

This figure covers the reading of packets from the driver. So we produce UMEM addresses to the fill ring, meaning we put some slots of our UMEM into the fill ring (1). Afterwards, we notify the kernel: Hey, there are entries in our fill ring, please write arriving packets there. After passing the fill ring (2) and the rx ring (3) to the kernel, the kernel writes packets at the slots we produced beforehand (4) to the rx ring. We can now fetch new packets using the rx ring, after the kernel gives us back both rings (5) (6). The rx ring contains packet descriptors in the slots we passed via the fill ring to the kernel, in case there were packets that arrived. Great, we can now handle all of our packets, and then start again putting some references in the fill ring, and continue the same reading packets from the NIC.

To send packets via the NIC, the remaining two rings are used, in a similar way seen before on the receive side. We produce packet descriptors to the tx ring, meaning we put some references to our UMEM into the tx ring. Once we filled the ring, we pass it to the kernel. After the kernel transferred the packets, the respective references are filled into the completion ring and our application can reuse the slots in the UMEM.

In summary, using AF_XDP, we get a pretty awesome tradeoff between using existing code of the kernel (NIC drivers) and gain high performance for packet processing. I hope this article gives you at least an idea of how AF_XDP works.

4个ring的总结

对于fill ring、completion ring、rx ring、tx ring的总结:

  • fill ring与rx ring配合,用于收包
    • fill ring(生产者是用户态程序,消费者是内核态中的XDP程序)类比于virtio-net收包时的avail ring
    • rx ring(生产者是XDP程序,消费者是用户态程序)类比于virtio-net收包时的used ring
  • completion ring与tx ring配合,用于发包
    • tx ring(生产者是用户态程序,消费者是XDP程序)类比于virtio-net发包时的avail ring
    • completion ring(生产者是XDP程序,消费者是用户态程序)类比于virtio-net发包时的used ring
  • The UMEM uses two rings: FILL and COMPLETION. Each socket associated with the UMEM must have an RX queue, TX queue or both. Say, that there is a setup with four sockets (all doing TX and RX). Then there will be one FILL ring, one COMPLETION ring, four TX rings and four RX rings.
  • The rings are head(producer)/tail(consumer) based rings. A producer writes the data ring at the index pointed out by struct xdp_ring producer member, and increasing the producer index. A consumer reads the data ring at the index pointed out by struct xdp_ring consumer member, and increasing the consumer index.

参考资料:

  1. AF_XDP技术详解
  2. https://www.kernel.org/doc/html/latest/networking/af_xdp.html?highlight=af_xdp
  3. https://docs.cilium.io/en/latest/bpf/
  4. https://www.youtube.com/watch?v=9bbdhnbVbDk
  5. https://www.youtube.com/watch?v=Gv-nG6F_09I&t=1417s
  6. http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf