本文将mark下Linux XPS(Transmit Packet Steering)相关notes。

1. Overview

The Linux network stack maps each core C to a different Tx queue Q, such that Q’s memory is allocated from C’s node. Additionally, memory allocations of packets transmitted via Q are likewise fulfilled using the same node. Cores can then transmit simultaneously through their individual queues in an uncoordinated, NU(D)MA-friendly manner while avoiding synchronization overheads. When a thread T that executes on C issues a system call to open a socket file descriptor S, the network stack associates Q with S, saving Q’s identifier in the socket data structure. After that, whenever T transmits through S, the network stack checks that T still runs on C. If it does not, the network stack updates S to point to the queue of T ’s new core. (The actual modification happens after Q is drained from any outstanding packets that originated from S, to avoid out-of-order transmissions.)

2. Optimization

2.1 reduce contention

contention on the device queue lock is significantly reduced since fewer CPUs contend for the same queue(contention can be eliminated completely if each CPU has its own transmit queue).

2.2 reduce cache miss rate on transmit completion

cache miss rate on transmit completion is reduced, in particular for data cache lines that hold the sk_buff structures.

网卡发完包后,会给CPU发送中断;接着linux内核协议栈就会调用kfree_skb,此时就会访问到sk_buff structures。如果发送数据包的core与调用kfree_skb的core一样,那么sk_buff structures的cache miss rate就会降低。

1
2
3
4
kfree_skb
└── kfree_skb_reason
└── skb_unref
└── skb->users

2.3 DMA Buffer NUMA Affinity

网卡发包时,DMA本node的内存即可,无需跨numa node,可以提升DMA的性能。详情可以参考DMA Buffer NUMA Affinity


参考资料:

  1. IOctopus: Outsmarting Nonuniform DMA(ASPLOS’20)
  2. Scaling in the Linux Networking Stack
  3. Linux网络栈的性能缩放