L

Notes about NVF

2024-10-27T08:55:46.000Z

本文将mark下NVF(Network Function Virtualization)相关notes。

背景

总结

Network Function Virtualization is a network architecture for virtualizing the entire class of network functions (NFs) on commodity off-the-shelf(现成的) general-purpose hardware.

参考资料:

Notes about RDMA ODP feature

2024-10-20T08:27:30.000Z

本文将mark下RDMA ODP(On-Demand-Paging) feature相关notes。

Introduction

On-Demand-Paging (ODP) is a technique to alleviate much of the shortcomings of memory registration. Applications no longer need to pin down the underlying physical pages of the address space, and track the validity of the mappings. Rather, the HCA requests the latest translations from the OS when pages are not present, and the OS invalidates translations which are no longer valid due to either non-present pages or mapping changes.

Synchronizing between CPU and RNIC page tables

Faulting

When an RDMA request accesses data on invalid virtual pages, (1a) the RNIC stalls the QP and raises an RNIC page fault interrupt. (1b) The driver requests the OS kernel for virtual-to-physical mappings via hmm_range_fault. The OS kernel triggers CPU page faults on these virtual pages and fills the CPU page table if necessary. (1c) The driver updates the mappings on the RNIC page table and (1d) resumes the QP.

Invalidation

When the OS kernel tries to unmap virtual pages in scenarios like swapping out or page migration, (2a)it notifies the RNIC driver to invalidate virtual pages via mmu_interval_notifier. (2b) The RNIC driver erases the virtual-to-physical mapping from the RNIC page table. (2c) The driver notifies the kernel that the physical pages are no longer used by the RNIC. Then, the OS kernel modifies the CPU page table and reuses the physical pages.

ODP MR(Memory Region) relies on faulting and invalidation flows to synchronize CPU and RNIC page tables.

Advising

An application can proactively request the RNIC driver to populate a range in the RNIC page table. The RNIC driver completes advising by steps (3a) – (3b), which are identical to steps (1b) – (1c).

enum ib_odp_general_cap_bits {
IB_ODP_SUPPORT= 1 << 0,
IB_ODP_SUPPORT_IMPLICIT = 1 << 1,
};

enum ib_odp_transport_cap_bits {
IB_ODP_SUPPORT_SEND= 1 << 0,
IB_ODP_SUPPORT_RECV= 1 << 1,
IB_ODP_SUPPORT_WRITE= 1 << 2,
IB_ODP_SUPPORT_READ= 1 << 3,
IB_ODP_SUPPORT_ATOMIC= 1 << 4,
IB_ODP_SUPPORT_SRQ_RECV= 1 << 5,
};

参考资料:

Optimized Memory Access
TeRM: Extending RDMA-Attached Memory with SSD(FAST’24)
Mellanox OFED for Linux User Manual
[PATCH v3 00/17] On demand paging
[PATCH for-next v7 0/7] On-Demand Paging on SoftRoCE
RDMA - ODP按需分页设计原理-优点-源码浅析

Notes about RDMA Device Memory

2024-10-20T00:29:55.000Z

本文将mark下RDMA Device Memory相关notes。

Introduction

Device Memory is a verbs API that allows using on-chip memory, located on the device, as a data buffer for send/receive and RDMA operations. The device memory can be mapped and accessed directly by user and kernel applications, and can be allocated in various sizes, registered as memory regions with local and remote access keys for performing the send/ receive and RDMA operations. Using the device memory to store packets for transmission can significantly reduce transmission latency compared to the host memory.

Motivation

staging buffer: 暂存缓冲区

Concepts

思考

可以类比于NVMe的CMB，RDMA Device Memory以mmio的形式expose给host，目的是让RDMA直接使用Device Memory，无需DMA到host的内存，减少了PCIe TLP的交互。

Notes about RDMA SRQ/XRC/DCT技术

2024-10-19T02:47:47.000Z

本文将mark下RDMA SRQ(Shared Receive Queue)/XRC(eXtended Reliable Connection)/DCT(Dynamically Connected Transport)技术相关notes。

1. SRQ

1.1 为什么需要SRQ

在没有SRQ的情况下，因为RC/UC/UD的接收方不知道对端什么时候会发送过来多少数据，所以必须做好最坏的打算，做好突发性收到大量数据的准备，也就是向RQ中下发足量的的接收WQE；另外RC服务类型可以利用流控机制来抑制发送方，也就是告诉对端”我这边RQ WQE不够了”，这样发送端就会暂时放缓或停止发送数据。

但是第一种方法由于是为最坏情况准备的，大部分时候有大量的RQ WQE处于空闲状态未被使用，这对内存是一种极大地浪费(主要是WQE指向的用于存放数据的内存空间)；第二种方法虽然不用下发那么多RQ WQE了，但是流控是有代价的，即会增加通信时延。

而SRQ通过允许很多QP共享接收WQE(本身其实不是很大)以及用于存放数据的内存空间(这可是很大一块内存)来解决上面的问题。当任何一个QP收到消息后，硬件会从SRQ中取出一个WQE，根据其内容存放接收到的数据，然后硬件通过Completion Queue来返回接收任务的完成信息给对应的上层用户。

1.2 SRQ Limit

SRQ可以设置一个阈值，当队列中剩余的WQE数量小于阈值时，这个SRQ就会上报一个异步事件。提醒用户“队列中的WQE快用完了，请下发更多WQE以防没有地方接收新的数据”。这个阈值就被称为SRQ Limit，这个上报的事件就被称为SRQ Limit Reached。

2. XRC

2.1 为什么需要XRC

当前的计算节点一般都有多核，因此可以运行多进程。在这样的计算节点组成的集群中，如果想用RC连接建立full mesh的全连接拓扑时，每个节点就需要建立N*p*p个QP(这里假设集群有N个节点，每个节点上有p个进程，需要让任何2个进程都连通)。当集群扩张，N和p同时增长时，一个节点所需的RC QP资源将变得不可接受。

XRC的思想是当一个进程想与某个远程节点的p个进程通信时不需要跟各个进程建立p个连接而只需要跟对端节点建立一个连接，连接上传输的报文携带了对端目的进程号(XRC SRQ)，报文到达连接对端(XRC TGT QP)时根据进程号分发至各个进程对应的XRC SRQ。这样源端进程只需要创建一个源端连接(XRC INI QP)就能跟对端所有进程通信了，这样所需总的QP数量就会除以p。

2.2 核心概念

上图中XRC下标xyz的含义:x代表发起端的node号，y代表发起端的进程号，z代表接收端的node号。

XRC INI QP

XRC发起端QP，是XRC操作的源端队列，用于发出XRC操作，但它没有接收XRC操作的功能，对比常规RC QP来说可以认为它是只有SQ没有RQ。XRC操作在对端由XRC TGT QP处理。

XRC TGT QP

XRC接收端QP，它处理XRC操作将其分发至报文SRQ number对应的SRQ。XRC TGT QP只能接收XRC操作，但它没有发出XRC操作的功能，对比常规RC QP来说可以认为它是只有RQ没有SQ。XRC操作在对端由XRC INI QP发出。

XRC SRQ

接收缓冲区(receive WQE)被放在XRC SRQ中以接收XRC请求，XRC请求中携带了XRC SRQ number，所以XRC TGT QP收到报文后会从报文指定的XRC SRQ中取receive WQE来存放XRC请求。

XRC domain

用于关联XRC TGT QP和XRC SRQ，XRC报文只能指定与XRC TGT QP在同一domain内的XRC SRQ，否则报文会被丢弃。这起到了隔离资源的作用，防止攻击报文随意指定XRC SRQ。

XRC INI QP和XRC TGT QP是一一对应的，host2上的每个进程在远端节点host0上都有自己对应的XRC TGT QP。XRC的共享体现在一个XRC TGT QP可以分发至多个XRC SRQ。一个进程一般只有一个XRC SRQ，它可以接收多个XRC TGT QP来的包。

3. DCT

Dynamically Connected transport (DCT) service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic. Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynamically connect and disconnect from any remote node. DCT connections only stay connected while they are active. This results in smaller memory footprint, less overhead to set connections and higher on-chip cache utilization and hence increased performance.

3.1 为什么需要DCT

UD虽然扩展性很好，但是不支持read/write单边语义。RC虽然支持read/write单边语义，但是扩展性不好。DCT的初衷就是融合2者的优点，保持RC的read/write单边语义和可靠连接特性，同时像UD一样用一个QP去跟多个远端通信，保持良好的可扩展性。DCT一般用于sparse traffic场景。

想用RC连接建立full mesh的全连接拓扑时:

在RC机制下，每个节点就需要建立N*p*p个QP
在XRC机制下，每个节点就需要建立N*p个QP
在DCT机制下，每个节点就需要建立p(可能p+n)个QP

3.2 什么是DCT

Dynamic Connectivity
Each DC Initiator can be used to reach any remote DC Target

DCT具有非对称的API：DC在发送侧的部分称为DC initiator(DCI)，在接收侧的部分称为DC target(DCT)。DCI和DCT不过是特殊类型的QP，它们依然遵循基本的QP操作，比如post send/receive。

DC意味着临时连接，在DCI上发送的每个send-WR都携带了目的地址信息，如果DCI当前连接的对端不是send-WR里携带的对端(node地址不一样)，则它会首先断开当前的连接，再连接到send-WR里携带的对端。只要后续的send-WR里携带的都是当前已连接对端，则都可以复用当前已建立的连接。如果DCI在一段指定的时间内都没有发送操作则也会断开当前连接。注意DCT每次临时建立的是一个RC可靠连接。

3.3 思考

DCT preserves their core connection-oriented design, but dynamically creates and destroys one-to-one connections. This provides software the illusion of using one QP to communicate with multiple remote machines, but at a prohibitively large performance cost for our workloads: DCT requires three additional network messages when the target machine of a DCT queue pair changes: a disconnect packet to the current machine, and a two-way handshake with the next machine to establish a connection[FaSST, OSDI’16].

所以DCT在sparse traffic场景中，性能才高。

3.4 XRC vs DCT

XRC: 发起端进程与不同node通信时，需要与不同node都建立XRC连接
DCT: 发起端进程与不同node通信时，只需建立一个连接；当发起端进程需要与新node通信时，先与原先的node断连，再与新node建连，从而达到只用一个连接的目标

3.5 学术论文

KRCORE: a microsecond-scale RDMA control plane for elastic computing(ATC’22)

参考资料:

Intel架构下TLB shutdown使用pause指令

2024-10-19T00:12:05.000Z

本文将mark下Intel架构下TLB shutdown使用pause指令的相关notes。

static void smp_call_function_many_cond(const struct cpumask *mask,
smp_call_func_t func, void *info,
unsigned int scf_flags,
smp_cond_func_t cond_func)
{
    ...
if (run_remote && wait) {
                // 按顺序等各个cpu修改csd的flag，不然死等
for_each_cpu(cpu, cfd->cpumask) {
call_single_data_t *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
csd_lock_wait(csd);
}
}
}

csd_lock_wait会调用到pause命令

csd_lock_wait
└── smp_cond_load_relaxed
    └── cpu_relax
        └── asm volatile("rep; nop")

rep;nop的机器码是f3 90，其实就是pause指令的机器码，相当于pause的一个”别名”。

参考资料:

x86的cpu_relax解析

Notes about RDMA UMR(User-Mode Memory Registration)

2024-10-13T10:49:06.000Z

本文将mark下RDMA UMR(User-Mode Memory Registration)机制相关notes。

What

User-Mode Memory Registration (UMR) supports the creation of memory keys for non-contiguous memory regions. This includes the concatenation(连接) of arbitrary contiguous regions of memory, as well as regions with regular structure.

Examples

Three examples of non-contiguous regions of memory that are used to form new contiguous regions of memory are described below.

将多块非连续的MR拼接成一个VA连续的MR

如上图所示，我们之前创建了3个常规的MR：MR1(green), MR2(purple), MR3(red)，现在我们想从这三个MR中各抽取一部分拼接起来形成一个新的连续的MR：第一块是MR1(v0-v1)部分，第二块是MR2(v2-v3)部分，第三块是MR3(v4-v5)部分。这个新的MR有一个新的base VA地址，长度是3个小块的长度之和。这样虽然内部是不连续的，但在外部访问者看来这个MR是连续的。

将一个MR内有规律非连续的块拼接成一个连续的MR

如上图所示，当我们做一个矩阵的转置时，需要把一列的元素拼成新的行，这个行就成了新的连续的MR。

将多个MR拼接成新的相互交织的连续MR

如上图所示，2个老矩阵的列相互交织形成新的列，这是一个新的VA连续的MR，它有自己新的base address和length。

思考

UMR会创建新的memory keys、VA(Virtual Address)地址和MTT entries；在MTT entry中，保证新的VA地址指向目标PA(Physical Address)即可。

参考资料:

Notes about ARM VHE mode

2024-09-08T12:57:19.000Z

本文将mark下ARM VHE(Virtualization Host Extensions)mode相关notes。

通常，寄主操作系统的内核部分运行在EL1，控制虚拟化的部分运行在EL2。然而，这种设计有一个明显的问题。VHE之前的Hypervisor通常需要设计成high-visor和low-visor两部分，前者运行在EL1，后者运行在EL2。分层设计在系统运行时会造成很多不必要的上下文切换，带来不少设计上的复杂性和性能开销。为了解决这个问题，虚拟化主机扩展（Virtualization Host Extensions, VHE）应运而生。该特性由Armv8.1-A引入，可以让寄主操作系统的内核部分直接运行在EL2上。

参考资料:

Notes about RDMA Event Queue mechanism

2024-08-25T10:50:35.000Z

本文将mark下RDMA中的Event Queue机制。
Host Channel Adapter(HCA) device, HCA device, NIC, NIC device and adapter device are used interchangeably.

Introduction

HCA has multiple sources that can generate events (completion events, asynchronous events/
errors). Once an event is generated internally, it can be reported to the host software via the Event
Queue mechanism. The EQ is a memory-resident circular buffer used by hardware to write event
cause information for consumption by the host software. Once event reporting is enabled, event
cause information is written by hardware to the EQ when the event occurs. If EQ is armed, HW
will subsequently generate an interrupt on the device interface (send MSI-X message or assert
the pin) as configured in the EQ.

Q && A

Q1: 都有cq了，为什么还要有completion EQ？
A1: 如果只有cq，就只能用轮询方式了，加上ceq之后，中断上来就能从ceqe中拿到哪个cq有数据

Q2: 为什么不能为每个cq分配一个中断vector，这样就无需eq机制了?
A2: 一个RDMA设备的CQ会很多，大概率会超过2048个，此时就超过了MSI-x table的上限，因而引入了eq机制，将多个cq绑定到1个eq上，然后为每个eq分配一个中断vector，控制eq的数量，就会保证vector个数不超过MSI-x table的上限

Q3: CQ与EQ是如何绑定的？
A3: While creating a CQ, software configures the EQ number to which this CQ will report completion events.

eRDMA example

Event queue (EQ) is the main notification way from erdma hardware to its driver. Each erdma device contains 2 kinds EQs: asynchronous EQ (AEQ) and completion EQ (CEQ). Per device has 1 AEQ, which used for RDMA async event report, and max to 32 CEQs (numbered for CEQ0 to CEQ31). CEQ0 is used for cmdq completion event report, and the rest CEQs are used for RDMA completion event report.

CQ与EQ的绑定

static int create_cq_cmd(struct erdma_ucontext *uctx, struct erdma_cq *cq)
{
struct erdma_dev *dev = to_edev(cq->ibcq.device);
struct erdma_cmdq_create_cq_req req;
struct erdma_mem *mem;
u32 page_size;

erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
CMDQ_OPCODE_CREATE_CQ);

        ...
req.cfg1 = FIELD_PREP(ERDMA_CMD_CREATE_CQ_EQN_MASK, cq->assoc_eqn);
        ...
}

int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
    struct ib_udata *udata)
{
        ...
        cq->assoc_eqn = attr->comp_vector + 1;
        ...
}

eqe获取CQ number

static irqreturn_t erdma_intr_ceq_handler(int irq, void *data)
{
struct erdma_eq_cb *ceq_cb = data;

tasklet_schedule(&ceq_cb->tasklet); //会调用erdma_intr_ceq_task

return IRQ_HANDLED;
}

erdma_intr_ceq_task
└── erdma_ceq_completion_handler
    ├── get_next_valid_eqe
    └── cqn = FIELD_GET(ERDMA_CEQE_HDR_CQN_MASK, READ_ONCE(*ceqe))

mellanox mlx4 example

CQ与EQ的绑定

1
2
3

mlx4_ib_create_cq[mlx4_ib_dev_ops.create_cq]
└── mlx4_cq_alloc
    └── cq_context->comp_eqn = ...

int mlx4_cq_alloc(struct mlx4_dev *dev, int nent,
  struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec,
  struct mlx4_cq *cq, unsigned vector, int collapsed,
  int timestamp_en, void *buf_addr, bool user_cq)
{
bool sw_cq_init = dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_SW_CQ_INIT;
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_cq_table *cq_table = &priv->cq_table;
struct mlx4_cmd_mailbox *mailbox;
struct mlx4_cq_context *cq_context;
u64 mtt_addr;
int err;

        ...

mailbox = mlx4_alloc_cmd_mailbox(dev);
        ...

cq_context = mailbox->buf;
        ...
cq_context->comp_eqn    = priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn;
        ...
}

eqe获取CQ number

enum mlx4_event {
MLX4_EVENT_TYPE_COMP   = 0x00,
MLX4_EVENT_TYPE_PATH_MIG   = 0x01,
MLX4_EVENT_TYPE_COMM_EST   = 0x02,
MLX4_EVENT_TYPE_SQ_DRAINED   = 0x03,
MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE   = 0x13,
MLX4_EVENT_TYPE_SRQ_LIMIT   = 0x14,
MLX4_EVENT_TYPE_CQ_ERROR   = 0x04,
MLX4_EVENT_TYPE_WQ_CATAS_ERROR   = 0x05,
MLX4_EVENT_TYPE_EEC_CATAS_ERROR   = 0x06,
MLX4_EVENT_TYPE_PATH_MIG_FAILED   = 0x07,
MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
MLX4_EVENT_TYPE_WQ_ACCESS_ERROR   = 0x11,
MLX4_EVENT_TYPE_SRQ_CATAS_ERROR   = 0x12,
MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR  = 0x08,
MLX4_EVENT_TYPE_PORT_CHANGE   = 0x09,
MLX4_EVENT_TYPE_EQ_OVERFLOW   = 0x0f,
MLX4_EVENT_TYPE_ECC_DETECT   = 0x0e,
MLX4_EVENT_TYPE_CMD   = 0x0a,
MLX4_EVENT_TYPE_VEP_UPDATE   = 0x19,
MLX4_EVENT_TYPE_COMM_CHANNEL   = 0x18,
MLX4_EVENT_TYPE_OP_REQUIRED   = 0x1a,
MLX4_EVENT_TYPE_FATAL_WARNING   = 0x1b,
MLX4_EVENT_TYPE_FLR_EVENT   = 0x1c,
MLX4_EVENT_TYPE_PORT_MNG_CHG_EVENT = 0x1d,
MLX4_EVENT_TYPE_RECOVERABLE_ERROR_EVENT  = 0x3e,
MLX4_EVENT_TYPE_NONE   = 0xff,
};

static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_eqe *eqe;
int cqn;
int eqes_found = 0;
int set_ci = 0;
int port;
int slave = 0;
int ret;
int flr_slave;
u8 update_slave_state;
int i;
enum slave_port_gen_event gen_event;
unsigned long flags;
struct mlx4_vport_state *s_info;
int eqe_size = dev->caps.eqe_size;

while ((eqe = next_eqe_sw(eq, dev->caps.eqe_factor, eqe_size))) {
/*
 * Make sure we read EQ entry contents after we've
 * checked the ownership bit.
 */
dma_rmb();

switch (eqe->type) {
case MLX4_EVENT_TYPE_COMP:
cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
mlx4_cq_completion(dev, cqn);
break;
                        ...

cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff保证了cq number就是event data的0~23位。

参考资料:

Notes about RDMA cmdq

2024-08-25T07:23:08.000Z

本文将mark下RDMA中cmdq(command queue)相关notes。
Host Channel Adapter(HCA) device, HCA device, NIC, NIC device and adapter device are used interchangeably.

Introduction

The HCA command interface is used for:

configuring the HCA
the handshake between hardware and system software
handling (querying, configuring, modifying) HCA objects

The HCA is configured using the command queues. Each function has its own command queues to get commands from its HCA driver.

The command queue is the transport that is used to pass commands to the HCA.

cmdq其实属于一种sq(Send Queue)，可以类比于NVMe的admin sq(submission queue)。

对于不同类型的RDMA设备，cmdq的具体实现是存在差异的。

mellanox mlx4 example

mellanox mlx4 cmdq的细节，可以参考spec中的7.14 Command Interface一节。

eRDMA example

Cmdq is the main control plane channel between erdma driver and hardware. After erdma device is initialized, the cmdq channel will be active in the whole lifecycle of this driver.

cmdq命令

eRDMA支持如下命令:

enum CMDQ_RDMA_OPCODE {
CMDQ_OPCODE_QUERY_DEVICE = 0,
CMDQ_OPCODE_CREATE_QP = 1,
CMDQ_OPCODE_DESTROY_QP = 2,
CMDQ_OPCODE_MODIFY_QP = 3,
CMDQ_OPCODE_CREATE_CQ = 4,
CMDQ_OPCODE_DESTROY_CQ = 5,
CMDQ_OPCODE_REFLUSH = 6,
CMDQ_OPCODE_REG_MR = 8,
CMDQ_OPCODE_DEREG_MR = 9
};

enum CMDQ_COMMON_OPCODE {
CMDQ_OPCODE_CREATE_EQ = 0,
CMDQ_OPCODE_DESTROY_EQ = 1,
CMDQ_OPCODE_QUERY_FW_INFO = 2,
CMDQ_OPCODE_CONF_MTU = 3,
CMDQ_OPCODE_CONF_DEVICE = 5,
CMDQ_OPCODE_ALLOC_DB = 8,
CMDQ_OPCODE_FREE_DB = 9,
};

erdma_device_ops

static const struct ib_device_ops erdma_device_ops = {
.owner = THIS_MODULE,
.driver_id = RDMA_DRIVER_ERDMA,
.uverbs_abi_ver = ERDMA_ABI_VERSION,

.alloc_mr = erdma_ib_alloc_mr,
.alloc_pd = erdma_alloc_pd,
.alloc_ucontext = erdma_alloc_ucontext,
.create_cq = erdma_create_cq,
.create_qp = erdma_create_qp,
.dealloc_pd = erdma_dealloc_pd,
.dealloc_ucontext = erdma_dealloc_ucontext,
.dereg_mr = erdma_dereg_mr,
.destroy_cq = erdma_destroy_cq,
.destroy_qp = erdma_destroy_qp,
.get_dma_mr = erdma_get_dma_mr,
.get_port_immutable = erdma_get_port_immutable,
.iw_accept = erdma_accept,
.iw_add_ref = erdma_qp_get_ref,
.iw_connect = erdma_connect,
.iw_create_listen = erdma_create_listen,
.iw_destroy_listen = erdma_destroy_listen,
.iw_get_qp = erdma_get_ibqp,
.iw_reject = erdma_reject,
.iw_rem_ref = erdma_qp_put_ref,
.map_mr_sg = erdma_map_mr_sg,
.mmap = erdma_mmap,
.mmap_free = erdma_mmap_free,
.modify_qp = erdma_modify_qp,
.post_recv = erdma_post_recv,
.post_send = erdma_post_send,
.poll_cq = erdma_poll_cq,
.query_device = erdma_query_device,
.query_gid = erdma_query_gid,
.query_port = erdma_query_port,
.query_qp = erdma_query_qp,
.req_notify_cq = erdma_req_notify_cq,
.reg_user_mr = erdma_reg_user_mr,

INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq),
INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd),
INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext),
INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp),
};

struct ib_device_ops - InfiniBand device operations, 其实是内核与cmdq的交互接口。以alloc_mr为例，用户态下发创建Memory Region的请求到内核，此时erdma_ib_alloc_mr就会被调用。

erdma_ib_alloc_mr
└── regmr_cmd
    ├── erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_REG_MR)
    └── erdma_post_cmd_wait

最终，eRDMA driver会往cmdq中下发CMDQ_OPCODE_REG_MR命令来创建Memory Region。

erdma_post_cmd_wait

erdma_post_cmd_wait
├── push_cmdq_sqe
│   └── kick_cmdq_db //更新sq的db寄存器
├── erdma_wait_cmd_completion // 如果使用cmdq eq中断
│   └── wait_for_completion_timeout //当前进程等待eq中断handler来唤醒(complete(&comp_wait->wait_event))
└── erdma_poll_cmd_completion // 如果使用polling
    └── erdma_polling_cmd_completions

cmdq初始化

erdma_probe
└── erdma_probe_dev
    ├── erdma_comm_irq_init
    │   └── request_irq(...erdma_comm_irq_handler...)
    └── erdma_cmdq_init
        ├── erdma_cmdq_sq_init
        ├── erdma_cmdq_cq_init
        └── erdma_cmdq_eq_init

cmdq中断通知

Q: cmdq已经有cq了，为什么还需要eq(CEQ0)?
A: 如果只有cq，就只能用轮询模式了，加上ceq之后，cmdq与eq配合就能完成中断通知。

static irqreturn_t erdma_comm_irq_handler(int irq, void *data)
{
struct erdma_dev *dev = data;

erdma_cmdq_completion_handler(&dev->cmdq);
erdma_aeq_event_handler(dev);

return IRQ_HANDLED;
}

erdma_cmdq_completion_handler
├── erdma_polling_cmd_completions
│   ├── erdma_poll_single_cmd_completion
│   │   ├── get_next_valid_cmdq_cqe
│   │   └── complete(&comp_wait->wait_event) //唤醒等待wait_event的进程(erdma_wait_cmd_completion)
│   └── arm_cmdq_cq //更新cq的db寄存器
└── notify_eq //更新eq的db寄存器

cmdq相关寄存器

...
#define ERDMA_REGS_CMDQ_SQ_ADDR_L_REG 0x20
#define ERDMA_REGS_CMDQ_SQ_ADDR_H_REG 0x24
#define ERDMA_REGS_CMDQ_CQ_ADDR_L_REG 0x28
#define ERDMA_REGS_CMDQ_CQ_ADDR_H_REG 0x2C
#define ERDMA_REGS_CMDQ_DEPTH_REG 0x30
#define ERDMA_REGS_CMDQ_EQ_DEPTH_REG 0x34
#define ERDMA_REGS_CMDQ_EQ_ADDR_L_REG 0x38
#define ERDMA_REGS_CMDQ_EQ_ADDR_H_REG 0x3C
...

参考资料:

RDMA 资料合集

2024-08-24T10:40:54.000Z

本文将持续记录RDMA资料合集。

network: USO vs UFO

2024-08-18T12:24:36.000Z

本文将mark下USO(UDP Segmentation offload) vs UFO(UDP Fragmentation Offload)相关notes。
需阅读Network Segmentation vs Fragmentation，值得注意的是，UDP也是存在Segmentation的。

定义

UDP Segmentation Offload (USO)is a feature that enables network interface cards (NICs) to offload the segmentation of UDP datagrams that are larger than the maximum transmission unit (MTU) of the network medium.

UDP fragmentation offload allows a device to fragment an oversized UDP datagram into multiple IPv4 fragments.

USO vs UFO

There is a USO feature that is different from existing UFO:

UFO fragments the UDP packet and only first fragment carries the UDP header (SKB_GSO_UDP in the Linux network stack)
USO segments the UDP packet, each segment has a UDP header and IP identification field is incremented for each segment. It is designated as SKB_GSO_UDP_L4 in the Linux network stack

VIRTIO_NET_F_HOST_USO (56)
Device can receive USO packets. Unlike UFO (fragmenting the packet) the USO splits large UDP packet to several segments when each of these smaller packets has UDP header.

参考资料:

Notes about network checksum offload

2024-08-18T11:21:20.000Z

本文将mark下network checksum offload技术的相关notes。

软件协议checksum

很多网络协议，例如IP、TCP、UDP都有自己的校验和(checksum)。

TCP checksum

TCP校验和计算三部分：TCP头部、TCP数据和TCP伪头部。TCP校验和是必须的。

UDP checksum

UDP校验和计算三部分：UDP头部、UDP数据和UDP伪头部。UDP校验和是可选的。

IP checksum

IP校验和只计算检验IP数据报的首部，但不包括IP数据报中的数据部分。

checksum offload

传统上，校验和的计算（发送数据包）和验证（接收数据包）是通过CPU完成的。这对CPU的影响很大，因为校验和需要每个字节的数据都参与计算。对于一个100G带宽的网络，需要CPU最多每秒计算大约12G的数据。

为了减轻这部分的影响，现在的网卡，都支持校验和的计算和验证。系统内核在封装网络数据包的时候，可以跳过校验和。网卡收到网络数据包之后，根据网络协议的规则，进行计算，再将校验和填入相应的位置。

因为Checksum offload的存在，在用tcpdump之类的抓包分析工具时，有时会发现抓到的包提示校验和错误（checksum incorrect）。tcpdump抓到的网络包就是系统内核发给网卡的网络包，如果校验和放到网卡去计算，那么tcpdump抓到包的时刻，校验和还没有被计算出来，自然看到的是错误的值。

virtio-net

VIRTIO_NET_F_CSUM (0)
Device handles packets with partial checksum. This “checksum offload” is a common feature on modern network cards.

VIRTIO_NET_F_HOST_TSO4
Requires VIRTIO_NET_F_CSUM.

由上述描述可知TSO需要Checksum offload的支持。因为在enable TSO时，TCP/IP协议栈并不知道最终的网络数据包是什么样，自然也没办法完成校验和计算。

参考资料:

Network Segmentation vs Fragmentation

2024-08-11T09:28:22.000Z

本文将mark下计算机网络中的分段(Segmentation)与分片(Fragmentation)操作。本文主要内容转载自动图图解！既然IP层会分片，为什么TCP层也还要分段？。

Overview

分段特指发生在使用TCP协议的传输层中的数据切分行为

分片特指发生在使用IP协议的网络IP层中的数据切分行为

TCP协议在将用户数据传给IP层之前，会先将大段的数据根据MSS（Maximum Segment Size）分成多个小段，这个过程是Segmentation，分出来的数据是Segments。IP协议因为MTU（Maximum Transmission Unit）的限制，会将上层传过来的并且超过MTU的数据，分成多个分片，这个过程是Fragmentation，分出来的数据是Fragments。这两个过程都是大块的数据分成多个小块数据，区别就是一个在TCP（L4），一个在IP（L3）完成。

MSS与MTU的区别

TCP 提交给 IP 层最大分段大小，不包含 TCP Header 和 TCP Option，只包含 TCP Payload ，MSS 是 TCP 用来限制应用层最大的发送字节数。
假设 MTU= 1500 byte，那么 MSS = 1500- 20(IP Header) -20 (TCP Header) = 1460 byte，如果应用层有 2000 byte 发送，那么需要两个切片才可以完成发送，第一个 TCP 切片 = 1460，第二个 TCP 切片 = 540。

MTU是由数据链路层提供，为了告诉上层IP层，自己的传输能力是多大。IP层就会根据它进行数据包切分。一般 MTU=1500 Byte。
假设IP层有 <= 1500 byte 需要发送，只需要一个 IP 包就可以完成发送任务；假设 IP 层有 > 1500 byte 数据需要发送，需要分片才能完成发送，分片后的 IP Header ID 相同，同时为了分片后能在接收端把切片组装起来，还需要在分片后的IP包里加上各种信息。比如这个分片在原来的IP包里的偏移offset。

在一台机器的应用层到这台机器的网卡，这条链路上，基本上可以保证，MSS < MTU。

为什么MTU一般是1500

这其实是由传输效率决定的。虽然我们平时用的网络感觉挺稳定的，但其实这是因为TCP在背地里做了各种重传等保证了传输的可靠，其实背地里线路是动不动就丢包的，而越大的包，发生丢包的概率就越大。

那是不是包越小就越好？也不是

如果选择一个比较小的长度，假设选择MTU为300Byte，TCP payload = 300 - IP Header - TCP Header = 300 - 20 - 20 = 260 byte。那有效传输效率= 260 / 300 = 86%

而如果以太网MTU长度为1500，那有效传输效率= 1460 / 1500 = 96% ，显然比 86% 高多了。

所以，包越小越不容易丢包，包越大，传输效率又越高，因此权衡之下，选了1500。

为什么IP层会分片，TCP还要分段

由于本身IP层就会做分片这件事情。就算TCP不分段，到了IP层，数据包也会被分片，数据也能正常传输。

既然网络层就会分片了，那么TCP为什么还要分段？是不是有些多此一举？

假设有一份数据，较大，且在TCP层不分段，如果这份数据在发送的过程中出现丢包现象，TCP会发生重传，那么重传的就是这一大份数据（虽然IP层会把数据切分为MTU长度的N多个小包，但是TCP重传的单位却是那一大份数据）。

如果TCP把这份数据，分段为N个小于等于MSS长度的数据包，到了IP层后加上IP头和TCP头，还是小于MTU，那么IP层也不会再进行分片。此时在传输路上发生了丢包，那么TCP重传的时候也只是重传那一小部分的MSS段。效率会比TCP不分段时更高。

类似的，传输层除了TCP外，还有UDP协议，但UDP本身不会分段，所以当数据量较大时，只能交给IP层去分片，然后传到底层进行发送。

正常情况下，在一台机器的传输层到网络层这条链路上，如果传输层对数据做了分段，那么IP层就不会再分片。如果传输层没分段，那么IP层就可能会进行分片。

数据在TCP分段，就是为了在IP层不需要分片，同时发生重传的时候只重传分段后的小份数据。

TCP分段了，IP层就一定不会分片了吗

在发送端，TCP分段后，IP层就不会再分片了。

但是整个传输链路中，可能还会有其他网络层设备，而这些设备的MTU可能小于发送端的MTU。此时虽然数据包在发送端已经分段过了，但是在IP层还会再分片一次。

如果链路上还有设备有更小的MTU，那么还会再分片，最后所有的分片都会在接收端进行组装。

因此，就算TCP分段过后，在链路上的其他节点的IP层也是有可能再分片的，而且哪怕数据被第一次IP分片过了，也是有可能被其他机器的IP层进行二次、三次、四次….分片的。

总结

(TCP)分段和(IP)分片各自发生在不同的协议层(分段-TCP传输层，分片-IP层)
TCP分段的原因是TCP报文段大小受MSS限制，IP分片则是因为IP数据报大小受MTU限制
在发送方，数据在TCP分段，在IP层就不需要分片，同时发生重传的时候只重传分段后的小份数据
虽然分段和分片不会在发送方同时发生，但却可能在同一次通信过程中分别在发送主机(分段)和转发设备(分片)中发生
IP分片是不得已的行为，尽量不在IP层分片，尤其是链路上中间设备的IP分片
UDP不会分段，就由IP来分片

参考资料:

Network RFC合集

2024-08-11T08:40:33.000Z

本文将持续记录计算机网络中协议的RFC号。

Notes about pci-pci bridge

2024-08-10T03:52:21.000Z

本文将mark下pci-pci bridge相关notes。

Overview

For PCI-PCI bridges to pass PCI I/O, PCI Memory or PCI Configuration address space reads and writes across them, they need to know the following:

Primary Bus Number
The bus number immediately upstream of the PCI-PCI Bridge,
Secondary Bus Number
The bus number immediately downstream of the PCI-PCI Bridge,
Subordinate Bus Number
The highest bus number of all of the busses that can be reached downstream of the bridge.
PCI I/O and PCI Memory Windows
The window base and size for PCI I/O address space and PCI Memory address space for all addresses downstream of the PCI-PCI Bridge.

Bus Number

/* Header type 1 (PCI-to-PCI bridges) */
#define PCI_PRIMARY_BUS0x18/* Primary bus number */
#define PCI_SECONDARY_BUS0x19/* Secondary bus number */
#define PCI_SUBORDINATE_BUS0x1a/* Highest bus number behind the bridge */
#define PCI_SEC_LATENCY_TIMER0x1b/* Latency timer for secondary interface */

The problem is that at the time when you wish to configure any given PCI-PCI bridge you do not know the subordinate bus number for that bridge. You do not know if there are further PCI-PCI bridges downstream and if you did, you do not know what numbers will be assigned to them. The answer is to use a depthwise recursive algorithm and scan each bus for any PCI-PCI bridges assigning them numbers as they are found. As each PCI-PCI bridge is found and its secondary bus numbered, assign it a temporary subordinate number of 0xFF and scan and assign numbers to all PCI-PCI bridges downstream of it.

其实subordinate的计算是基于DFS算法的。

算法详细请参考Configuring PCI-PCI Bridges - Assigning PCI Bus Numbers

PCI I/O and PCI Memory Windows

PCI-PCI bridges only pass a subset of PCI I/O and PCI memory read and write requests downstream. For example, in the following Figure, the PCI-PCI bridge will only pass read and write addresses from PCI bus 0 to PCI bus 1 if they are for PCI I/O or PCI memory addresses owned by either the SCSI or ethernet device; all other PCI I/O and memory addresses are ignored. This filtering stops addresses propogating needlessly throughout the system. To do this, the PCI-PCI bridges must be programmed with a base and limit for PCI I/O and PCI Memory space access that they have to pass from their primary bus onto their secondary bus.

The PCI-PCI Bridge
We now cross the PCI-PCI Bridge and allocate PCI memory there:
- The Ethernet Device
  This is asking for 0xB0 bytes of both PCI I/O and PCI Memory space. It gets allocated PCI I/O at 0x4000 and PCI Memory at 0x400000. The PCI Memory base is moved to 0x4000B0 and the PCI I/O base to 0x40B0.
- The SCSI Device
  This is asking for 0x1000 PCI Memory and so it is allocated it at 0x401000 after it has been naturally aligned. The PCI I/O base is still 0x40B0 and the PCI Memory base has been moved to 0x402000.
The PCI-PCI Bridge’s PCI I/O and Memory Windows
We now return to the bridge and set its PCI I/O window at between 0x4000 and 0x40B0 and it’s PCI Memory window at between 0x400000 and 0x402000. This means that the PCI-PCI Bridge will ignore the PCI Memory accesses for the video device and pass them on if they are for the ethernet or scsi devices.

参考资料:

Notes about CUDA Unified Memory

2024-08-04T09:50:41.000Z

本文将mark下CUDA Unified Memory相关notes。

Motivation

Overview

Traditionally, GPUs and CPUs have their own memory spaces, and applications running on one particular GPU cannot access the data directly from the memory of other GPUs or CPUs. To improve memory utilization, the latest NVIDIA PASCAL GPU released in 2016 supports unified memory , i.e., each GPU can access the whole memory space of both GPUs and CPUs via uniform memory addresses. In particular, the unified memory provides to all GPUs and CPUs a single memory address space, with an automatic page migration for data locality. The page migration engine also allows GPU threads to trigger page fault when the accessed data does not reside in GPU memory, and this makes the system eficiently migrate pages from anywhere in the system to the memory of GPUs in an on-demand manner.

The benefits of unified memory are twofold. First, it enables a GPU to handle dataset which is larger than its own memory size, because the unified memory can migrate data from CPU memory to GPU memory in an on-demand fashion. Second, using the unified memory can simplify the programming model. In particular, programmers can simply use a pointer to access data pages no matter where they reside, instead of explicitly calling data migration.

CUDA 6+:UNIFIED MEMORY

simplify the programming model

CUDA 8+: UNIFIED MEMORY

SVA

参考资料:

Notes about flock 文件锁

2024-08-04T07:16:39.000Z

本文将mark下flock 文件锁相关notes。本文内容主要转载自被遗忘的桃源——flock 文件锁。

文件锁 flock

为解决多进程对同一文件的读写冲突，在linux 系统中，提供了 flock 这一系统调用，用来实现对文件的读写保护，即文件锁的功能。文件锁保护文件的功能，与pthread 库中多线程使用读写锁来保护内存资源的方式是类似的。 flock 的 man page 中有如下介绍：

flock - apply or remove an advisory lock on an open file

从中可以解读出两点内容：

flock 提供的文件锁是建议性质的。所谓 “建议性锁”，通常也叫作非强制性锁，即一个进程可以忽略其他进程加的锁，直接对目标文件进行读写操作。因而，只有当前进程主动调用 flock去检测是否已有其他进程对目标文件加了锁，文件锁才会在多进程的同步中起到作用。表述的更明确一点，就是如果其他进程已经用 flock 对某个文件加了锁，当前进程在读写这一文件时，未使用 flock 加锁（即未检测是否已有其他进程锁定文件），那么当前进程可以直接操作这一文件，其他进程加的文件锁对当前进程的操作不会有任何影响。这种可以被忽略、需要双方互相检测确认的加锁机制，就被称为 ”建议性“ 锁。
文件锁必须作用在一个打开的文件上，即从应用的角度看，文件锁应当作用于一个打开的文件句柄上。

共享锁与互斥锁

linux 中 flock 系统调用的原型如下：

1 2	#include int flock(int fd, int operation);

当 flock 执行成功时，会返回0；当出现错误时，会返回 -1，并设置相应的 errno 值。

在flock 原型中，参数 operation 可以使用 LOCK_SH 或 LOCK_EX 常量，分别对应共享锁和排他锁。这两个常量的定义在 file.h 中。与 flock 相关的常量定义如下：

/* Operations for the `flock' call.  */                                          
#define LOCK_SH 1 /* Shared lock.  */                                            
#define LOCK_EX 2   /* Exclusive lock.  */                                       
#define LOCK_UN 8 /* Unlock.  */                                                 

/* Can be OR'd in to one of the above.  */                                       
#define LOCK_NB 4 /* Don't block when locking.  */

当使用 LOCK_SH 共享锁时，多个进程都可以使用共享锁锁定同一个文件，从而实现多个进程对文件的并行读取。由此可见，LOCK_SH 共享锁类似于多线程读写锁 pthread_rwlock_t 类型中的读锁。当使用LOCK_EX 排他锁时，同一时刻只能有一个进程锁定成功，其余进行只能阻塞，这种行为与多线程读写锁中的写锁类似。

阻塞与非阻塞

flock 文件锁提供了阻塞和非阻塞两种使用方式。当处于阻塞模式时，如果当前进程无法成功获取到文件锁，那么进程就会一直阻塞等待，直到其他进程在对应文件上释放了锁，本进程能成功持有锁为止。在默认情况下，flock 提供是阻塞模式的文件锁。

在日常使用中，文件锁还会使用在另外一种场景下，即进程首先尝试对文件加锁，当加锁失败时，不希望进程阻塞，而是希望 flock 返回错误信息，进程进行错误处理后，继续进行下面的处理。在这种情形下就需要使用 flock 的非阻塞模式。把flock 的工作模式设置为非阻塞模式非常简单，只要将原有的 operation 参数改为锁的类型与 LOCK_NB 常量进行按位或操作即可，例如：

1 2	int ret = flock(open_fd, LOCK_SH \| LOCK_NB); int ret = flock(open_fd, LOCK_EX \| LOCK_NB);

在非阻塞模式下，加文件锁失败并不影响进程流程的执行，但要注意加入错误处理逻辑，在加锁失败时，不能对目标文件进行操作。

flock 命令

除了多种语言提供 flock 系统调用或函数，linux shell 中也提供了 flock 命令。
flock(1)

参考资料:

深入理解eventfd_signal

2024-08-03T06:24:14.000Z

本文将mark下eventfd_signal的实现。

Overview

eventfd_signal
└── eventfd_signal_mask
    └── wake_up_locked_poll[__wake_up_locked_key]
        └── __wake_up_common

static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key,
wait_queue_entry_t *bookmark)
{
wait_queue_entry_t *curr, *next;
int cnt = 0;
        ...

list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
unsigned flags = curr->flags;
int ret;

if (flags & WQ_FLAG_BOOKMARK)
continue;

ret = curr->func(curr, mode, wake_flags, key);
                ...
}

...
}

由__wake_up_common的实现可知，最终eventfd_signal调用了wait_queue_entry的func回调。

/*
 * A single wait-queue entry structure:
 */
struct wait_queue_entry {
unsigned intflags;
void*private;
wait_queue_func_tfunc;
struct list_headentry;
};

vhost_poll_wakeup

源码解析:vhost ioeventfd与irqfd中提到过vhost_poll_wakeup，那么这个函数又是如何与eventfd_signal关联起来的呢？

void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
     __poll_t mask, struct vhost_dev *dev,
     struct vhost_virtqueue *vq)
{
        ...
        init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
        ...
}

static inline void
init_waitqueue_func_entry(struct wait_queue_entry *wq_entry, wait_queue_func_t func)
{
wq_entry->flags= 0;
wq_entry->private= NULL;
wq_entry->func= func;
}

由上述代码片段可知，vhost_poll_wakeup被设置为了wait_queue_entry的func回调。

由此可知，eventfd_signal最终调用了vhost_poll_wakeup函数；因此，vhost_poll_wakeup函数运行上下文是vCPU线程(kvm调用了eventfd_signal，而kvm的运行上下文是vCPU线程)。

ioeventfd_write
└── eventfd_signal
    └── eventfd_signal_mask
        └── wake_up_locked_poll[__wake_up_locked_key]
            └── __wake_up_common
                └── vhost_poll_wakeup

select/poll/epoll `wait_queue_entry`的`func`回调

// for select and poll
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
poll_table *p)
{
struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
struct poll_table_entry *entry = poll_get_entry(pwq);
if (!entry)
return;
entry->filp = get_file(filp);
entry->wait_address = wait_address;
entry->key = p->_key;
init_waitqueue_func_entry(&entry->wait, pollwake);
entry->wait.private = pwq;
add_wait_queue(wait_address, &entry->wait);
}

对于select和poll，wait_queue_entry的func回调是pollwake。

// for epoll
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 poll_table *pt)
{
struct ep_pqueue *epq = container_of(pt, struct ep_pqueue, pt);
struct epitem *epi = epq->epi;
struct eppoll_entry *pwq;

if (unlikely(!epi))// an earlier allocation has failed
return;

pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL);
if (unlikely(!pwq)) {
epq->epi = NULL;
return;
}

init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
if (epi->event.events & EPOLLEXCLUSIVE)
add_wait_queue_exclusive(whead, &pwq->wait);
else
add_wait_queue(whead, &pwq->wait);
pwq->next = epi->pwqlist;
epi->pwqlist = pwq;
}

对于epoll，wait_queue_entry的func回调是ep_poll_callback。

为了方便起见，本文只详细介绍下pollwake。

// 在等待队列(wait_queue_t)上回调函数(func)  
// 文件就绪后被调用，唤醒调用进程，其中key是文件提供的当前状态掩码  
static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)  
{  
    struct poll_table_entry *entry;  
    // 取得文件对应的poll_table_entry  
    entry = container_of(wait, struct poll_table_entry, wait);  
    // 过滤不关注的事件  
    if (key && !((unsigned long)key & entry->key)) {  
        return 0;  
    }  
    // 唤醒  
    return __pollwake(wait, mode, sync, key);  
}  
static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)  
{  
    struct poll_wqueues *pwq = wait->private;  
    // 将调用进程 pwq->polling_task 关联到 dummy_wait  
    DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);  
    smp_wmb();  
    pwq->triggered = 1;// 标记为已触发  
    // 唤醒调用进程  
    return default_wake_function(&dummy_wait, mode, sync, key);  
}  
  
// 默认的唤醒函数,poll/select 设置的回调函数会调用此函数唤醒  
// 直接唤醒等待队列上的线程,即将线程移到运行队列(rq)  
int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,  
                          void *key)  
{  
    // 这个函数比较复杂, 这里就不具体分析了  
    return try_to_wake_up(curr->private, mode, wake_flags);  
}

参考资料:

linux 内核poll/select/epoll实现剖析（经典）-上

深入理解virtio kick操作

2024-07-28T02:09:42.000Z

本文将结合virtio spec与linux 源码，深入解析下virtio中的kick操作。

Overview

本文将详细的阐述virtio中的kick操作。根据kick操作的发展历史，按照如下顺序去介绍:

legacy device kick
modern device kick
VIRTIO_F_NOTIFICATION_DATA feature的kick

legacy device

/* the notify function used when creating a virt queue */
bool vp_notify(struct virtqueue *vq)
{
/* we write the queue's selector into the notification register to
 * signal the other end */
iowrite16(vq->index, (void __iomem *)vq->priv);
return true;
}

// legacy device
static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
  struct virtio_pci_vq_info *info,
  unsigned int index,
  void (*callback)(struct virtqueue *vq),
  const char *name,
  bool ctx,
  u16 msix_vec)
{
        ...
/* create the vring */
vq = vring_create_virtqueue(index, num,
    VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
    true, false, ctx,
    vp_notify, callback, name);
        ...
        vq->priv = (void __force *)vp_dev->ldev.ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
        ...
}

kick寄存器位于bar0中的VIRTIO_PCI_QUEUE_NOTIFY位置；不同vq使用同一个kick寄存器地址，往kick寄存器写入vq的index，告诉virtio后端要处理哪个vq。

modern device

在设备实现中，一般会将queue_notify_off设置为vq index；也就是说，vp_modern_get_queue_notify_off的返回值，一般会与输入变量index相同。

/*
 * vp_modern_get_queue_notify_off - get notification offset for a virtqueue
 * @mdev: the modern virtio-pci device
 * @index: the queue index
 *
 * Returns the notification offset for a virtqueue
 */
static u16 vp_modern_get_queue_notify_off(struct virtio_pci_modern_device *mdev,
  u16 index)
{
vp_iowrite16(index, &mdev->common->queue_select);

return vp_ioread16(&mdev->common->queue_notify_off);
}

当notify_off_multiplier不为0时，不同vq使用不同的kick寄存器地址，往kick寄存器写入vq的index，告诉virtio后端要处理哪个vq
当notify_off_multiplier为0时，不同vq使用相同的kick寄存器地址，往kick寄存器写入vq的index，告诉virtio后端要处理哪个vq

// modern device
static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
  struct virtio_pci_vq_info *info,
  unsigned int index,
  void (*callback)(struct virtqueue *vq),
  const char *name,
  bool ctx,
  u16 msix_vec)
{
        ...
/* create the vring */
vq = vring_create_virtqueue(index, num,
    SMP_CACHE_BYTES, &vp_dev->vdev,
    true, true, ctx,
    notify, callback, name);
        ...
        vq->priv = (void __force *)vp_modern_map_vq_notify(mdev, index, NULL);
        ...
}

/*
 * vp_modern_map_vq_notify - map notification area for a
 * specific virtqueue
 * @mdev: the modern virtio-pci device
 * @index: the queue index
 * @pa: the pointer to the physical address of the nofity area
 *
 * Returns the address of the notification area
 */
void __iomem *vp_modern_map_vq_notify(struct virtio_pci_modern_device *mdev,
      u16 index, resource_size_t *pa)
{
u16 off = vp_modern_get_queue_notify_off(mdev, index);

if (mdev->notify_base) {
/* offset should not wrap */
if ((u64)off * mdev->notify_offset_multiplier + 2
> mdev->notify_len) {
dev_warn(&mdev->pci_dev->dev,
 "bad notification offset %u (x %u) "
 "for queue %u > %zd",
 off, mdev->notify_offset_multiplier,
 index, mdev->notify_len);
return NULL;
}
if (pa)
*pa = mdev->notify_pa +
      off * mdev->notify_offset_multiplier;
return mdev->notify_base + off * mdev->notify_offset_multiplier;
} else {
...
}
}

VIRTIO_F_NOTIFICATION_DATA feature

值得注意的是，对于split vq，desc table的最大size为2^16；对于packed vq，desc table的最大size为2^15；

在kick寄存器中，不止存放了vq index:

对于split vq，kick寄存器中还存放了avail_idx
对于packed vq，kick寄存器中还存放了avail_idx(为了表述的方便，严格来说，packed vq已经没有了avail ring，也就不存在avail_idx了)与wrap counter

vq_notif_config_data在一般情况下，就是vq index。

// modern device
static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
  struct virtio_pci_vq_info *info,
  unsigned int index,
  void (*callback)(struct virtqueue *vq),
  const char *name,
  bool ctx,
  u16 msix_vec)
{
        ...
if (__virtio_test_bit(&vp_dev->vdev, VIRTIO_F_NOTIFICATION_DATA))
notify = vp_notify_with_data;
else
notify = vp_notify;
        ...
}

static bool vp_notify_with_data(struct virtqueue *vq)
{
u32 data = vring_notification_data(vq);

iowrite32(data, (void __iomem *)vq->priv);

return true;
}

u32 vring_notification_data(struct virtqueue *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
u16 next;

if (vq->packed_ring)
next = (vq->packed.next_avail_idx &
~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR))) |
vq->packed.avail_wrap_counter <<
VRING_PACKED_EVENT_F_WRAP_CTR;
else
next = vq->split.avail_idx_shadow;

return next << 16 | _vq->index;
}

参考资料:

virtio 0.9.5 spec
virtio 1.3 spec

(转)用户态GPU池化技术

2024-07-27T03:29:50.000Z

本文转载自: 用户态GPU池化技术。主要mark下内核态虚拟化和用户态虚拟化两类方案。

1. 概述

目前有若干种软件技术方案能实现GPU虚拟化，这些方案可以分为内核态虚拟化和用户态虚拟化两类。本文主要论述这两种方案的差异。

2. 内核态与用户态解析

以英伟达的GPU为例，应用到硬件从上至下分为用户态、内核态、GPU硬件三个层次(见下图:CUDA软件栈)。

2.1 用户态层

用户态是应用程序运行的环境。各种使用英伟达GPU的应用程序，比如人工智能计算类的应用，2D/3D图形渲染类的应用，都运行在用户态。为了便利编程以及安全因素，英伟达提供了用户态的运行库CUDA(Compute Unified Device Architecture)作为GPU并行计算的编程接口（类似的接口也包括由社区共同制订的OpenGL、Vulkan接口等），应用程序可以使用CUDA API来编写并行计算任务，并通过调用CUDA API与GPU用户态驱动进行通信。GPU用户态驱动再通过ioctl、mmap、read、write接口直接和GPU的内核态驱动进行交互。

2.2 内核态层

该层主要运行的是GPU的内核态驱动程序，它与操作系统内核紧密集成，受到操作系统以及CPU硬件的特殊保护。内核态驱动可以执行特权指令，提供对硬件的访问和操作接口，并对GPU硬件进行底层控制。出于系统安全考虑，用户态的代码只能通过操作系统预先定义好的标准接口(Linux下有例如ioctl，mmap，read，write 等少量接口)，调用内核态的代码。通过这些接口被调用的内核态代码一般是预先安装好的设备的内核态驱动。这样保证内核态和用户态的安全隔离，防止不安全的用户态代码破坏整个计算机系统。GPU的内核态驱动通过PCIe接口（也可能是其他硬件接口）以TLP报文的形式跟硬件进行通信。

特别的，包括英伟达在内的各类AI芯片产品的内核态和用户态之间的接口定义并不包含在例如CUDA、OpenGL、Vulkan等协议标准里面，他们也未曾向行业公开这一层的接口定义。因此各类行业应用也不会基于这一层的接口进行编程。

3. 两种虚拟化技术难度解析

从技术可能性的角度来看，用户态与内核态各有相应的接口可以实现GPU虚拟化或者GPU池化：用户态的CUDA、OpenGL、Vulkan等应用运行时接口；内核态暴露的 ioctl、read、write等设备驱动接口。

3.1 用户态虚拟化

利用CUDA、OpenGL、Vulkan等标准接口，对API进行拦截和转发，对被拦截的函数进行解析，然后调用硬件厂商提供的用户态库中的相应函数（见上图）。拦截CUDA等用户态接口不需要在OS内核层进行设备文件的插入，因为这些接口的使用方式是操作系统在运行可执行文件的时候（例如Linux下的elf二进制），由操作系统的加载器自动在系统中按照固定的规则来寻找其依赖的外部接口，学术名称做符号（symbol）。那么根据操作系统寻找依赖的规则，很容易可以通过替换symbol的来源，使得当可执行文件发生例如CUDA接口调用的时候，调用的不是英伟达的闭源用户态软件提供的接口，而是一个经过修改后的同名接口，从而拦截到例如CUDA接口的调用。

经过API拦截之后，用户态虚拟化方案还可以利用RPC的方式进行远程API Remoting（见上图），即CPU主机可以通过网络调用GPU主机的GPU，实现GPU的远程调用。如此一来，多个GPU服务器可以组成资源池，供多个AI业务任意调用，达到实现GPU池化的目的。用户态虚拟化是一种软件的实现方案。目前业内已经成型的产品有：趋动科技的OrionX GPU池化产品，VMware的Bitfusion产品。这类技术方案拥有几个优点：

CUDA、OpenGL、Vulkan等接口都是公开的标准化接口，具有开放性和接口稳定性。所以基于这些接口的实现方案具有很好的兼容性和可持续性。
因为该方案运行在用户态，因此可以规避内核态代码过于复杂容易引入安全问题的工程实践，可以在用户态通过复杂的网络协议栈和操作系统支持来实现及优化远程GPU的能力，从而高效率地支持GPU池化。
由于该方案工作在用户态，从部署形态上对用户环境的侵入性最小，也最安全，即使发生故障也可以迅速被操作系统隔离，而通过一些软件工程的设计可以有很强的自恢复能力。

当然，这类方案也有缺点：相比于内核态接口，用户态API接口支持更复杂的参数和功能，因此用户态API接口的数量比内核态接口的数量要高几个数量级。这导致在用户态层实现GPU虚拟化和GPU池化的研发工作量要比在内核态实现要大得多。

3.2 内核态虚拟化

跟上述用户态拦截API类似的，第三方厂商所做的内核态虚拟化方案通过拦截ioctl、mmap、read、write等这类内核态与用户态之间的接口来实现GPU虚拟化。这类方案的关键点在于需要在操作系统内核里面增加一个内核拦截模块，并且在操作系统上创建一些设备文件来模拟正常的GPU设备文件。例如，英伟达GPU在Linux上的设备文件有/dev/nvidiactl、 /dev/nvidia0等多个文件。因此，在使用虚拟化的GPU时，把虚拟化出来的设备文件mount到业务容器内部，同时通过挂载重命名的机制伪装成英伟达的同名设备文件名，让应用程序访问。这样在容器内部的应用程序通过CUDA去访问设备文件的时候，仍然会去打开例如/dev/nvidiactl 和 /dev/nvidia0这样的设备文件，该访问就会被转发到模拟的设备文件，并向内核态发送例如ioctl这样的接口调用，进而被内核拦截模块截获并进行解析。目前国内的qGPU和cGPU方案都是工作在这一层。这类技术方案的优点是：

有较好的灵活性，而且不依赖GPU硬件，可以在数据中心级和消费级的GPU上使用。
在GPU共享的同时，具备不错的隔离能力。
由于只支持运行在容器环境中，研发工作量相比用户态方案要小得多。

这类方案由于工作在内核态，缺点也是显而易见的：

需要在内核态层插入文件，对系统的侵入性大，容易引入安全隐患。
由于英伟达GPU内核态驱动的ioctl等接口以及用户态模块都是闭源的，接口也不开放，因此只有英伟达自己可以在这层支持所有的GPU虚拟化能力，其他第三方厂商只能通过一定程度的逆向工程来实现对这些接口的解析。这种行为存在着极大的法律风险和不确定性，可持续性远低于用户态方案。
第三方厂商由于缺少完整的接口细节，目前只能通过接口“规避”的方式来支持。所谓“规避”，简单来说就是只解析必要的少数几个接口，其他的不劫持直接放过。为了方便实现“规避”效果，这类方案目前都只能支持基于容器虚拟化的环境（因为很容易实现），无法支持非容器化环境以及KVM虚拟化环境，更加无法跨越操作系统支持GPU池化最核心的远程GPU调用，因此这类方案不是完整的GPU池化方案。

3.3 接口解析

上述两种虚拟化方案在经过接口拦截之后，就可以在当前的接口调用中被激活，接下来就是对该接口进行解析。不管是 ioctl 接口还是 CUDA接口，从计算机设计上，都可以表达为interface_name(paramerA, parameterB, …)这样的形式。也就是接口名称，接口参数（返回值也是一种参数形式）。而不管基于哪一层接口的拦截，这里的解析又分为两种：

同一个进程空间的接口解析（见上图）：在现代操作系统中，不管在用户态还是内核态，代码都执行在由CPU硬件 + 操作系统维护的一个进程空间里面，在一个进程空间里面有统一的进程上下文（context），并且所有的资源在进程空间内都是共享的，视图是统一的，包括访存地址空间(address space)，也包括GPU设备上的资源。这个现代操作系统的设计可以为同一个进程空间的接口解析带来极大的便利。因为对于一个接口interface_name(paramerA, parameterB, …)，即使存在不公开含义的参数，例如parameterB是不公开的，但是利用一个进程空间内所有的资源都是共享且视图统一的这个特点，只要确定该部分内容不需要被GPU虚拟化模拟执行所需要，那么虚拟化软件可以不需要对其进行解析，在截获之后，直接透传给英伟达自己的闭源模块就可以。实际上，只有少量接口，少量参数会被需要在一个进程内被解析并且模拟执行，因此选择这个技术路线可以“规避”掉绝大多数接口、参数的解析工作。具体以针对英伟达的GPU为例，只有非常少的接口、参数需要被真正解析并模拟执行。一些产品之所以能在非公开的内核接口层实现GPU虚拟化，是利用了同一个操作系统的特点，基于少量接口信息，来达到GPU虚拟化的目的。但是这样的技术路线也有一个非常明显的限制，就是只能在同一个进程空间内进行接口的拦截、解析和执行。因此这种技术路线从原理上就无法支持跨OS内核的KVM虚拟化，更无法跨越物理节点做到远程调用GPU。
不同进程空间的接口解析（见上图）：当GPU应用所在的操作系统和管理物理GPU所在的操作系统是两个不同的操作系统的时候，要达到GPU虚拟化、GPU池化的目的，就需要跨进程对选定的GPU接口层进行跨进程的接口解析。典型的场景如 KVM虚拟机，还有跨物理节点调用GPU。由于应用程序和GPU管理软件栈（例如GPU驱动）已经不在一个操作系统的管理下，因此资源就不再是共享的了，视图也不再是统一的了。例如，同样的一个虚拟地址(virtual address)在不同的进程空间代表的很可能是不一样的内容。所以对于所有接口interface_name(paramerA, parameterB, …)，都要进行完善的解析、处理，并通过例如网络的方式跨越操作系统进行传送。以英伟达的 CUDA 为例有数万个接口，需要对每一个接口都进行跨进程空间的接口解析，然后进行行为模拟。因此，在不公开的接口层进行跨进程空间的接口解析，原理上是行不通的。

经过接口解析之后，则需要向GPU应用提供一个模拟的GPU执行环境，这个模拟的动作是由GPU虚拟化和GPU池化的软件来完成的。不同软件提供的模拟的能力是有差异的，但是其基础的能力，都是要保持对上层应用的透明性，使得应用不需要改动实现，不需要重新编译。

3.4 总结

对于GPU虚拟化和资源池化，由于在接口层的的选择上有两个分支，在接口解析上也有两个分支，所以排列组合起来有4种可能，下面对4种方式做一个对比。

通过对比这4种可能的方式，我们做个总结：

内核态方案仅能在同一个进程空间工作，无法跨机，因此无法实现GPU池化。
内核态方案要在同一个进程空间实现GPU虚拟化是相对简单的。
只有用户态方案可以实现跨不同进程空间工作，可以跨机，因此可以实现GPU池化。
用户态方案要想跨不同进程空间实现GPU池化，有大量接口需要解析，难度与门槛很高。

L

Notes about NVF

背景

总结

Notes about RDMA ODP feature

Introduction

Synchronizing between CPU and RNIC page tables

Faulting

Invalidation

Advising

Related code

Notes about RDMA Device Memory

Introduction

Motivation

Concepts

思考

相关paper

Notes about RDMA SRQ/XRC/DCT技术

1. SRQ

1.1 为什么需要SRQ

1.2 SRQ Limit

2. XRC

2.1 为什么需要XRC

2.2 核心概念

3. DCT

3.1 为什么需要DCT

3.2 什么是DCT

3.3 思考

3.4 XRC vs DCT

3.5 学术论文

Intel架构下TLB shutdown使用pause指令

Notes about RDMA UMR(User-Mode Memory Registration)

What

Examples

将多块非连续的MR拼接成一个VA连续的MR

将一个MR内有规律非连续的块拼接成一个连续的MR

将多个MR拼接成新的相互交织的连续MR

思考

Notes about ARM VHE mode

Notes about RDMA Event Queue mechanism

Introduction

Q && A

eRDMA example

CQ与EQ的绑定

eqe获取CQ number

mellanox mlx4 example

CQ与EQ的绑定

eqe获取CQ number

Notes about RDMA cmdq

Introduction

mellanox mlx4 example

eRDMA example

cmdq命令

erdma_device_ops

erdma_post_cmd_wait

cmdq初始化

cmdq中断通知

cmdq相关寄存器

RDMA 资料合集

network: USO vs UFO

定义

USO vs UFO

Notes about network checksum offload

软件协议checksum

TCP checksum

UDP checksum

IP checksum

checksum offload

virtio-net

Network Segmentation vs Fragmentation

Overview

MSS与MTU的区别

为什么MTU一般是1500

为什么IP层会分片，TCP还要分段

TCP分段了，IP层就一定不会分片了吗

总结

Network RFC合集

Notes about pci-pci bridge

Overview

Bus Number

select/poll/epoll `wait_queue_entry`的`func`回调