本文将mark下RDMA中cmdq(command queue)相关notes。
Host Channel Adapter(HCA) device, HCA device, NIC, NIC device and adapter device are used interchangeably.

Introduction

The HCA command interface is used for:

  • configuring the HCA
  • the handshake between hardware and system software
  • handling (querying, configuring, modifying) HCA objects

The HCA is configured using the command queues. Each function has its own command queues to get commands from its HCA driver.

The command queue is the transport that is used to pass commands to the HCA.

cmdq其实属于一种sq(Send Queue),可以类比于NVMe的admin sq(submission queue)。

对于不同类型的RDMA设备,cmdq的具体实现是存在差异的。

mellanox mlx4 example

mellanox mlx4 cmdq的细节,可以参考spec中的7.14 Command Interface一节。

eRDMA example

Cmdq is the main control plane channel between erdma driver and hardware. After erdma device is initialized, the cmdq channel will be active in the whole lifecycle of this driver.

cmdq命令

eRDMA支持如下命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
enum CMDQ_RDMA_OPCODE {
CMDQ_OPCODE_QUERY_DEVICE = 0,
CMDQ_OPCODE_CREATE_QP = 1,
CMDQ_OPCODE_DESTROY_QP = 2,
CMDQ_OPCODE_MODIFY_QP = 3,
CMDQ_OPCODE_CREATE_CQ = 4,
CMDQ_OPCODE_DESTROY_CQ = 5,
CMDQ_OPCODE_REFLUSH = 6,
CMDQ_OPCODE_REG_MR = 8,
CMDQ_OPCODE_DEREG_MR = 9
};

enum CMDQ_COMMON_OPCODE {
CMDQ_OPCODE_CREATE_EQ = 0,
CMDQ_OPCODE_DESTROY_EQ = 1,
CMDQ_OPCODE_QUERY_FW_INFO = 2,
CMDQ_OPCODE_CONF_MTU = 3,
CMDQ_OPCODE_CONF_DEVICE = 5,
CMDQ_OPCODE_ALLOC_DB = 8,
CMDQ_OPCODE_FREE_DB = 9,
};

erdma_device_ops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
static const struct ib_device_ops erdma_device_ops = {
.owner = THIS_MODULE,
.driver_id = RDMA_DRIVER_ERDMA,
.uverbs_abi_ver = ERDMA_ABI_VERSION,

.alloc_mr = erdma_ib_alloc_mr,
.alloc_pd = erdma_alloc_pd,
.alloc_ucontext = erdma_alloc_ucontext,
.create_cq = erdma_create_cq,
.create_qp = erdma_create_qp,
.dealloc_pd = erdma_dealloc_pd,
.dealloc_ucontext = erdma_dealloc_ucontext,
.dereg_mr = erdma_dereg_mr,
.destroy_cq = erdma_destroy_cq,
.destroy_qp = erdma_destroy_qp,
.get_dma_mr = erdma_get_dma_mr,
.get_port_immutable = erdma_get_port_immutable,
.iw_accept = erdma_accept,
.iw_add_ref = erdma_qp_get_ref,
.iw_connect = erdma_connect,
.iw_create_listen = erdma_create_listen,
.iw_destroy_listen = erdma_destroy_listen,
.iw_get_qp = erdma_get_ibqp,
.iw_reject = erdma_reject,
.iw_rem_ref = erdma_qp_put_ref,
.map_mr_sg = erdma_map_mr_sg,
.mmap = erdma_mmap,
.mmap_free = erdma_mmap_free,
.modify_qp = erdma_modify_qp,
.post_recv = erdma_post_recv,
.post_send = erdma_post_send,
.poll_cq = erdma_poll_cq,
.query_device = erdma_query_device,
.query_gid = erdma_query_gid,
.query_port = erdma_query_port,
.query_qp = erdma_query_qp,
.req_notify_cq = erdma_req_notify_cq,
.reg_user_mr = erdma_reg_user_mr,

INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq),
INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd),
INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext),
INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp),
};

struct ib_device_ops - InfiniBand device operations, 其实是内核与cmdq的交互接口。以alloc_mr为例,用户态下发创建Memory Region的请求到内核,此时erdma_ib_alloc_mr就会被调用。

1
2
3
4
erdma_ib_alloc_mr
└── regmr_cmd
├── erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_REG_MR)
└── erdma_post_cmd_wait

最终,eRDMA driver会往cmdq中下发CMDQ_OPCODE_REG_MR命令来创建Memory Region。

erdma_post_cmd_wait

1
2
3
4
5
6
7
erdma_post_cmd_wait
├── push_cmdq_sqe
│ └── kick_cmdq_db //更新sq的db寄存器
├── erdma_wait_cmd_completion // 如果使用cmdq eq中断
│ └── wait_for_completion_timeout //当前进程等待eq中断handler来唤醒(complete(&comp_wait->wait_event))
└── erdma_poll_cmd_completion // 如果使用polling
└── erdma_polling_cmd_completions

cmdq初始化

1
2
3
4
5
6
7
8
erdma_probe
└── erdma_probe_dev
├── erdma_comm_irq_init
│ └── request_irq(...erdma_comm_irq_handler...)
└── erdma_cmdq_init
├── erdma_cmdq_sq_init
├── erdma_cmdq_cq_init
└── erdma_cmdq_eq_init

cmdq中断通知

Q: cmdq已经有cq了,为什么还需要eq(CEQ0)?
A: 如果只有cq,就只能用轮询模式了,加上ceq之后,cmdq与eq配合就能完成中断通知。

1
2
3
4
5
6
7
8
9
static irqreturn_t erdma_comm_irq_handler(int irq, void *data)
{
struct erdma_dev *dev = data;

erdma_cmdq_completion_handler(&dev->cmdq);
erdma_aeq_event_handler(dev);

return IRQ_HANDLED;
}
1
2
3
4
5
6
7
erdma_cmdq_completion_handler
├── erdma_polling_cmd_completions
│ ├── erdma_poll_single_cmd_completion
│ │ ├── get_next_valid_cmdq_cqe
│ │ └── complete(&comp_wait->wait_event) //唤醒等待wait_event的进程(erdma_wait_cmd_completion)
│ └── arm_cmdq_cq //更新cq的db寄存器
└── notify_eq //更新eq的db寄存器

cmdq相关寄存器

1
2
3
4
5
6
7
8
9
10
...
#define ERDMA_REGS_CMDQ_SQ_ADDR_L_REG 0x20
#define ERDMA_REGS_CMDQ_SQ_ADDR_H_REG 0x24
#define ERDMA_REGS_CMDQ_CQ_ADDR_L_REG 0x28
#define ERDMA_REGS_CMDQ_CQ_ADDR_H_REG 0x2C
#define ERDMA_REGS_CMDQ_DEPTH_REG 0x30
#define ERDMA_REGS_CMDQ_EQ_DEPTH_REG 0x34
#define ERDMA_REGS_CMDQ_EQ_ADDR_L_REG 0x38
#define ERDMA_REGS_CMDQ_EQ_ADDR_H_REG 0x3C
...

参考资料:

  1. Mellanox Adapters Programmer’s Reference Manual (PRM)
  2. RDMA/erdma: Add cmdq implementation