本文将mark下RDMA中的Event Queue机制。
Host Channel Adapter(HCA) device, HCA device, NIC, NIC device and adapter device are used interchangeably.

Introduction

HCA has multiple sources that can generate events (completion events, asynchronous events/
errors). Once an event is generated internally, it can be reported to the host software via the Event
Queue mechanism. The EQ is a memory-resident circular buffer used by hardware to write event
cause information for consumption by the host software. Once event reporting is enabled, event
cause information is written by hardware to the EQ when the event occurs. If EQ is armed, HW
will subsequently generate an interrupt on the device interface (send MSI-X message or assert
the pin) as configured in the EQ.

Q && A

Q1: 都有cq了,为什么还要有completion EQ?
A1: 如果只有cq,就只能用轮询方式了,加上ceq之后,中断上来就能从ceqe中拿到哪个cq有数据

Q2: 为什么不能为每个cq分配一个中断vector,这样就无需eq机制了?
A2: 一个RDMA设备的CQ会很多,大概率会超过2048个,此时就超过了MSI-x table的上限,因而引入了eq机制,将多个cq绑定到1个eq上,然后为每个eq分配一个中断vector,控制eq的数量,就会保证vector个数不超过MSI-x table的上限

Q3: CQ与EQ是如何绑定的?
A3: While creating a CQ, software configures the EQ number to which this CQ will report completion events.

eRDMA example

Event queue (EQ) is the main notification way from erdma hardware to its driver. Each erdma device contains 2 kinds EQs: asynchronous EQ (AEQ) and completion EQ (CEQ). Per device has 1 AEQ, which used for RDMA async event report, and max to 32 CEQs (numbered for CEQ0 to CEQ31). CEQ0 is used for cmdq completion event report, and the rest CEQs are used for RDMA completion event report.

CQ与EQ的绑定

1
2
3
4
5
6
7
8
9
10
11
12
13
14
static int create_cq_cmd(struct erdma_ucontext *uctx, struct erdma_cq *cq)
{
struct erdma_dev *dev = to_edev(cq->ibcq.device);
struct erdma_cmdq_create_cq_req req;
struct erdma_mem *mem;
u32 page_size;

erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
CMDQ_OPCODE_CREATE_CQ);

...
req.cfg1 = FIELD_PREP(ERDMA_CMD_CREATE_CQ_EQN_MASK, cq->assoc_eqn);
...
}
1
2
3
4
5
6
7
int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
struct ib_udata *udata)
{
...
cq->assoc_eqn = attr->comp_vector + 1;
...
}

eqe获取CQ number

1
2
3
4
5
6
7
8
static irqreturn_t erdma_intr_ceq_handler(int irq, void *data)
{
struct erdma_eq_cb *ceq_cb = data;

tasklet_schedule(&ceq_cb->tasklet); //会调用erdma_intr_ceq_task

return IRQ_HANDLED;
}
1
2
3
4
erdma_intr_ceq_task
└── erdma_ceq_completion_handler
├── get_next_valid_eqe
└── cqn = FIELD_GET(ERDMA_CEQE_HDR_CQN_MASK, READ_ONCE(*ceqe))

mellanox mlx4 example

CQ与EQ的绑定

1
2
3
mlx4_ib_create_cq[mlx4_ib_dev_ops.create_cq]
└── mlx4_cq_alloc
└── cq_context->comp_eqn = ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int mlx4_cq_alloc(struct mlx4_dev *dev, int nent,
struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec,
struct mlx4_cq *cq, unsigned vector, int collapsed,
int timestamp_en, void *buf_addr, bool user_cq)
{
bool sw_cq_init = dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_SW_CQ_INIT;
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_cq_table *cq_table = &priv->cq_table;
struct mlx4_cmd_mailbox *mailbox;
struct mlx4_cq_context *cq_context;
u64 mtt_addr;
int err;

...

mailbox = mlx4_alloc_cmd_mailbox(dev);
...

cq_context = mailbox->buf;
...
cq_context->comp_eqn = priv->eq_table.eq[MLX4_CQ_TO_EQ_VECTOR(vector)].eqn;
...
}

eqe获取CQ number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
enum mlx4_event {
MLX4_EVENT_TYPE_COMP = 0x00,
MLX4_EVENT_TYPE_PATH_MIG = 0x01,
MLX4_EVENT_TYPE_COMM_EST = 0x02,
MLX4_EVENT_TYPE_SQ_DRAINED = 0x03,
MLX4_EVENT_TYPE_SRQ_QP_LAST_WQE = 0x13,
MLX4_EVENT_TYPE_SRQ_LIMIT = 0x14,
MLX4_EVENT_TYPE_CQ_ERROR = 0x04,
MLX4_EVENT_TYPE_WQ_CATAS_ERROR = 0x05,
MLX4_EVENT_TYPE_EEC_CATAS_ERROR = 0x06,
MLX4_EVENT_TYPE_PATH_MIG_FAILED = 0x07,
MLX4_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
MLX4_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11,
MLX4_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12,
MLX4_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08,
MLX4_EVENT_TYPE_PORT_CHANGE = 0x09,
MLX4_EVENT_TYPE_EQ_OVERFLOW = 0x0f,
MLX4_EVENT_TYPE_ECC_DETECT = 0x0e,
MLX4_EVENT_TYPE_CMD = 0x0a,
MLX4_EVENT_TYPE_VEP_UPDATE = 0x19,
MLX4_EVENT_TYPE_COMM_CHANNEL = 0x18,
MLX4_EVENT_TYPE_OP_REQUIRED = 0x1a,
MLX4_EVENT_TYPE_FATAL_WARNING = 0x1b,
MLX4_EVENT_TYPE_FLR_EVENT = 0x1c,
MLX4_EVENT_TYPE_PORT_MNG_CHG_EVENT = 0x1d,
MLX4_EVENT_TYPE_RECOVERABLE_ERROR_EVENT = 0x3e,
MLX4_EVENT_TYPE_NONE = 0xff,
};

static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_eqe *eqe;
int cqn;
int eqes_found = 0;
int set_ci = 0;
int port;
int slave = 0;
int ret;
int flr_slave;
u8 update_slave_state;
int i;
enum slave_port_gen_event gen_event;
unsigned long flags;
struct mlx4_vport_state *s_info;
int eqe_size = dev->caps.eqe_size;

while ((eqe = next_eqe_sw(eq, dev->caps.eqe_factor, eqe_size))) {
/*
* Make sure we read EQ entry contents after we've
* checked the ownership bit.
*/
dma_rmb();

switch (eqe->type) {
case MLX4_EVENT_TYPE_COMP:
cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff;
mlx4_cq_completion(dev, cqn);
break;
...

cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff保证了cq number就是event data的0~23位。


参考资料:

  1. Mellanox Adapters Programmer’s Reference Manual (PRM)
  2. RDMA/erdma: Add event queue implementation
  3. ibv_create_cq()