本文主要汇总KVM中MMIO Emulation的过程。

Prerequisite

Introduction to ept misconfig

Overview

For a summary, the following shows the process of MMIO implementation:

  1. QEMU declares a memory region(but not allocate ram or commit it to kvm)
  2. Guest first access the MMIO address, cause a EPT violation VM-exit
  3. KVM construct the EPT page table and marks the page table entry with special mark(110b)
  4. Later the guest access these MMIO, it will be processed by EPT misconfig VM-exit handler

QEMU part

这里以e1000网卡模拟为例,设备初始化MMIO时候时候注册的MemoryRegion为IO类型(不是RAM类型)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static void
e1000_mmio_setup(E1000State *d)
{
int i;
const uint32_t excluded_regs[] = {
E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
};
// 这里注册MMIO,调用memory_region_init_io,mr->ram = false!!!
memory_region_init_io(&d->mmio, OBJECT(d), &e1000_mmio_ops, d,
"e1000-mmio", PNPMMIO_SIZE);
memory_region_add_coalescing(&d->mmio, 0, excluded_regs[0]);
for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
memory_region_add_coalescing(&d->mmio, excluded_regs[i] + 4,
excluded_regs[i+1] - excluded_regs[i] - 4);
memory_region_init_io(&d->io, OBJECT(d), &e1000_io_ops, d, "e1000-io", IOPORT_SIZE);
}

QEMU uses function memory_region_init_io to declare a MMIO region. Here we can see the mr->ram is false so no really memory is allocated.

QEMU调用kvm_set_phys_mem注册虚拟机的物理内存到KVM相关的数据结构中的时候,会调用memory_region_is_ram来判断该段物理地址空间是否是RAM设备, 如果不是RAM设备直接return了.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void kvm_set_phys_mem(KVMMemoryListener *kml,
MemoryRegionSection *section, bool add)
{
......
if (!memory_region_is_ram(mr)) {
if (writeable || !kvm_readonly_mem_allowed) {
return; // 设备MR不是RAM但可以写,那么这里直接return不注册到kvm里面
} else if (!mr->romd_mode) {
/* If the memory device is not in romd_mode, then we actually want
* to remove the kvm memory slot so all accesses will trap. */
add = false;
}
}
......
}

KVM part

In vmx_init, when ept enabled, it calls ept_set_mmio_spte_mask.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static void ept_set_mmio_spte_mask(void)
{
/*
* EPT Misconfigurations can be generated if the value of bits 2:0
* of an EPT paging-structure entry is 110b (write/execute).
*/
kvm_mmu_set_mmio_spte_mask(VMX_EPT_RWX_MASK,
VMX_EPT_MISCONFIG_WX_VALUE, 0);
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
{
...
shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
...
}

Here set shadow_mmio_mask.

We the guest access the MMIO address, the VM will exit caused by ept violation and tdp_page_fault will be called. __direct_map will be called to construct the EPT page table.

After the long call-chain, the final function mark_mmio_spte will be called to set the spte with shadow_mmio_mask which as we already know is set when the vmx initialization.

1
2
3
4
5
__direct_map
mmu_set_spte
set_spte
set_mmio_spte
mark_mmio_spte

The condition to call mark_mmio_spte is is_noslot_pfn.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
pfn_t pfn, unsigned access)
{
if (unlikely(is_noslot_pfn(pfn))) {
mark_mmio_spte(kvm, sptep, gfn, access);
return true;
}

return false;
}

static inline bool is_noslot_pfn(pfn_t pfn)
{
return pfn == KVM_PFN_NOSLOT;
}

As we know the QEMU doesn’t commit the MMIO memory region, so pfn is KVM_PFN_NOSLOT and then mark the spte with shadow_mmio_mask.

When the guest later access this MMIO page, as it’s ept page table entry is 110b, this will cause the VM exit by EPT misconfig, any how can a page be write/execute but no read permission. In the handler handle_ept_misconfig it first process the MMIO case, this will dispatch to the QEMU part.

1
2
3
4
5
6
7
8
9
10
vcpu_run
vcpu_enter_guest
kvm_x86_ops->run(vcpu) (run the guest!)
handle_exit_irqoff()
handle_exit() which is vmx_handle_exit
handle all the vmexit, fill in the KVM_EXIT reasons
(kvm_vmx_exit_handlers[exit_reason](vcpu))
handle_ept_misconfig (just one of many handlers!)
kvm_mmu_page_fault
x86_emulate_instruction
1
2
3
4
5
6
7
8
9
10
11
12
13
x86_emulate_instruction
x86_emulate_insn
writeback
segmented_write
write_emulated[emulator_write_emulated]
emulator_read_write
emulator_read_write_onepage
ops->read_write_mmio[write_mmio]
vcpu_mmio_write
kvm_io_bus_write
__kvm_io_bus_write
kvm_iodevice_write
ops->write[ioeventfd_write]

最后会调用到ioeventfd_write,写eventfd给QEMU发送通知事件。


参考资料:

  1. MMIO Emulation
  2. KVM MMIO implementation
  3. Notes on Virtualization Stack
  4. Qemu-kvm的ioeventfd创建与触发的大致流程