Notes about KVM steal time

本文将参考steal time技术分析，并结合v4.19 kernel来解析steal time。

1. Background

当前Host的墙上时间是HWT1，此时Guest中的墙上时间GWT1，如果是同一个时区的话，此时HWT1和GWT1是相等的。

如果此时Host中发生了调度，Guest所在的qemu进程不执行了，那么HWT1将继续增长，GWT1是否应该增长呢？

如果GWT1不增长，那么等到Guest继续执行的时候，就会继续在原来的GWT1基础上增长，那么HWT2到HWT1之间的时间就发生了丢失；现象就是Guest中的时间变慢了。
如果GWT1同时增长，那么就会在Guest进程切回来继续执行的时候，Guest中的时间会瞬间增大了HWT2减掉HWT1的差值。现象就是Guest的墙上时间是对的。可是新的问题又来了：在Guest的qemu进程被Host切换之前，Guest中刚刚切换走redis，开始执行Nginx；等到Guest继续执行的时候，因为Guest中的时钟跳变增大了很多，Guest会认为Nginx执行了大量的CPU时间。如果Linux Guest中采用的是cfs调度算法，那么Nginx下次被调度会隔比较长的时间。可是实际上呢，Nginx根本没有得到执行！

2. Motivation

为了解决上述Guest中的调度问题，就引入了steal time。
Steal time的原理就是：告诉Guest，哪些时间被Host给steal了，调度的时候，忽略这部分时间，就可以正确调度了。
所以，基本就是两个部分：

在Host中通知Guest具体的steal time是多少
在Guest中处理这些时间，修正因时间跳变引起的调度错误

3. Identify steal time in Guest

可以在/proc/stat中的第八项看到steal time(top就是从/proc/stat获取的数据)。

1 2	// https://elixir.bootlin.com/linux/v4.19/source/fs/proc/stat.c steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];

4. Guest register steal time

static void kvm_register_steal_time(void)
{
	int cpu = smp_processor_id();
	struct kvm_steal_time *st = &per_cpu(steal_time, cpu);

	if (!has_steal_clock)
		return;

	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
	pr_info("kvm-stealtime: cpu %d, msr %llx\n",
		cpu, (unsigned long long) slow_virt_to_phys(st));
}

guest通过写MSR MSR_KVM_STEAL_TIME，把per_cpu变量steal_time的物理地址（Guest Physical Address）告诉Host。

MSR_KVM_STEAL_TIME MSR的描述:

MSR_KVM_STEAL_TIME: 0x4b564d03

	data: 64-byte alignment physical address of a memory area which must be
	in guest RAM, plus an enable bit in bit 0. This memory is expected to
	hold a copy of the following structure:

	struct kvm_steal_time {
		__u64 steal;
		__u32 version;
		__u32 flags;
		__u8  preempted;
		__u8  u8_pad[3];
		__u32 pad[11];
	}

	whose data will be filled in by the hypervisor periodically. Only one
	write, or registration, is needed for each VCPU. The interval between
	updates of this structure is arbitrary and implementation-dependent.
	The hypervisor may update this structure at any time it sees fit until
	anything with bit0 == 0 is written to it. Guest is required to make sure
	this structure is initialized to zero.

	Fields have the following meanings:

		version: a sequence counter. In other words, guest has to check
		this field before and after grabbing time information and make
		sure they are both equal and even. An odd version indicates an
		in-progress update.

		flags: At this point, always zero. May be used to indicate
		changes in this structure in the future.

		steal: the amount of time in which this vCPU did not run, in
		nanoseconds. Time during which the vcpu is idle, will not be
		reported as steal time.

		preempted: indicate the vCPU who owns this struct is running or
		not. Non-zero values mean the vCPU has been preempted. Zero
		means the vCPU is not preempted. NOTE, it is always zero if the
		the hypervisor doesn't support this field.

version的作用类似于rtc的”Update in progress” flag，防止读到中间状态。
guest grab time information时，应该包含如下逻辑:

do {
	version1 = before grabbing time information
	do something to grab time information
	version2 = after grabbing time information
} while(!(version1 == version2 && version1为偶数));

5. Host calculate steal time

// https://elixir.bootlin.com/linux/v4.19/source/include/linux/sched.h#L290
struct sched_info {
        ...
	/* Time spent waiting on a runqueue: */
	unsigned long long		run_delay;
        ...
};

注意看run_delay，如注释，就是task等待的时间，也就是没有执行的时间（例子中Guest的qemu被切换走的时间）。

// https://elixir.bootlin.com/linux/v4.19/source/arch/x86/kvm/x86.c#L2292
static void record_steal_time(struct kvm_vcpu *vcpu)
{
	...
	vcpu->arch.st.steal.version += 1;

	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));

	smp_wmb();

	vcpu->arch.st.steal.steal += current->sched_info.run_delay -
		vcpu->arch.st.last_steal;
	vcpu->arch.st.last_steal = current->sched_info.run_delay;

	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));

	smp_wmb();

	vcpu->arch.st.steal.version += 1;

	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
}

在Host中，用run_delay计算出Guest的steal time，并通过kvm_write_guest_cached告诉Guest（前文中Guest向Host注册的地址，Host直接修改）。
这样，在Guest恢复执行的时候，就可以知道steal time的具体大小了。

6. Guest scheduler处理steal time

// https://elixir.bootlin.com/linux/v4.19/source/kernel/sched/core.c#L132

/*
 * RQ-clock updating methods:
 */
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
	...
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
	if (static_key_false((&paravirt_steal_rq_enabled))) {
		steal = paravirt_steal_clock(cpu_of(rq));
		steal -= rq->prev_steal_time_rq;

		if (unlikely(steal > delta))
			steal = delta;

		rq->prev_steal_time_rq += steal;
		delta -= steal;
	}
#endif

	rq->clock_task += delta;
	...
}

clock_task记录的是队列中任务执行的时间，其由函数update_rq_clock_task()负责更新。steal time是guest可以意识到的时间，这个时间不被计算到具体的调度队列的运行时间，因而虚拟化下guest中的task调度正常，不会出现时间跳变引起的调度错误。

参考资料:

文章目录