本文将参考steal time技术分析,并结合v4.19 kernel来解析steal time。

1. Background

当前Host的墙上时间是HWT1,此时Guest中的墙上时间GWT1,如果是同一个时区的话,此时HWT1和GWT1是相等的。

如果此时Host中发生了调度,Guest所在的qemu进程不执行了,那么HWT1将继续增长,GWT1是否应该增长呢?

  • 如果GWT1不增长,那么等到Guest继续执行的时候,就会继续在原来的GWT1基础上增长,那么HWT2到HWT1之间的时间就发生了丢失;现象就是Guest中的时间变慢了。

  • 如果GWT1同时增长,那么就会在Guest进程切回来继续执行的时候,Guest中的时间会瞬间增大了HWT2减掉HWT1的差值。现象就是Guest的墙上时间是对的。可是新的问题又来了:在Guest的qemu进程被Host切换之前,Guest中刚刚切换走redis,开始执行Nginx;等到Guest继续执行的时候,因为Guest中的时钟跳变增大了很多,Guest会认为Nginx执行了大量的CPU时间。如果Linux Guest中采用的是cfs调度算法,那么Nginx下次被调度会隔比较长的时间。可是实际上呢,Nginx根本没有得到执行!

2. Motivation

为了解决上述Guest中的调度问题,就引入了steal time。
Steal time的原理就是:告诉Guest,哪些时间被Host给steal了,调度的时候,忽略这部分时间,就可以正确调度了。
所以,基本就是两个部分:

  1. 在Host中通知Guest具体的steal time是多少
  2. 在Guest中处理这些时间,修正因时间跳变引起的调度错误

3. Identify steal time in Guest


可以在/proc/stat中的第八项看到steal time(top就是从/proc/stat获取的数据)。

1
2
// https://elixir.bootlin.com/linux/v4.19/source/fs/proc/stat.c
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];

4. Guest register steal time

1
2
3
4
5
6
7
8
9
10
11
12
static void kvm_register_steal_time(void)
{
int cpu = smp_processor_id();
struct kvm_steal_time *st = &per_cpu(steal_time, cpu);

if (!has_steal_clock)
return;

wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
pr_info("kvm-stealtime: cpu %d, msr %llx\n",
cpu, (unsigned long long) slow_virt_to_phys(st));
}

guest通过写MSR MSR_KVM_STEAL_TIME,把per_cpu变量steal_time的物理地址(Guest Physical Address)告诉Host。

MSR_KVM_STEAL_TIME MSR的描述:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
MSR_KVM_STEAL_TIME: 0x4b564d03

data: 64-byte alignment physical address of a memory area which must be
in guest RAM, plus an enable bit in bit 0. This memory is expected to
hold a copy of the following structure:

struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
__u8 preempted;
__u8 u8_pad[3];
__u32 pad[11];
}

whose data will be filled in by the hypervisor periodically. Only one
write, or registration, is needed for each VCPU. The interval between
updates of this structure is arbitrary and implementation-dependent.
The hypervisor may update this structure at any time it sees fit until
anything with bit0 == 0 is written to it. Guest is required to make sure
this structure is initialized to zero.

Fields have the following meanings:

version: a sequence counter. In other words, guest has to check
this field before and after grabbing time information and make
sure they are both equal and even. An odd version indicates an
in-progress update.

flags: At this point, always zero. May be used to indicate
changes in this structure in the future.

steal: the amount of time in which this vCPU did not run, in
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.

preempted: indicate the vCPU who owns this struct is running or
not. Non-zero values mean the vCPU has been preempted. Zero
means the vCPU is not preempted. NOTE, it is always zero if the
the hypervisor doesn't support this field.

version的作用类似于rtc的”Update in progress” flag,防止读到中间状态。
guest grab time information时,应该包含如下逻辑:

1
2
3
4
5
do {
version1 = before grabbing time information
do something to grab time information
version2 = after grabbing time information
} while(!(version1 == version2 && version1为偶数));

5. Host calculate steal time

1
2
3
4
5
6
7
// https://elixir.bootlin.com/linux/v4.19/source/include/linux/sched.h#L290
struct sched_info {
...
/* Time spent waiting on a runqueue: */
unsigned long long run_delay;
...
};

注意看run_delay,如注释,就是task等待的时间,也就是没有执行的时间(例子中Guest的qemu被切换走的时间)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// https://elixir.bootlin.com/linux/v4.19/source/arch/x86/kvm/x86.c#L2292
static void record_steal_time(struct kvm_vcpu *vcpu)
{
...
vcpu->arch.st.steal.version += 1;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));

smp_wmb();

vcpu->arch.st.steal.steal += current->sched_info.run_delay -
vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));

smp_wmb();

vcpu->arch.st.steal.version += 1;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
}

在Host中,用run_delay计算出Guest的steal time,并通过kvm_write_guest_cached告诉Guest(前文中Guest向Host注册的地址,Host直接修改)。
这样,在Guest恢复执行的时候,就可以知道steal time的具体大小了。

6. Guest scheduler处理steal time

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// https://elixir.bootlin.com/linux/v4.19/source/kernel/sched/core.c#L132

/*
* RQ-clock updating methods:
*/
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
...
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((&paravirt_steal_rq_enabled))) {
steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;

if (unlikely(steal > delta))
steal = delta;

rq->prev_steal_time_rq += steal;
delta -= steal;
}
#endif

rq->clock_task += delta;
...
}

clock_task记录的是队列中任务执行的时间,其由函数update_rq_clock_task()负责更新。steal time是guest可以意识到的时间,这个时间不被计算到具体的调度队列的运行时间,因而虚拟化下guest中的task调度正常,不会出现时间跳变引起的调度错误。


参考资料:

  1. steal time技术分析
  2. Steal time for KVM
  3. KVM下STEAL_TIME源代码分析
  4. kvm steal 溯源
  5. What is CPU steal time?