本文将mark下kvm中的Paravirtualized ticket spinlocks相关notes。
参考内核版本为v5.0,主要内容转载自kvm performance optimization technologies, part two

在kvm中,也会将Paravirtualized ticket spinlocks称为PV unhalt。

Background

Now, suppose that vCPU A grabs a spinlock, and before it finishes, is preempted by the scheduler. And then suppose vCPU B tries to grab the spinlock. Now B, instead of spinning for the short period of time that A needed the spinlock for, will be spinning until vCPU is scheduled again — which may be anywhere from several milliseconds to hundreds of milliseconds, depending on how busy the system is. B is now using the cpu but accomplishing nothing. It’s burning up its VM’s share of CPU time, and keeping other VMs with useful work to do from running. Worse yet, it may even be that the reason A is not running is that the hypervisor’s scheduler is trying to give priority to B — so B is actively keeping A from finishing the work that it needed to do with the spinlock.

The situation gets even worse with a more advanced form of spinlock called a ticketlock. Ticketlocks have a lot of advantages for large systems, including reduced cacheline bouncing and more predictable wait time. (See this LWN article for a complete description.) The key attribute for this discussion is that ticketlocks essentially make a first-come first-served queue: if A has the lock, then B tries to grab it, and then C, B is guaranteed to get the lock before C. So now, if C is spinning waiting for the lock, it must wait for both A and B to finish before it can get the lock.

The result is that on a moderately loaded system, the vast majority of the cpu cycles are actually spent waiting for ticketlocks rather than doing any useful work. This is called a “ticketlock storm”.

Overview

The PV unhalt feature is used to set the pv_lock_ops to rewrite the native spinlock’s function so it can be more optimizated.

Though the total implementation of pv spinlock is related with the spinlock implementation such as ticketlock and queued spinlock, the basic idea behind the pv spinlock is the same. That is instead of spining while the vCPU can’t get the spinlock, it will execute halt instruction and let the other vCPU got scheduled. When the vCPU can get the spinlock, allows a vCPU to kick the target vCPU out of halt state.

When the time spent waiting for a lock reaches a given threshold, kvm is notified via a hypercall that a vCPU is currently blocked by a held lock. KVM then halts the waiting vCPU until the lock is detected to be available.

初始化

kvm side

kvm should expose the KVM_FEATURE_PV_UNHALT to the guest.

1
2
3
4
5
6
7
8
9
static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
u32 index, int *nent, int maxnent)
{
...
case KVM_CPUID_FEATURES:
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
...
(1 << KVM_FEATURE_PV_UNHALT) |
...

guest side

When the guest startup, kvm_spinlock_init is used to initialize the pv spinlock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void __init kvm_spinlock_init(void)
{
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;

if (kvm_para_has_hint(KVM_HINTS_REALTIME))
return;

/* Don't use the pvqspinlock code if there is only 1 vCPU. */
if (num_possible_cpus() == 1)
return;

__pv_init_lock_hash();
pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_ops.lock.queued_spin_unlock =
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_ops.lock.wait = kvm_wait;
pv_ops.lock.kick = kvm_kick_cpu;

if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.lock.vcpu_is_preempted =
PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
}
}

最值得关注的函数是kvm_waitkvm_kick_cpu,会在下面进行详细的介绍。

halt when vCPU is blocked by a held lock

guest side

When the vCPU can’t get the spinlock, it will call wait callback.

1
2
3
4
static __always_inline void pv_wait(u8 *ptr, u8 val)
{
PVOP_VCALL2(lock.wait, ptr, val);
}

Then it will execute the halt instruction in kvm_wait.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static void kvm_wait(u8 *ptr, u8 val)
{
unsigned long flags;

if (in_nmi())
return;

local_irq_save(flags);

if (READ_ONCE(*ptr) != val)
goto out;

/*
* halt until it's our turn and kicked. Note that we do safe halt
* for irq enabled case to avoid hang when lock info is overwritten
* in irq spinlock slowpath and no spurious interrupt occur to save us.
*/
if (arch_irqs_disabled_flags(flags))
halt();
else
safe_halt();

out:
local_irq_restore(flags);
}

kvm side

1
2
3
handle_halt
└── kvm_emulate_halt
└── kvm_vcpu_halt

When the guest execute halt instruction, the kvm_emulate_halt->kvm_vcpu_halt will be called. This will set the vcpu->arch.mp_state to KVM_MP_STATE_HALTED. Then vcpu_block will be called to block this vCPU.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static int vcpu_run(struct kvm_vcpu *vcpu)
{
int r;
struct kvm *kvm = vcpu->kvm;

vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
vcpu->arch.l1tf_flush_l1d = true;

for (;;) {
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);
} else {
r = vcpu_block(kvm, vcpu);
}
...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops->pre_block || kvm_x86_ops->pre_block(vcpu) == 0)) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

if (kvm_x86_ops->post_block)
kvm_x86_ops->post_block(vcpu);

if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
return 1;
}

kvm_apic_accept_events(vcpu);
switch(vcpu->arch.mp_state) {
case KVM_MP_STATE_HALTED:
vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.mp_state =
KVM_MP_STATE_RUNNABLE;
/* fall through */
case KVM_MP_STATE_RUNNABLE:
vcpu->arch.apf.halted = false;
break;
case KVM_MP_STATE_INIT_RECEIVED:
break;
default:
return -EINTR;
break;
}
return 1;
}

kick the halted vCPU until the lock is detected to be available

guest side

When the halted vCPU can get the spinlock, the kick callback will be called by pv_kick by a running vCPU. The kvm_kick_cpu will be called and this trigger a KVM_HC_KICK_CPU hypercall.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
static __always_inline void pv_kick(int cpu)
{
PVOP_VCALL1(lock.kick, cpu);
}

/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
static void kvm_kick_cpu(int cpu)
{
int apicid;
unsigned long flags = 0;

apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
}

kvm side

When the guest trigger KVM_HC_KICK_CPU hypercall, kvm_pv_kick_cpu_op will be called.

1
2
3
4
5
6
7
8
9
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
...
case KVM_HC_KICK_CPU:
kvm_pv_kick_cpu_op(vcpu->kvm, a0, a1);
ret = 0;
break;
...
}

The kvm_pv_kick_cpu_op will send an interrupt to the lapic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* kvm_pv_kick_cpu_op: Kick a vcpu.
*
* @apicid - apicid of vcpu to be kicked.
*/
static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned long flags, int apicid)
{
struct kvm_lapic_irq lapic_irq;

lapic_irq.shorthand = 0;
lapic_irq.dest_mode = 0;
lapic_irq.level = 0;
lapic_irq.dest_id = apicid;
lapic_irq.msi_redir_hint = false;

lapic_irq.delivery_mode = APIC_DM_REMRD;
kvm_irq_delivery_to_apic(kvm, NULL, &lapic_irq, NULL);
}

1
2
3
kvm_irq_delivery_to_apic
└── kvm_apic_set_irq
└── __apic_accept_irq

Then in __apic_accept_irq it will kick the blocked vCPU.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* Add a pending IRQ into lapic.
* Return 1 if successfully added and 0 if discarded.
*/
static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
int vector, int level, int trig_mode,
struct dest_map *dest_map)
{
...
case APIC_DM_REMRD:
result = 1;
vcpu->arch.pv.pv_unhalted = 1;
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
break;
...
}

For the halted vCPU, then kvm_vcpu_block returns, it will set vcpu->arch.mp_state to KVM_MP_STATE_RUNNABLE and let the halted vCPU enter Non-root mode to get the spinlock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
if (!kvm_arch_vcpu_runnable(vcpu) &&
(!kvm_x86_ops->pre_block || kvm_x86_ops->pre_block(vcpu) == 0)) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);

if (kvm_x86_ops->post_block)
kvm_x86_ops->post_block(vcpu);

// 此时kvm_check_request(KVM_REQ_UNHALT, vcpu)为true
// 因为__apic_accept_irq调用了kvm_make_request(KVM_REQ_EVENT, vcpu)
if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
return 1;
}

kvm_apic_accept_events(vcpu);
switch(vcpu->arch.mp_state) {
case KVM_MP_STATE_HALTED:
vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.mp_state =
KVM_MP_STATE_RUNNABLE;
...
}


参考资料:

  1. kvm performance optimization technologies, part two
  2. kvm : Paravirt-spinlock support for KVM guests
  3. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud
  4. Benchmarking the new PV ticketlock implementation
  5. How to deal with lock-holder preemption
  6. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3