Introduction to QEMU-KVM Live Migration.

本文只介绍Pre-copy memory migration,大部分内容转载自:qemu热迁移简介.

文中的代码解析基于QEMU 5.1.0。

1. Usage

The usage of QEMU&&KVM live migration

2. 基本原理

推荐读下NSDI‘05 Live Migration of Virtual Machines

首先看看热迁移过程中qemu的哪些部分会包含进来。上图中间的灰色部分是虚拟机的内存,它对于qemu来说是黑盒,qemu不会做任何假设,而只是一股脑儿的发送到dst(destination host)。左边的区域是表示设备的状态,这部分是虚拟机可见的,qemu使用自己的协议来发送这部分。右边的是不会迁移的部分,但是还是需要dst和src(source host)保持一致,一般来说,src和dst的虚拟机使用相同的qemu command line能够保证这部分一致。

需要满足很多条件才能进行热迁:

  1. 使用共享存储,如NFS
  2. host的时间要一致
  3. 网络配置要一致,不能说src能访问某个网络,dst不能
  4. host CPU类型要一致,毕竟host导出指令集给guest
  5. 虚拟机的机器类型,QEMU版本,rom版本等

热迁移主要包括三个步骤:

  1. 将虚拟机所有RAM pages设置成dirty,主要函数:ram_save_setup
  2. 持续迭代将虚拟机的dirty pages发送到dst,直到达到一定条件,比如dirty pages数量比较少, 主要函数:ram_save_iterate
  3. 停止src上面的guest,把剩下的dirty pages发送到dst,之后发送设备状态,主要函数: qemu_savevm_state_complete_precopy

其中步骤1和步骤2是上图中的灰色区域,步骤3是灰色和左边的区域。

之后就可以在dst上面继续运行qemu程序了。

3. Algorithm

  1. Setup
  • Start guest on destination, connect, enable dirty page logging and more
  1. Transfer Memory
  • Guest continues to run
  • Bandwidth limitation (controlled by the user)
  • First transfer the whole memory
  • Iteratively transfer all dirty pages (pages that were written to by the guest).
  1. Stop the guest
  • And sync VM image(s) (guest’s hard drives).
  1. Transfer State
  • As fast as possible (no bandwidth limitation)
  • All VM devices’ state and dirty pages yet to be transferred
  1. Continue the guest
  • On destination upon success
    • Broadcast “I’m over here” Ethernet packet to announce new location of NIC(s).
  • On source upon failure (with one exception).

4. 发送端源码分析

在qemu的monitor输入migrate命令后,经过的一些函数:

1
2
3
4
5
6
7
8
hmp_migrate
qmp_migrate
tcp_start_outgoing_migration
socket_start_outgoing_migration
socket_outgoing_migration
migration_channel_connect
qemu_fopen_channel_output
migrate_fd_connect
1
2
3
4
5
6
7
8
void migrate_fd_connect(MigrationState *s, Error *error_in)
{
...

qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
QEMU_THREAD_JOINABLE);
s->migration_thread_running = true;
}

migrate_fd_connect函数创建了一个迁移线程,线程函数为migration_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
migration_thread
qemu_savevm_state_setup
ram_save_setup[save_setup]
ram_init_all
ram_init_bitmaps
ram_list_init_bitmaps
bitmap_new
bitmap_set
migration_iteration_run
qemu_savevm_state_pending
ram_save_pending[save_live_pending]
qemu_savevm_state_iterate
ram_save_iterate[save_live_iterate]
ram_find_and_save_block
migration_completion
vm_stop_force_state
qemu_savevm_state_complete_precopy
qemu_savevm_state_complete_precopy_iterable
ram_save_complete[save_live_complete_precopy]
ram_find_and_save_block

migration_thread主要就是用来完成之前提到的热迁移的三个步骤。
首先来看第一个步骤,qemu_savevm_state_setup标记所有RAM pages为dirty。

接着看第二个步骤,由while循环中的两个函数完成: qemu_savevm_state_pendingqemu_savevm_state_iterate

第一个函数通过调用回调函数ram_save_pending确定还要传输的字节数,比较简单。 第二个函数通过调用回调函数ram_save_iterate用来把dirty pages传到dst上面。

ram_find_and_save_block–>find_dirty_block–>ram_save_host_page–>migration_bitmap_clear_dirty–>ram_save_target_page–>ram_save_page–>save_normal_page->qemu_put_buffer_async –>…->qemu_fflush –>…->send

在while循环中反复调用ram_save_pendingram_save_iterate不停向dst发送虚拟机脏页,直到达到一定的条件,然后进入第三个步骤。

第三个步骤就是调用migration_completion,在这一步中会停止src虚拟机,然后把最后剩的一点脏页拷贝到dst去。

5. 接收端源码分析

接收端的qemu运行参数跟发送端的一样,但是多了一个参数-incoming tcp:0:6666, qemu在解析到-incoming后,就会等待src迁移过来,我们来看看这个流程。

main –>qemu_init –>qemu_start_incoming_migration –>tcp_start_incoming_migration –>socket_start_incoming_migration –>socket_accept_incoming_migration –>migration_channel_process_incoming ->migration_ioc_process_incoming ->migration_incoming_process ->process_incoming_migration_co ->qemu_loadvm_state ->qemu_loadvm_state_main

process_incoming_migration_co函数用来完成数据接收,恢复虚拟机的运行。最重要的是qemu_loadvm_state,用于接收数据,在dst重构虚拟机。

1
2
3
4
5
6
7
8
int qemu_loadvm_state(QEMUFile *f)
{
...
ret = qemu_loadvm_state_main(f, mis);
...

return ret;
}

显然,qemu_loadvm_state_main是构建虚拟机的主要函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
{
uint8_t section_type;
int ret = 0;

retry:
while (true) {
section_type = qemu_get_byte(f);

if (qemu_file_get_error(f)) {
ret = qemu_file_get_error(f);
break;
}

trace_qemu_loadvm_state_section(section_type);
switch (section_type) {
case QEMU_VM_SECTION_START:
case QEMU_VM_SECTION_FULL:
ret = qemu_loadvm_section_start_full(f, mis);
if (ret < 0) {
goto out;
}
break;
case QEMU_VM_SECTION_PART:
case QEMU_VM_SECTION_END:
ret = qemu_loadvm_section_part_end(f, mis);
if (ret < 0) {
goto out;
}
break;
case QEMU_VM_COMMAND:
ret = loadvm_process_command(f);
trace_qemu_loadvm_state_section_command(ret);
if ((ret < 0) || (ret == LOADVM_QUIT)) {
goto out;
}
break;
case QEMU_VM_EOF:
/* This is the end of migration */
goto out;
default:
error_report("Unknown savevm section type %d", section_type);
ret = -EINVAL;
goto out;
}
}

out:
if (ret < 0) {
qemu_file_set_error(f, ret);

/* Cancel bitmaps incoming regardless of recovery */
dirty_bitmap_mig_cancel_incoming();

/*
* If we are during an active postcopy, then we pause instead
* of bail out to at least keep the VM's dirty data. Note
* that POSTCOPY_INCOMING_LISTENING stage is still not enough,
* during which we're still receiving device states and we
* still haven't yet started the VM on destination.
*
* Only RAM postcopy supports recovery. Still, if RAM postcopy is
* enabled, canceled bitmaps postcopy will not affect RAM postcopy
* recovering.
*/
if (postcopy_state_get() == POSTCOPY_INCOMING_RUNNING &&
migrate_postcopy_ram() && postcopy_pause_incoming(mis)) {
/* Reset f to point to the newly created channel */
f = mis->from_src_file;
goto retry;
}
}
return ret;
}

qemu_loadvm_state_main分别处理各个section, src会把QEMU_VM_SECTION_START等标志放到流中。

1
2
3
4
5
6
7
qemu_loadvm_section_start_full
find_se
vmstate_load
ram_load[load_state]
ram_load_precopy
qemu_get_buffer
...

ram_load负责把接收到的数据拷贝到dst这端虚拟机的内存上。

6. MISC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
typedef struct SaveVMHandlers {
/* This runs inside the iothread lock. */
SaveStateHandler *save_state;

void (*save_cleanup)(void *opaque);
int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque);
int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);

/* This runs both outside and inside the iothread lock. */
bool (*is_active)(void *opaque);
bool (*has_postcopy)(void *opaque);

/* is_active_iterate
* If it is not NULL then qemu_savevm_state_iterate will skip iteration if
* it returns false. For example, it is needed for only-postcopy-states,
* which needs to be handled by qemu_savevm_state_setup and
* qemu_savevm_state_pending, but do not need iterations until not in
* postcopy stage.
*/
bool (*is_active_iterate)(void *opaque);

/* This runs outside the iothread lock in the migration case, and
* within the lock in the savevm case. The callback had better only
* use data that is local to the migration thread or protected
* by other locks.
*/
int (*save_live_iterate)(QEMUFile *f, void *opaque);

/* This runs outside the iothread lock! */
int (*save_setup)(QEMUFile *f, void *opaque);
void (*save_live_pending)(QEMUFile *f, void *opaque,
uint64_t threshold_size,
uint64_t *res_precopy_only,
uint64_t *res_compatible,
uint64_t *res_postcopy_only);
/* Note for save_live_pending:
* - res_precopy_only is for data which must be migrated in precopy phase
* or in stopped state, in other words - before target vm start
* - res_compatible is for data which may be migrated in any phase
* - res_postcopy_only is for data which must be migrated in postcopy phase
* or in stopped state, in other words - after source vm stop
*
* Sum of res_postcopy_only, res_compatible and res_postcopy_only is the
* whole amount of pending data.
*/


LoadStateHandler *load_state;
int (*load_setup)(QEMUFile *f, void *opaque);
int (*load_cleanup)(void *opaque);
/* Called when postcopy migration wants to resume from failure */
int (*resume_prepare)(MigrationState *s, void *opaque);
} SaveVMHandlers;
1
2
3
4
5
6
7
8
9
10
11
12
13
static SaveVMHandlers savevm_ram_handlers = {
.save_setup = ram_save_setup,
.save_live_iterate = ram_save_iterate,
.save_live_complete_postcopy = ram_save_complete,
.save_live_complete_precopy = ram_save_complete,
.has_postcopy = ram_has_postcopy,
.save_live_pending = ram_save_pending,
.load_state = ram_load,
.save_cleanup = ram_save_cleanup,
.load_setup = ram_load_setup,
.load_cleanup = ram_load_cleanup,
.resume_prepare = ram_resume_prepare,
};

以这些callback函数为接口来研究Live Migration,也是学习源码的一个极佳途径,能掌握全局。


参考资料:

  1. Migration | KVM Docs