本文将结合QEMU代码,解析RTC虚拟化。一些细节作者也没有捋清楚,待日后更新吧。

1. Prerequisite

  • 读者需对RTC有一定的了解
    • 往PIO 0x70写入0x00后,从PIO 0x71寄存器中读到的就是当前的秒数
    • 往PIO 0x70写入0x02后,从PIO 0x71寄存器中读到的就是当前的分钟数
    • 往PIO 0x70写入0x04后,从PIO 0x71寄存器中读到的就是当前的小时数
  • PIO virtualization in QEMU/KVM

2. How to use RTC in QEMU

2.1 QEMU document

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
``-rtc [base=utc|localtime|datetime][,clock=host|rt|vm][,driftfix=none|slew]``
Specify ``base`` as ``utc`` or ``localtime`` to let the RTC start at
the current UTC or local time, respectively. ``localtime`` is
required for correct date in MS-DOS or Windows. To start at a
specific point in time, provide datetime in the format
``2006-06-17T16:01:21`` or ``2006-06-17``. The default base is UTC.

By default the RTC is driven by the host system time. This allows
using of the RTC as accurate reference clock inside the guest,
specifically if the host time is smoothly following an accurate
external reference clock, e.g. via NTP. If you want to isolate the
guest time from the host, you can set ``clock`` to ``rt`` instead,
which provides a host monotonic clock if host support it. To even
prevent the RTC from progressing during suspension, you can set
``clock`` to ``vm`` (virtual clock). '\ ``clock=vm``\ ' is
recommended especially in icount mode in order to preserve
determinism; however, note that in icount mode the speed of the
virtual clock is variable and can in general differ from the host
clock.

Enable ``driftfix`` (i386 targets only) if you experience time drift
problems, specifically with Windows' ACPI HAL. This option will try
to figure out how many timer interrupts were not processed by the
Windows guest and will re-inject them.
  • UTC is the primary time standard by which the world regulates clocks and time.

  • system time vs monotonic clock
    NTP
    system time:

    monotonic clock:

  • clock=rt相比,clock=vm增加了一个新的特性:当guest suspend的时候,RTC暂停计时

  • icount: instruction counter

  • HAL(Hardware Abstraction Layers)

2.2 QEMUClockType

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/**
* QEMUClockType:
*
* The following clock types are available:
*
* @QEMU_CLOCK_REALTIME: Real time clock
*
* The real time clock should be used only for stuff which does not
* change the virtual machine state, as it runs even if the virtual
* machine is stopped.
*
* @QEMU_CLOCK_VIRTUAL: virtual clock
*
* The virtual clock only runs during the emulation. It stops
* when the virtual machine is stopped.
*
* @QEMU_CLOCK_HOST: host clock
*
* The host clock should be used for device models that emulate accurate
* real time sources. It will continue to run when the virtual machine
* is suspended, and it will reflect system time changes the host may
* undergo (e.g. due to NTP).
*
* @QEMU_CLOCK_VIRTUAL_RT: realtime clock used for icount warp
*
* Outside icount mode, this clock is the same as @QEMU_CLOCK_VIRTUAL.
* In icount mode, this clock counts nanoseconds while the virtual
* machine is running. It is used to increase @QEMU_CLOCK_VIRTUAL
* while the CPUs are sleeping and thus not executing instructions.
*/

typedef enum {
QEMU_CLOCK_REALTIME = 0,
QEMU_CLOCK_VIRTUAL = 1,
QEMU_CLOCK_HOST = 2,
QEMU_CLOCK_VIRTUAL_RT = 3,
QEMU_CLOCK_MAX
} QEMUClockType;

2.3 QEMU参数解析

至于QEMU是如何解析baseclockdriftfix这些参数的呢?请参考configure_rtc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void configure_rtc(QemuOpts *opts)
{
const char *value;

/* Set defaults */
rtc_clock = QEMU_CLOCK_HOST;
rtc_ref_start_datetime = qemu_clock_get_ms(QEMU_CLOCK_HOST) / 1000;
rtc_realtime_clock_offset = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;

value = qemu_opt_get(opts, "base");
if (value) {
if (!strcmp(value, "utc")) {
rtc_base_type = RTC_BASE_UTC;
} else if (!strcmp(value, "localtime")) {
Error *blocker = NULL;
rtc_base_type = RTC_BASE_LOCALTIME;
error_setg(&blocker, QERR_REPLAY_NOT_SUPPORTED,
"-rtc base=localtime");
replay_add_blocker(blocker);
} else {
rtc_base_type = RTC_BASE_DATETIME;
configure_rtc_base_datetime(value);
}
}
...
}

3. Full picture

往PIO 0x70写入0x00后,从PIO 0x71寄存器中读到的就是当前的秒数。本节以该操作为例,介绍下整个流程。

3.1 Guest写PIO 0x70

  1. Guest在Non-root mode下执行了OUT指令
  2. PIO VM Exit
  3. KVM发现自己处理不了这个PIO,就将这个IO请求forward给QEMU
  4. QEMU处理这个IO请求

至于QEMU如何处理这个IO请求,请参考cmos_ioport_write

3.2 Guest读PIO 0x71

  1. Guest在Non-root mode下执行了IN指令
  2. PIO VM Exit
  3. KVM发现自己处理不了这个PIO,就将这个IO请求forward给QEMU
  4. QEMU处理这个IO请求

至于QEMU如何处理这个IO请求,请参考cmos_ioport_read

4. 模拟mc146818时钟芯片

4.1 数据结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
typedef struct RTCState {
ISADevice parent_obj;

MemoryRegion io;
MemoryRegion coalesced_io;
uint8_t cmos_data[128];
uint8_t cmos_index;
int32_t base_year;
uint64_t base_rtc;
uint64_t last_update;
int64_t offset;
qemu_irq irq;
int it_shift;
/* periodic timer */
QEMUTimer *periodic_timer;
int64_t next_periodic_time;
/* update-ended timer */
QEMUTimer *update_timer;
uint64_t next_alarm_time;
uint16_t irq_reinject_on_ack_count;
uint32_t irq_coalesced;
uint32_t period;
QEMUTimer *coalesced_timer;
Notifier clock_reset_notifier;
LostTickPolicy lost_tick_policy;
Notifier suspend_notifier;
QLIST_ENTRY(RTCState) link;
} RTCState;
  • cmos_data存放128字节的数据
  • base_rtc is the RTC value when the RTC was last updated
  • last_update is the guest time when the RTC was last updated

4.2 初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
static const MemoryRegionOps cmos_ops = {
.read = cmos_ioport_read,
.write = cmos_ioport_write,
.impl = {
.min_access_size = 1,
.max_access_size = 1,
},
.endianness = DEVICE_LITTLE_ENDIAN,
};

static void rtc_realizefn(DeviceState *dev, Error **errp)
{
...
int base = 0x70;

s->cmos_data[RTC_REG_A] = 0x26;
s->cmos_data[RTC_REG_B] = 0x02;
s->cmos_data[RTC_REG_C] = 0x00;
s->cmos_data[RTC_REG_D] = 0x80;
...

memory_region_init_io(&s->io, OBJECT(s), &cmos_ops, s, "rtc", 2);
isa_register_ioport(isadev, &s->io, base);
...

4.3 cmos_ioport_write

Guest往PIO 0x70写入0x00后,QEMU中的处理:

1
2
3
4
5
6
7
8
9
10
11
static void cmos_ioport_write(void *opaque, hwaddr addr,
uint64_t data, unsigned size)
{
RTCState *s = opaque;
uint32_t old_period;
bool update_periodic_timer;

if ((addr & 1) == 0) {
s->cmos_index = data & 0x7f;
} else {
...

此时,addr & 1为0(0x70),因此会执行s->cmos_index = data & 0x7f,设置cmos_index为0。

4.4 cmos_ioport_read

Guest从PIO 0x71寄存器中读到的就是当前的秒数,QEMU中的处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static uint64_t cmos_ioport_read(void *opaque, hwaddr addr,
unsigned size)
{
RTCState *s = opaque;
int ret;
if ((addr & 1) == 0) {
return 0xff;
} else {
switch(s->cmos_index) {
case RTC_IBM_PS2_CENTURY_BYTE:
s->cmos_index = RTC_CENTURY;
/* fall through */
case RTC_CENTURY:
case RTC_SECONDS:
case RTC_MINUTES:
case RTC_HOURS:
case RTC_DAY_OF_WEEK:
case RTC_DAY_OF_MONTH:
case RTC_MONTH:
case RTC_YEAR:
/* if not in set mode, calibrate cmos before
* reading*/
if (rtc_running(s)) {
rtc_update_time(s);
}
ret = s->cmos_data[s->cmos_index];
break;
...

此时,addr & 1为1(0x71),因此会执行rtc_update_time,然后返回s->cmos_data[s->cmos_index]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void rtc_update_time(RTCState *s)
{
struct tm ret;
time_t guest_sec;
int64_t guest_nsec;

guest_nsec = get_guest_rtc_ns(s);
guest_sec = guest_nsec / NANOSECONDS_PER_SECOND;
gmtime_r(&guest_sec, &ret);

/* Is SET flag of Register B disabled? */
if ((s->cmos_data[RTC_REG_B] & REG_B_SET) == 0) {
rtc_set_cmos(s, &ret);
}
}

rtc_set_cmos会设置cmos_data中的内容。

1
2
3
4
5
6
7
static uint64_t get_guest_rtc_ns(RTCState *s)
{
uint64_t guest_clock = qemu_clock_get_ns(rtc_clock);

return s->base_rtc * NANOSECONDS_PER_SECOND +
guest_clock - s->last_update + s->offset;
}

https://lore.kernel.org/qemu-devel/1342781633-7288-5-git-send-email-pbonzini@redhat.com/

Calculate guest RTC based on the time of the last update.The formula is:
(base_rtc + guest_time_now - guest_time_last_update + offset)

  • base_rtc is the RTC value when the RTC was last updated
  • guest_time_now is the guest time when the access happens
  • guest_time_last_update is the guest time when the RTC was last updated
  • offset is used when divider reset happens or the set bit is toggled(可以暂时忽略,若想深入研究,需仔细阅读RTC的spec)

4.5 Update guest RTC

什么时候会更新guest的RTC呢?
一旦更新guest的RTC,就会更新base_rtclast_update

4.5.1 base_rtclast_update的初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static void rtc_set_date_from_host(ISADevice *dev)
{
RTCState *s = MC146818_RTC(dev);
struct tm tm;

qemu_get_timedate(&tm, 0);

s->base_rtc = mktimegm(&tm);
s->last_update = qemu_clock_get_ns(rtc_clock);
s->offset = 0;

/* set the CMOS date */
rtc_set_cmos(s, &tm);
}

static void rtc_realizefn(DeviceState *dev, Error **errp)
{
...
rtc_set_date_from_host(isadev);
...
}

4.5.2 base_rtclast_update的更新

1
2
3
4
5
6
7
8
9
10
static void rtc_set_time(RTCState *s)
{
struct tm tm;

rtc_get_time(s, &tm);
s->base_rtc = mktimegm(&tm);
s->last_update = qemu_clock_get_ns(rtc_clock);

...
}

rtc_set_time会更新base_rtclast_update

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
static void cmos_ioport_write(void *opaque, hwaddr addr,
uint64_t data, unsigned size)
{
RTCState *s = opaque;
uint32_t old_period;
bool update_periodic_timer;

if ((addr & 1) == 0) {
s->cmos_index = data & 0x7f;
} else {
CMOS_DPRINTF("cmos: write index=0x%02x val=0x%02" PRIx64 "\n",
s->cmos_index, data);
switch(s->cmos_index) {
case RTC_SECONDS_ALARM:
case RTC_MINUTES_ALARM:
case RTC_HOURS_ALARM:
s->cmos_data[s->cmos_index] = data;
check_update_timer(s);
break;
case RTC_IBM_PS2_CENTURY_BYTE:
s->cmos_index = RTC_CENTURY;
/* fall through */
case RTC_CENTURY:
case RTC_SECONDS:
case RTC_MINUTES:
case RTC_HOURS:
case RTC_DAY_OF_WEEK:
case RTC_DAY_OF_MONTH:
case RTC_MONTH:
case RTC_YEAR:
s->cmos_data[s->cmos_index] = data;
/* if in set mode, do not update the time */
if (rtc_running(s)) {
rtc_set_time(s);
check_update_timer(s);
}
break;
case RTC_REG_A:
update_periodic_timer = (s->cmos_data[RTC_REG_A] ^ data) & 0x0f;
old_period = rtc_periodic_clock_ticks(s);

if ((data & 0x60) == 0x60) {
if (rtc_running(s)) {
rtc_update_time(s);
}
/* What happens to UIP when divider reset is enabled is
* unclear from the datasheet. Shouldn't matter much
* though.
*/
s->cmos_data[RTC_REG_A] &= ~REG_A_UIP;
} else if (((s->cmos_data[RTC_REG_A] & 0x60) == 0x60) &&
(data & 0x70) <= 0x20) {
/* when the divider reset is removed, the first update cycle
* begins one-half second later*/
if (!(s->cmos_data[RTC_REG_B] & REG_B_SET)) {
s->offset = 500000000;
rtc_set_time(s);
}
s->cmos_data[RTC_REG_A] &= ~REG_A_UIP;
}
/* UIP bit is read only */
s->cmos_data[RTC_REG_A] = (data & ~REG_A_UIP) |
(s->cmos_data[RTC_REG_A] & REG_A_UIP);

if (update_periodic_timer) {
periodic_timer_update(s, qemu_clock_get_ns(rtc_clock),
old_period, true);
}

check_update_timer(s);
break;
case RTC_REG_B:
update_periodic_timer = (s->cmos_data[RTC_REG_B] ^ data)
& REG_B_PIE;
old_period = rtc_periodic_clock_ticks(s);

if (data & REG_B_SET) {
/* update cmos to when the rtc was stopping */
if (rtc_running(s)) {
rtc_update_time(s);
}
/* set mode: reset UIP mode */
s->cmos_data[RTC_REG_A] &= ~REG_A_UIP;
data &= ~REG_B_UIE;
} else {
/* if disabling set mode, update the time */
if ((s->cmos_data[RTC_REG_B] & REG_B_SET) &&
(s->cmos_data[RTC_REG_A] & 0x70) <= 0x20) {
s->offset = get_guest_rtc_ns(s) % NANOSECONDS_PER_SECOND;
rtc_set_time(s);
}
}
...

从第34、57和90行可知,当guest往PIO 0x71设置值的时候,就可能会调用rtc_set_time
比如在guest中,当前为2022年,如果guest想设置为2021年,此时就会触发base_rtclast_update的更新。

1
2
3
4
5
6
7
8
9
10
11
static int rtc_post_load(void *opaque, int version_id)
{
RTCState *s = opaque;

if (version_id <= 2 || rtc_clock == QEMU_CLOCK_REALTIME) {
rtc_set_time(s);
s->offset = 0;
check_update_timer(s);
}
...
}

当live migration时,在目的端,也可能会调用rtc_set_time

4.6 MISC

  • rtc_policy_slew_deliver_irqdriftfix=slew参数对应的操作,细节未研究,待日后更新。
  • rtc作为定时器的用法,本文也没有阐述,待日后更新。

参考资料:

  1. qemu时钟虚拟化
  2. BIOS execution in QEMU: first I/O interaction