本文将借鉴A Beginners’ Guide to x86-64 Instruction Encoding,并补充相关材料,以一个具体的例子来介绍Intel Instruction Encoding。

1. Background

以一个典型的memory reference来引入Instruction Encoding。
[base + index*scale + disp]

baseindex是寄存器,disp是偏移量,scale是系数。

Figure1

Figure1中,SIB中的Scale,Index,Base与scale,index,base相对应。 Displacement与disp相对应。

2. Tools and tips for finding out an x86-64 instruction’s encoding

To quickly find out the encoding of an instruction, you can use the GNU assembler as and the objdump tool together. For example, to find out the encoding of the instruction addq 10(%rdi), %r8, you can do it as follows.

First, create a file add.s containing one line

1
addq 10(%rdi), %r8

Second, assemble the add.s to object file by

1
$ as add.s -o add.o

Last, deassemble the object file by

1
$ objdump -d add.o

It will print out

1
2
3
4
5
6
add.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <.text>:
0: 4c 03 47 0a add 0xa(%rdi),%r8

Here 4c 03 47 0a is the 4-byte encoding of the addq instruction.

3. Brief introduction to x86-64 instruction encoding

The x86-64 instructions are encoded one by one as a variable number of bytes for each. Each instruction’s encoding consists of:

  • an opcode
  • a register and/or address mode specifier consisting of the ModR/M byte and sometimes the scale-index-base (SIB) byte (if required)
  • a displacement and an immediate data field (if required)

Please refer to Figure1 for more information.

4. An example: manually encode an x86-64 instruction

Let’s take a look at the encoding of an instruction add r8,QWORD PTR [rdi+0xa] (in Intel syntax) in the previous part. Let’s see how it is encoded to 4c 03 47 0a.

From the “add” instruction reference from “ADD”, “INSTRUCTION SET REFERENCE” in the ISA reference Volume 2A., find the line for the encoding of the ADD r64, r/m64 corresponding to this instruction

1
2
3
Opcode      Instruction     Op/  64-bit Compat/   Description
En Mode Leg Mode
REX.W+03/r ADD r64,r/m64 RM Valid N.E. Add r/m64 to r64.

REX info:

and, from the REX description

In 64-bit mode, the instruction’s default operation size is 32 bits. … Using a REX prefix in the form of REX.W promotes operation to 64 bits.

So, we get

1
REX.W = 1

The ‘R’, ‘X’ and ‘B’ bits are related to the operand encoding (check “Table 2-4. REX Prefix Fields [BITS: 0100WRXB]” of the reference volume 2A).

REX.X bit modifies the SIB index field.

SIB is not used in this instruction. Hence,

1
REX.X = 0

Let’s further look at the encoding of the operands. From the “Instruction Operand Encoding” for the add instruction:

1
2
Op/En Operand 1      Operand 2    Operand 3 Operand 4
RM ModRM:reg(r,w) ModRM:r/m(r) NA NA

There will be 2 operand parts for the RM encoding. The first part will be ModRM:reg(r,w) and the second part will be ModRM:r/m(r). “Figure 2-4. Memory Addressing Without an SIB Byte; REX.X Not Used” from Volume 2 shows the encoding for this case.

The REX.R and REX.B bits and the ModeRM byte will be decided accordingly. There are 3 parts in the ModRM byte: ‘mod’, ‘reg’ and ‘r/m’.

There is a table “Table 2-2. 32-Bit Addressing Forms with the ModR/M Byte” (it is for 32-bit operands. But from 2.2.1.1, “In 64-bit mode, these formats do not change. Bits needed to
define fields in the 64-bit context are provided by the addition of REX prefixes” and hence the same value can be used) in Volume 2 which shows mapping of the operands combinations to the bits values of ‘mod’.

Although the table applies to 64-bit modes too, it does not show the additional registers like r8. Hence, we only use it to find out bits for ‘Mod’ only for the addq instruction we are encoding it. As 0xa can be encoded in a byte, we can use disp8 to keep the instruction encoding short. From the row of [EDI]+disp8 (actually, all disp8 ones share the same ‘Mod’ bits),

1
Mod = 01 (in bits)

For the encoding of the registers, I compiled a table for the general purpose 64-bit registers for your reference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
_.Reg  Register
----------------
0.000 RAX
0.001 RCX
0.010 RDX
0.011 RBX
0.100 RSP
0.101 RBP
0.110 RSI
0.111 RDI
1.000 R8
1.001 R9
1.010 R10
1.011 R11
1.100 R12
1.101 R13
1.110 R14
1.111 R15

The ‘_‘ in the ‘_.Reg’ are usually a bit in the REX prefix, such as REX.B and REX.R, depending on specific instructions and operand combinations.

For the addq instruction in this case, r8 is 1.000 and rdi is 0.111. Hence, in bits, we get

1
2
3
4
reg = 000
r/m = 111
REX.B = 0 (from `rdi`)
REX.R = 1 (from `r8`)

Now, let’s put them together.

By putting the ‘WRXB’ bits ([BITS: 0100WRXB]) together, we get the REX prefix for this instruction is

1
0100 1100

Together with the 03 in REX.W+03/r from the reference for the ADD instruction, the opcode part, in hexadecimal, is

1
4c 03

By putting the mod, reg and r/m together, we get the ModRM byte (in bits)

1
01 000 111

which is, in hexadecimal,

1
47

Following the ModRM byte is the displacement is 0xa(10‘s hexadecimal representation) in one byte (disp8).

Putting all these together, we finally get the encoding of add r8,[rdi+0xa]:

1
4c 03 47 0a

In this example, to show the process, I have shown how to manually do an instruction’s encoding which is usually done by the assembler. You may use the same method to encode all other instruction by checking the reference documents for details of every instruction/operand combinations’ cases.

5. Tips

is a very good page from OSDev as a quick reference.

可以快速扫一下内容,例如:

有些内容还是比较形象直观的。

  • Intel SDM vol2 2.1 INSTRUCTION FORMAT FOR PROTECTED MODE, REAL-ADDRESS MODE,
    AND VIRTUAL-8086 MODE

以上内容是对Figure1的补充说明。

  • Intel SDM vol2 3.1 INTERPRETING THE INSTRUCTION REFERENCE PAGES

Intel SDM vol2中有具体的指令说明,需要先扫一下3.1 INTERPRETING THE INSTRUCTION REFERENCE PAGES中的内容。This section describes the format of information contained in the instruction reference pages in this chapter. It explains notational conventions and abbreviations used in these sections.

For example:


参考资料:

  1. A Beginners’ Guide to x86-64 Instruction Encoding
  2. X86-64 Instruction Encoding
  3. x86 Instruction Encoding
  4. Intel SDM
  5. Memory References