Skip to content

Instantly share code, notes, and snippets.

@vic511
Last active May 3, 2021 16:07
Show Gist options
  • Select an option

  • Save vic511/936f3ba2ba843ab654fab66908512708 to your computer and use it in GitHub Desktop.

Select an option

Save vic511/936f3ba2ba843ab654fab66908512708 to your computer and use it in GitHub Desktop.
FCSC 2021 quals - VMV reverse challenge

VMV writeup

Architecture

We are facing a stripped ELF 64-bit binary. It accepts a command-line parameter most likely being the password being checked. We notice the binary is taking a lot of time to execute before noticing us of a fail.

As the name suggests, we are going to consider the binary is actually an implementation of a custom virtual machine. I mainly analyzed this binary using a static approach.

Looking at the main() function, we notice a lot of calls to a specific function, taking a seemingly arbitrary number as first argument, and a function pointer as second argument. This function stores a new node to a global hashtable allocated in the main() function. We guess this allows to efficiently lookup VM instruction handlers from their respective opcode.

The other function refering to this global hashtable looks up an instruction handler from a given opcode, and is called in the main program loop, matching the way a classic VM works.

Some more reverse engineering leads us to discover the way instructions are being executed:

  1. Initialize VM CPU
  2. While cpu->loop is true
  3. Fetch next opcode
  4. Get instruction handler from opcode
  5. Execute handler, giving it VM CPU structure as parameter
  6. Go to 2

After reverse engineering the whole binary, the recovered VM CPU structure looks like the following:

struct vmv_cpu {
  /* Stack buffer */
  int32_t *stack;
  /* Generic registers */
  int32_t r0_r6[7];
  /* Program counter, pointing to the next instruction */
  int32_t pc;
  /* More on that later */
  int32_t tape_cursor;
  /* Link register, holding function return address */
  int32_t lr;
  /* Generic registers */
  int32_t r10_r11[2];
  /* Stack pointer */
  int32_t sp;
  /* Generic registers */
  int32_t r13_r19[7];
  /* Bytecode buffer */
  int32_t const *bytecode;
  /* More on that later */
  int32_t *tape;
  int32_t tape_cursor_next;
  /* Compiler padding */
  uint8_t _pad0[4];
  /* ROM buffer - more on that later */
  int32_t const *rom;
  size_t rom_cursor;
  /* Boolean for the main loop */
  int32_t loop;
  /* Exit code */
  int32_t exit_value;
  /* Has 'putchar' instruction been called? */
  int32_t displayed;
  /* Compiler padding */
  uint8_t _pad1[4];
};

We are facing a 32-bit virtual machine, containing common CPU elements such as a program counter and a link register (similar to ARM), a 4096-byte stack along with its pointer, and 16 general-purpose registers. More information about VMV-specific data structures is coming afterwards.

The bytecode buffer is base64-decoded inside the main() function, and passed as a parameter to the CPU initialization function.

Instructions

Instructions get their operands from the bytecode, can fetch arguments from the stack and optionally push a result onto the stack.

The following instructions are supported:

  • xor, add, and, mul: fetch two immediate parameters on the stack; All but xor compute operation results modulo 2^31 - 1;
  • mod: fetches a register value and an immediate parameter;
  • call: saves pc to lr and adds an immediate operand to pc;
  • ret: returns from a procedure, setting pc to lr;
  • push: pushes a value onto the stack; Comes in two variants, taking either a register or an immediate operand;
  • pop: pops a value from the stack into a register;
  • inc, dec: respectively increment / decrement a register;
  • jmp: adds an immediate value relative to pc;
  • je, jne: compares two immediate parameters, conditionally jumping to the operand;
  • tape_store, tape_fetch, tape_shift: more on that later;
  • exit: cleanly exits the VM using an exit value stored in a given register operand.
  • die: 3 duplicate implementations of this instruction are present, abruptly exiting the VM with an error code set to EXIT_FAILURE.

Instructions fetch register operands using an helper function applying an exclusive OR operation to the fetched operand: (value ^ 0x7B) & 0x3F. It is then used as an index to the start of cpu->r0_r6 array.

Tape

The tape is a read / write memory reachable from within the virtual machine, the latter providing instructions to fetch, store data (tape_fetch, tape_store), and shift the next tape cursor (tape_shift). It is allocated in the CPU initialization function, as an array of 0x2000000 32-bit values.

tape_shift has an interesting implementation, as it actually sets tape_cursor to tape_cursor_next, then shifts the latter. This allows independent code to not overlap, providing a way to dynamically reserve memory space inside the tape (but no way to free it).

ROM

A read-only memory is available to the CPU, initialized in the main() function. A single rom_fetch instruction allows to retreive the next 32-bit value pointed to by rom_cursor, which is then incremented.

We notice the 16 bytes long argument given to the binary is concatenated to some decoded data before being passed to the CPU initialization function as the ROM buffer. It will most likely be used to influence the bytecode behavior at runtime.

## User interaction

A putchar instruction allows to display an ASCII character contained in a given register. It also sets the displayed flag, preventing the main loop waiting message to be shown again.

Bytecode analysis

Using the knowledge we gained from reverse engineering, we can extract the bytecode from the virtual machine and disassemble it using a custom Python script.

   0x0    tape_shift  0x400
   0x8          push  tape_cursor
  0x10           pop  r16
  0x18          push  0x11
  0x20           pop  tape_cursor
  0x28    tape_shift  tape_cursor
  0x30          push  tape_cursor
  0x38           pop  r3
  0x40     rom_fetch  r0
  0x48    tape_shift  r0
  0x50          push  tape_cursor
  0x58           pop  r14
  0x60          push  r3
  0x68          push  0x0
  0x70           add
		   ...

Reverse engineering the whole bytecode leads us to a very interesting result: this seems to be another virtual machine implementation, very similar to the top one.

This is the start of the reconstructed pseudocode:

// Allocating 0x400 bytes for the virtual stack
tape_shift(0x400);
// Stack pointer
r16 = tape_cursor;

tape_cursor = 0x11;
// (virtual) CPU state allocation
tape_shift(tape_cursor);
// Contains state base index
r3 = tape_cursor;

// Fetch ROM u32 = bytecode size
r0 = rom_fetch();
tape_shift(r0);
// Base bytecode index
r14 = tape_cursor;

// Set PC to bytecode start
tape_cursor = r3 + 0;
tape_store(r14);

// Copy bytecode to tape
r1 = 0;
while (r1 != r0) {
    r5 = rom_fetch();
    tape_cursor = r14 + r1;
    tape_store(r5);
    ++r1;
}

// XOR key for 'putchar'
r2 = 0x8929d1e;
// XOR key for register indexes (actually unused via r11, hardcoded)
r11 = 0x39ee8310;

Architecture

The memory allocated by the nested VM on the tape has this layout:

0x0      0x400   0x411       0x411 + size
 +---------+-------+---------------+
 |  Stack  |  CPU  |    bytecode   |
 +---------+-------+---------------+

Register r3 contains the CPU base index (0x400), matching the following data structure:

struct vcpu {
  uint32_t pc;
  uint32_t r1_r4[3];
  uint32_t lr;
  uint32_t mem_pointer;
  uint32_t r6_r16[10];
};

Register r14 holds the bytecode base address, r16 is the stack pointer. They do not belong to the virtual CPU data structure because they do not need to, as they are only relevant to the nested VM CPU implementation.

The rom_fetch instruction is first called once to retreive the size of the bytecode to execute. The bytecode is then copied to the tape in order to allow reading bytes arbitrarily.

Regarding calling conventions, functions pop arguments pushed to the stack and the return value is stored in the r0 register.

An example function is cpu_fetch_u32(), fetching the next 32-bit value from the bytecode buffer stored in the tape.

// 0xdc0
cpu_fetch_u32()
{
    // Fetch PC
    tape_cursor = r3 + 0;
    tape_cursor = tape_fetch();
    // Fetch u32 at this location
    r0 = tape_fetch();

    // Increment PC
    r5 = tape_cursor + 1;
    tape_cursor = r3 + 0;
    tape_store(r5);
    // r0 is return value by convention
    return r0;
}

The main loop and instructions are very similar to the top-level VM, easying our reverse engineering process. The main differences comes from opcode values and XOR keys being applied before a call to putchar or after fetching the value of a register operand.

while (true) {
    switch (cpu_fetch_u32()) {
	case 0x952db75f: goto vinsn_exit_reg;
	case 0x140c2cf8: goto vinsn_dec;
	// ...
	default: die(1);
    }
}

Nested virtual machines

From our understanding of this first bytecode, we can adapt our custom disassembler to help us reverse engineer the nested bytecode stored in the ROM.

0x0    0x4           0xe60  0xe64         0x1cc0   0x2f4c        0x2f5c
 +------+--------------+------+--------------+-------+--------------+
 | size |  bytecode 1  | size |  bytecode 2  |  ...  |  User input  |
 +------+--------------+------+--------------+-------+--------------+

			    ROM format

We dump the ROM, and disassemble the next bytecode in order to dive one level further. A quick analysis shows us the bytecode is strictly identical to the previous nested VM, except for some constants (XOR keys, opcodes) and registers holding tapes indexes, such as the stack pointer or bytecode base.

Every virtual machine allocates the same amount of bytes for their implementation usage: stack, CPU, bytecode buffer. The tape memory looks this way after three levels of nesting:

  +------------------------+------------------------+-------
  |         VM 1           |         VM 2           |  VM 3
  +-------+-----+----------+-------+-----+----------+-------
  | stack | CPU | bytecode | stack | CPU | bytecode |  ...
  +-------+-----+----------+-------+-----+----------+-------

We can adapt our custom nested disassembler again, and associate an array of nested opcodes to each instruction. The process keeps on going until we reach the final bytecode, being the actual check being executed against the user input originally given to the binary as argument. There is a total of 4 nested VMs, explaining the struggle our processor has to execute this binary :)

Final bytecode

In the final bytecode, the ROM cursor points to the beginning of provided user input. Four 32-bit values are fetched from it. The following reconstructed pseudocode allows us to establish the constraints we need to respect in order to have a victory message printed:

// Successful constraints counter
r16 = 0;
// First four bytes
if (rom_fetch() * 0x117052c0 == 1) {
    ++r16;
}

// Next four input bytes
r14 = rom_fetch();
if (r14 % 0x77f3 != 0x4926 ||
    r14 % 0x7c49 != 0x3159) {
    // Fail
    r16 = 0;
}

// Next four input bytes
if (rom_fetch() * 0x278bce9d == 1) {
    ++r16;
}

// Last four input bytes
r14 = rom_fetch();
if (r14 % 0x77f3 != 0x28b2 ||
    r14 % 0x7c49 != 0x44a9) {
    // Fail
    r16 = 0;
}

if (r16 == 2) {
    print("Congratulations, you won. Validate with FCSC{<input>}\n");
    exit(0);
} else {
    print("Noooooo, damn you lost!\n");
    exit(1);
}

Reminding the specificity about mul (result is computed modulo 2^31 - 1), we solve the constraints using Wolfram Alpha and end up with the following password:

$ ./vmv 'w3NeEdT0gODe3p3r'
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
Congratulations, you won. Validate with FCSC{<input>}
./vmv 'w3NeEdT0gODe3p3r'  193.84s user 0.20s system 99% cpu 3:15.13 total

The flag is FCSC{w3NeEdT0gODe3p3r}.

Notes

I wrote a small IDA script to generate breakpoints dumping the top-level VM state for each bytecode instruction to ease the dynamic debugging process. As you may have noticed, this hasn't proven to be very useful to solve the challenge, but I enjoyed writing these small instrumentation tools.

I also tried to first solve the challenge by setting a hardware read watchpoint on the user input being appended to the ROM in order to see what operations were done on it. The nested virtual machines however totally obfuscated the checking process and I gave up trying to figure out the necessary constraints, the debugger overhead also being very heavy when setting a software breakpoint, due to the nature of the nested VMs and the huge number of operations being processed.

angr also didn't seem to be willing to solve the constraints being applied to the binary argument.

#!/usr/bin/env python3
import argparse
import functools
import sys
import struct
import enum
import typing
import io
from dataclasses import dataclass
class OperandType(enum.Enum):
IMM = enum.auto()
REG = enum.auto()
REL = enum.auto()
@dataclass
class Operand:
type: OperandType
value: int
@dataclass
class InsnMeta:
opcode: int
mnem: str
operands: typing.List[OperandType]
def size(self) -> int:
return 4 * (1 + len(self.operands))
@dataclass
class Insn:
meta: InsnMeta
addr: int
operands: typing.List[Operand]
@property
def pc(self):
return self.addr + self.meta.size()
def op_str(self, op):
if self.meta.mnem == 'putchar' and op.type == OperandType.IMM:
byte = functools.reduce(lambda a, b: a ^ b,
Disassembler.PUTCHAR_XOR, op.value) & 0xff
return repr(chr(byte))
if op.type == OperandType.REG:
return f'r{op.value}'
elif op.type == OperandType.IMM:
return hex(op.value if op.value >= 0 else 2**32 + op.value)
elif op.type == OperandType.REL:
return hex(self.pc + op.value * 4)
class Insns(enum.Enum):
exit_reg = InsnMeta([0x952db75f, 0x72c7f2f8, 0x2f206b0b, 0xab243ec5],
"exit", [OperandType.REG])
dec = InsnMeta([0x140c2cf8, 0x66245e09, 0xb8ea3ec1, 0x437919f9], "dec",
[OperandType.REG])
ret = InsnMeta([0x4517cc48, 0x8001e4c7, 0x371ee5a6, 0xd077bb5e], "ret", [])
fetch_mem = InsnMeta([0xe80c7be2, 0xc0459d50, 0x5277de4d, 0x87d05da9],
"fetch_mem", [OperandType.REG])
call = InsnMeta([0x950885b3, 0x4edce4c, 0x1a08a46e, 0x8bf4943f], "call",
[OperandType.REL])
putchar_reg = InsnMeta([0xf85712c3, 0x53e8cf12, 0xf82ba570, 0xd365d099],
"putchar", [OperandType.REG])
jmp = InsnMeta([0xd708325c, 0x6fe6d225, 0xa7f8d895, 0x4020f282], "jmp",
[OperandType.REL])
xor = InsnMeta([0x1688047, 0x497f012f, 0xca7dcd4e, 0x9684106d], "xor", [])
mul = InsnMeta([0x54d4c1e6, 0x525b9eb7, 0x6de1c5fd, 0x723c8b46], "mul", [])
and_ = InsnMeta([0xa9475b13, 0x25846339, 0x16be1ae2, 0x6f1bb4d], "and", [])
inc = InsnMeta([0x29cc50c7, 0xad2d69fe, 0x64f31e5b, 0x7861f11], "inc",
[OperandType.REG])
je = InsnMeta([0xbf5e62ba, 0xdfa93118, 0x2f6e151d, 0x59d310e5], "je",
[OperandType.REL])
tape_store = InsnMeta([0x81761c59, 0x317b5ad6, 0x6877d9f8, 0x2c998d3c],
"tape_store", [OperandType.REG])
add = InsnMeta([0x2445bafc, 0x6ad4019a, 0xff957dd8, 0xd1befc4e], "add", [])
push_imm = InsnMeta([0x37d5991b, 0xb33f88f0, 0xe0b77ad0, 0xafd8c046],
"push", [OperandType.IMM])
rom_fetch = InsnMeta([0xeef14b6e, 0x7c5cdc7d, 0x355fca70, 0x8b29d9a0],
"rom_fetch", [OperandType.REG])
push_reg = InsnMeta([0x3bd7549b, 0x8ecfaa1e, 0x6808acc9, 0x84f16f5c],
"push", [OperandType.REG])
jne = InsnMeta([0x56ae0803, 0xff95601b, 0x13eb7cf1, 0x94c4b853], "jne",
[OperandType.REL])
putchar_imm = InsnMeta([0xb46465f, 0xd2e1893d, 0x6675dc16, 0xf4afe29b],
"putchar", [OperandType.IMM])
exit_imm = InsnMeta([0x8032f32e, 0xd2717ddc, 0x1fd474eb, 0xe3e320fb],
"exit", [OperandType.IMM])
tape_shift_imm = InsnMeta([0x87ab3b02, 0xe8eb3552, 0x5c03ff97, 0x5a9c2232],
"tape_shift", [OperandType.IMM])
tape_shift_reg = InsnMeta([0xfb0a90ec, 0x4d093f79, 0xd3d129a5, 0xa856d0d6],
"tape_shift", [OperandType.REG])
pop = InsnMeta([0xee68c600, 0x7280820f, 0xcac47105, 0xa6f7b9b2], "pop",
[OperandType.REG])
mod = InsnMeta([0x5dcc45a4, 0xa1d88221, 0x5dd2c9f3, 0x91685a], "mod",
[OperandType.REG])
invalid = InsnMeta([0x0, 0x0, 0x0, 0x0], "(invalid)", [])
class Disassembler:
# Fast lookup
REG_XOR = [0x10, 0x1f, 0xfb, 0x95]
# Only for final VM
PUTCHAR_XOR = [0x1e, 0xbc, 0x01, 0x63, 0xa3]
def __init__(self, fp, nest_lvl):
if not 1 <= nest_lvl <= len(self.REG_XOR):
raise ValueError(
f'Nesting level must be between 1 and {len(self.REG_XOR)}')
self._fp = fp
self._nest_lvl = nest_lvl - 1
self.opcode_to_meta = {
insn.value.opcode[nest_lvl - 1]: insn.value
for insn in Insns
}
# Fetch correct bytecode
for n in range(nest_lvl):
size = self.__read_u32() * 0x4
data = fp.read(size)
self._fp = io.BytesIO(data)
def __read(self, n):
data = self._fp.read(n)
if len(data) != n:
raise EOFError()
return data
def __read_u32(self):
return struct.unpack('<I', self.__read(4))[0]
def __read_i32(self):
return struct.unpack('<i', self.__read(4))[0]
def __iter_operands(self, meta):
for op_type in meta.operands:
if op_type == OperandType.REG:
value = (self.__read_u32()
^ self.REG_XOR[self._nest_lvl]) & 0x3f
elif op_type in (OperandType.REL, OperandType.IMM):
value = self.__read_i32()
yield Operand(op_type, value)
def disassemble(self):
while True:
addr = self._fp.tell()
try:
opcode = self.__read_u32()
except EOFError:
return
meta = self.opcode_to_meta.get(opcode, Insns.invalid.value)
operands = list(self.__iter_operands(meta))
yield Insn(meta, addr, operands)
INSN_MAX_MNEMLEN = max(len(insn.value.mnem) for insn in Insns)
def print_insn(insn):
print(f'{insn.addr:-#6x} ', end='')
print(insn.meta.mnem.rjust(INSN_MAX_MNEMLEN, ' '), end='')
if insn.operands:
operands = ', '.join(map(insn.op_str, insn.operands))
print(f' {operands}', end='')
print()
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-n',
'--nest',
default=1,
type=int,
help='Vm nesting level')
parser.add_argument('bytecode', help='File to disassemble')
return parser.parse_args()
def main():
args = parse_args()
with open(args.bytecode, 'rb') as fp:
disas = Disassembler(fp, nest_lvl=args.nest)
for insn in disas.disassemble():
print_insn(insn)
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment