We are facing a stripped ELF 64-bit binary. It accepts a command-line parameter most likely being the password being checked. We notice the binary is taking a lot of time to execute before noticing us of a fail.
As the name suggests, we are going to consider the binary is actually an implementation of a custom virtual machine. I mainly analyzed this binary using a static approach.
Looking at the main() function, we notice a lot of calls to a specific
function, taking a seemingly arbitrary number as first argument, and a
function pointer as second argument.
This function stores a new node to a global hashtable allocated in the main()
function. We guess this allows to efficiently lookup VM instruction handlers
from their respective opcode.
The other function refering to this global hashtable looks up an instruction handler from a given opcode, and is called in the main program loop, matching the way a classic VM works.
Some more reverse engineering leads us to discover the way instructions are being executed:
- Initialize VM CPU
- While
cpu->loopis true - Fetch next opcode
- Get instruction handler from opcode
- Execute handler, giving it VM CPU structure as parameter
- Go to 2
After reverse engineering the whole binary, the recovered VM CPU structure looks like the following:
struct vmv_cpu {
/* Stack buffer */
int32_t *stack;
/* Generic registers */
int32_t r0_r6[7];
/* Program counter, pointing to the next instruction */
int32_t pc;
/* More on that later */
int32_t tape_cursor;
/* Link register, holding function return address */
int32_t lr;
/* Generic registers */
int32_t r10_r11[2];
/* Stack pointer */
int32_t sp;
/* Generic registers */
int32_t r13_r19[7];
/* Bytecode buffer */
int32_t const *bytecode;
/* More on that later */
int32_t *tape;
int32_t tape_cursor_next;
/* Compiler padding */
uint8_t _pad0[4];
/* ROM buffer - more on that later */
int32_t const *rom;
size_t rom_cursor;
/* Boolean for the main loop */
int32_t loop;
/* Exit code */
int32_t exit_value;
/* Has 'putchar' instruction been called? */
int32_t displayed;
/* Compiler padding */
uint8_t _pad1[4];
};We are facing a 32-bit virtual machine, containing common CPU elements such as a program counter and a link register (similar to ARM), a 4096-byte stack along with its pointer, and 16 general-purpose registers. More information about VMV-specific data structures is coming afterwards.
The bytecode buffer is base64-decoded inside the main() function, and passed
as a parameter to the CPU initialization function.
Instructions get their operands from the bytecode, can fetch arguments from the stack and optionally push a result onto the stack.
The following instructions are supported:
xor,add,and,mul: fetch two immediate parameters on the stack; All butxorcompute operation results modulo2^31 - 1;mod: fetches a register value and an immediate parameter;call: savespctolrand adds an immediate operand topc;ret: returns from a procedure, settingpctolr;push: pushes a value onto the stack; Comes in two variants, taking either a register or an immediate operand;pop: pops a value from the stack into a register;inc,dec: respectively increment / decrement a register;jmp: adds an immediate value relative topc;je,jne: compares two immediate parameters, conditionally jumping to the operand;tape_store,tape_fetch,tape_shift: more on that later;exit: cleanly exits the VM using an exit value stored in a given register operand.die: 3 duplicate implementations of this instruction are present, abruptly exiting the VM with an error code set toEXIT_FAILURE.
Instructions fetch register operands using an helper function applying an
exclusive OR operation to the fetched operand: (value ^ 0x7B) & 0x3F. It is
then used as an index to the start of cpu->r0_r6 array.
The tape is a read / write memory reachable from within the virtual machine,
the latter providing instructions to fetch, store data (tape_fetch,
tape_store), and shift the next tape cursor (tape_shift). It is allocated
in the CPU initialization function, as an array of 0x2000000 32-bit values.
tape_shift has an interesting implementation, as it actually sets
tape_cursor to tape_cursor_next, then shifts the latter.
This allows independent code to not overlap, providing a way to dynamically
reserve memory space inside the tape (but no way to free it).
A read-only memory is available to the CPU, initialized in the main()
function. A single rom_fetch instruction allows to retreive the next 32-bit
value pointed to by rom_cursor, which is then incremented.
We notice the 16 bytes long argument given to the binary is concatenated to some decoded data before being passed to the CPU initialization function as the ROM buffer. It will most likely be used to influence the bytecode behavior at runtime.
## User interaction
A putchar instruction allows to display an ASCII character contained
in a given register. It also sets the displayed flag, preventing the main
loop waiting message to be shown again.
Using the knowledge we gained from reverse engineering, we can extract the bytecode from the virtual machine and disassemble it using a custom Python script.
0x0 tape_shift 0x400
0x8 push tape_cursor
0x10 pop r16
0x18 push 0x11
0x20 pop tape_cursor
0x28 tape_shift tape_cursor
0x30 push tape_cursor
0x38 pop r3
0x40 rom_fetch r0
0x48 tape_shift r0
0x50 push tape_cursor
0x58 pop r14
0x60 push r3
0x68 push 0x0
0x70 add
...Reverse engineering the whole bytecode leads us to a very interesting result: this seems to be another virtual machine implementation, very similar to the top one.
This is the start of the reconstructed pseudocode:
// Allocating 0x400 bytes for the virtual stack
tape_shift(0x400);
// Stack pointer
r16 = tape_cursor;
tape_cursor = 0x11;
// (virtual) CPU state allocation
tape_shift(tape_cursor);
// Contains state base index
r3 = tape_cursor;
// Fetch ROM u32 = bytecode size
r0 = rom_fetch();
tape_shift(r0);
// Base bytecode index
r14 = tape_cursor;
// Set PC to bytecode start
tape_cursor = r3 + 0;
tape_store(r14);
// Copy bytecode to tape
r1 = 0;
while (r1 != r0) {
r5 = rom_fetch();
tape_cursor = r14 + r1;
tape_store(r5);
++r1;
}
// XOR key for 'putchar'
r2 = 0x8929d1e;
// XOR key for register indexes (actually unused via r11, hardcoded)
r11 = 0x39ee8310;The memory allocated by the nested VM on the tape has this layout:
0x0 0x400 0x411 0x411 + size
+---------+-------+---------------+
| Stack | CPU | bytecode |
+---------+-------+---------------+
Register r3 contains the CPU base index (0x400), matching the following data
structure:
struct vcpu {
uint32_t pc;
uint32_t r1_r4[3];
uint32_t lr;
uint32_t mem_pointer;
uint32_t r6_r16[10];
};Register r14 holds the bytecode base address, r16 is the stack pointer.
They do not belong to the virtual CPU data structure because they do not need
to, as they are only relevant to the nested VM CPU implementation.
The rom_fetch instruction is first called once to retreive the size of the
bytecode to execute. The bytecode is then copied to the tape in order to allow
reading bytes arbitrarily.
Regarding calling conventions, functions pop arguments pushed to the stack and
the return value is stored in the r0 register.
An example function is cpu_fetch_u32(), fetching the next 32-bit value from
the bytecode buffer stored in the tape.
// 0xdc0
cpu_fetch_u32()
{
// Fetch PC
tape_cursor = r3 + 0;
tape_cursor = tape_fetch();
// Fetch u32 at this location
r0 = tape_fetch();
// Increment PC
r5 = tape_cursor + 1;
tape_cursor = r3 + 0;
tape_store(r5);
// r0 is return value by convention
return r0;
}The main loop and instructions are very similar to the top-level VM, easying
our reverse engineering process. The main differences comes from opcode values
and XOR keys being applied before a call to putchar or after fetching the
value of a register operand.
while (true) {
switch (cpu_fetch_u32()) {
case 0x952db75f: goto vinsn_exit_reg;
case 0x140c2cf8: goto vinsn_dec;
// ...
default: die(1);
}
}From our understanding of this first bytecode, we can adapt our custom disassembler to help us reverse engineer the nested bytecode stored in the ROM.
0x0 0x4 0xe60 0xe64 0x1cc0 0x2f4c 0x2f5c
+------+--------------+------+--------------+-------+--------------+
| size | bytecode 1 | size | bytecode 2 | ... | User input |
+------+--------------+------+--------------+-------+--------------+
ROM format
We dump the ROM, and disassemble the next bytecode in order to dive one level further. A quick analysis shows us the bytecode is strictly identical to the previous nested VM, except for some constants (XOR keys, opcodes) and registers holding tapes indexes, such as the stack pointer or bytecode base.
Every virtual machine allocates the same amount of bytes for their implementation usage: stack, CPU, bytecode buffer. The tape memory looks this way after three levels of nesting:
+------------------------+------------------------+-------
| VM 1 | VM 2 | VM 3
+-------+-----+----------+-------+-----+----------+-------
| stack | CPU | bytecode | stack | CPU | bytecode | ...
+-------+-----+----------+-------+-----+----------+-------
We can adapt our custom nested disassembler again, and associate an array of nested opcodes to each instruction. The process keeps on going until we reach the final bytecode, being the actual check being executed against the user input originally given to the binary as argument. There is a total of 4 nested VMs, explaining the struggle our processor has to execute this binary :)
In the final bytecode, the ROM cursor points to the beginning of provided user input. Four 32-bit values are fetched from it. The following reconstructed pseudocode allows us to establish the constraints we need to respect in order to have a victory message printed:
// Successful constraints counter
r16 = 0;
// First four bytes
if (rom_fetch() * 0x117052c0 == 1) {
++r16;
}
// Next four input bytes
r14 = rom_fetch();
if (r14 % 0x77f3 != 0x4926 ||
r14 % 0x7c49 != 0x3159) {
// Fail
r16 = 0;
}
// Next four input bytes
if (rom_fetch() * 0x278bce9d == 1) {
++r16;
}
// Last four input bytes
r14 = rom_fetch();
if (r14 % 0x77f3 != 0x28b2 ||
r14 % 0x7c49 != 0x44a9) {
// Fail
r16 = 0;
}
if (r16 == 2) {
print("Congratulations, you won. Validate with FCSC{<input>}\n");
exit(0);
} else {
print("Noooooo, damn you lost!\n");
exit(1);
}Reminding the specificity about mul (result is computed modulo 2^31 - 1),
we solve the constraints using Wolfram Alpha and end up with the following
password:
$ ./vmv 'w3NeEdT0gODe3p3r'
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
[fr] Ne quittez pas, un correspondant va prendre votre appel... [\fr]
Congratulations, you won. Validate with FCSC{<input>}
./vmv 'w3NeEdT0gODe3p3r' 193.84s user 0.20s system 99% cpu 3:15.13 totalThe flag is FCSC{w3NeEdT0gODe3p3r}.
I wrote a small IDA script to generate breakpoints dumping the top-level VM state for each bytecode instruction to ease the dynamic debugging process. As you may have noticed, this hasn't proven to be very useful to solve the challenge, but I enjoyed writing these small instrumentation tools.
I also tried to first solve the challenge by setting a hardware read watchpoint on the user input being appended to the ROM in order to see what operations were done on it. The nested virtual machines however totally obfuscated the checking process and I gave up trying to figure out the necessary constraints, the debugger overhead also being very heavy when setting a software breakpoint, due to the nature of the nested VMs and the huge number of operations being processed.
angr also didn't seem to be willing to solve the
constraints being applied to the binary argument.