Skip to content

Instantly share code, notes, and snippets.

@ruvnet
Last active April 4, 2026 15:22
Show Gist options
  • Select an option

  • Save ruvnet/8082d0b339f05e73cf48b491de5b8ee6 to your computer and use it in GitHub Desktop.

Select an option

Save ruvnet/8082d0b339f05e73cf48b491de5b8ee6 to your computer and use it in GitHub Desktop.
RVM Hypervisor Core — Deep Research Report | Repo: github.com/ruvnet/rvm

RVM — The Virtual Machine Built for the Agentic Age

Rust no_std License ADR EPIC

Agents don't fit in VMs. They need something that understands how they think.

Part of the RuVector ecosystem. Uses RuVix kernel primitives and RVF package format. Designed for Cognitum Seed, Appliance, and future chip targets.

Traditional hypervisors were built for an era of static server workloads — long-running VMs with predictable resource needs. AI agents are different. They spawn in milliseconds, communicate in dense, shifting graphs, share context across trust boundaries, and die without warning. VMs are the wrong abstraction.

RVM replaces VMs with coherence domains — lightweight, graph-structured partitions whose isolation, scheduling, and memory placement are driven by how agents actually communicate. When two agents start talking more, RVM moves them closer. When trust drops, RVM splits them apart. Every mutation is proof-gated. Every action is witnessed. The system understands its own structure.

Agent swarm → [RVM Coherence Engine] → Optimal Placement → Witness Proof
                    ↑                                            │
                    └──── Agent Communication Graph ────────────┘
                          (< 50µs adaptive re-partitioning)

No KVM. No Linux. No VMs. Bare-metal Rust. Built for agents.

Traditional VM:     VM₁  VM₂  VM₃  VM₄    (static, opaque boxes — agents don't fit)
                    ─────────────────────
RVM:                ┌─A──B─┐  ┌─C─┐  D    (dynamic, agent-driven domains)
                    │  ↔   │──│ ↔ │──↔    (edges = agent communication weight)
                    └──────┘  └───┘        (auto-split when trust or coupling changes)

What Agents Need vs What They Get

What Agents Need VMs / Containers RVM
Sub-millisecond spawn Seconds to boot < 10µs partition switch
Dense, shifting comms graph Static NIC-to-NIC Graph-weighted CommEdges, auto-rebalanced
Shared context with isolation All or nothing Capability-gated shared memory, proof-checked
Per-agent fault containment Whole-VM crash F1–F4 graduated rollback, no reboot needed
Auditable every action External log bolted on 64-byte witness on every syscall, hash-chained
Hibernate and reconstruct Kill and restart Dormant tier → rebuilt from witness log
Run on 64KB MCUs Needs gigabytes Seed profile: 64KB–1MB, capability-enforced

Why RVM?

Dynamic Re-isolation and Self-Healing Boundaries. Because RVM uses graph-theoretic mincut algorithms, it can dynamically restructure its isolation boundaries to match how workloads actually communicate. If an agent in one partition begins communicating heavily with an agent in another, RVM automatically triggers a partition split and migrates the agent to optimise placement — no manual configuration. No existing hypervisor can split or merge live partitions along a graph-theoretic cut boundary.

Memory Time Travel and Deep Forensics. Traditional virtual memory permanently overwrites state or blindly swaps it to disk. RVM stores dormant memory as a checkpoint combined with a delta-compressed witness trail. Any historical state can be perfectly rebuilt on demand — days or weeks later — because every privileged action is recorded in a tamper-evident, hash-chained witness log. External forensic tools can reconstruct past states to answer precise questions such as "which task mutated this vector store between 14:00 and 14:05 on Tuesday?"

Targeted Fault Rollback Without Global Reboots. When the kernel detects a coherence violation or memory corruption it does not crash. Instead it finds the last known-good checkpoint, replays the witness log, explicitly skips the mutation that caused the failure, and resumes from a corrected state (DC-14, failure classes F1–F3).

Deterministic Multi-Tenant Edge Orchestration. Existing edge orchestrators rely on Linux-based VMs or containers, inheriting scheduling unpredictability and no guarantee of bounded latency with provable isolation. RVM enables scenarios such as an autonomous vehicle where safety-critical sensor-fusion agents (Reflex mode, < 10 µs switch) are strictly isolated from low-priority infotainment agents, or a smart factory floor running hard real-time PLC control loops safely alongside ML inference agents.

High-Assurance Security on Extreme Microcontrollers. Through its Seed hardware profile (ADR-138), RVM brings capability-enforced isolation, proof-gated execution, and witness attestation to deeply constrained IoT devices with as little as 64 KB of RAM. Delivering this level of zero-trust, auditable security on microcontroller-class hardware is a novel capability not provided by any existing embedded operating system.


Architecture

+----------------------------------------------------------+
|                       rvm-kernel                         |
|                                                          |
|  +-----------+  +-----------+  +------------+            |
|  | rvm-boot  |  | rvm-sched |  | rvm-memory |            |
|  +-----+-----+  +-----+-----+  +------+-----+            |
|        |              |               |                   |
|  +-----+--------------+---------------+------+            |
|  |               rvm-partition               |            |
|  +-----+---------+-----------+----------+----+            |
|        |         |           |          |                 |
|  +-----+--+ +---+------+ +--+-----+ +--+--------+        |
|  | rvm-cap| |rvm-witness| |rvm-proof| |rvm-security|     |
|  +-----+--+ +---+------+ +--+-----+ +--+--------+        |
|        |         |           |          |                 |
|  +-----+---------+-----------+----------+----+            |
|  |               rvm-types                   |            |
|  +-----+-------------------------------------+            |
|        |                                                  |
|  +-----+--+  +----------+  +-------------+               |
|  | rvm-hal|  | rvm-wasm |  |rvm-coherence|               |
|  +--------+  +----------+  +-------------+               |
+----------------------------------------------------------+
Layer 4: Persistent State
         witness log │ compressed dormant memory │ RVF checkpoints
         ─────────────────────────────────────────────────────────
Layer 3: Execution Adapters
         bare partition │ WASM partition │ service adapter
         ─────────────────────────────────────────────────────────
Layer 2: Coherence Engine (OPTIONAL — DC-1)
         graph state │ mincut │ pressure scoring │ migration
         ─────────────────────────────────────────────────────────
Layer 1: RVM Core (Rust, no_std)
         partitions │ capabilities │ scheduler │ witnesses
         ─────────────────────────────────────────────────────────
Layer 0: Machine Entry (assembly, <500 LoC)
         reset vector │ trap handlers │ context switch

First-Class Kernel Objects

Object Purpose
Partition Coherence domain container — unit of scheduling, isolation, and migration
Capability Unforgeable authority token with 7 rights (READ, WRITE, GRANT, REVOKE, EXECUTE, PROVE, GRANT_ONCE)
Witness 64-byte hash-chained audit record emitted by every privileged action
MemoryRegion Typed, tiered, owned memory (Hot/Warm/Dormant/Cold) with move semantics
CommEdge Inter-partition communication channel — weighted edge in the coherence graph
DeviceLease Time-bounded, revocable hardware device access
CoherenceScore Graph-derived locality and coupling metric
CutPressure Isolation signal — high pressure triggers migration or split
RecoveryCheckpoint State snapshot for rollback and reconstruction

Crate Structure

Crate Purpose
rvm-types Foundation types: addresses, IDs, capabilities, witness records, coherence scores
rvm-hal Platform-agnostic hardware abstraction traits (MMU, timer, interrupts)
rvm-cap Capability-based access control with derivation trees and three-tier proof
rvm-witness Append-only witness trail with hash-chain integrity
rvm-proof Proof-gated state transitions (P1/P2/P3 tiers)
rvm-partition Partition lifecycle, split/merge, capability tables, communication edges
rvm-sched Coherence-weighted 2-signal scheduler (deadline urgency + cut pressure)
rvm-memory Guest physical address space management with tiered placement
rvm-coherence Real-time Phi computation and EMA-filtered coherence scoring
rvm-boot Deterministic 7-phase boot sequence with witness gating
rvm-wasm Optional WebAssembly guest runtime
rvm-security Unified security gate: capability check + proof verification + witness log
rvm-kernel Top-level integration crate re-exporting all subsystems

Dependency Graph

rvm-types (foundation, no deps)
    ├── rvm-hal
    ├── rvm-cap
    ├── rvm-witness
    ├── rvm-proof ← rvm-cap + rvm-witness
    ├── rvm-partition ← rvm-hal + rvm-cap + rvm-witness
    ├── rvm-sched ← rvm-partition + rvm-witness
    ├── rvm-memory ← rvm-hal + rvm-partition + rvm-witness
    ├── rvm-coherence ← rvm-partition + rvm-sched [OPTIONAL]
    ├── rvm-boot ← rvm-hal + rvm-partition + rvm-witness + rvm-sched + rvm-memory
    ├── rvm-wasm ← rvm-partition + rvm-cap + rvm-witness [OPTIONAL]
    ├── rvm-security ← rvm-cap + rvm-proof + rvm-witness
    └── rvm-kernel ← ALL

Build

# Check (no_std by default)
cargo check

# Run tests
cargo test

# Run benchmarks
cargo bench

# Build with std support
cargo check --features std

Design Constraints (ADR-132 through ADR-140)

ID Constraint Status
DC-1 Coherence engine is optional; system degrades gracefully Stub
DC-2 MinCut budget: 50 µs per epoch Stub
DC-3 Capabilities are unforgeable, monotonically attenuated Implemented
DC-4 2-signal priority: deadline_urgency + cut_pressure_boost Implemented
DC-5 Three systems cleanly separated (kernel + coherence + agents) Enforced
DC-6 Degraded mode when coherence unavailable Stub
DC-7 Migration timeout enforcement (100 ms) Type only
DC-8 Capabilities follow objects during partition split Type only
DC-9 Coherence score range [0.0, 1.0] as fixed-point Implemented
DC-10 Epoch-based witness batching (no per-switch records) Implemented
DC-11 Merge requires coherence above threshold Implemented
DC-12 Max 256 physical VMIDs, multiplexed for >256 partitions Implemented
DC-13 WASM is optional; native bare partitions are first class Enforced
DC-14 Failure classes: transient, recoverable, permanent, catastrophic Type only
DC-15 All types are no_std, forbid(unsafe_code), deny(missing_docs) Enforced

🔍 RVM vs State of the Art (12 differences)
RVM KVM/Firecracker seL4 Theseus OS
Primary abstraction Coherence domains (graph-partitioned) Virtual machines Processes + capabilities Cells (intralingual)
Isolation driver Dynamic mincut + cut pressure Hardware EPT/NPT Formal verification + caps Rust type system
Scheduling signal Structural coherence (graph metrics) CPU time / fairness Priority / round-robin Cooperative
Memory model 4-tier reconstructable (Hot/Warm/Dormant/Cold) Demand paging Untyped memory + retype Single address space
Audit trail Witness-native (64B hash-chained records) External logging Not built-in Not built-in
Mutation control Proof-gated (3-layer: P1/P2/P3) Unix permissions Capability tokens Rust ownership
Partition operations Live split/merge along graph cuts Not supported Not supported Not supported
Linux dependency None — bare-metal Yes (KVM is a kernel module) None None
Language 95-99% Rust, <500 LoC assembly C C + Isabelle/HOL proofs Rust
Target Edge, IoT, agents Cloud servers Safety-critical Research
Boot time < 250ms to first witness ~125ms (Firecracker) Varies N/A
Partition switch < 10µs ~2-5µs (VM exit) ~0.5-1µs (IPC) N/A (no isolation)
✨ 6 Novel Capabilities (No Prior Art)

1. Kernel-Level Graph Control Loop

No existing OS uses spectral graph coherence metrics as a scheduling signal. RVM's coherence engine runs mincut algorithms in the kernel's scheduling loop — graph structure directly drives where computation runs, when partitions split, and which memory stays resident.

2. Reconstructable Memory ("Memory Time Travel")

RVM explicitly rejects demand paging. Dormant memory is stored as witness checkpoint + delta compression, not raw bytes. The system can deterministically reconstruct any historical state from the witness log.

3. Proof-Gated Infrastructure

Every state mutation requires a valid proof token verified through a three-tier system: P1 capability (<1µs), P2 policy (<100µs), P3 deep (<10ms, post-v1).

4. Witness-Native OS

Every privileged action emits a fixed 64-byte, FNV-1a hash-chained record. Tamper-evident by construction. Full deterministic replay from any checkpoint.

5. Live Partition Split/Merge

Partitions split along graph-theoretic cut boundaries and merge when coherence rises. Capabilities follow ownership (DC-8), regions use weighted scoring (DC-9), merges require 7 preconditions (DC-11).

6. Edge Security on 64KB RAM

Capability-based isolation, proof-gated execution, and witness attestation on microcontroller-class hardware (Cortex-M/R, 64KB RAM).

🎯 Success Criteria (v1)
# Criterion Target
1 All 13 crates compile with #![no_std] and #![forbid(unsafe_code)] Enforced
2 Cold boot to first witness < 250ms on Appliance hardware
3 Hot partition switch < 10 microseconds
4 Witness record is exactly 64 bytes, cache-line aligned Compile-time asserted
5 Capability derivation depth bounded at 8 levels Enforced
6 EMA coherence filter operates without floating-point Implemented
7 Boot sequence is deterministic and witness-gated Implemented
8 Remote memory traffic reduction ≥ 20% vs naive placement Target
9 Fault recovery without global reboot (F1–F3) Target
🏗️ Implementation Phases

Phase 1: Foundation (M0-M1) — "Can it boot and isolate?"

  • M0: Bare-metal Rust boot on QEMU AArch64 virt. Reset → EL2 → serial → MMU → first witness.
  • M1: Partition + capability model. Create, destroy, switch. Simple deadline scheduler.

Phase 2: Differentiation (M2-M3) — "Can it prove and witness?"

  • M2: Witness logging (64-byte chained records) + P1/P2 proof verifier.
  • M3: 2-signal scheduler (deadline + cut_pressure). Flow + Reflex modes. Zero-copy IPC.

Phase 3: Innovation (M4-M5) — "Can it think about coherence?"

  • M4: Dynamic mincut integration (DC-2 budget). Live coherence graph. Migration triggers.
  • M5: Memory tier management. Reconstruction from dormant state.

Phase 4: Expansion (M6-M7) — "Can agents run on it?"

  • M6: WASM agent runtime adapter. Agent lifecycle.
  • M7: Seed/Appliance hardware bring-up. All success criteria.
🔐 Security Model

Capability-Based Authority. All access controlled through unforgeable kernel-resident tokens. No ambient authority. Seven rights with monotonic attenuation.

Proof-Gated Mutation. No memory remap, device mapping, migration, or partition merge without a valid proof token. Three tiers with strict latency budgets.

Witness-Native Audit. 64-byte records for every mutating operation. Hash-chained for tamper evidence. Deterministic replay from checkpoint + witness log.

Failure Classification. F1 (agent restart) → F2 (partition reconstruct) → F3 (memory rollback) → F4 (kernel reboot). Each escalation witnessed.

🖥️ Target Platforms
Platform Profile RAM Coherence Engine WASM
Seed Tiny, persistent, event-driven 64KB–1MB No (DC-1) Optional
Appliance Edge hub, deterministic orchestration 1–32GB Yes (full) Yes
Chip Future Cognitum silicon Tile-local Hardware-assisted Yes
📚 ADR References
ADR Topic
ADR-132 RVM top-level architecture and 15 design constraints
ADR-133 Partition object model and split/merge semantics
ADR-134 Witness schema and log format (64-byte records)
ADR-135 Three-tier proof system (P1/P2/P3)
ADR-136 Memory hierarchy and reconstruction
ADR-137 Bare-metal boot sequence
ADR-138 Seed hardware bring-up
ADR-139 Appliance deployment model
ADR-140 Agent runtime adapter
🔧 Development

Prerequisites

  • Rust 1.77+ with aarch64-unknown-none target
  • QEMU 8.0+ (for AArch64 virt machine emulation)
rustup target add aarch64-unknown-none
brew install qemu  # macOS

Project Conventions

  • #![no_std] everywhere — the kernel runs on bare metal
  • #![forbid(unsafe_code)] where possible; unsafe blocks audited and commented
  • #![deny(missing_docs)] — every public API documented
  • Move semantics for memory ownership (OwnedRegion<P> is non-copyable)
  • Const generics for fixed-size structures (no heap allocation in kernel paths)
  • Every state mutation emits a witness record

RuVector Integration

Crate Role in RVM
ruvector-mincut Partition placement and isolation decisions
ruvector-sparsifier Compressed shadow graph for Laplacian operations
ruvector-solver Effective resistance → coherence scores
ruvector-coherence Spectral coherence tracking
ruvix-* Kernel primitives (Task, Capability, Region, Queue, Timer, Proof)
rvf Package format for boot images, checkpoints, and cold storage

License

Licensed under either of:

at your option.


EPIC · Research Gist · pi.ruv.io Brain

RVM Hypervisor Core — Deep Research Report

A coherence-native microhypervisor for Cognitum Seed, Appliance, and future chips Project: RVM (RuVix Virtual Machine) — uses RuVector, RuVix, and RVF formats

Date: 2026-04-04 (updated 2026-04-04) Generated by: RVM Research Swarm (5 concurrent agents, 9,863 lines) Repository: github.com/ruvnet/RuVector | github.com/ruvnet/rvm EPIC: ruvnet/RuVector#328


Naming Convention

  • RVM: The project name (RuVix Virtual Machine)
  • RuVector: The parent ecosystem (graph/vector/coherence crates)
  • RuVix: Legacy name for the kernel workspace (crates/ruvix/)
  • RVF: RuVector Format — package and checkpoint format
  • Crates: rvm-types, rvm-cap, rvm-witness, rvm-partition, rvm-sched, rvm-memory, rvm-coherence, rvm-boot, rvm-wasm, rvm-security, rvm-kernel

RVM State-of-the-Art Analysis: Bare-Metal Rust Hypervisors and Coherence-Native OS Design

Date: 2026-04-04 Scope: Research survey for the RVM microhypervisor project Constraint: RVM does NOT depend on Linux or KVM


Table of Contents

  1. Bare-Metal Rust OS/Hypervisor Projects
  2. Capability-Based Systems
  3. Coherence Protocols
  4. Agent/Edge Computing Runtimes
  5. Graph-Partitioned Scheduling
  6. Existing RuVector Crates Relevant to Hypervisor Design
  7. Synthesis: How Each Area Maps to RVM Design Decisions
  8. References

1. Bare-Metal Rust OS/Hypervisor Projects

1.1 RustyHermit (Hermit OS)

What it is: A Rust-based lightweight unikernel targeting scalable and predictable runtime for high-performance and cloud computing. Originally a rewrite of HermitCore.

Boot model: RustyHermit supports two deployment modes: (a) running inside a VM via the uhyve hypervisor (which itself requires KVM), and (b) running bare-metal side-by-side with Linux in a multi-kernel configuration. The uhyve path depends on KVM; the multi-kernel path allows bare-metal but assumes a Linux host for the other kernel.

Memory model: Single address space unikernel model. The application and kernel share one address space with no process isolation boundary. Memory safety comes from Rust's ownership model rather than MMU page tables.

Scheduling: Cooperative scheduling within the single unikernel image. No preemptive multitasking between isolated components. The scheduler is optimized for throughput rather than isolation.

RVM relevance: RustyHermit demonstrates that a pure-Rust kernel can achieve competitive performance, but its unikernel design lacks the isolation model RVM requires. RVM's capability-gated multi-task model is fundamentally different. However, RustyHermit's approach to no_std Rust kernel bootstrapping and its minimal dependency chain are instructive for RVM's Phase B bare-metal port.

Key lesson for RVM: Unikernels trade isolation for performance. RVM takes the opposite stance -- isolation is non-negotiable, but it must be capability-based rather than process-based.

1.2 Theseus OS

What it is: A research OS written entirely in Rust exploring "intralingual design" -- closing the semantic gap between compiler and hardware by maximally leveraging language safety and affine types.

Boot model: Boots on bare-metal x86_64 hardware (tested on Intel NUC, Thinkpad) and in QEMU. No dependency on Linux or KVM for operation. Uses a custom bootloader.

Memory model: All code runs at Ring 0 in a single virtual address space, including user applications written in purely safe Rust. Protection comes from the Rust type system rather than hardware privilege levels. The OS can guarantee at compile time that a given application or kernel component cannot violate isolation between modules.

Scheduling: Component-granularity scheduling where OS modules can be dynamically loaded and unloaded at runtime. State management is the central innovation -- Theseus minimizes the states one component holds for another, enabling live evolution of running system components.

RVM relevance: Theseus's intralingual approach is the closest philosophical match to RVM. Both systems bet on Rust's type system as a primary isolation mechanism. However, Theseus runs everything at Ring 0, while RVM uses EL1/EL0 separation with hardware MMU enforcement as a defense-in-depth layer on top of type safety.

Key lesson for RVM: Language-level isolation can replace MMU-based isolation for trusted components, but hardware-enforced boundaries remain essential for untrusted WASM workloads. RVM's hybrid approach (type safety for kernel, MMU for user components) is well-positioned.

1.3 RedLeaf

What it is: An OS developed from scratch in Rust to explore the impact of language safety on OS organization. Published at OSDI 2020.

Boot model: Boots on bare-metal x86_64. No Linux dependency. Custom bootloader with UEFI support.

Memory model: Does not rely on hardware address spaces for isolation. Instead uses only type and memory safety of the Rust language. Introduces "language domains" as the unit of isolation -- a lightweight abstraction for information hiding and fault isolation. Domains can be dynamically loaded and cleanly terminated without affecting other domains.

Scheduling: Domain-aware scheduling where the unit of execution is a domain rather than a process. Domains communicate through shared heaps with ownership transfer semantics that leverage Rust's ownership model for zero-copy IPC.

RVM relevance: RedLeaf's domain model closely parallels RVM's capability-gated task model. Both systems achieve isolation without traditional process boundaries. RedLeaf's shared heap with ownership transfer is conceptually similar to RVM's queue-based IPC with zero-copy ring buffers. RedLeaf also achieves 10Gbps network driver performance matching DPDK, demonstrating that language-based isolation does not inherently sacrifice throughput.

Key lesson for RVM: Language domains with clean termination semantics map well to RVM's RVF component model. The ability to isolate and restart a crashed driver without system-wide impact is exactly what RVM needs for agent workloads.

1.4 Tock OS

What it is: A secure embedded OS for microcontrollers, written in Rust, designed for running multiple concurrent, mutually distrustful applications.

Boot model: Runs on bare-metal Cortex-M and RISC-V microcontrollers. No OS dependency. Direct hardware boot.

Memory model: Dual isolation strategy:

  • Capsules (kernel components): Language-based isolation using safe Rust. Zero overhead. Capsules can only be written in safe Rust.
  • Processes (applications): Hardware MPU isolation. The MPU limits which memory addresses a process can access; violations trap to the kernel.

Scheduling: Priority-based preemptive scheduling with per-process grant regions for safe kernel-user memory sharing. Tock 2.2 (January 2025) achieved compilation on stable Rust for the first time.

RVM relevance: Tock's dual isolation model (language for trusted, hardware for untrusted) is the same architectural pattern RVM employs. Tock's capsule model directly influenced RVM's approach to kernel extensions. The 2025 TickTock formal verification effort discovered five previously unknown MPU configuration bugs and two interrupt handling bugs that broke isolation -- a cautionary result for any system relying on MPU/MMU configuration correctness.

Key lesson for RVM: Formal verification of the MMU/MPU configuration code in ruvix-aarch64 should be a priority. The TickTock results demonstrate that even mature, well-tested isolation code can harbor subtle bugs.

1.5 Hubris (Oxide Computer)

What it is: A microkernel OS for deeply embedded systems, developed by Oxide Computer Company. Written entirely in Rust. Production-deployed in Oxide rack-mount server service controllers.

Boot model: Bare-metal on ARM Cortex-M microcontrollers. No OS dependency. Static binary with all tasks compiled together.

Memory model: Strictly static architecture. No dynamic memory allocation. No runtime task creation or destruction. The kernel is approximately 2000 lines of Rust. Memory regions are assigned at compile time via a build system configuration (TOML-based task descriptions).

Scheduling: Strictly synchronous IPC model. Preemptive priority-based scheduling. Tasks that crash can be restarted without affecting the rest of the system. No driver code runs in privileged mode.

RVM relevance: Hubris demonstrates that a production-quality Rust microkernel can be extremely small (~2000 lines) while providing real isolation. Its static, no-allocation design philosophy aligns with RVM's "fixed memory layout" constraint. Hubris's approach to compile-time task configuration is analogous to RVM's RVF manifest-driven resource declaration.

Key lesson for RVM: Static resource declaration at boot (from RVF manifest) is a proven pattern. Hubris's production track record at Oxide validates the Rust microkernel approach for real hardware.

1.6 Redox OS

What it is: A complete Unix-like microkernel OS written in Rust, targeting general-purpose desktop and server use.

Boot model: Boots on bare-metal x86_64 hardware. Custom bootloader with UEFI support. The 2025-2026 roadmap includes ARM and RISC-V support.

Memory model: Traditional microkernel with hardware address space isolation. Processes run in separate address spaces. The kernel handles memory management, scheduling, and IPC. Device drivers run in userspace.

Scheduling: Standard microkernel scheduling with userspace servers. Recent 2025 improvements yielded 500-700% file I/O performance gains. Self-hosting is a key roadmap goal.

RVM relevance: Redox proves that a full microkernel OS can be written in Rust and run on real hardware. Its "everything in Rust" approach validates the toolchain. However, Redox's Unix-like POSIX interface is exactly the abstraction mismatch that RVM is designed to avoid. Redox optimizes for human-process workloads; RVM optimizes for agent-vector-graph workloads.

Key lesson for RVM: Redox's experience with driver isolation in userspace and its bare-metal boot process are directly transferable. But RVM should not adopt POSIX semantics.

1.7 Hyperlight (Microsoft)

What it is: A micro-VM manager that creates ultra-lightweight VMs with no OS inside. Open-sourced in 2024-2025, now in the CNCF Sandbox.

Boot model: Creates VMs using hardware hypervisor support (Hyper-V on Windows, KVM on Linux, mshv on Azure). The VMs themselves contain no operating system -- just a linear memory slice and a CPU. VM creation takes 1-2ms, with warm-start latency of 0.9ms.

Memory model: Each micro-VM gets a flat linear memory region. No virtual devices, no filesystem, no OS. The Hyperlight Wasm guest compiles wasmtime as a no_std Rust module that runs directly inside the micro-VM.

Scheduling: Host-managed. The micro-VMs are extremely short-lived function executions. No internal scheduler needed.

RVM relevance: Hyperlight demonstrates the "WASM-in-a-VM-with-no-OS" pattern that is extremely relevant to RVM. The key insight is that wasmtime can be compiled as a no_std component and run without any operating system. RVM's approach of embedding a WASM runtime directly in the kernel aligns with this pattern, but RVM goes further by providing kernel-native vector/graph primitives that Hyperlight lacks.

Key lesson for RVM: Wasmtime's no_std mode is production-viable. The Hyperlight architecture validates the "no OS needed for WASM execution" thesis. RVM should study Hyperlight's wasmtime-platform.h abstraction layer for the Phase B bare-metal WASM port.


2. Capability-Based Systems

2.1 seL4's Capability Model

Architecture: seL4 is the gold standard for capability-based microkernels. It was the first OS kernel to receive a complete formal proof of functional correctness (8,700 lines of C verified from abstract specification down to binary). Every kernel resource is accessed through capabilities -- unforgeable tokens managed by the kernel.

Capability structure: seL4 capabilities encode: an object pointer (which kernel object), access rights (what operations are permitted), and a badge (extra metadata for IPC demultiplexing). Capabilities are stored in CNodes (capability nodes), which are themselves accessed through capabilities, forming a recursive namespace.

Delegation and revocation: Capabilities can be copied (with equal or lesser rights), moved between CNodes, and revoked. Revocation is recursive -- revoking a capability invalidates all capabilities derived from it.

Rust bindings: The sel4-sys crate provides Rust bindings for seL4 system calls. Antmicro and Google developed a version designed for maintainability. The seL4 Microkit framework supports Rust as a first-class language.

RVM's adoption of seL4 concepts:

  • RVM's ruvix-cap crate implements seL4-style capabilities with CapRights, CapHandle, derivation trees, and epoch-based invalidation
  • Maximum delegation depth of 8 (configurable) prevents unbounded chains
  • Audit logging with depth-warning threshold at 4
  • The GRANT_ONCE right provides non-transitive delegation (not in seL4)
  • Unlike seL4's C implementation, RVM's capability manager is #![forbid(unsafe_code)]

Gap analysis: seL4's formal verification is its strongest asset. RVM currently lacks formal proofs for its capability manager. The Tock/TickTock experience (five bugs found through verification) suggests formal verification of ruvix-cap should be prioritized.

2.2 CHERI Hardware Capabilities

Architecture: CHERI (Capability Hardware Enhanced RISC Instructions) extends processor ISAs with hardware-enforced capabilities. Rather than relying solely on page tables for memory protection, CHERI encodes bounds and permissions directly in pointer representations. Pointers become fat capabilities that carry their own access metadata.

ARM Morello: Arm's Morello evaluation platform implemented CHERI extensions on an Armv8.2-A processor. Performance evaluation on 20 C/C++ applications showed overheads ranging from negligible to 1.65x, with the highest costs in pointer-intensive workloads. However, as of 2025, Arm has stepped back from active Morello development, pushing CHERI adoption toward smaller embedded processors.

Verified temporal safety: A 2025 paper at CPP presented a formal CHERI C memory model for verified temporal safety, demonstrating that CHERI can enforce not just spatial safety (bounds) but also temporal safety (use-after-free prevention).

RVM relevance: CHERI's capability-per-pointer model is more fine-grained than RVM's capability-per-object model. If future AArch64 processors include CHERI extensions, RVM could leverage them for sub-region protection within capability boundaries. In the near term, RVM achieves similar goals through Rust's ownership system (compile-time) and MMU page tables (runtime).

Key lesson for RVM: CHERI demonstrates that hardware capabilities are feasible but face adoption challenges. RVM's software-capability approach (ruvix-cap) is the right near-term strategy, with CHERI as a future hardware acceleration path. The ruvix-hal HAL trait layer already allows for pluggable MMU implementations, which could be extended to CHERI capabilities.

2.3 Barrelfish Multikernel

Architecture: Barrelfish runs a separate small kernel ("CPU driver") on each core. Kernels share no memory. All inter-core communication is explicit message passing. The rationale: hardware cache coherence protocols are difficult to scale beyond ~80 cores, so Barrelfish makes communication explicit rather than relying on shared-memory illusions.

Capability model: Barrelfish uses a capability system where the CPU driver maintains capabilities, executes syscalls on capabilities, and schedules dispatchers. Dispatchers are the unit of scheduling -- an application spanning multiple cores has a dispatcher per core, and dispatchers never migrate.

System knowledge base: At boot, Barrelfish probes hardware to measure inter-core communication performance, stores results in a small database (SKB), and runs an optimizer to select communication patterns.

RVM relevance: Barrelfish's per-core kernel model directly informs RVM's future Phase C (SMP) design. The ruvix-smp crate already provides CPU topology management, per-CPU state tracking, IPI messaging (Reschedule, TlbFlush, FunctionCall), and lock-free atomic state transitions -- all aligned with the multikernel philosophy.

Key lesson for RVM: For multi-core RVM, the Barrelfish model suggests: (1) run a scheduler instance per core rather than a single shared scheduler, (2) use explicit message passing between per-core schedulers, (3) probe inter-core latency at boot and store in a performance database that the coherence-aware scheduler can consult.


3. Coherence Protocols

3.1 Hardware Cache Coherence: MOESI and MESIF

MESI (Modified, Exclusive, Shared, Invalid): The baseline snooping protocol. Each cache line exists in one of four states. Write operations invalidate all other copies (write-invalidate). Simple but generates high bus traffic on writes to shared data.

MOESI (adds Owned): AMD's extension. The Owned state allows a modified, shared line to serve reads directly from the owning cache rather than writing back to memory first. This reduces write-back traffic at the cost of more complex state transitions.

MESIF (adds Forward): Intel's extension. The Forward state designates exactly one cache as the responder for shared-line requests, eliminating redundant responses when multiple caches hold the same shared line. Optimized for read-heavy sharing patterns.

Scalability limits: All snooping protocols face fundamental scalability issues beyond ~32-64 cores because every cache must observe every bus transaction. This motivates the shift to directory-based protocols at higher core counts.

3.2 Directory-Based Coherence

Architecture: Instead of broadcasting on a bus, directory protocols maintain a centralized (or distributed) directory tracking which caches hold each line. Only the relevant caches receive invalidation messages. Traffic scales with the number of sharers rather than the number of cores.

Overhead: Directory entries consume storage (bit-vector per cache line per core). For N cores with M cache lines, the directory requires O(N * M) bits. Various compression techniques (limited pointer directories, coarse directories) reduce this at the cost of precision.

Relevance to RVM: Directory-based coherence is the hardware mechanism that enables many-core scaling. RVM's SMP design should account for NUMA effects and directory-based coherence latencies when making scheduling decisions.

3.3 Software Coherence Protocols

Overview: Software coherence replaces hardware snooping/directory mechanisms with explicit software-managed cache operations. The OS or runtime issues explicit cache flush/invalidate instructions at synchronization points.

Examples:

  • Linux's explicit DMA coherence management (dma_map_single with cache maintenance)
  • Barrelfish's message-based coherence (no shared memory, explicit transfers)
  • GPU compute models (explicit host-device memory transfers)

Trade-offs: Software coherence eliminates hardware complexity but requires programmers (or compilers/runtimes) to correctly manage cache state. Errors lead to stale data or corruption. The benefit is full control over when coherence traffic occurs.

3.4 Coherence Signals as Scheduling Inputs -- The RVM Innovation

This is where RVM's design diverges from all existing systems. No existing OS uses coherence metrics as a scheduling signal. RVM's scheduler (ruvix-sched) computes priority as:

score = deadline_urgency + novelty_boost - risk_penalty

Where risk_penalty is derived from the pending coherence delta -- a measure of how much a task's execution would reduce global structural coherence. This is computed using spectral graph theory (Fiedler value, spectral gap, effective resistance) from the ruvector-coherence crate.

Why this matters: Traditional schedulers optimize for latency, throughput, or fairness. RVM optimizes for structural consistency. A task that would introduce logical contradictions into the system's knowledge graph gets deprioritized. A task processing genuinely novel information gets boosted. This is the right scheduling objective for agent workloads where maintaining a coherent world model is more important than raw throughput.

No prior art exists for coherence-driven scheduling in operating systems. The closest analogs are:

  • Database transaction schedulers that consider serializability (but these gate on commit, not schedule)
  • Network quality-of-service schedulers that consider flow coherence (but this is packet-level, not semantic)
  • Game engine entity-component schedulers that consider data locality (but this is cache-coherence, not semantic coherence)

4. Agent/Edge Computing Runtimes

4.1 Wasmtime Bare-Metal Embedding

Current status: Wasmtime can be compiled as a no_std Rust crate. The embedder must implement a platform abstraction layer (wasmtime-platform.h) specifying how to allocate virtual memory, handle signals, and manage threads.

Hyperlight precedent: Microsoft's Hyperlight Wasm project compiles wasmtime into a no_std guest that runs inside micro-VMs with no operating system. This is the strongest proof-of-concept for wasmtime on bare metal.

Practical considerations:

  • Wasmtime's cranelift JIT compiler works in no_std mode but requires virtual memory for code generation
  • The signals-and-traps feature can be disabled for platforms without virtual memory support
  • Custom memory allocators must be provided via the platform abstraction

RVM integration path: RVM's Phase B plan (weeks 35-36) specifies porting wasmtime or wasm-micro-runtime to bare metal. Given Hyperlight's success with no_std wasmtime, wasmtime is the recommended path. The ruvix-hal MMU trait can provide the virtual memory abstraction that wasmtime's platform layer requires.

4.2 Lunatic (Erlang-Like WASM Runtime)

What it is: A universal runtime for server-side applications inspired by Erlang. Actors are represented as WASM instances with per-actor sandboxing and runtime permissions.

Key features:

  • Preemptive scheduling of WASM processes via work-stealing async executor
  • Per-process fine-grained resource access control (filesystem, memory, network) enforced at the syscall level
  • Automatic transformation of blocking code into async operations
  • Written in Rust using wasmtime and tokio, with custom stack switching

Agent workload alignment: Lunatic's actor model closely matches agent workloads:

  • Each agent is an isolated WASM instance (Lunatic process)
  • Agents communicate through typed message passing
  • A failing agent can be restarted without affecting others (supervision trees)
  • Different agents can be written in different languages (polyglot via WASM)

RVM relevance: Lunatic validates the "agents as lightweight WASM processes" model but runs on top of Linux (tokio for async I/O, wasmtime for WASM). RVM can adopt Lunatic's architectural patterns while eliminating the Linux dependency. Key patterns to adopt:

  • Per-agent capability sets (RVM already has this via ruvix-cap)
  • Supervision trees for agent fault recovery
  • Work-stealing across cores (for Phase C SMP)

4.3 How Agent Workloads Differ from Traditional VM Workloads

Dimension Traditional VM/Container Agent Workload
Lifecycle Long-running process Short-lived reasoning bursts + long idle
State model Files and databases Vectors, graphs, proof chains
Communication TCP/Unix sockets Typed semantic queues with coherence scores
Isolation Address space separation Capability-gated resource access
Failure Kill and restart process Isolate, checkpoint, replay from last coherent state
Scheduling objective Fairness / throughput Coherence preservation / novelty exploration
Memory pattern Heap allocation / GC Append-only regions + slab allocators
Security model User/group permissions Proof-gated mutations with attestation witnesses

4.4 What an Agent-Optimized Hypervisor Needs

Based on the above analysis, an agent-optimized hypervisor requires:

  1. Kernel-native vector/graph stores -- Agents think in embeddings and knowledge graphs, not files. These must be first-class kernel objects, not userspace libraries serializing to disk.

  2. Coherence-aware scheduling -- The scheduler must understand that not all runnable tasks should run. A task that would decohere the world model should be delayed.

  3. Proof-gated mutations -- Every state change must carry a cryptographic witness. This enables checkpoint/replay, audit, and distributed attestation.

  4. Zero-copy typed IPC -- Agents exchange structured data (vectors, graph patches, proof tokens), not byte streams. The queue abstraction must be typed and schema-aware.

  5. Sub-millisecond task spawn -- Agent reasoning involves spawning many short-lived sub-tasks. Task creation must be cheaper than thread creation.

  6. Capability delegation without kernel round-trip -- Agents frequently delegate partial authority. This should be achievable through capability derivation in user space with kernel validation on use.

  7. Deterministic replay -- For debugging and audit, the kernel must support replaying a sequence of operations and reaching the same state.

All seven of these requirements are already addressed by RVM's architecture (ADR-087).


5. Graph-Partitioned Scheduling

5.1 Min-Cut Based Task Placement

Theory: Given a graph where nodes are tasks and edges represent communication volume, the minimum cut partitioning assigns tasks to processors to minimize inter-processor communication. The min-cut objective directly minimizes the scheduling overhead of cross-core data movement.

Algorithms:

  • Karger's randomized contraction: O(n^2 log n) for global min-cut
  • Stoer-Wagner deterministic: O(nm + n^2 log n) for global min-cut
  • KaHIP/METIS multilevel: Practical tools for balanced k-way partitioning

RVM's ruvector-mincut crate implements subpolynomial dynamic min-cut with self-healing networks, including:

  • Exact and (1+epsilon)-approximate algorithms
  • j-Tree hierarchical decomposition for multi-level partitioning
  • Canonical pseudo-deterministic min-cut (source-anchored, tree-packing, dynamic tiers)
  • Agentic 256-core parallel backend
  • SNN-based neural optimization (attractor, causal, morphogenetic, strange loop, time crystal)

5.2 Spectral Partitioning for Workload Isolation

Theory: Spectral partitioning uses the eigenvectors of the graph Laplacian to identify natural clusters. The Fiedler vector (eigenvector corresponding to the second-smallest eigenvalue) provides an optimal bisection -- the Cheeger bound guarantees that spectral bisection produces partitions with nearly optimal conductance.

RVM's ruvector-coherence spectral module already implements:

  • Fiedler value estimation via inverse iteration with CG solver
  • Spectral gap ratio computation
  • Effective resistance sampling
  • Degree regularity scoring
  • Composite Spectral Coherence Score (SCS) with incremental updates

The SpectralTracker supports first-order perturbation updates (delta_lambda ~ v^T * delta_L * v) for incremental edge weight changes, avoiding full recomputation on every graph mutation.

5.3 Dynamic Graph Rebalancing Under Load

Challenge: Static partitioning fails when workload patterns change at runtime. Agents spawn, terminate, and change their communication patterns dynamically.

Approaches:

  • Diffusion-based: Migrate load from overloaded partitions to underloaded neighbors. O(diameter) convergence. Simple but can oscillate.
  • Repartitioning: Periodically re-run the partitioner on the current communication graph. Expensive but globally optimal.
  • Incremental spectral: Track the Fiedler vector incrementally (as ruvector-coherence does) and trigger repartitioning only when the spectral gap drops below a threshold.

RVM design implication: The scheduler's partition manager (ruvix-sched/partition.rs) currently uses static round-robin partition scheduling with fixed time slices. The spectral coherence infrastructure from ruvector-coherence is already in the workspace (ruvix-sched depends on it optionally via the coherence feature flag). The path forward:

  1. Monitor the inter-task communication graph using queue message counters
  2. Build a Laplacian from the communication weights
  3. Compute the SCS incrementally using SpectralTracker
  4. When SCS drops below threshold, trigger repartitioning using ruvector-mincut
  5. Migrate tasks between partitions based on the new cut

5.4 The ruvector-sparsifier Connection

The ruvector-sparsifier crate provides dynamic spectral graph sparsification -- an "always-on compressed world model." For large task graphs, sparsification reduces the graph to O(n log n / epsilon^2) edges while preserving all cuts to within a (1+epsilon) factor. This means the scheduler can maintain an approximate communication graph at dramatically lower cost than the full graph, using it for partitioning decisions.


6. Existing RuVector Crates Relevant to Hypervisor Design

6.1 ruvector-mincut

Relevance: CRITICAL for graph-partitioned scheduling

  • Provides the algorithmic backbone for task-to-partition assignment
  • Subpolynomial dynamic min-cut means the scheduler can re-partition in response to workload changes without O(n^3) overhead
  • The j-Tree hierarchical decomposition (feature jtree) maps directly to multi-level partition hierarchies
  • The canonical min-cut feature provides deterministic partitioning -- the same communication graph always produces the same partition, enabling reproducible scheduling behavior
  • SNN integration enables learned partitioning policies

Integration point: Wire into ruvix-sched's PartitionManager to dynamically assign new tasks to optimal partitions based on their communication pattern with existing tasks.

6.2 ruvector-sparsifier

Relevance: HIGH for scalable partition management

  • Dynamic spectral sparsification keeps the scheduler's view of the task communication graph manageable as the number of tasks grows
  • Static and dynamic modes: static for boot-time graph reduction, dynamic for runtime maintenance
  • Preserves all cuts within (1+epsilon), so min-cut-based partition decisions remain valid on the sparsified graph
  • SIMD and WASM feature flags for acceleration

Integration point: Preprocess the inter-task communication graph through the sparsifier before feeding it to ruvector-mincut for partition computation.

6.3 ruvector-solver

Relevance: HIGH for spectral computations

  • Sublinear-time sparse linear system solver: O(log n) to O(sqrt(n)) for PageRank, Neumann series, forward/backward push, conjugate gradient
  • Direct application: solving the graph Laplacian systems needed for Fiedler vector computation and effective resistance estimation
  • The CG solver in ruvector-coherence/spectral.rs is a minimal inline implementation; ruvector-solver provides a more optimized, parallel version

Integration point: Replace the inline CG solver in spectral.rs with ruvector-solver's optimized implementation for faster coherence score computation in the scheduler hot path.

6.4 ruvector-cnn

Relevance: MODERATE for novelty detection

  • CNN feature extraction for image embeddings with SIMD acceleration
  • INT8 quantized inference for resource-constrained environments
  • The scheduler's novelty tracker (ruvix-sched/novelty.rs) computes novelty as distance from a centroid in embedding space
  • For vision-based agents, ruvector-cnn could provide the embedding that feeds into the novelty computation

Integration point: In RVF component space (above the kernel), vision agents use ruvector-cnn for perception. The resulting embedding vectors feed into the kernel's novelty tracker through the update_task_novelty syscall.

6.5 ruvector-coherence

Relevance: CRITICAL -- already integrated

  • Provides the coherence measurement primitives that drive the scheduler's risk penalty
  • Spectral module computes Fiedler value, spectral gap, effective resistance, degree regularity
  • SpectralTracker supports incremental updates (first-order perturbation)
  • HnswHealthMonitor provides health alerts when graph coherence degrades
  • Already a workspace dependency of ruvix-sched (optional, behind coherence feature flag)

Integration point: Already wired. The spectral coherence score feeds into compute_risk_penalty() in the priority module.

6.6 ruvector-raft

Relevance: HIGH for distributed RVM clusters

  • Raft consensus for distributed metadata coordination
  • Relevant for Phase D (distributed RVM mesh, demo 5 in ADR-087)
  • Provides leader election, log replication, and consistent state machine application
  • Could coordinate partition assignments across a cluster of RVM nodes

Integration point: Future use for distributed scheduling consensus in multi-node RVM deployments.

6.7 Other Notable Crates

Crate Relevance Use
ruvector-graph HIGH Graph database for task communication topology
ruvector-hyperbolic-hnsw MODERATE Hierarchical embedding search for agent memory
ruvector-delta-consensus HIGH Delta-based consensus for distributed state
ruvector-attention MODERATE Attention mechanisms for priority computation
sona MODERATE Self-optimizing neural architecture for scheduler tuning
ruvector-nervous-system LOW Higher-level coordination (above kernel)
thermorust LOW Thermal monitoring for Raspberry Pi targets
ruvector-verified HIGH ProofGate, ProofEnvironment, attestation chain

7. Synthesis: How Each Area Maps to RVM Design Decisions

7.1 Decision Matrix

Research Area Key Finding RVM Design Decision Status
Bare-metal Rust Theseus/RedLeaf prove language isolation viable Hybrid: type safety (kernel) + MMU (user) Phase A done, Phase B planned
Bare-metal Rust Hubris shows ~2000-line Rust kernel suffices Keep nucleus minimal (12 syscalls, ~3000 LOC) Implemented
Bare-metal Rust Hyperlight proves no_std wasmtime works Use wasmtime no_std for WASM runtime Phase B weeks 35-36
Capabilities seL4 model is the gold standard ruvix-cap implements seL4-style capabilities Implemented (54 tests)
Capabilities CHERI is future hardware path HAL abstraction layer ready for CHERI Designed, not yet needed
Capabilities TickTock found 5 MPU bugs via verification Prioritize formal verification of MMU code Planned
Coherence Barrelfish: make coherence explicit, don't rely on snooping Per-core schedulers with message-passing (Phase C) ruvix-smp designed
Coherence No prior art for semantic coherence in scheduling RVM's coherence-aware scheduler is novel Implemented (39 tests)
Coherence Spectral methods provide mathematical guarantees ruvector-coherence spectral module Implemented
Agent runtimes Lunatic validates actors-as-WASM model RVF components as capability-gated WASM actors Designed
Agent runtimes Agent workloads differ fundamentally from VM workloads 6 primitives + 12 syscalls, no POSIX Implemented
Graph scheduling Min-cut minimizes cross-partition traffic Wire ruvector-mincut into partition manager Designed, not yet wired
Graph scheduling Spectral partitioning gives near-optimal cuts Already have spectral infrastructure Implemented
Graph scheduling Dynamic rebalancing needs incremental spectral updates SpectralTracker supports perturbation updates Implemented

7.2 Open Research Questions

  1. Formal verification scope: What subset of the ruvix kernel can be practically verified? The entire ruvix-cap crate is #![forbid(unsafe_code)] and is a good candidate. The ruvix-aarch64 crate contains inherent unsafe code (MMU manipulation) that would need different verification techniques (possibly refinement proofs as in seL4).

  2. Coherence signal latency: Computing spectral coherence scores involves linear algebra (CG solver, power iteration). Can this be fast enough for the scheduling hot path? The inline CG solver in spectral.rs uses 10-15 iterations; benchmarking against ruvector-solver's optimized version is needed.

  3. WASM runtime selection: Wasmtime's no_std support is proven (Hyperlight) but cranelift JIT requires virtual memory. For the initial Phase B port, should RVM use: (a) wasmtime with cranelift JIT (better performance, needs MMU), (b) wasmtime with winch baseline compiler (simpler, still needs MMU), or (c) wasm-micro-runtime (interpreter, no MMU needed, slower)?

  4. Multi-core coherence architecture: When Phase C introduces SMP, should the scheduler use: (a) a single shared scheduler with spinlock protection (simple, doesn't scale), (b) per-core schedulers with work-stealing (Lunatic model), or (c) per-core schedulers with message-passing (Barrelfish model)? The Barrelfish data suggests (c) for >8 cores.

  5. Dynamic partition count: The current PartitionManager uses a compile-time const generic M for maximum partitions. Should this be dynamic to support workloads with variable component counts?

7.3 Recommended Next Steps

  1. Immediate: Wire ruvector-mincut into ruvix-sched's PartitionManager for dynamic task-to-partition assignment based on communication graph analysis.

  2. Phase B priority: Study Hyperlight's wasmtime no_std integration for the bare-metal WASM runtime port. The wasmtime-platform.h abstraction maps cleanly to ruvix-hal traits.

  3. Verification: Begin formal verification of ruvix-cap using Kani (Rust model checker) or Creusot. The #![forbid(unsafe_code)] constraint makes this tractable.

  4. Benchmarking: Measure spectral coherence computation latency in the scheduling hot path. If too slow, implement a fast-path approximation that falls back to full computation periodically (the SpectralTracker already supports this with refresh_threshold).

  5. Phase C design: Adopt Barrelfish's per-core kernel model for SMP. The ruvix-smp crate's topology and IPI infrastructure is already aligned with this approach.


8. References

Bare-Metal Rust OS Projects

Capability-Based Systems

Multikernel and Coherence

Agent/WASM Runtimes

Graph Partitioning

RVM Internal

  • ADR-087: RVM Cognition Kernel (docs/adr/ADR-087-ruvix-cognition-kernel.md)
  • ADR-014: Coherence Engine Architecture (docs/adr/ADR-014-coherence-engine.md)
  • ADR-029: RVF Canonical Format
  • ADR-047: Proof-Gated Mutation Protocol
  • ruvix workspace: crates/ruvix/ (22 internal crates)
  • ruvector-mincut: crates/ruvector-mincut/
  • ruvector-sparsifier: crates/ruvector-sparsifier/
  • ruvector-solver: crates/ruvector-solver/
  • ruvector-coherence: crates/ruvector-coherence/
  • ruvector-raft: crates/ruvector-raft/

RVM Microhypervisor Architecture

Status

Draft -- 2026-04-04

Abstract

RVM is a Rust-first bare-metal microhypervisor that replaces the VM abstraction with coherence domains (partitions). It runs standalone without Linux or KVM, targeting QEMU virt as the reference platform with paths to real hardware on AArch64, RISC-V, and x86-64. The hypervisor integrates RuVector's mincut, sparsifier, and solver crates as first-class subsystems driving placement, isolation, and scheduling decisions.

This document covers the full system architecture from reset vector to agent runtime.


Table of Contents

  1. Design Principles
  2. Boot Sequence
  3. Core Kernel Objects
  4. Memory Architecture
  5. Scheduler Design
  6. IPC Design
  7. Device Model
  8. Witness Subsystem
  9. Agent Runtime Layer
  10. Hardware Abstraction
  11. Integration with RuVector
  12. What Makes RVM Different

1. Design Principles

1.1 Not a VM, Not a Container -- a Coherence Domain

Traditional hypervisors (KVM, Xen, Firecracker) virtualize hardware to run guest operating systems. Traditional containers (Docker, gVisor) share a host kernel with namespace isolation. RVM does neither.

A RVM partition is a coherence domain: a set of memory regions, capabilities, communication edges, and scheduled tasks that form a self-consistent unit of computation. Partitions are not VMs -- they have no emulated hardware, no guest kernel, no BIOS. They are not containers -- there is no host kernel to share. The hypervisor is the kernel.

The unit of isolation is defined by the graph structure of partition communication, not by hardware virtualization features. A mincut of the communication graph reveals the natural fault isolation boundary. This is a fundamentally different model.

1.2 Core Invariants

These invariants hold for every operation in the system:

ID Invariant Enforcement
INV-1 No mutation without proof ProofGate<T> at type level, 3-tier verification
INV-2 No access without capability Capability table checked on every syscall
INV-3 Every privileged action is witnessed Append-only witness log, no opt-out
INV-4 No unbounded allocation in syscall path Pre-allocated structures, slab allocators
INV-5 No priority inversion Capability-based access prevents blocking on unheld resources
INV-6 Reconstruction from witness + dormant state Deterministic replay from checkpoint + log

1.3 Crate Dependency DAG

ruvix-types          (no_std, #![forbid(unsafe_code)])
    |
    +-- ruvix-cap    (capability manager, derivation trees)
    |       |
    +-------+-- ruvix-proof    (3-tier proof engine)
    |       |
    +-------+-- ruvix-region   (typed memory with ownership)
    |       |
    +-------+-- ruvix-queue    (io_uring-style IPC)
    |       |
    +-------+-- ruvix-sched    (graph-pressure scheduler)
    |       |
    +-------+-- ruvix-vecgraph (kernel-resident vector/graph)
    |
    +-- ruvix-hal              (HAL traits, platform-agnostic)
    |       |
    |       +-- ruvix-aarch64  (ARM boot, MMU, exceptions)
    |       +-- ruvix-riscv    (RISC-V boot, MMU, exceptions)  [Phase C]
    |       +-- ruvix-x86_64   (x86 boot, VMX, exceptions)     [Phase D]
    |
    +-- ruvix-physmem          (buddy allocator)
    +-- ruvix-dtb              (device tree parser)
    +-- ruvix-drivers          (PL011, GIC, timer)
    +-- ruvix-dma              (DMA engine)
    +-- ruvix-net              (virtio-net)
    +-- ruvix-witness          (witness log + replay)       [NEW]
    +-- ruvix-partition        (coherence domain manager)   [NEW]
    +-- ruvix-commedge         (partition communication)    [NEW]
    +-- ruvix-pressure         (mincut integration)         [NEW]
    +-- ruvix-agent            (WASM agent runtime)         [NEW]
    |
    +-- ruvix-nucleus          (integration, syscall dispatch)

2. Boot Sequence

RVM boots directly from the reset vector with no dependency on any existing OS, bootloader, or hypervisor. The sequence is identical in structure across architectures, with platform-specific assembly stubs.

2.1 Stage 0: Reset Vector (Assembly)

The CPU begins execution at the platform-defined reset vector. A minimal assembly stub performs the operations that cannot be expressed in Rust.

AArch64 (EL2 entry for hypervisor mode):

// ruvix-aarch64/src/boot.S
.section .text.boot
.global _start

_start:
    // On QEMU virt, firmware drops us at EL2 (hypervisor mode)
    // x0 = DTB address

    // 1. Check we are at EL2
    mrs     x1, CurrentEL
    lsr     x1, x1, #2
    cmp     x1, #2
    b.ne    _wrong_el

    // 2. Disable MMU, caches (clean state)
    mrs     x1, sctlr_el2
    bic     x1, x1, #1         // M=0: MMU off
    bic     x1, x1, #(1 << 2)  // C=0: data cache off
    bic     x1, x1, #(1 << 12) // I=0: instruction cache off
    msr     sctlr_el2, x1
    isb

    // 3. Set up exception vector table
    adr     x1, _exception_vectors_el2
    msr     vbar_el2, x1

    // 4. Initialize stack pointer
    adr     x1, _stack_top
    mov     sp, x1

    // 5. Clear BSS
    adr     x1, __bss_start
    adr     x2, __bss_end
.Lbss_clear:
    cmp     x1, x2
    b.ge    .Lbss_done
    str     xzr, [x1], #8
    b       .Lbss_clear
.Lbss_done:

    // 6. x0 still holds DTB address -- pass to Rust
    bl      ruvix_entry

    // Should never return
    b       .

_wrong_el:
    // If at EL1, attempt to elevate via HVC (QEMU-specific)
    // If at EL3, configure EL2 and eret
    // ...

RISC-V (HS-mode entry):

// ruvix-riscv/src/boot.S
.section .text.boot
.global _start

_start:
    // a0 = hart ID, a1 = DTB address
    // QEMU starts in M-mode; OpenSBI transitions to S-mode
    // We need HS-mode (hypervisor extension)

    // 1. Check for hypervisor extension
    csrr    t0, misa
    andi    t0, t0, (1 << 7)   // 'H' bit
    beqz    t0, _no_hypervisor

    // 2. Park non-boot harts
    bnez    a0, _park

    // 3. Set up stack
    la      sp, _stack_top

    // 4. Clear BSS
    la      t0, __bss_start
    la      t1, __bss_end
1:  bge     t0, t1, 2f
    sd      zero, (t0)
    addi    t0, t0, 8
    j       1b
2:

    // 5. Enter Rust (a0=hart_id, a1=dtb)
    call    ruvix_entry

_park:
    wfi
    j       _park

x86-64 (VMX root mode):

; ruvix-x86_64/src/boot.asm
; Entered from a multiboot2-compliant loader or direct long mode setup
; eax = multiboot2 magic, ebx = info struct pointer

section .text.boot
global _start
bits 64

_start:
    ; 1. Already in long mode (64-bit) from bootloader
    ; 2. Enable VMX if supported
    mov     ecx, 0x3A          ; IA32_FEATURE_CONTROL MSR
    rdmsr
    test    eax, (1 << 2)      ; VMXON outside SMX
    jz      _no_vmx

    ; 3. Set up stack
    lea     rsp, [_stack_top]

    ; 4. Clear BSS
    lea     rdi, [__bss_start]
    lea     rcx, [__bss_end]
    sub     rcx, rdi
    shr     rcx, 3
    xor     eax, eax
    rep     stosq

    ; 5. rdi = multiboot info pointer
    mov     rdi, rbx
    call    ruvix_entry

    hlt
    jmp     $

2.2 Stage 1: Rust Entry and Hardware Detection

The assembly stub hands off to a single Rust entry point. This function is #[no_mangle] and extern "C", receiving the DTB/multiboot pointer.

// ruvix-nucleus/src/entry.rs

/// Unified Rust entry point. Platform stubs call this after basic setup.
/// `platform_info` is a DTB address (AArch64/RISC-V) or multiboot2 info
/// pointer (x86-64).
#[no_mangle]
pub extern "C" fn ruvix_entry(platform_info: usize) -> ! {
    // Phase 1: Hardware detection
    let hw = HardwareInfo::detect(platform_info);

    // Phase 2: Early serial for diagnostics
    let mut console = hw.early_console();
    console.write_str("RVM v0.1.0 booting\n");
    console.write_fmt(format_args!(
        "  arch={}, cores={}, ram={}MB\n",
        hw.arch_name(), hw.core_count(), hw.ram_bytes() >> 20
    ));

    // Phase 3: Physical memory allocator
    let mut phys = PhysicalAllocator::new(&hw.memory_regions);

    // Phase 4: MMU / page table setup
    let mut mmu = hw.init_mmu(&mut phys);

    // Phase 5: Hypervisor mode configuration
    hw.init_hypervisor_mode(&mut mmu);

    // Phase 6: Interrupt controller
    let mut irq = hw.init_interrupt_controller();

    // Phase 7: Timer
    let timer = hw.init_timer(&mut irq);

    // Phase 8: Kernel subsystem initialization
    let kernel = Kernel::init(KernelInit {
        phys: &mut phys,
        mmu: &mut mmu,
        irq: &mut irq,
        timer: &timer,
        console: &mut console,
    });

    // Phase 9: Load boot RVF and start first partition
    kernel.load_boot_rvf_and_start();

    // Phase 10: Enter scheduler (never returns)
    kernel.scheduler_loop()
}

2.3 Stage 2: MMU and Hypervisor Mode

The critical distinction from a traditional kernel: RVM runs in hypervisor privilege level, not kernel level.

Architecture RVM Level Guest (Partition) Level What This Means
AArch64 EL2 EL1/EL0 RVM controls stage-2 page tables; partitions get full EL1 page tables if needed
RISC-V HS-mode VS-mode/VU-mode Hypervisor extension controls guest physical address translation
x86-64 VMX root VMX non-root EPT (Extended Page Tables) provide second-level address translation

Running at the hypervisor level provides two key advantages over running at kernel level (EL1/Ring 0):

  1. Two-stage address translation: The hypervisor controls the mapping from guest-physical to host-physical addresses. Partitions can have their own page tables (stage-1) while the hypervisor enforces isolation via stage-2 tables. This is strictly more powerful than single-stage translation.

  2. Trap-and-emulate without paravirtualization: The hypervisor can trap on specific instructions (WFI, MSR, MMIO access) without requiring the partition to be aware it is virtualized. This is essential for running unmodified WASM runtimes.

Stage-2 page table setup (AArch64):

// ruvix-aarch64/src/stage2.rs

/// Stage-2 translation table for a partition.
///
/// Maps Intermediate Physical Addresses (IPA) produced by the partition's
/// stage-1 tables to actual Physical Addresses (PA). The hypervisor
/// controls this mapping exclusively.
pub struct Stage2Tables {
    /// Level-0 table base (4KB aligned)
    root: PhysAddr,
    /// Physical pages backing the table structure
    pages: ArrayVec<PhysAddr, 512>,
    /// IPA range assigned to this partition
    ipa_range: Range<u64>,
}

impl Stage2Tables {
    /// Create stage-2 tables for a partition with the given IPA range.
    ///
    /// The IPA range defines the partition's "view" of physical memory.
    /// All accesses outside this range trap to the hypervisor.
    pub fn new(
        ipa_range: Range<u64>,
        phys: &mut PhysicalAllocator,
    ) -> Result<Self, HypervisorError> {
        let root = phys.allocate_page()?;
        // Zero the root table
        unsafe { core::ptr::write_bytes(root.as_mut_ptr::<u8>(), 0, PAGE_SIZE) };

        Ok(Self {
            root,
            pages: ArrayVec::new(),
            ipa_range,
        })
    }

    /// Map an IPA to a PA with the given attributes.
    ///
    /// Enforces that the IPA falls within the partition's assigned range.
    pub fn map(
        &mut self,
        ipa: u64,
        pa: PhysAddr,
        attrs: Stage2Attrs,
        phys: &mut PhysicalAllocator,
    ) -> Result<(), HypervisorError> {
        if !self.ipa_range.contains(&ipa) {
            return Err(HypervisorError::IpaOutOfRange);
        }
        // Walk/allocate 4-level table and install entry
        self.walk_and_install(ipa, pa, attrs, phys)
    }

    /// Activate these tables for the current vCPU.
    ///
    /// Writes VTTBR_EL2 with the table base and VMID.
    pub unsafe fn activate(&self, vmid: u16) {
        let vttbr = self.root.as_u64() | ((vmid as u64) << 48);
        core::arch::asm!(
            "msr vttbr_el2, {val}",
            "isb",
            val = in(reg) vttbr,
        );
    }
}

/// Stage-2 page attributes.
#[derive(Debug, Clone, Copy)]
pub struct Stage2Attrs {
    pub readable: bool,
    pub writable: bool,
    pub executable: bool,
    /// Device memory (non-cacheable, strongly ordered)
    pub device: bool,
}

2.4 Stage 3: Capability Table and Kernel Object Initialization

After the MMU is active and hypervisor mode is configured, the kernel initializes its object tables:

// ruvix-nucleus/src/init.rs

impl Kernel {
    pub fn init(init: KernelInit) -> Self {
        // 1. Capability manager with root capability
        let mut cap_mgr: CapabilityManager<4096> =
            CapabilityManager::new(CapManagerConfig::default());

        // 2. Region manager backed by physical allocator
        let region_mgr = RegionManager::new_baremetal(init.phys);

        // 3. Queue manager (pre-allocate ring buffer pool)
        let queue_mgr = QueueManager::new(init.phys, 256); // 256 queues max

        // 4. Proof engine
        let proof_engine = ProofEngine::new(ProofEngineConfig::default());

        // 5. Witness log (append-only, physically backed)
        let witness_log = WitnessLog::new(init.phys, WITNESS_LOG_SIZE);

        // 6. Partition manager (coherence domain manager)
        let partition_mgr = PartitionManager::new(&mut cap_mgr);

        // 7. CommEdge manager (inter-partition channels)
        let commedge_mgr = CommEdgeManager::new(&queue_mgr);

        // 8. Pressure engine (mincut integration)
        let pressure = PressureEngine::new();

        // 9. Scheduler
        let scheduler = Scheduler::new(SchedulerConfig::default());

        // 10. Vector/graph kernel objects
        let vecgraph = VecGraphManager::new(init.phys, &proof_engine);

        Self {
            cap_mgr, region_mgr, queue_mgr, proof_engine,
            witness_log, partition_mgr, commedge_mgr, pressure,
            scheduler, vecgraph, timer: init.timer.clone(),
        }
    }
}

3. Core Kernel Objects

RVM defines eight first-class kernel objects. The first six (Task, Capability, Region, Queue, Timer, Proof) are inherited from Phase A (ADR-087). The remaining two (Partition, CommEdge) plus the supplementary metric objects (CoherenceScore, CutPressure, DeviceLease) are new to the hypervisor architecture.

3.1 Partition (Coherence Domain Container)

A partition is the primary execution container. It is NOT a VM.

// ruvix-partition/src/partition.rs

/// A coherence domain: the fundamental unit of isolation in RVM.
///
/// A partition groups:
/// - A set of tasks that execute within the domain
/// - A set of memory regions owned by the domain
/// - A capability table scoped to the domain
/// - A set of CommEdges connecting to other partitions
/// - A coherence score measuring internal consistency
/// - A set of device leases for hardware access
///
/// Partitions can be split, merged, migrated, and hibernated.
/// The hypervisor manages stage-2 page tables per partition,
/// ensuring hardware-enforced memory isolation.
pub struct Partition {
    /// Unique partition identifier
    id: PartitionId,

    /// Stage-2 page tables (hardware isolation)
    stage2: Stage2Tables,

    /// Tasks belonging to this partition
    tasks: BTreeMap<TaskHandle, TaskControlBlock>,

    /// Memory regions owned by this partition
    regions: BTreeMap<RegionHandle, RegionDescriptor>,

    /// Capability table for this partition
    cap_table: CapabilityTable,

    /// Communication edges to other partitions
    comm_edges: ArrayVec<CommEdgeHandle, MAX_EDGES_PER_PARTITION>,

    /// Current coherence score (computed by solver crate)
    coherence: CoherenceScore,

    /// Current cut pressure (computed by mincut crate)
    cut_pressure: CutPressure,

    /// Active device leases
    device_leases: ArrayVec<DeviceLease, MAX_DEVICES_PER_PARTITION>,

    /// Partition state
    state: PartitionState,

    /// Witness log segment for this partition
    witness_segment: WitnessSegmentHandle,
}

/// Partition lifecycle states.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum PartitionState {
    /// Actively scheduled, tasks running
    Active,
    /// All tasks suspended, state in hot memory
    Suspended,
    /// State compressed and moved to warm tier
    Warm,
    /// State serialized to cold storage, reconstructable
    Dormant,
    /// Being split into two partitions (transient)
    Splitting,
    /// Being merged with another partition (transient)
    Merging,
    /// Being migrated to another physical node (transient)
    Migrating,
}

/// Partition identity.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct PartitionId(u64);

/// Maximum communication edges per partition.
pub const MAX_EDGES_PER_PARTITION: usize = 64;

/// Maximum devices per partition.
pub const MAX_DEVICES_PER_PARTITION: usize = 8;

Partition operations trait:

/// Operations on coherence domains.
pub trait PartitionOps {
    /// Create a new empty partition with its own stage-2 address space.
    fn create(
        &mut self,
        config: PartitionConfig,
        parent_cap: CapHandle,
        proof: &ProofToken,
    ) -> Result<PartitionId, HypervisorError>;

    /// Split a partition along a mincut boundary.
    ///
    /// The mincut algorithm identifies the optimal split point.
    /// Tasks, regions, and capabilities are redistributed according
    /// to which side of the cut they fall on.
    fn split(
        &mut self,
        partition: PartitionId,
        cut: &CutResult,
        proof: &ProofToken,
    ) -> Result<(PartitionId, PartitionId), HypervisorError>;

    /// Merge two partitions into one.
    ///
    /// Requires that the partitions share at least one CommEdge
    /// and that the merged coherence score exceeds a threshold.
    fn merge(
        &mut self,
        a: PartitionId,
        b: PartitionId,
        proof: &ProofToken,
    ) -> Result<PartitionId, HypervisorError>;

    /// Transition a partition to the dormant state.
    ///
    /// Serializes all state, releases physical memory, and records
    /// a reconstruction receipt in the witness log.
    fn hibernate(
        &mut self,
        partition: PartitionId,
        proof: &ProofToken,
    ) -> Result<ReconstructionReceipt, HypervisorError>;

    /// Reconstruct a dormant partition from its receipt.
    fn reconstruct(
        &mut self,
        receipt: &ReconstructionReceipt,
        proof: &ProofToken,
    ) -> Result<PartitionId, HypervisorError>;
}

3.2 Capability (Unforgeable Token)

Capabilities are inherited directly from ruvix-cap (Phase A). In the hypervisor context, the capability system is extended with new object types:

// ruvix-types/src/object.rs (extended)

/// All kernel object types that can be referenced by capabilities.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum ObjectType {
    // Phase A objects
    Task        = 0,
    Region      = 1,
    Queue       = 2,
    Timer       = 3,
    VectorStore = 4,
    GraphStore  = 5,

    // Hypervisor objects (new)
    Partition   = 6,
    CommEdge    = 7,
    DeviceLease = 8,
    WitnessLog  = 9,
    PhysMemPool = 10,
}

/// Capability rights bitmap (extended for hypervisor).
bitflags! {
    pub struct CapRights: u32 {
        // Phase A rights
        const READ       = 1 << 0;
        const WRITE      = 1 << 1;
        const GRANT      = 1 << 2;
        const GRANT_ONCE = 1 << 3;
        const PROVE      = 1 << 4;
        const REVOKE     = 1 << 5;

        // Hypervisor rights (new)
        const SPLIT      = 1 << 6;   // Split a partition
        const MERGE      = 1 << 7;   // Merge partitions
        const MIGRATE    = 1 << 8;   // Migrate partition to another node
        const HIBERNATE  = 1 << 9;   // Hibernate/reconstruct
        const LEASE      = 1 << 10;  // Acquire device lease
        const WITNESS    = 1 << 11;  // Read witness log
    }
}

3.3 Witness (Audit Record)

Every privileged action produces a witness record. See Section 8 for the full design.

3.4 MemoryRegion (Typed, Tiered Memory)

Memory regions from Phase A are extended with tier awareness:

// ruvix-region/src/tiered.rs

/// Memory tier indicating thermal/access characteristics.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(u8)]
pub enum MemoryTier {
    /// Actively accessed, in L1/L2 cache working set.
    /// Physical pages pinned, stage-2 mapped.
    Hot = 0,

    /// Recently accessed, in DRAM but not cache-hot.
    /// Physical pages allocated, stage-2 mapped but may be
    /// compressed in background.
    Warm = 1,

    /// Not recently accessed. Pages compressed in-place
    /// using LZ4. Stage-2 mapping points to compressed form.
    /// Access triggers decompression fault handled by hypervisor.
    Dormant = 2,

    /// Evicted to persistent storage (NVMe, SD card, network).
    /// Stage-2 mapping removed. Access triggers reconstruction
    /// via the reconstruction protocol.
    Cold = 3,
}

/// A memory region with ownership tracking and tier management.
pub struct TieredRegion {
    /// Base region (Immutable, AppendOnly, or Slab policy)
    inner: RegionDescriptor,

    /// Current memory tier
    tier: MemoryTier,

    /// Owning partition
    owner: PartitionId,

    /// Sharing bitmap: which partitions have read access via CommEdge
    shared_with: BitSet<256>,

    /// Last access timestamp (for tier promotion/demotion)
    last_access_ns: u64,

    /// Compressed size (if Dormant tier)
    compressed_size: Option<usize>,

    /// Reconstruction receipt (if Cold tier)
    reconstruction: Option<ReconstructionReceipt>,
}

See Section 4 for the full memory architecture.

3.5 CommEdge (Communication Channel)

A CommEdge is a typed, capability-checked communication channel between two partitions:

// ruvix-commedge/src/lib.rs

/// A communication edge between two partitions.
///
/// CommEdges are the only mechanism for inter-partition communication.
/// They carry typed messages, support zero-copy sharing, and are
/// tracked by the coherence graph.
pub struct CommEdge {
    /// Unique edge identifier
    id: CommEdgeHandle,

    /// Source partition
    source: PartitionId,

    /// Destination partition
    dest: PartitionId,

    /// Underlying queue (from ruvix-queue)
    queue: QueueHandle,

    /// Edge weight in the coherence graph.
    /// Updated on every message send: weight += message_bytes.
    /// Decays over time: weight *= decay_factor per epoch.
    weight: AtomicU64,

    /// Message count since last epoch
    message_count: AtomicU64,

    /// Capability required to send on this edge
    send_cap: CapHandle,

    /// Capability required to receive on this edge
    recv_cap: CapHandle,

    /// Whether this edge supports zero-copy region sharing
    zero_copy: bool,

    /// Shared memory regions (if zero_copy is true)
    shared_regions: ArrayVec<RegionHandle, 16>,
}

/// CommEdge operations.
pub trait CommEdgeOps {
    /// Create a new CommEdge between two partitions.
    ///
    /// Both partitions must hold appropriate capabilities.
    /// The edge is registered in the coherence graph.
    fn create_edge(
        &mut self,
        source: PartitionId,
        dest: PartitionId,
        config: CommEdgeConfig,
        proof: &ProofToken,
    ) -> Result<CommEdgeHandle, HypervisorError>;

    /// Send a message over a CommEdge.
    ///
    /// Updates edge weight in the coherence graph.
    fn send(
        &mut self,
        edge: CommEdgeHandle,
        msg: &[u8],
        priority: MsgPriority,
        cap: CapHandle,
    ) -> Result<(), HypervisorError>;

    /// Receive a message from a CommEdge.
    fn recv(
        &mut self,
        edge: CommEdgeHandle,
        buf: &mut [u8],
        timeout: Duration,
        cap: CapHandle,
    ) -> Result<usize, HypervisorError>;

    /// Share a memory region over a CommEdge (zero-copy).
    ///
    /// Maps the region into the destination partition's stage-2
    /// address space with read-only permissions. The source retains
    /// ownership.
    fn share_region(
        &mut self,
        edge: CommEdgeHandle,
        region: RegionHandle,
        proof: &ProofToken,
    ) -> Result<(), HypervisorError>;

    /// Destroy a CommEdge.
    ///
    /// Unmaps any shared regions and removes the edge from the
    /// coherence graph.
    fn destroy_edge(
        &mut self,
        edge: CommEdgeHandle,
        proof: &ProofToken,
    ) -> Result<(), HypervisorError>;
}

3.6 DeviceLease (Time-Bounded Device Access)

// ruvix-partition/src/device_lease.rs

/// A time-bounded, revocable lease granting a partition access to
/// a hardware device.
///
/// Device leases are the hypervisor's mechanism for safe device
/// assignment. Unlike passthrough (where the guest owns the device
/// permanently), leases expire and can be revoked.
pub struct DeviceLease {
    /// Unique lease identifier
    id: LeaseId,

    /// Device being leased
    device: DeviceDescriptor,

    /// Partition holding the lease
    holder: PartitionId,

    /// Lease expiration (absolute time in nanoseconds)
    expires_ns: u64,

    /// Whether the lease has been revoked
    revoked: bool,

    /// MMIO region mapped into the partition's stage-2 space
    mmio_region: Option<RegionHandle>,

    /// Interrupt routing: device IRQ -> partition's virtual IRQ
    irq_routing: Option<(u32, u32)>,  // (physical_irq, virtual_irq)
}

/// Lease operations.
pub trait LeaseOps {
    /// Acquire a lease on a device.
    ///
    /// Requires LEASE capability. The device's MMIO region is mapped
    /// into the partition's stage-2 address space. Interrupts from
    /// the device are routed to the partition.
    fn acquire(
        &mut self,
        device: DeviceDescriptor,
        partition: PartitionId,
        duration_ns: u64,
        cap: CapHandle,
        proof: &ProofToken,
    ) -> Result<LeaseId, HypervisorError>;

    /// Renew an existing lease.
    fn renew(
        &mut self,
        lease: LeaseId,
        additional_ns: u64,
        proof: &ProofToken,
    ) -> Result<(), HypervisorError>;

    /// Revoke a lease (immediate).
    ///
    /// Unmaps MMIO region, disables interrupt routing, resets
    /// device to safe state.
    fn revoke(
        &mut self,
        lease: LeaseId,
        proof: &ProofToken,
    ) -> Result<(), HypervisorError>;
}

3.7 CoherenceScore

// ruvix-pressure/src/coherence.rs

/// A coherence score for a partition, computed by the solver crate.
///
/// The score measures how "internally consistent" a partition is:
/// high coherence means the partition's tasks and data are tightly
/// coupled and should stay together. Low coherence signals that
/// the partition may benefit from splitting.
#[derive(Debug, Clone, Copy)]
pub struct CoherenceScore {
    /// Aggregate score in [0.0, 1.0]. Higher = more coherent.
    pub value: f64,

    /// Per-task contribution to the score.
    /// Identifies which tasks are most/least coupled.
    pub task_contributions: [f32; 64],

    /// Timestamp of last computation.
    pub computed_at_ns: u64,

    /// Whether the score is stale (> 1 epoch old).
    pub stale: bool,
}

3.8 CutPressure

// ruvix-pressure/src/cut.rs

/// Graph-derived isolation signal for a partition.
///
/// CutPressure is computed by running the ruvector-mincut algorithm
/// on the partition's communication graph. High pressure means the
/// partition has a cheap cut -- it could easily be split into two
/// independent halves.
#[derive(Debug, Clone)]
pub struct CutPressure {
    /// Minimum cut value across all edges in/out of this partition.
    /// Lower value = higher pressure to split.
    pub min_cut_value: f64,

    /// The actual cut: which edges to sever.
    pub cut_edges: ArrayVec<CommEdgeHandle, 32>,

    /// Partition IDs on each side of the proposed cut.
    pub side_a: ArrayVec<TaskHandle, 64>,
    pub side_b: ArrayVec<TaskHandle, 64>,

    /// Estimated coherence scores after split.
    pub predicted_coherence_a: f64,
    pub predicted_coherence_b: f64,

    /// Timestamp.
    pub computed_at_ns: u64,
}

4. Memory Architecture

4.1 Two-Stage Address Translation

RVM uses hardware-enforced two-stage address translation for partition isolation:

Partition Virtual Address (VA)
    |
    | Stage-1 translation (partition's own page tables, EL1)
    |
    v
Intermediate Physical Address (IPA)
    |
    | Stage-2 translation (hypervisor-controlled, EL2)
    |
    v
Physical Address (PA)

Each partition has its own stage-1 page tables (which it controls) and stage-2 page tables (which only the hypervisor can modify). This means:

  • A partition cannot access memory outside its assigned IPA range
  • The hypervisor can remap, compress, or migrate physical pages without the partition's knowledge
  • Zero-copy sharing is implemented by mapping the same PA into two partitions' stage-2 tables

4.2 Physical Memory Allocator

The physical allocator uses a buddy system with per-tier free lists:

// ruvix-physmem/src/buddy.rs

/// Physical memory allocator with tier-aware allocation.
pub struct PhysicalAllocator {
    /// Buddy allocator for each tier
    tiers: [BuddyAllocator; 4],  // Hot, Warm, Dormant, Cold

    /// Total physical memory available
    total_pages: usize,

    /// Per-tier statistics
    stats: [TierStats; 4],
}

impl PhysicalAllocator {
    /// Allocate pages from a specific tier.
    pub fn allocate_pages(
        &mut self,
        count: usize,
        tier: MemoryTier,
    ) -> Result<PhysRange, AllocError> {
        self.tiers[tier as usize].allocate(count)
    }

    /// Promote pages from a colder tier to a warmer tier.
    ///
    /// This is called when a dormant region is accessed.
    pub fn promote(
        &mut self,
        range: PhysRange,
        from: MemoryTier,
        to: MemoryTier,
    ) -> Result<PhysRange, AllocError> {
        assert!(to < from, "promotion must go to a warmer tier");
        let new_range = self.tiers[to as usize].allocate(range.page_count())?;
        // Copy and decompress if needed
        self.copy_and_promote(range, new_range, from, to)?;
        self.tiers[from as usize].free(range);
        Ok(new_range)
    }

    /// Demote pages to a colder tier.
    ///
    /// Pages are compressed (Dormant) or evicted (Cold).
    pub fn demote(
        &mut self,
        range: PhysRange,
        from: MemoryTier,
        to: MemoryTier,
    ) -> Result<DemoteReceipt, AllocError> {
        assert!(to > from, "demotion must go to a colder tier");
        match to {
            MemoryTier::Dormant => self.compress_in_place(range),
            MemoryTier::Cold => self.evict_to_storage(range),
            _ => unreachable!(),
        }
    }
}

4.3 Memory Ownership via Rust's Type System

Memory ownership is enforced at the type level. A RegionHandle is a non-copyable token:

// ruvix-region/src/ownership.rs

/// A typed memory region handle. Non-copyable, non-clonable.
///
/// Ownership semantics:
/// - Exactly one partition owns a region at any time
/// - Transfer requires a proof and witness record
/// - Sharing creates a read-only view (not an ownership transfer)
/// - Dropping the handle does NOT free the region (the hypervisor manages lifetime)
pub struct OwnedRegion<P: RegionPolicy> {
    handle: RegionHandle,
    owner: PartitionId,
    _policy: PhantomData<P>,
}

/// Immutable region policy marker.
pub struct Immutable;

/// Append-only region policy marker.
pub struct AppendOnly;

/// Slab region policy marker.
pub struct Slab;

impl<P: RegionPolicy> OwnedRegion<P> {
    /// Transfer ownership to another partition.
    ///
    /// Consumes self, ensuring the old owner cannot use the handle.
    /// Updates stage-2 page tables for both partitions.
    pub fn transfer(
        self,
        new_owner: PartitionId,
        proof: &ProofToken,
        witness: &mut WitnessLog,
    ) -> Result<OwnedRegion<P>, HypervisorError> {
        witness.record(WitnessRecord::RegionTransfer {
            region: self.handle,
            from: self.owner,
            to: new_owner,
            proof_tier: proof.tier(),
        });
        // Remap stage-2 tables
        Ok(OwnedRegion {
            handle: self.handle,
            owner: new_owner,
            _policy: PhantomData,
        })
    }
}

/// Zero-copy sharing between partitions.
///
/// Only Immutable and AppendOnly regions can be shared (INV-4 from
/// Phase A: TOCTOU protection). Slab regions are never shared.
impl OwnedRegion<Immutable> {
    pub fn share_readonly(
        &self,
        target: PartitionId,
        edge: CommEdgeHandle,
        witness: &mut WitnessLog,
    ) -> Result<SharedRegionView, HypervisorError> {
        witness.record(WitnessRecord::RegionShare {
            region: self.handle,
            owner: self.owner,
            target,
            edge,
        });
        Ok(SharedRegionView {
            handle: self.handle,
            viewer: target,
        })
    }
}

4.4 Tier Management

The hypervisor runs a background tier management loop that promotes and demotes regions based on access patterns:

// ruvix-partition/src/tier_manager.rs

/// Tier management policy.
pub struct TierPolicy {
    /// Promote to Hot if accessed more than this many times per epoch
    pub hot_access_threshold: u32,
    /// Demote to Dormant if not accessed for this many epochs
    pub dormant_after_epochs: u32,
    /// Demote to Cold if dormant for this many epochs
    pub cold_after_epochs: u32,
    /// Maximum Hot tier memory (bytes) before forced demotion
    pub max_hot_bytes: usize,
    /// Compression algorithm for Dormant tier
    pub compression: CompressionAlgorithm,
}

/// Reconstruction protocol for dormant/cold state.
///
/// A reconstruction receipt contains everything needed to rebuild
/// a region from its serialized form plus the witness log.
#[derive(Debug, Clone)]
pub struct ReconstructionReceipt {
    /// Region identity
    pub region: RegionHandle,
    /// Owning partition
    pub partition: PartitionId,
    /// Hash of the serialized state
    pub state_hash: [u8; 32],
    /// Storage location (for Cold tier)
    pub storage_location: StorageLocation,
    /// Witness log range needed for replay
    pub witness_range: Range<u64>,
    /// Proof that the serialization was correct
    pub attestation: ProofAttestation,
}

#[derive(Debug, Clone)]
pub enum StorageLocation {
    /// Compressed in DRAM at the given physical address range
    CompressedDram(PhysRange),
    /// On block device at the given LBA range
    BlockDevice { device: DeviceDescriptor, lba_range: Range<u64> },
    /// On remote node (for distributed RVM)
    Remote { node_id: u64, receipt_id: u64 },
}

4.5 No Demand Paging

RVM does not implement demand paging, swap, or copy-on-write. All regions are physically backed at creation time. This is a deliberate design choice:

  • Deterministic latency: No page fault handler in the critical path
  • Simpler correctness proofs: No hidden state in page tables
  • Better for real-time: No unbounded delay from swap I/O

The tradeoff is higher memory pressure, which is managed by the tier system: instead of swapping, RVM compresses (Dormant) or serializes (Cold) entire regions with explicit witness records.


5. Scheduler Design

5.1 Three Scheduling Modes

The scheduler operates in one of three modes at any given time:

// ruvix-sched/src/mode.rs

/// Scheduler operating mode.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SchedulerMode {
    /// Hard real-time mode.
    ///
    /// Activated when any partition has a deadline-critical task.
    /// Uses pure EDF (Earliest Deadline First) within partitions.
    /// No novelty boosting. No coherence-based reordering.
    /// Guaranteed bounded preemption latency.
    Reflex,

    /// Normal operating mode.
    ///
    /// Combines three signals:
    /// 1. Deadline pressure (EDF baseline)
    /// 2. Novelty signal (priority boost for new information)
    /// 3. Structural risk (deprioritize mutations that lower coherence)
    /// 4. Cut pressure (boost partitions near a split boundary)
    Flow,

    /// Recovery mode.
    ///
    /// Activated when coherence drops below a critical threshold
    /// or a partition reconstruction fails. Reduces concurrency,
    /// favors stability over throughput.
    Recovery,
}

5.2 Graph-Pressure-Driven Scheduling

In Flow mode, the scheduler uses the coherence graph to make decisions:

// ruvix-sched/src/graph_pressure.rs

/// Priority computation for Flow mode.
///
/// final_priority = deadline_urgency
///                + (novelty_boost * NOVELTY_WEIGHT)
///                - (structural_risk * RISK_WEIGHT)
///                + (cut_pressure_boost * PRESSURE_WEIGHT)
pub fn compute_flow_priority(
    task: &TaskControlBlock,
    partition: &Partition,
    pressure: &PressureEngine,
    now_ns: u64,
) -> FlowPriority {
    // 1. Deadline urgency: how close to missing the deadline
    let deadline_urgency = task.deadline
        .map(|d| {
            let remaining = d.saturating_sub(now_ns);
            // Urgency increases as deadline approaches
            1.0 / (remaining as f64 / 1_000_000.0 + 1.0)
        })
        .unwrap_or(0.0);

    // 2. Novelty boost: is this task processing genuinely new data?
    let novelty_boost = partition.coherence.task_contributions
        [task.handle.index() % 64] as f64;

    // 3. Structural risk: would this task's pending mutations
    //    lower the partition's coherence score?
    let structural_risk = task.pending_mutation_risk();

    // 4. Cut pressure boost: if this partition is near a split
    //    boundary, boost tasks that would reduce the cut cost
    //    (making the partition more internally coherent)
    let cut_boost = if partition.cut_pressure.min_cut_value < SPLIT_THRESHOLD {
        // Boost tasks on the heavier side of the cut
        let on_heavy_side = partition.cut_pressure.side_a.len()
            > partition.cut_pressure.side_b.len();
        if partition.cut_pressure.side_a.contains(&task.handle) == on_heavy_side {
            PRESSURE_BOOST
        } else {
            0.0
        }
    } else {
        0.0
    };

    FlowPriority {
        deadline_urgency,
        novelty_boost: novelty_boost * NOVELTY_WEIGHT,
        structural_risk: structural_risk * RISK_WEIGHT,
        cut_pressure_boost: cut_boost,
        total: deadline_urgency
            + novelty_boost * NOVELTY_WEIGHT
            - structural_risk * RISK_WEIGHT
            + cut_boost,
    }
}

const NOVELTY_WEIGHT: f64 = 0.3;
const RISK_WEIGHT: f64 = 2.0;
const PRESSURE_BOOST: f64 = 0.5;
const SPLIT_THRESHOLD: f64 = 0.2;

5.3 Partition Split/Merge Triggers

The scheduler monitors cut pressure and triggers structural changes:

// ruvix-sched/src/structural.rs

/// Structural change triggers evaluated every epoch.
pub fn evaluate_structural_changes(
    partitions: &[Partition],
    pressure: &PressureEngine,
    config: &StructuralConfig,
) -> Vec<StructuralAction> {
    let mut actions = Vec::new();

    for partition in partitions {
        let cp = &partition.cut_pressure;
        let cs = &partition.coherence;

        // SPLIT trigger: low mincut AND low coherence
        if cp.min_cut_value < config.split_cut_threshold
            && cs.value < config.split_coherence_threshold
            && cp.predicted_coherence_a > cs.value
            && cp.predicted_coherence_b > cs.value
        {
            actions.push(StructuralAction::Split {
                partition: partition.id,
                cut: cp.clone(),
            });
        }

        // MERGE trigger: high coherence between two partitions
        // connected by a heavy CommEdge
        for edge_handle in &partition.comm_edges {
            if let Some(edge) = pressure.get_edge(*edge_handle) {
                let weight = edge.weight.load(Ordering::Relaxed);
                if weight > config.merge_edge_threshold {
                    let other = if edge.source == partition.id {
                        edge.dest
                    } else {
                        edge.source
                    };
                    actions.push(StructuralAction::Merge {
                        a: partition.id,
                        b: other,
                        edge_weight: weight,
                    });
                }
            }
        }

        // HIBERNATE trigger: partition has been suspended for too long
        if partition.state == PartitionState::Suspended
            && partition.last_activity_ns + config.hibernate_after_ns < now_ns()
        {
            actions.push(StructuralAction::Hibernate {
                partition: partition.id,
            });
        }
    }

    actions
}

5.4 Per-CPU Scheduling

On multi-core systems, each CPU runs its own scheduler instance with partition affinity:

// ruvix-sched/src/percpu.rs

/// Per-CPU scheduler state.
pub struct PerCpuScheduler {
    /// CPU identifier
    cpu_id: u32,

    /// Partitions assigned to this CPU
    assigned: ArrayVec<PartitionId, 32>,

    /// Current time quantum remaining (microseconds)
    quantum_remaining: u32,

    /// Currently running task
    current: Option<TaskHandle>,

    /// Mode
    mode: SchedulerMode,
}

/// Global scheduler coordinates per-CPU instances.
pub struct GlobalScheduler {
    /// Per-CPU schedulers
    per_cpu: ArrayVec<PerCpuScheduler, MAX_CPUS>,

    /// Partition-to-CPU assignment (informed by coherence graph)
    assignment: PartitionAssignment,

    /// Global mode override (Recovery overrides all CPUs)
    global_mode: Option<SchedulerMode>,
}

6. IPC Design

6.1 Zero-Copy Message Passing

All inter-partition communication goes through CommEdges, which wrap the ruvix-queue ring buffers. Zero-copy is achieved by descriptor passing:

// ruvix-commedge/src/zerocopy.rs

/// A zero-copy message descriptor.
///
/// Instead of copying data, the sender places a descriptor in the
/// queue that references a shared region. The receiver reads directly
/// from the shared region.
///
/// This is safe because:
/// 1. Only Immutable or AppendOnly regions can be shared (no mutation)
/// 2. The stage-2 page tables enforce read-only access for the receiver
/// 3. The witness log records every share operation
#[derive(Debug, Clone, Copy)]
#[repr(C)]
pub struct ZeroCopyDescriptor {
    /// Shared region handle
    pub region: RegionHandle,
    /// Offset within the region
    pub offset: u32,
    /// Length of the data
    pub length: u32,
    /// Schema hash (for type checking)
    pub schema_hash: u64,
}

/// Send a zero-copy message.
///
/// The region must already be shared with the destination partition
/// via `CommEdgeOps::share_region`.
pub fn send_zerocopy(
    edge: &CommEdge,
    desc: ZeroCopyDescriptor,
    cap: CapHandle,
    cap_mgr: &CapabilityManager,
    witness: &mut WitnessLog,
) -> Result<(), HypervisorError> {
    // 1. Capability check
    let cap_entry = cap_mgr.lookup(cap)?;
    if !cap_entry.rights.contains(CapRights::WRITE) {
        return Err(HypervisorError::CapabilityDenied);
    }

    // 2. Verify region is shared with destination
    if !edge.shared_regions.contains(&desc.region) {
        return Err(HypervisorError::RegionNotShared);
    }

    // 3. Validate descriptor bounds
    // (offset + length must be within region size)

    // 4. Enqueue descriptor in ring buffer
    edge.queue.send_raw(
        bytemuck::bytes_of(&desc),
        MsgPriority::Normal,
    )?;

    // 5. Witness
    witness.record(WitnessRecord::ZeroCopySend {
        edge: edge.id,
        region: desc.region,
        offset: desc.offset,
        length: desc.length,
    });

    Ok(())
}

6.2 Async Notification Mechanism

For lightweight signaling without data transfer (e.g., "new data available"), RVM provides notifications:

// ruvix-commedge/src/notification.rs

/// A notification word: a bitmask that can be atomically OR'd.
///
/// Notifications are the lightweight alternative to sending a
/// full message. A partition can wait on a notification word
/// and be woken when any bit is set.
///
/// This maps to a virtual interrupt injection at the hypervisor
/// level: setting a notification bit triggers a stage-2 fault
/// that the hypervisor converts to a virtual IRQ in the
/// destination partition.
pub struct NotificationWord {
    /// The notification bits (64 independent signals)
    bits: AtomicU64,

    /// Source partition (who can signal)
    source: PartitionId,

    /// Destination partition (who is waiting)
    dest: PartitionId,

    /// Capability required to signal
    signal_cap: CapHandle,
}

impl NotificationWord {
    /// Signal one or more notification bits.
    pub fn signal(&self, mask: u64, cap: CapHandle) -> Result<(), HypervisorError> {
        // Capability check omitted for brevity
        self.bits.fetch_or(mask, Ordering::Release);
        // Inject virtual interrupt into destination partition
        inject_virtual_irq(self.dest, NOTIFICATION_VIRQ);
        Ok(())
    }

    /// Wait for any bit in the mask to be set.
    ///
    /// Blocks the calling task until a matching bit is set.
    /// Returns the bits that were set.
    pub fn wait(&self, mask: u64) -> u64 {
        loop {
            let current = self.bits.load(Ordering::Acquire);
            let matched = current & mask;
            if matched != 0 {
                // Clear the matched bits
                self.bits.fetch_and(!matched, Ordering::AcqRel);
                return matched;
            }
            // Block task until notification IRQ
            yield_until_irq();
        }
    }
}

6.3 Shared Memory Regions with Witness Tracking

Every shared memory operation is witnessed:

// Witness records for IPC operations
pub enum IpcWitnessRecord {
    /// A region was shared between partitions
    RegionShared {
        region: RegionHandle,
        from: PartitionId,
        to: PartitionId,
        permissions: PagePermissions,
        edge: CommEdgeHandle,
    },
    /// A zero-copy message was sent
    ZeroCopySent {
        edge: CommEdgeHandle,
        region: RegionHandle,
        offset: u32,
        length: u32,
    },
    /// A region share was revoked
    ShareRevoked {
        region: RegionHandle,
        from: PartitionId,
        to: PartitionId,
    },
    /// A notification was signaled
    NotificationSignaled {
        source: PartitionId,
        dest: PartitionId,
        mask: u64,
    },
}

7. Device Model

7.1 Lease-Based Device Access

RVM does not emulate hardware. Instead, it provides direct device access through time-bounded leases. This is fundamentally different from KVM's device emulation (QEMU) or Firecracker's minimal device model (virtio).

Traditional Hypervisor:
    Guest -> emulated device -> host driver -> real hardware

RVM:
    Partition -> [lease check] -> real hardware (via stage-2 MMIO mapping)

The hypervisor maps device MMIO regions directly into the partition's stage-2 address space. The partition interacts with real hardware registers. The hypervisor's role is limited to:

  1. Granting and revoking leases
  2. Routing interrupts
  3. Ensuring lease expiration
  4. Resetting devices on lease revocation

7.2 Device Capability Tokens

// ruvix-drivers/src/device_cap.rs

/// A device descriptor identifying a hardware device.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct DeviceDescriptor {
    /// Device class
    pub class: DeviceClass,
    /// MMIO base address (physical)
    pub mmio_base: u64,
    /// MMIO region size
    pub mmio_size: usize,
    /// Primary interrupt number
    pub irq: u32,
    /// Device-specific identifier
    pub device_id: u32,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum DeviceClass {
    Uart,
    Timer,
    InterruptController,
    NetworkVirtio,
    BlockVirtio,
    Gpio,
    Rtc,
    Pci,
}

/// Device registry maintained by the hypervisor.
pub struct DeviceRegistry {
    /// All discovered devices
    devices: ArrayVec<DeviceDescriptor, 64>,

    /// Current leases: device -> (partition, expiration)
    leases: BTreeMap<DeviceDescriptor, DeviceLease>,

    /// Devices reserved for the hypervisor (never leased)
    reserved: ArrayVec<DeviceDescriptor, 8>,
}

impl DeviceRegistry {
    /// Discover devices from the device tree.
    pub fn from_dtb(dtb: &DeviceTree) -> Self {
        let mut reg = Self::new();
        for node in dtb.iter_devices() {
            let desc = DeviceDescriptor::from_dtb_node(node);
            reg.devices.push(desc);
        }
        // Reserve the interrupt controller and hypervisor timer
        reg.reserved.push(reg.find_gic().unwrap());
        reg.reserved.push(reg.find_timer().unwrap());
        reg
    }
}

7.3 Interrupt Routing

Interrupts from leased devices are routed to the holding partition as virtual interrupts:

// ruvix-drivers/src/irq_route.rs

/// Interrupt routing table.
///
/// Maps physical IRQs to virtual IRQs in partitions.
/// Only one partition can receive a given physical IRQ at a time.
pub struct IrqRouter {
    /// Physical IRQ -> (partition, virtual IRQ)
    routes: BTreeMap<u32, (PartitionId, u32)>,
}

impl IrqRouter {
    /// Route a physical IRQ to a partition.
    ///
    /// Called when a device lease is acquired.
    pub fn add_route(
        &mut self,
        phys_irq: u32,
        partition: PartitionId,
        virt_irq: u32,
    ) -> Result<(), HypervisorError> {
        if self.routes.contains_key(&phys_irq) {
            return Err(HypervisorError::IrqAlreadyRouted);
        }
        self.routes.insert(phys_irq, (partition, virt_irq));
        Ok(())
    }

    /// Handle a physical IRQ.
    ///
    /// Called from the hypervisor's IRQ handler. Looks up the
    /// route and injects a virtual interrupt into the target
    /// partition.
    pub fn dispatch(&self, phys_irq: u32) -> Option<(PartitionId, u32)> {
        self.routes.get(&phys_irq).copied()
    }
}

7.4 Virtio-Like Minimal Device Model

For devices that cannot be directly leased (shared devices, emulated devices for testing), RVM provides a minimal virtio-compatible interface:

// ruvix-drivers/src/virtio_shim.rs

/// Minimal virtio device shim.
///
/// This is NOT full virtio emulation. It provides:
/// - A single virtqueue (descriptor table + available ring + used ring)
/// - Interrupt injection via notification words
/// - Region-backed buffers (no DMA emulation)
///
/// Used for: virtio-console (debug), virtio-net (networking between
/// partitions), virtio-blk (block storage).
pub trait VirtioShim {
    /// Device type (net = 1, blk = 2, console = 3)
    fn device_type(&self) -> u32;

    /// Process available descriptors.
    fn process_queue(&mut self, queue: &VirtQueue) -> usize;

    /// Device-specific configuration read.
    fn read_config(&self, offset: u32) -> u32;

    /// Device-specific configuration write.
    fn write_config(&mut self, offset: u32, value: u32);
}

8. Witness Subsystem

8.1 Append-Only Log Design

The witness log is the audit backbone of RVM. Every privileged action produces a witness record. The log is append-only: there is no API to delete or modify records.

// ruvix-witness/src/log.rs

/// The kernel witness log.
///
/// Backed by a physically contiguous region in DRAM (Hot tier).
/// When the log fills, older segments are compressed to Warm tier
/// and eventually serialized to Cold tier.
///
/// The log is structured as a series of 64-byte records packed
/// into 4KB pages. Each page has a header with a running hash.
pub struct WitnessLog {
    /// Current write position (page index + offset within page)
    write_pos: AtomicU64,

    /// Physical pages backing the log
    pages: ArrayVec<PhysAddr, WITNESS_LOG_MAX_PAGES>,

    /// Running hash over all records (FNV-1a)
    chain_hash: AtomicU64,

    /// Sequence number (monotonically increasing)
    sequence: AtomicU64,

    /// Segment index for archival
    current_segment: u32,
}

/// Maximum log pages before rotation to warm tier.
pub const WITNESS_LOG_MAX_PAGES: usize = 4096; // 16 MB of hot log

8.2 Compact Binary Format

Each witness record is exactly 64 bytes to align with cache lines and avoid variable-length parsing:

// ruvix-witness/src/record.rs

/// A witness record. Fixed 64 bytes.
///
/// Layout:
///   [0..8]   sequence number (u64, little-endian)
///   [8..16]  timestamp_ns (u64)
///   [16..17] record_kind (u8)
///   [17..18] proof_tier (u8)
///   [18..20] reserved (2 bytes)
///   [20..28] subject_id (u64, partition/task/region ID)
///   [28..36] object_id (u64, target of the action)
///   [36..44] aux_data (u64, action-specific)
///   [44..52] chain_hash_before (u64, hash of all preceding records)
///   [52..60] record_hash (u64, hash of this record's fields [0..52])
///   [60..64] reserved_flags (u32)
#[derive(Debug, Clone, Copy)]
#[repr(C, align(64))]
pub struct WitnessRecord {
    pub sequence: u64,
    pub timestamp_ns: u64,
    pub kind: WitnessRecordKind,
    pub proof_tier: u8,
    pub _reserved: [u8; 2],
    pub subject_id: u64,
    pub object_id: u64,
    pub aux_data: u64,
    pub chain_hash_before: u64,
    pub record_hash: u64,
    pub flags: u32,
}

static_assertions::assert_eq_size!(WitnessRecord, [u8; 64]);

/// What kind of action was witnessed.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum WitnessRecordKind {
    // Partition lifecycle
    PartitionCreate    = 0x01,
    PartitionSplit     = 0x02,
    PartitionMerge     = 0x03,
    PartitionHibernate = 0x04,
    PartitionReconstruct = 0x05,
    PartitionMigrate   = 0x06,

    // Capability operations
    CapGrant           = 0x10,
    CapRevoke          = 0x11,
    CapDelegate        = 0x12,

    // Memory operations
    RegionCreate       = 0x20,
    RegionDestroy      = 0x21,
    RegionTransfer     = 0x22,
    RegionShare        = 0x23,
    RegionTierChange   = 0x24,

    // Communication
    CommEdgeCreate     = 0x30,
    CommEdgeDestroy    = 0x31,
    ZeroCopySend       = 0x32,
    NotificationSignal = 0x33,

    // Proof verification
    ProofVerified      = 0x40,
    ProofRejected      = 0x41,
    ProofEscalated     = 0x42,

    // Device operations
    LeaseAcquire       = 0x50,
    LeaseRevoke        = 0x51,
    LeaseExpire        = 0x52,

    // Vector/Graph mutations
    VectorPut          = 0x60,
    GraphMutation      = 0x61,

    // Scheduler events
    TaskSpawn          = 0x70,
    TaskTerminate      = 0x71,
    ModeSwitch         = 0x72,
    StructuralChange   = 0x73,

    // Boot and attestation
    BootAttestation    = 0x80,
    CheckpointCreated  = 0x81,
}

8.3 What Gets Witnessed

Every action in the following categories:

Category Examples Record Kind
Partition lifecycle Create, split, merge, hibernate, reconstruct, migrate 0x01-0x06
Capability changes Grant, revoke, delegate 0x10-0x12
Memory operations Region create/destroy/transfer/share, tier changes 0x20-0x24
Communication Edge create/destroy, zero-copy send, notification 0x30-0x33
Proof verification Verified, rejected, escalated 0x40-0x42
Device access Lease acquire/revoke/expire 0x50-0x52
Data mutation Vector put, graph mutation 0x60-0x61
Scheduling Task spawn/terminate, mode switch, structural change 0x70-0x73
Boot Boot attestation, checkpoints 0x80-0x81

8.4 Replay and Audit

The witness log supports two operations: audit (verify integrity) and replay (reconstruct state).

// ruvix-witness/src/replay.rs

/// Verify the integrity of the witness log.
///
/// Walks the log from start to end, recomputing chain hashes.
/// Any break in the chain indicates tampering.
pub fn audit_log(log: &WitnessLog) -> AuditResult {
    let mut expected_hash: u64 = 0;
    let mut record_count: u64 = 0;
    let mut violations: Vec<AuditViolation> = Vec::new();

    for record in log.iter() {
        // Verify chain hash
        if record.chain_hash_before != expected_hash {
            violations.push(AuditViolation::ChainBreak {
                sequence: record.sequence,
                expected: expected_hash,
                found: record.chain_hash_before,
            });
        }

        // Verify record self-hash
        let computed = compute_record_hash(&record);
        if record.record_hash != computed {
            violations.push(AuditViolation::RecordTampered {
                sequence: record.sequence,
            });
        }

        // Advance chain
        expected_hash = fnv1a_combine(expected_hash, record.record_hash);
        record_count += 1;
    }

    AuditResult {
        total_records: record_count,
        violations,
        chain_valid: violations.is_empty(),
    }
}

/// Replay a witness log to reconstruct system state.
///
/// Given a checkpoint and a witness log segment, deterministically
/// reconstructs the system state at any point in the log.
pub fn replay_from_checkpoint(
    checkpoint: &Checkpoint,
    log_segment: &[WitnessRecord],
) -> Result<KernelState, ReplayError> {
    let mut state = checkpoint.restore()?;

    for record in log_segment {
        state.apply_witness_record(record)?;
    }

    Ok(state)
}

8.5 Integration with Proof Verifier

The witness log and proof engine form a closed loop:

  1. A task requests a mutation (e.g., vector_put_proved)
  2. The proof engine verifies the proof token (3-tier routing)
  3. If the proof is valid, the mutation is applied
  4. A witness record is emitted (ProofVerified + VectorPut)
  5. If the proof is invalid, a rejection record is emitted (ProofRejected)
  6. The witness record's chain hash incorporates the proof attestation

This means the witness log contains a complete, tamper-evident history of every proof that was checked and every mutation that was applied.


9. Agent Runtime Layer

9.1 WASM Partition Adapter

Agent workloads run as WASM modules inside partitions. The WASM runtime itself runs in the partition's address space (EL1/EL0), not in the hypervisor.

// ruvix-agent/src/adapter.rs

/// Configuration for a WASM agent partition.
pub struct AgentPartitionConfig {
    /// WASM module bytes
    pub wasm_module: &'static [u8],

    /// Memory limits
    pub max_memory_pages: u32,  // Each page = 64KB
    pub initial_memory_pages: u32,

    /// Stack size for the WASM execution
    pub stack_size: usize,

    /// Capabilities granted to this agent
    pub capabilities: ArrayVec<CapHandle, 32>,

    /// Communication edges to other agents
    pub comm_edges: ArrayVec<CommEdgeConfig, 16>,

    /// Scheduling priority
    pub priority: TaskPriority,

    /// Optional deadline for real-time agents
    pub deadline: Option<Duration>,
}

/// WASM host functions exposed to agents.
///
/// These are the agent's interface to the hypervisor, mapped to
/// syscalls via the partition's capability table.
pub trait AgentHostFunctions {
    // --- Communication ---

    /// Send a message to another agent via CommEdge.
    fn send(&mut self, edge_id: u32, data: &[u8]) -> Result<(), AgentError>;

    /// Receive a message from a CommEdge.
    fn recv(&mut self, edge_id: u32, buf: &mut [u8]) -> Result<usize, AgentError>;

    /// Signal a notification.
    fn notify(&mut self, edge_id: u32, mask: u64) -> Result<(), AgentError>;

    // --- Memory ---

    /// Request a shared memory region.
    fn request_shared_region(
        &mut self,
        size: usize,
        policy: u32,
    ) -> Result<u32, AgentError>;

    /// Map a shared region from another agent.
    fn map_shared(&mut self, region_id: u32) -> Result<*const u8, AgentError>;

    // --- Vector/Graph ---

    /// Read a vector from the kernel vector store.
    fn vector_get(
        &mut self,
        store_id: u32,
        key: u64,
        buf: &mut [f32],
    ) -> Result<usize, AgentError>;

    /// Write a vector with proof.
    fn vector_put(
        &mut self,
        store_id: u32,
        key: u64,
        data: &[f32],
    ) -> Result<(), AgentError>;

    // --- Lifecycle ---

    /// Spawn a child agent.
    fn spawn_agent(&mut self, config_ptr: u32) -> Result<u32, AgentError>;

    /// Request hibernation.
    fn hibernate(&mut self) -> Result<(), AgentError>;

    /// Yield execution.
    fn yield_now(&mut self);
}

9.2 Agent-to-Coherence-Domain Mapping

Each agent maps to exactly one partition. Multiple agents can share a partition if they are tightly coupled (high coherence score).

Agent A ──┐
          ├── Partition P1 (coherence = 0.92)
Agent B ──┘
               │ CommEdge (weight=1500)
               v
Agent C ──── Partition P2 (coherence = 0.87)
               │ CommEdge (weight=200)
               v
Agent D ──┐
          ├── Partition P3 (coherence = 0.95)
Agent E ──┘

When the mincut algorithm detects that Agent B communicates more with Agent C than with Agent A, it will trigger a partition split, moving Agent B from P1 to P2 (or creating a new partition).

9.3 Agent Lifecycle

// ruvix-agent/src/lifecycle.rs

/// Agent lifecycle states.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AgentState {
    /// Being initialized (WASM module loading, capability setup)
    Initializing,

    /// Actively executing within its partition
    Running,

    /// Suspended (waiting on I/O or explicit yield)
    Suspended,

    /// Being migrated to a different partition
    Migrating {
        from: PartitionId,
        to: PartitionId,
    },

    /// Hibernated (state serialized, partition may be dormant)
    Hibernated,

    /// Being reconstructed from hibernated state
    Reconstructing,

    /// Terminated (cleanup complete)
    Terminated,
}

/// Agent migration protocol.
///
/// Migration moves an agent from one partition to another without
/// losing state. This is triggered by the mincut-based placement
/// engine when it detects that an agent is misplaced.
pub fn migrate_agent(
    agent: AgentHandle,
    from: PartitionId,
    to: PartitionId,
    kernel: &mut Kernel,
) -> Result<(), MigrationError> {
    // 1. Suspend agent
    kernel.suspend_task(agent.task)?;

    // 2. Serialize agent state (WASM memory, stack, globals)
    let state = kernel.serialize_wasm_state(agent)?;

    // 3. Create new task in destination partition
    let new_task = kernel.create_task_in_partition(to, agent.config)?;

    // 4. Restore state into new task
    kernel.restore_wasm_state(new_task, &state)?;

    // 5. Transfer owned regions
    for region in agent.owned_regions() {
        kernel.transfer_region(region, from, to)?;
    }

    // 6. Update CommEdge endpoints
    for edge in agent.comm_edges() {
        kernel.update_edge_endpoint(edge, from, to)?;
    }

    // 7. Update coherence graph
    kernel.pressure_engine.agent_migrated(agent, from, to);

    // 8. Witness
    kernel.witness_log.record(WitnessRecord::new(
        WitnessRecordKind::PartitionMigrate,
        from.0,
        to.0,
        agent.0 as u64,
    ));

    // 9. Resume agent in new partition
    kernel.resume_task(new_task)?;

    // 10. Destroy old task
    kernel.destroy_task(agent.task)?;

    Ok(())
}

9.4 Multi-Agent Communication

Agents communicate exclusively through CommEdges. The communication pattern is recorded in the coherence graph and drives placement decisions:

// ruvix-agent/src/communication.rs

/// Agent communication layer built on CommEdges.
pub struct AgentComm {
    /// Agent's partition
    partition: PartitionId,

    /// Named edges: edge_name -> CommEdgeHandle
    edges: BTreeMap<&'static str, CommEdgeHandle>,

    /// Message serialization format
    format: MessageFormat,
}

#[derive(Debug, Clone, Copy)]
pub enum MessageFormat {
    /// Raw bytes (no serialization overhead)
    Raw,
    /// WIT Component Model types (schema-validated)
    Wit,
    /// CBOR (compact, self-describing)
    Cbor,
}

impl AgentComm {
    /// Send a typed message to a named edge.
    pub fn send<T: Serialize>(
        &self,
        edge_name: &str,
        message: &T,
    ) -> Result<(), AgentError> {
        let edge = self.edges.get(edge_name)
            .ok_or(AgentError::UnknownEdge)?;
        let bytes = self.serialize(message)?;
        // This goes through CommEdgeOps::send, which updates
        // the coherence graph edge weight
        syscall_queue_send(*edge, &bytes, MsgPriority::Normal)
    }

    /// Receive a typed message from a named edge.
    pub fn recv<T: Deserialize>(
        &self,
        edge_name: &str,
        timeout: Duration,
    ) -> Result<T, AgentError> {
        let edge = self.edges.get(edge_name)
            .ok_or(AgentError::UnknownEdge)?;
        let mut buf = [0u8; 65536];
        let len = syscall_queue_recv(*edge, &mut buf, timeout)?;
        self.deserialize(&buf[..len])
    }
}

10. Hardware Abstraction

10.1 HAL Trait Design

The HAL defines platform-agnostic traits. Existing traits from ruvix-hal (Console, Timer, InterruptController, Mmu, PowerManagement) are extended with hypervisor-specific traits:

// ruvix-hal/src/hypervisor.rs

/// Hypervisor-specific hardware abstraction.
///
/// This trait captures the operations that differ between
/// ARM EL2, RISC-V HS-mode, and x86 VMX root mode.
pub trait HypervisorHal {
    /// Stage-2/EPT page table type
    type Stage2Table;

    /// Virtual CPU context type
    type VcpuContext;

    /// Configure the CPU for hypervisor mode.
    ///
    /// Called once during boot. Sets up:
    /// - Stage-2 translation (VTCR_EL2 / hgatp / EPT pointer)
    /// - Trap configuration (HCR_EL2 / hedeleg / VM-execution controls)
    /// - Virtual interrupt delivery
    unsafe fn init_hypervisor_mode(&self) -> Result<(), HalError>;

    /// Create a new stage-2 address space.
    fn create_stage2_table(
        &self,
        phys: &mut dyn PhysicalAllocator,
    ) -> Result<Self::Stage2Table, HalError>;

    /// Map a page in a stage-2 table.
    fn stage2_map(
        &self,
        table: &mut Self::Stage2Table,
        ipa: u64,
        pa: u64,
        attrs: Stage2Attrs,
    ) -> Result<(), HalError>;

    /// Unmap a page from a stage-2 table.
    fn stage2_unmap(
        &self,
        table: &mut Self::Stage2Table,
        ipa: u64,
    ) -> Result<(), HalError>;

    /// Switch to a partition's address space.
    ///
    /// Activates the partition's stage-2 tables and restores
    /// the vCPU context.
    unsafe fn enter_partition(
        &self,
        table: &Self::Stage2Table,
        vcpu: &Self::VcpuContext,
    );

    /// Handle a trap from a partition.
    ///
    /// Called when the partition triggers a stage-2 fault,
    /// HVC/ECALL, or trapped instruction.
    fn handle_trap(
        &self,
        vcpu: &mut Self::VcpuContext,
        trap: TrapInfo,
    ) -> TrapAction;

    /// Inject a virtual interrupt into a partition.
    fn inject_virtual_irq(
        &self,
        vcpu: &mut Self::VcpuContext,
        irq: u32,
    ) -> Result<(), HalError>;

    /// Flush stage-2 TLB entries for a partition.
    fn flush_stage2_tlb(&self, vmid: u16);
}

/// Information about a trap from a partition.
#[derive(Debug)]
pub struct TrapInfo {
    /// Trap cause
    pub cause: TrapCause,
    /// Faulting address (if applicable)
    pub fault_addr: Option<u64>,
    /// Instruction that caused the trap (for emulation)
    pub instruction: Option<u32>,
}

#[derive(Debug)]
pub enum TrapCause {
    /// Stage-2 page fault (IPA not mapped)
    Stage2Fault { ipa: u64, is_write: bool },
    /// Hypercall (HVC/ECALL/VMCALL)
    Hypercall { code: u64, args: [u64; 4] },
    /// MMIO access to an unmapped device
    MmioAccess { addr: u64, is_write: bool, value: u64, size: u8 },
    /// WFI/WFE instruction (idle)
    WaitForInterrupt,
    /// System register access (trapped MSR/CSR)
    SystemRegister { reg: u32, is_write: bool, value: u64 },
}

#[derive(Debug)]
pub enum TrapAction {
    /// Resume the partition
    Resume,
    /// Resume with modified register state
    ResumeModified,
    /// Suspend the partition's current task
    SuspendTask,
    /// Terminate the partition
    Terminate,
}

10.2 What Must Be in Assembly vs Rust

Component Language Reason
Reset vector, stack setup, BSS clear Assembly No Rust runtime available yet
Exception vector table entry points Assembly Fixed hardware-defined layout; must save/restore registers in exact order
Context switch (register save/restore) Assembly Must atomically save all 31 GPRs + SP + PC + PSTATE
TLB invalidation sequences Inline asm in Rust Specific instruction sequences with barriers
Cache maintenance Inline asm in Rust DC/IC instructions
Everything else Rust Type safety, borrow checker, no_std ecosystem

Target: less than 500 lines of assembly total per platform.

10.3 Platform Abstraction Summary

Operation AArch64 (EL2) RISC-V (HS-mode) x86-64 (VMX root)
Stage-2 tables VTTBR_EL2 + VTT hgatp + G-stage PT EPTP + EPT
Trap entry VBAR_EL2 vectors stvec (VS traps delegate to HS) VM-exit handler
Virtual IRQ HCR_EL2.VI bit hvip.VSEIP Posted interrupts / VM-entry interruption
Hypercall HVC instruction ECALL from VS-mode VMCALL instruction
VMID/ASID VTTBR_EL2[63:48] hgatp.VMID VPID (16-bit)
Cache control DC CIVAC, IC IALLU SFENCE.VMA INVLPG, WBINVD
Timer CNTHP_CTL_EL2 htimedelta + stimecmp VMX preemption timer

10.4 QEMU virt as Reference Platform

The QEMU AArch64 virt machine is the first target:

// ruvix-aarch64/src/qemu_virt.rs

/// QEMU virt machine memory map.
pub const QEMU_VIRT_FLASH_BASE: u64     = 0x0000_0000;
pub const QEMU_VIRT_GIC_DIST_BASE: u64  = 0x0800_0000;
pub const QEMU_VIRT_GIC_CPU_BASE: u64   = 0x0801_0000;
pub const QEMU_VIRT_UART_BASE: u64      = 0x0900_0000;
pub const QEMU_VIRT_RTC_BASE: u64       = 0x0901_0000;
pub const QEMU_VIRT_GPIO_BASE: u64      = 0x0903_0000;
pub const QEMU_VIRT_RAM_BASE: u64       = 0x4000_0000;
pub const QEMU_VIRT_RAM_SIZE: u64       = 0x4000_0000; // 1 GB default

/// QEMU launch command for testing:
///
/// ```sh
/// qemu-system-aarch64 \
///     -machine virt,virtualization=on,gic-version=3 \
///     -cpu cortex-a72 \
///     -m 1G \
///     -nographic \
///     -kernel target/aarch64-unknown-none/release/ruvix \
///     -smp 4
/// ```
///
/// Key flags:
///   virtualization=on  -- enables EL2 (hypervisor mode)
///   gic-version=3      -- GICv3 (supports virtual interrupts)
///   -smp 4             -- 4 cores for multi-partition testing

11. Integration with RuVector

11.1 mincut Crate -> Partition Placement Engine

The ruvector-mincut crate provides the dynamic minimum cut algorithm that drives partition split/merge decisions. The integration maps the hypervisor's coherence graph to the mincut data structure:

// ruvix-pressure/src/mincut_bridge.rs

use ruvector_mincut::{MinCutBuilder, DynamicMinCut};

/// Bridge between the hypervisor coherence graph and ruvector-mincut.
pub struct MinCutBridge {
    /// The dynamic mincut structure
    mincut: Box<dyn DynamicMinCut>,

    /// Mapping: PartitionId -> mincut vertex ID
    partition_to_vertex: BTreeMap<PartitionId, usize>,

    /// Mapping: CommEdgeHandle -> mincut edge
    edge_to_mincut: BTreeMap<CommEdgeHandle, (usize, usize)>,

    /// Recomputation epoch
    epoch: u64,
}

impl MinCutBridge {
    pub fn new() -> Self {
        let mincut = MinCutBuilder::new()
            .exact()
            .build()
            .expect("mincut init");
        Self {
            mincut: Box::new(mincut),
            partition_to_vertex: BTreeMap::new(),
            edge_to_mincut: BTreeMap::new(),
            epoch: 0,
        }
    }

    /// Register a new partition as a vertex.
    pub fn add_partition(&mut self, id: PartitionId) -> usize {
        let vertex = self.partition_to_vertex.len();
        self.partition_to_vertex.insert(id, vertex);
        vertex
    }

    /// Register a CommEdge as a weighted edge.
    ///
    /// Called when a CommEdge is created.
    pub fn add_edge(
        &mut self,
        edge: CommEdgeHandle,
        source: PartitionId,
        dest: PartitionId,
        initial_weight: f64,
    ) -> Result<(), PressureError> {
        let u = *self.partition_to_vertex.get(&source)
            .ok_or(PressureError::UnknownPartition)?;
        let v = *self.partition_to_vertex.get(&dest)
            .ok_or(PressureError::UnknownPartition)?;
        self.mincut.insert_edge(u, v, initial_weight)?;
        self.edge_to_mincut.insert(edge, (u, v));
        Ok(())
    }

    /// Update edge weight (called on every message send).
    ///
    /// Uses delete + insert since ruvector-mincut supports dynamic updates.
    pub fn update_weight(
        &mut self,
        edge: CommEdgeHandle,
        new_weight: f64,
    ) -> Result<(), PressureError> {
        let (u, v) = *self.edge_to_mincut.get(&edge)
            .ok_or(PressureError::UnknownEdge)?;
        let _ = self.mincut.delete_edge(u, v);
        self.mincut.insert_edge(u, v, new_weight)?;
        Ok(())
    }

    /// Compute the current minimum cut.
    ///
    /// Returns CutPressure indicating where the system should split.
    pub fn compute_pressure(&self) -> CutPressure {
        let cut = self.mincut.min_cut();
        CutPressure {
            min_cut_value: cut.value,
            cut_edges: self.translate_cut_edges(&cut),
            // ... translate partition sides
            computed_at_ns: now_ns(),
            ..Default::default()
        }
    }
}

API mapping from ruvector-mincut:

mincut API Hypervisor Use
MinCutBuilder::new().exact().build() Initialize placement engine
insert_edge(u, v, weight) Register CommEdge creation
delete_edge(u, v) Register CommEdge destruction
min_cut_value() Query current cut pressure
min_cut() -> MinCutResult Get the actual cut for split decisions
WitnessTree Certify that the computed cut is correct

11.2 sparsifier Crate -> Efficient Graph State

The ruvector-sparsifier crate maintains a compressed shadow of the coherence graph. When the full graph becomes large (hundreds of partitions, thousands of edges), the sparsifier provides an approximate view that preserves spectral properties:

// ruvix-pressure/src/sparse_bridge.rs

use ruvector_sparsifier::{AdaptiveGeoSpar, SparseGraph, SparsifierConfig, Sparsifier};

/// Sparsified view of the coherence graph.
///
/// The full coherence graph tracks every CommEdge and its weight.
/// The sparsifier maintains a compressed version that preserves
/// the Laplacian energy within (1 +/- epsilon), enabling efficient
/// coherence score computation on large graphs.
pub struct SparseBridge {
    /// The full graph (source of truth)
    full_graph: SparseGraph,

    /// The sparsifier (compressed view)
    sparsifier: AdaptiveGeoSpar,

    /// Compression ratio
    compression: f64,
}

impl SparseBridge {
    pub fn new(epsilon: f64) -> Self {
        let full_graph = SparseGraph::new();
        let config = SparsifierConfig {
            epsilon,
            ..Default::default()
        };
        let sparsifier = AdaptiveGeoSpar::build(&full_graph, config)
            .expect("sparsifier init");
        Self {
            full_graph,
            sparsifier,
            compression: 1.0,
        }
    }

    /// Add a CommEdge to the graph.
    pub fn add_edge(
        &mut self,
        u: usize,
        v: usize,
        weight: f64,
    ) -> Result<(), PressureError> {
        self.full_graph.add_edge(u, v, weight);
        self.sparsifier.insert_edge(u, v, weight)?;
        self.compression = self.sparsifier.compression_ratio();
        Ok(())
    }

    /// Get the sparsified graph for coherence computation.
    ///
    /// The solver crate operates on this compressed graph,
    /// not the full graph.
    pub fn sparsified(&self) -> &SparseGraph {
        self.sparsifier.sparsifier()
    }

    /// Audit sparsifier quality.
    pub fn audit(&self) -> bool {
        self.sparsifier.audit().passed
    }
}

API mapping from ruvector-sparsifier:

sparsifier API Hypervisor Use
SparseGraph::from_edges() Build initial coherence graph
AdaptiveGeoSpar::build() Create compressed view
insert_edge() / delete_edge() Dynamic graph updates
sparsifier() -> &SparseGraph Feed to solver for coherence
audit() -> AuditResult Verify compression quality
compression_ratio() Monitor graph efficiency

11.3 solver Crate -> Coherence Score Computation

The ruvector-solver crate computes coherence scores by solving Laplacian systems on the sparsified coherence graph:

// ruvix-pressure/src/coherence_solver.rs

use ruvector_solver::traits::{SolverEngine, SparseLaplacianSolver};
use ruvector_solver::neumann::NeumannSolver;
use ruvector_solver::types::{CsrMatrix, ComputeBudget};

/// Coherence score computation via Laplacian solver.
///
/// The coherence score of a partition is derived from the
/// effective resistance between its internal nodes. Low
/// effective resistance = high coherence (tightly coupled).
pub struct CoherenceSolver {
    /// The solver engine
    solver: NeumannSolver,

    /// Compute budget per invocation
    budget: ComputeBudget,
}

impl CoherenceSolver {
    pub fn new() -> Self {
        Self {
            solver: NeumannSolver::new(1e-4, 200), // tolerance, max_iter
            budget: ComputeBudget::default(),
        }
    }

    /// Compute the coherence score for a partition.
    ///
    /// Uses the sparsified Laplacian to compute average effective
    /// resistance between all pairs of tasks in the partition.
    /// Lower resistance = higher coherence.
    pub fn compute_coherence(
        &self,
        partition: &Partition,
        sparse_graph: &SparseGraph,
    ) -> Result<CoherenceScore, PressureError> {
        // 1. Extract the subgraph for this partition
        let subgraph = extract_partition_subgraph(partition, sparse_graph);

        // 2. Build Laplacian matrix
        let laplacian = build_laplacian(&subgraph);

        // 3. Compute effective resistance between task pairs
        let mut total_resistance = 0.0;
        let mut pairs = 0;
        let task_ids: Vec<usize> = partition.tasks.keys()
            .map(|t| t.index())
            .collect();

        for i in 0..task_ids.len() {
            for j in (i+1)..task_ids.len() {
                let r = self.solver.effective_resistance(
                    &laplacian,
                    task_ids[i],
                    task_ids[j],
                    &self.budget,
                )?;
                total_resistance += r;
                pairs += 1;
            }
        }

        // 4. Normalize: coherence = 1 / (1 + avg_resistance)
        let avg_resistance = if pairs > 0 {
            total_resistance / pairs as f64
        } else {
            0.0
        };
        let coherence_value = 1.0 / (1.0 + avg_resistance);

        Ok(CoherenceScore {
            value: coherence_value,
            task_contributions: compute_per_task_contributions(
                &laplacian, &task_ids, &self.solver, &self.budget,
            ),
            computed_at_ns: now_ns(),
            stale: false,
        })
    }
}

API mapping from ruvector-solver:

solver API Hypervisor Use
NeumannSolver::new(tol, max_iter) Create solver for coherence computation
solve(&matrix, &rhs) -> SolverResult General sparse linear solve
effective_resistance(laplacian, s, t) Core coherence metric between task pairs
estimate_complexity(profile, n) Budget estimation before solving
ComputeBudget Bound solver computation per epoch

11.4 Full Pressure Engine Pipeline

The three crates form a pipeline that runs every scheduler epoch:

CommEdge weight updates (per message)
        |
        v
[ruvector-sparsifier]  -- maintain compressed coherence graph
        |
        v
[ruvector-solver]      -- compute coherence scores from Laplacian
        |
        v
[ruvector-mincut]      -- compute cut pressure from communication graph
        |
        v
Scheduler decisions:
  - Task priority adjustment (Flow mode)
  - Partition split/merge triggers
  - Agent migration signals
  - Tier promotion/demotion hints
// ruvix-pressure/src/engine.rs

/// The unified pressure engine.
///
/// Combines sparsifier, solver, and mincut into a single subsystem
/// that the scheduler queries every epoch.
pub struct PressureEngine {
    /// Sparsified coherence graph
    sparse: SparseBridge,

    /// Mincut for split/merge decisions
    mincut: MinCutBridge,

    /// Coherence solver
    solver: CoherenceSolver,

    /// Epoch counter
    epoch: u64,

    /// Epoch duration in nanoseconds
    epoch_duration_ns: u64,

    /// Cached results (valid for one epoch)
    cached_coherence: BTreeMap<PartitionId, CoherenceScore>,
    cached_pressure: Option<CutPressure>,
}

impl PressureEngine {
    /// Called every scheduler epoch.
    ///
    /// Recomputes coherence scores and cut pressure.
    pub fn tick(
        &mut self,
        partitions: &[Partition],
    ) -> EpochResult {
        self.epoch += 1;

        // 1. Decay edge weights (exponential decay per epoch)
        self.sparse.decay_weights(0.95);
        self.mincut.decay_weights(0.95);

        // 2. Audit sparsifier quality
        if !self.sparse.audit() {
            self.sparse.rebuild();
        }

        // 3. Recompute coherence scores
        for partition in partitions {
            let score = self.solver.compute_coherence(
                partition,
                self.sparse.sparsified(),
            );
            if let Ok(s) = score {
                self.cached_coherence.insert(partition.id, s);
            }
        }

        // 4. Recompute cut pressure
        self.cached_pressure = Some(self.mincut.compute_pressure());

        // 5. Evaluate structural changes
        let actions = evaluate_structural_changes(
            partitions,
            self,
            &StructuralConfig::default(),
        );

        EpochResult {
            epoch: self.epoch,
            actions,
            coherence_scores: self.cached_coherence.clone(),
            cut_pressure: self.cached_pressure.clone(),
        }
    }

    /// Called on every CommEdge message send.
    ///
    /// Incrementally updates edge weights in both the sparsifier
    /// and the mincut structure.
    pub fn on_message_sent(
        &mut self,
        edge: CommEdgeHandle,
        bytes: usize,
    ) {
        if let Some((u, v)) = self.mincut.edge_to_mincut.get(&edge) {
            let new_weight = bytes as f64; // Simplified; real impl accumulates
            let _ = self.sparse.update_weight(*u, *v, new_weight);
            let _ = self.mincut.update_weight(edge, new_weight);
        }
    }
}

12. What Makes RVM Different

12.1 Comparison Matrix

Property KVM/QEMU Firecracker seL4 RVM
Abstraction unit VM (full hardware) microVM (minimal HW) Thread + address space Coherence domain (partition)
Device model Full QEMU emulation Minimal virtio Passthrough Time-bounded leases
Isolation basis EPT/stage-2 EPT/stage-2 Capabilities + page tables Capabilities + stage-2 + graph theory
Scheduling Linux CFS Linux CFS Priority-based Graph-pressure-driven, 3 modes
IPC Virtio rings VSOCK Synchronous IPC Zero-copy CommEdges with coherence tracking
Audit None built-in None built-in Formal proof (binary level) Witness log (every privileged action)
Mutation control None None Capability rights Proof-gated (3-tier cryptographic verification)
Memory model Demand paging Demand paging (host) Typed memory objects Tiered (Hot/Warm/Dormant/Cold), no demand paging
Dynamic reconfiguration VM migration (external) Snapshot/restore Static CNode tree Mincut-driven split/merge/migrate
Graph awareness None None None Native: mincut, sparsifier, solver integrated
Agent-native No No (but fast boot) No Yes: WASM partitions, lifecycle management
Written in C (QEMU) + C (Linux) Rust (VMM) + C (Linux) C + Isabelle/HOL proofs Rust (< 500 lines asm per platform)
Host OS dependency Linux required Linux required None (standalone) None (standalone)

12.2 Key Differentiators

1. Graph-theory-native isolation. No other hypervisor uses mincut algorithms to determine isolation boundaries. KVM and Firecracker rely on the human to define VM boundaries. seL4 relies on the human to define CNode trees. RVM computes boundaries dynamically from observed communication patterns.

2. Proof-gated mutation. seL4 has formal verification of the kernel binary, but does not gate runtime state mutations with proofs. RVM requires a cryptographic proof for every mutation, checked at three tiers (Reflex < 100ns, Standard < 100us, Deep < 10ms).

3. Witness-native auditability. The witness log is not an optional feature or an afterthought. It is woven into every syscall path. Every privileged action produces a 64-byte witness record with a chained hash. The log is tamper-evident and supports deterministic replay.

4. Coherence-driven scheduling. The scheduler does not just balance CPU load. It considers the graph structure of partition communication, novelty of incoming data, and structural risk of pending mutations. This is a fundamentally different optimization target.

5. Tiered memory without demand paging. By eliminating page faults from the critical path and replacing them with explicit tier transitions, RVM achieves deterministic latency while still supporting memory overcommit through compression and serialization.

6. Agent-native runtime. WASM agents are first-class entities with defined lifecycle states (spawn, execute, migrate, hibernate, reconstruct). The hypervisor understands agent communication patterns and uses them to optimize placement.

12.3 Threat Model

RVM assumes:

  • Trusted: The hypervisor binary (verified boot with ML-DSA-65 signatures), hardware
  • Untrusted: All partition code, all agent WASM modules, all inter-partition messages
  • Partially trusted: Device firmware (isolated via leases with bounded time)

The capability system ensures that a compromised partition cannot:

  • Access memory outside its stage-2 address space
  • Send messages on edges it does not hold capabilities for
  • Mutate kernel state without a valid proof
  • Read the witness log without WITNESS capability
  • Acquire devices without LEASE capability
  • Modify another partition's coherence score

12.4 Performance Targets

Operation Target Latency Bound
Hypercall (syscall) round-trip < 1 us Hardware trap + capability check
Zero-copy message send < 500 ns Ring buffer enqueue + witness record
Notification signal < 200 ns Atomic OR + virtual IRQ inject
Proof verification (Reflex) < 100 ns Hash comparison
Proof verification (Standard) < 100 us Merkle witness verification
Proof verification (Deep) < 10 ms Full coherence check via solver
Partition split < 50 ms Stage-2 table creation + region remapping
Agent migration < 100 ms State serialize + transfer + restore
Coherence score computation < 5 ms per epoch Laplacian solve on sparsified graph
Witness record write < 50 ns Cache-line-aligned append

Appendix A: Syscall Table (Extended for Hypervisor)

The Phase A syscall table (12 syscalls) is extended with hypervisor-specific operations:

# Syscall Phase Proof Required Witnessed
0 task_spawn A No Yes
1 cap_grant A No Yes
2 region_map A No Yes
3 queue_send A No Yes
4 queue_recv A No No (read-only)
5 timer_wait A No No
6 rvf_mount A Yes Yes
7 attest_emit A Yes Yes
8 vector_get A No No (read-only)
9 vector_put_proved A Yes Yes
10 graph_apply_proved A Yes Yes
11 sensor_subscribe A No Yes
12 partition_create B+ Yes Yes
13 partition_split B+ Yes Yes
14 partition_merge B+ Yes Yes
15 partition_hibernate B+ Yes Yes
16 partition_reconstruct B+ Yes Yes
17 commedge_create B+ Yes Yes
18 commedge_destroy B+ Yes Yes
19 device_lease_acquire B+ Yes Yes
20 device_lease_revoke B+ Yes Yes
21 witness_read B+ No No (read-only)
22 notify_signal B+ No Yes
23 notify_wait B+ No No

Appendix B: New Crate Summary

Crate Purpose Dependencies Est. Lines
ruvix-partition Coherence domain manager types, cap, region, hal ~2,000
ruvix-commedge Inter-partition communication types, cap, queue ~1,200
ruvix-pressure mincut/sparsifier/solver bridge ruvector-mincut, ruvector-sparsifier, ruvector-solver ~1,800
ruvix-witness Append-only audit log + replay types, physmem ~1,500
ruvix-agent WASM agent runtime adapter types, cap, partition, commedge ~2,500
ruvix-riscv RISC-V HS-mode HAL hal, types ~2,000
ruvix-x86_64 x86 VMX root HAL hal, types ~2,500

Total new code: ~13,500 lines (Rust) + ~1,500 lines (assembly, 3 platforms)

Appendix C: Build and Test

# Build for QEMU AArch64 virt (hypervisor mode)
cargo build --target aarch64-unknown-none \
    --release \
    -p ruvix-nucleus \
    --features "baremetal,aarch64,hypervisor"

# Run on QEMU
qemu-system-aarch64 \
    -machine virt,virtualization=on,gic-version=3 \
    -cpu cortex-a72 \
    -m 1G \
    -smp 4 \
    -nographic \
    -kernel target/aarch64-unknown-none/release/ruvix

# Run unit tests (hosted, std feature)
cargo test --workspace --features "std,test-hosted"

# Run integration tests (QEMU)
cargo test --test qemu_integration --features "qemu-test"

RVM Security Model

Status

Draft -- Research document for RVM bare-metal microhypervisor security architecture.

Date

2026-04-04

Scope

This document specifies the security model for RVM as a standalone, bare-metal, Rust-first microhypervisor for agents and edge computing. RVM does NOT depend on Linux or KVM. It boots directly on hardware (AArch64 primary, x86_64 secondary) and enforces all isolation through its own MMU page tables, capability system, and proof-gated mutation protocol.

The security model builds on the primitives already implemented in Phase A (ruvix-types, ruvix-cap, ruvix-proof, ruvix-region, ruvix-queue, ruvix-vecgraph, ruvix-nucleus) and extends them for bare-metal operation with hardware-enforced isolation.


1. Capability-Based Authority

1.1 Design Philosophy

RVM enforces the principle of least authority through capabilities. There is no ambient authority anywhere in the system. Every syscall requires an explicit capability handle that authorizes the operation. This means:

  • No global namespaces (no filesystem paths, no PIDs, no network ports accessible by name)
  • No superuser or root -- the root task holds initial capabilities but cannot bypass the model
  • No default permissions -- a newly spawned task has exactly the capabilities its parent explicitly grants via cap_grant
  • No ambient access to hardware -- device MMIO regions, interrupt lines, and DMA channels are all gated by capabilities

1.2 Capability Structure

Capabilities are kernel-resident objects. User-space code never sees the raw capability; it holds an opaque CapHandle that the kernel resolves through a per-task capability table.

/// The kernel-side capability. User space never sees this directly.
/// File: crates/ruvix/crates/types/src/capability.rs
#[repr(C)]
pub struct Capability {
    pub object_id: u64,          // Kernel object being referenced
    pub object_type: ObjectType, // Region, Queue, VectorStore, Task, etc.
    pub rights: CapRights,       // Bitmap of permitted operations
    pub badge: u64,              // Caller-visible demux identifier
    pub epoch: u64,              // Revocation epoch (stale handles detected)
}

Rights bitmap (from crates/ruvix/crates/types/src/capability.rs):

Right Bit Authorizes
READ 0 vector_get, queue_recv, region read
WRITE 1 queue_send, region append/slab write
GRANT 2 cap_grant to another task
REVOKE 3 Revoke capabilities derived from this one
EXECUTE 4 Task entry point, RVF component execution
PROVE 5 Generate proof tokens (vector_put_proved, graph_apply_proved)
GRANT_ONCE 6 Non-transitive grant (derived cap cannot re-grant)

1.3 Capability Delegation and Attenuation

Delegation follows strict monotonic attenuation: a task can only grant capabilities it holds, and the granted rights must be a subset of the held rights. This is enforced at the type level in Capability::derive():

/// Derive a capability with equal or fewer rights.
/// Returns None if rights escalation is attempted or GRANT right is absent.
pub fn derive(&self, new_rights: CapRights, new_badge: u64) -> Option<Self> {
    if !self.has_rights(CapRights::GRANT) { return None; }
    if !new_rights.is_subset_of(self.rights) { return None; }
    // GRANT_ONCE strips GRANT from the derived cap
    let final_rights = if self.rights.contains(CapRights::GRANT_ONCE) {
        new_rights.difference(CapRights::GRANT).difference(CapRights::GRANT_ONCE)
    } else {
        new_rights
    };
    Some(Self { rights: final_rights, badge: new_badge, ..*self })
}

Delegation depth limit: Maximum 8 levels (configurable per RVF manifest). The derivation tree tracks the full chain, and audit flags chains deeper than 4 (AUDIT_DEPTH_WARNING_THRESHOLD).

1.4 Capability Revocation

Revocation propagates through the derivation tree. When a capability is revoked:

  1. The capability's epoch is incremented in the kernel's object table
  2. All entries in the derivation tree rooted at the revoked capability are invalidated
  3. Any held CapHandle referencing the old epoch returns KernelError::StaleCapability

This is O(d) where d is the number of derived capabilities, bounded by the delegation depth limit and the per-task capability table size (1024 entries max).

1.5 How This Differs from DAC/MAC

Property DAC (Unix) MAC (SELinux) Capability (RVM)
Authority source User identity System-wide policy labels Explicit token per object
Ambient authority Yes (UID 0) Yes (unconfined domain) None
Confused deputy Possible Mitigated by labels Prevented by design
Delegation chmod/chown Policy reload cap_grant with attenuation
Revocation File permission change Policy reload Tree-propagating, epoch-based
Granularity File/directory Type/role/level Per-object, per-right

The critical difference: in RVM, authority is carried by the message, not the sender's identity. A task cannot access a resource simply because of "who it is" -- it must present a valid capability handle that was explicitly granted to it through a traceable delegation chain.


2. Proof-Gated Mutation

2.1 Invariant

No state mutation without a valid proof token. This is a kernel invariant, not a policy. The kernel physically prevents mutation of vector stores, graph stores, and RVF mounts without a ProofToken that passes all verification steps. Read operations (vector_get, queue_recv) do not require proofs.

2.2 What Constitutes a Valid Proof

A proof token must pass six verification steps (implemented in crates/ruvix/crates/vecgraph/src/proof_policy.rs ProofVerifier::verify()):

  1. Capability check: The calling task must hold a capability with PROVE right on the target object
  2. Hash match: proof.mutation_hash == expected_mutation_hash -- the proof authorizes exactly the mutation being applied
  3. Tier satisfaction: proof.tier >= policy.required_tier -- higher tiers satisfy lower requirements (Deep satisfies Standard satisfies Reflex)
  4. Time bound: current_time_ns <= proof.valid_until_ns -- proofs expire
  5. Validity window: The window proof.valid_until_ns - current_time_ns must not exceed policy.max_validity_window_ns (prevents pre-computing proofs far in advance)
  6. Nonce uniqueness: Each nonce can be consumed exactly once (ring buffer of 64 recent nonces prevents replay)

2.3 Proof Tiers

Three tiers provide a latency/assurance tradeoff:

Tier Name Latency Budget Payload Use Case
0 Reflex <1 us SHA-256 hash High-frequency vector updates
1 Standard <100 us Merkle witness (root + path) Graph mutations
2 Deep <10 ms Coherence certificate (scores + partition + signature) Structural changes

2.4 Proof Lifecycle

  Task                    Proof Engine (RVF component)              Kernel
    |                              |                                  |
    |-- prepare mutation --------->|                                  |
    |   (compute mutation_hash)    |                                  |
    |                              |-- evaluate coherence state ----->|
    |                              |<-- current state ----------------|
    |                              |                                  |
    |<-- ProofToken ---------------|                                  |
    |   (hash, tier, payload,      |                                  |
    |    expiry, nonce)            |                                  |
    |                              |                                  |
    |-- syscall (token) ------------------------------------------>|
    |                                                               |
    |                              Kernel verifies 6 steps:         |
    |                              1. PROVE right on cap            |
    |                              2. Hash match                    |
    |                              3. Tier >= policy                |
    |                              4. Not expired                   |
    |                              5. Window not too wide           |
    |                              6. Nonce not reused              |
    |                                                               |
    |<-- ProofAttestation (82 bytes) -------------------------------|
    |                              |                                  |
    |                              |   Witness record appended       |

2.5 What Requires a Proof

Operation Proof Required Minimum Tier
region_map Yes (capability proof) N/A -- capability check only
vector_put_proved Yes Per-store ProofPolicy
graph_apply_proved Yes Per-store ProofPolicy
rvf_mount Yes Deep (signature verification)
vector_get No N/A
queue_send / queue_recv No N/A (capability-gated only)
task_spawn No N/A (capability-gated only)
cap_grant No N/A (GRANT right required)
timer_wait No N/A
attest_emit Yes (proof consumed) Per-operation
sensor_subscribe No N/A (capability-gated only)

2.6 Proof-Gated Device Mapping (Bare-Metal Extension)

On bare metal, device MMIO regions are mapped into a task's address space through region_map with a RegionPolicy::DeviceMmio variant (new for Phase B). This mapping requires:

  1. A capability with READ and/or WRITE rights on the device object
  2. A ProofToken with tier >= Standard proving the task's intent matches the device mapping
  3. The device must not already be mapped to another partition (exclusive lease)
/// Extended region policy for bare-metal device access.
/// New for Phase B -- extends the existing RegionPolicy enum.
pub enum RegionPolicy {
    Immutable,
    AppendOnly { max_size: usize },
    Slab { slot_size: usize, slot_count: usize },
    /// Device MMIO region. Mapped as uncacheable, device memory.
    /// Requires proof-gated capability for mapping.
    DeviceMmio {
        phys_base: u64,    // Physical base address of MMIO range
        size: usize,       // Size in bytes
        device_id: u32,    // Kernel-assigned device identifier
    },
}

2.7 Proof-Gated Migration

Partition migration (moving a task and its state from one physical node to another in a RVM mesh) requires a Deep-tier proof containing:

  • Coherence certificate showing the partition's state is consistent
  • Source and destination node attestation (both nodes are trusted)
  • Hash of the serialized partition state

Without this proof, the kernel refuses to serialize or deserialize partition state.

/// Trait for migration authorization. Implemented by the migration subsystem.
pub trait MigrationAuthority {
    /// Verify that migration of this partition is authorized.
    /// Returns the serialized partition state only if proof validates.
    fn authorize_migration(
        &mut self,
        partition_id: u32,
        destination_attestation: &ProofAttestation,
        proof: &ProofToken,
    ) -> Result<SerializedPartition, KernelError>;

    /// Accept an incoming migrated partition.
    /// Verifies the source attestation and proof before instantiating.
    fn accept_migration(
        &mut self,
        serialized: &SerializedPartition,
        source_attestation: &ProofAttestation,
        proof: &ProofToken,
    ) -> Result<PartitionHandle, KernelError>;
}

2.8 Proof-Gated Partition Merge/Split

Graph partitions (mincut boundaries in the vecgraph store) can only be merged or split with a Deep-tier proof that includes the coherence impact analysis:

pub enum GraphMutationKind {
    AddNode { /* ... */ },
    RemoveNode { /* ... */ },
    AddEdge { /* ... */ },
    RemoveEdge { /* ... */ },
    UpdateWeight { /* ... */ },
    /// Merge two partitions. Requires Deep-tier proof with coherence cert.
    MergePartitions {
        source_partition: u32,
        target_partition: u32,
    },
    /// Split a partition at a mincut boundary. Requires Deep-tier proof.
    SplitPartition {
        partition: u32,
        cut_specification: MinCutSpec,
    },
}

3. Witness-Native Audit

3.1 Design Principle

Every privileged action in RVM emits a witness record to the kernel's append-only witness log. "Privileged action" means any syscall that mutates kernel state: vector writes, graph mutations, RVF mounts, task spawns, capability grants, region mappings.

3.2 Witness Record Format

Each record is 96 bytes, compact enough to sustain thousands of records per second on embedded hardware without blocking the syscall path:

/// 96-byte witness record.
/// File: crates/ruvix/crates/nucleus/src/witness_log.rs
#[repr(C)]
pub struct WitnessRecord {
    pub sequence: u64,            // Monotonically increasing (8 bytes)
    pub kind: WitnessRecordKind,  // Boot, Mount, VectorMutation, etc. (1 byte)
    pub timestamp_ns: u64,        // Nanoseconds since boot (8 bytes)
    pub mutation_hash: [u8; 32],  // SHA-256 of the mutation data (32 bytes)
    pub attestation_hash: [u8; 32], // Hash of the proof attestation (32 bytes)
    pub resource_id: u64,         // Object identifier (8 bytes)
    // 7 bytes padding to 96
}

Record kinds:

Kind Value Emitted By
Boot 0 kernel_entry at boot completion
Mount 1 rvf_mount syscall
VectorMutation 2 vector_put_proved syscall
GraphMutation 3 graph_apply_proved syscall
Checkpoint 4 Periodic state snapshots
ReplayComplete 5 After replaying from checkpoint
CapGrant 6 cap_grant syscall (proposed extension)
CapRevoke 7 Capability revocation (proposed extension)
TaskSpawn 8 task_spawn syscall (proposed extension)
DeviceMap 9 Device MMIO mapping (proposed extension)

3.3 Tamper Evidence

The witness log must be tamper-evident. The current Phase A implementation uses simple append-only semantics with FNV-1a hashing. For bare-metal, the following extensions are required:

Hash chaining: Each witness record includes the hash of the previous record, forming a Merkle-like chain. Tampering with any record invalidates all subsequent records.

/// Extended witness record with hash chaining for tamper evidence.
pub struct ChainedWitnessRecord {
    /// The base witness record (96 bytes).
    pub record: WitnessRecord,
    /// SHA-256 hash of the previous record's serialized bytes.
    /// For the first record (sequence 0), this is all zeros.
    pub prev_hash: [u8; 32],
    /// SHA-256(serialize(record) || prev_hash). Computed by the kernel.
    pub chain_hash: [u8; 32],
}

TEE signing (when available): On hardware with TrustZone (Raspberry Pi 4/5), witness records can be signed by the Secure World using a device-unique key. This means even a compromised kernel (EL1) cannot forge witness entries:

/// Trait for hardware-backed witness signing.
pub trait WitnessSigner {
    /// Sign a chained witness record using hardware-bound key.
    /// On AArch64 with TrustZone, this issues an SMC to Secure World.
    /// On platforms without TEE, returns None (software chain only).
    fn sign_witness(&self, record: &ChainedWitnessRecord) -> Option<[u8; 64]>;

    /// Verify a signed witness record.
    fn verify_witness_signature(
        &self,
        record: &ChainedWitnessRecord,
        signature: &[u8; 64],
    ) -> bool;
}

3.4 Replayability and Forensics

The witness log, combined with periodic checkpoints, enables deterministic replay:

  1. Checkpoint: The kernel serializes all vector stores, graph stores, capability tables, and scheduler state to an immutable region. A WitnessRecordKind::Checkpoint record captures the state hash and the witness sequence number at checkpoint time.

  2. Replay: Starting from a checkpoint, the kernel replays all witness records in sequence order, re-applying each mutation. Because mutations are deterministic (same proof token + same state = same result), the final state is identical.

  3. Forensic query: External tools can load the witness log and answer questions like:

    • "Which task mutated vector store X between timestamps T1 and T2?"
    • "What was the coherence score before and after each graph mutation?"
    • "Has the hash chain been broken?" (indicates tampering)

3.5 Witness-Enabled Rollback/Recovery

If a coherence violation is detected (coherence score drops below the configured threshold), the kernel can:

  1. Stop accepting new mutations to the affected partition
  2. Find the most recent checkpoint where coherence was above threshold
  3. Replay witnesses from that checkpoint, skipping the offending mutation
  4. Resume normal operation from the corrected state

This requires the offending mutation to be identified by its witness record (the mutation_hash and attestation_hash pinpoint exactly which operation caused the violation).


4. Isolation Model

4.1 Partition Isolation Guarantees

RVM partitions are the unit of isolation. Each partition consists of:

  • One or more tasks sharing a capability namespace
  • A set of regions (memory objects) accessible only through capabilities held by those tasks
  • Queue endpoints for controlled inter-partition communication

Isolation guarantee: A partition cannot access any memory, device, or kernel object for which it does not hold a valid capability. This is enforced at two levels:

  1. Software: The capability table lookup in every syscall rejects invalid or stale handles
  2. Hardware: MMU page tables enforce that each partition's regions are mapped only in that partition's address space, with no overlapping physical pages between partitions (except explicitly shared immutable regions)

4.2 MMU-Enforced Memory Isolation (Bare Metal)

On bare metal, RVM directly controls the AArch64 MMU. Each partition gets its own translation tables loaded via TTBR0_EL1 on context switch:

/// Per-partition page table management.
/// Kernel mappings use TTBR1_EL1 (shared across all partitions).
/// Partition mappings use TTBR0_EL1 (swapped on context switch).
pub trait PartitionAddressSpace {
    /// Create a new empty address space for a partition.
    fn create() -> Result<Self, KernelError> where Self: Sized;

    /// Map a region into this partition's address space.
    /// Physical pages are allocated from the kernel's physical allocator.
    /// Page table entries enforce the region's policy:
    ///   Immutable  -> PTE_USER | PTE_RO | PTE_CACHEABLE
    ///   AppendOnly -> PTE_KERNEL_RW | PTE_CACHEABLE (user writes via syscall)
    ///   Slab       -> PTE_KERNEL_RW | PTE_CACHEABLE (user writes via syscall)
    ///   DeviceMmio -> PTE_USER | PTE_DEVICE | PTE_nG (non-global, per-partition)
    fn map_region(
        &mut self,
        region: &RegionDescriptor,
        phys_pages: &[PhysFrame],
    ) -> Result<VirtAddr, KernelError>;

    /// Unmap a region, invalidating all TLB entries for those pages.
    fn unmap_region(&mut self, virt_addr: VirtAddr, size: usize) -> Result<(), KernelError>;

    /// Activate this address space (write to TTBR0_EL1 + TLBI).
    unsafe fn activate(&self);
}

Critical invariant: The kernel NEVER maps the same physical page as writable in two different partitions' address spaces simultaneously. Immutable regions may be shared read-only (content-addressable deduplication is safe for immutable data).

4.3 EL1/EL0 Separation

  • EL1 (kernel mode): All kernel code, syscall handlers, interrupt handlers, scheduler, capability table, proof verifier, witness log
  • EL0 (user mode): All RVF components, WASM runtimes, AgentDB, all application code

Syscalls transition EL0 -> EL1 via the SVC instruction. The exception handler in EL1 validates the capability before dispatching to the syscall implementation. Return to EL0 uses ERET.

No EL0 code can:

  • Read or write kernel memory (TTBR1_EL1 mappings are PTE_KERNEL_RW)
  • Modify page tables (page table pages are not mapped in EL0)
  • Disable interrupts (only EL1 can mask IRQs via DAIF)
  • Access device MMIO unless explicitly mapped through a capability

4.4 Side-Channel Mitigation

4.4.1 Spectre v1 (Bounds Check Bypass)

  • All array accesses in the kernel use bounds-checked indexing (Rust's default)
  • The CapabilityTable uses get() returning Option<&T>, never unchecked indexing
  • Critical paths include an lfence / csdb barrier after bounds checks on the syscall dispatch path
/// Spectre-safe capability table lookup.
/// The index is bounds-checked, and a speculation barrier follows.
pub fn lookup(&self, handle: CapHandle) -> Option<&Capability> {
    let idx = handle.raw().id as usize;
    if idx >= self.entries.len() {
        return None;
    }
    // AArch64: CSDB (Consumption of Speculative Data Barrier)
    // Prevents speculative use of the result before bounds check resolves
    #[cfg(target_arch = "aarch64")]
    unsafe { core::arch::asm!("csdb"); }
    self.entries.get(idx).and_then(|e| e.as_ref())
}

4.4.2 Spectre v2 (Branch Target Injection)

  • AArch64: Enable branch prediction barriers via SCTLR_EL1 configuration
  • On context switch between partitions: flush branch predictor state (IC IALLU + TLBI VMALLE1IS + DSB ISH + ISB)
  • Kernel compiled with -Zbranch-protection=bti (Branch Target Identification)

4.4.3 Meltdown (Rogue Data Cache Load)

  • AArch64 is not vulnerable to Meltdown when Privileged Access Never (PAN) is enabled
  • RVM enables PAN via SCTLR_EL1.PAN = 1 at boot
  • Kernel accesses user memory only through explicit copy routines that temporarily disable PAN

4.4.4 Microarchitectural Data Sampling (MDS)

  • On x86_64 (secondary target): VERW-based buffer clearing on every kernel exit
  • On AArch64 (primary target): Not vulnerable to known MDS variants
  • Defense in depth: all sensitive kernel data structures are allocated in dedicated slab regions that are never shared across partitions

4.5 Time Isolation

Timing side channels are mitigated through several mechanisms:

  1. Fixed-time capability lookup: The capability table lookup path executes in constant time regardless of whether the capability is found or not (compare all entries, select result at the end)

  2. Scheduler noise injection: The scheduler adds a small random jitter (0-10 us) to context switch timing to prevent a partition from inferring another partition's behavior from scheduling patterns

  3. Timer virtualization: Each partition sees a virtual timer (CNTVCT_EL0) that advances at the configured rate but does not leak information about other partitions' execution. The kernel programs CNTV_CVAL_EL0 per-partition.

  4. Constant-time proof verification: The ProofVerifier::verify() path is written to avoid early returns that would leak information about which check failed. All six checks execute, and only the final result is returned.

/// Constant-time proof verification to prevent timing side channels.
/// All checks execute regardless of early failures.
pub fn verify_constant_time(
    &mut self,
    proof: &ProofToken,
    expected_hash: &[u8; 32],
    current_time_ns: u64,
    capability: &Capability,
) -> Result<ProofAttestation, KernelError> {
    let mut valid = true;

    // All checks execute -- no early return
    valid &= capability.has_rights(CapRights::PROVE);
    valid &= proof.mutation_hash == *expected_hash;
    valid &= self.policy.tier_satisfies(proof.tier);
    valid &= !proof.is_expired(current_time_ns);
    valid &= (proof.valid_until_ns.saturating_sub(current_time_ns))
        <= self.policy.max_validity_window_ns;
    let nonce_ok = self.nonce_tracker.check_and_mark(proof.nonce);
    valid &= nonce_ok;

    if valid {
        Ok(self.create_attestation(proof, current_time_ns))
    } else {
        // Roll back nonce if overall verification failed
        if nonce_ok {
            self.nonce_tracker.unmark(proof.nonce);
        }
        Err(KernelError::ProofRejected)
    }
}

4.6 Coherence Domain Isolation

Each vector store and graph store belongs to a coherence domain. Coherence domains provide an additional layer of isolation at the semantic level:

  • Mutations within a coherence domain are evaluated against that domain's coherence config
  • Cross-domain references require explicit capability-mediated linking
  • Coherence violations in one domain do not affect other domains
  • Each domain has its own proof policy, nonce tracker, and witness region
/// Coherence domain configuration.
pub struct CoherenceDomain {
    pub domain_id: u32,
    pub vector_stores: &[VectorStoreHandle],
    pub graph_stores: &[GraphHandle],
    pub proof_policy: ProofPolicy,
    pub min_coherence_score: u16,  // 0-10000 (0.00-1.00)
    pub isolation_level: DomainIsolationLevel,
}

pub enum DomainIsolationLevel {
    /// Stores in this domain share no physical pages with other domains.
    Full,
    /// Read-only immutable data may be shared across domains.
    SharedImmutable,
}

5. Device Security

5.1 Lease-Based Device Access

Devices are not permanently assigned to partitions. Instead, RVM uses time-bounded, revocable leases:

/// A time-bounded, revocable lease on a device.
pub struct DeviceLease {
    /// Capability handle authorizing device access.
    pub cap: CapHandle,
    /// Device identifier (kernel-assigned, not hardware address).
    pub device_id: DeviceId,
    /// Lease start time (nanoseconds since boot).
    pub granted_at_ns: u64,
    /// Lease expiry (0 = no expiry, must be explicitly revoked).
    pub expires_at_ns: u64,
    /// Rights on the device (READ for sensors, WRITE for actuators, both for DMA).
    pub rights: CapRights,
    /// The MMIO region mapped for this lease (None if not yet mapped).
    pub mmio_region: Option<RegionHandle>,
}

/// Trait for the device lease manager.
pub trait DeviceLeaseManager {
    /// Request a lease on a device. Requires a capability with appropriate rights.
    /// The lease is time-bounded; after expiry, the mapping is automatically torn down.
    fn request_lease(
        &mut self,
        device_id: DeviceId,
        cap: CapHandle,
        duration_ns: u64,
    ) -> Result<DeviceLease, KernelError>;

    /// Renew an existing lease. Must be called before expiry.
    fn renew_lease(
        &mut self,
        lease: &mut DeviceLease,
        additional_ns: u64,
    ) -> Result<(), KernelError>;

    /// Revoke a lease immediately. Tears down MMIO mapping and flushes DMA.
    fn revoke_lease(&mut self, lease: DeviceLease) -> Result<(), KernelError>;

    /// Check if a lease is still valid.
    fn is_lease_valid(&self, lease: &DeviceLease, current_time_ns: u64) -> bool;
}

Lease lifecycle:

  1. Partition requests a lease via request_lease() with a capability
  2. Kernel checks the capability has appropriate rights on the device object
  3. Kernel maps the device's MMIO region into the partition's address space as RegionPolicy::DeviceMmio with PTE_DEVICE (uncacheable) flags
  4. Kernel programs an expiry timer; when it fires, the lease is automatically torn down
  5. On teardown: MMIO pages are unmapped, TLB is flushed, DMA channels are reset

5.2 DMA Isolation

DMA is the most dangerous hardware capability because DMA engines can read/write arbitrary physical memory. RVM uses a layered defense:

5.2.1 With IOMMU (Preferred)

On platforms with an IOMMU (ARM SMMU, Intel VT-d), the kernel programs the IOMMU's page tables to restrict each device's DMA to only the physical pages belonging to the leaseholder's regions:

/// IOMMU-based DMA isolation.
pub trait IommuController {
    /// Create a DMA mapping for a device, restricting it to the given physical pages.
    /// The device can only DMA to/from these pages and no others.
    fn map_device_dma(
        &mut self,
        device_id: DeviceId,
        allowed_pages: &[PhysFrame],
        direction: DmaDirection,
    ) -> Result<DmaMapping, KernelError>;

    /// Remove a DMA mapping, preventing the device from accessing those pages.
    fn unmap_device_dma(
        &mut self,
        device_id: DeviceId,
        mapping: DmaMapping,
    ) -> Result<(), KernelError>;

    /// Invalidate all DMA mappings for a device (called on lease revocation).
    fn invalidate_device(&mut self, device_id: DeviceId) -> Result<(), KernelError>;
}

5.2.2 Without IOMMU (Bounce Buffers)

On platforms without an IOMMU (early Raspberry Pi models), DMA isolation uses bounce buffers:

  1. The kernel allocates a dedicated physical region for DMA operations
  2. Before a device-to-memory transfer, the kernel prepares the bounce buffer
  3. After transfer completion, the kernel copies data from the bounce buffer to the partition's region (after validation)
  4. The device never has direct access to partition memory

This is slower (extra copy) but maintains the isolation invariant. The crates/ruvix/crates/dma/ crate provides the abstraction layer.

/// Bounce buffer DMA isolation (fallback when no IOMMU).
pub struct BounceBufferDma {
    /// Kernel-owned physical region for DMA bounce.
    bounce_region: PhysRegion,
    /// Maximum bounce buffer size.
    max_bounce_size: usize,
}

impl BounceBufferDma {
    /// Execute a DMA transfer through the bounce buffer.
    /// The device only ever sees the bounce buffer's physical address.
    pub fn transfer(
        &mut self,
        device: DeviceId,
        partition_region: &RegionHandle,
        offset: usize,
        length: usize,
        direction: DmaDirection,
    ) -> Result<(), KernelError> {
        if length > self.max_bounce_size {
            return Err(KernelError::LimitExceeded);
        }
        match direction {
            DmaDirection::MemToDevice => {
                // Copy from partition region to bounce buffer
                self.copy_to_bounce(partition_region, offset, length)?;
                // Program DMA from bounce buffer to device
                self.start_dma(device, direction)?;
            }
            DmaDirection::DeviceToMem => {
                // Program DMA from device to bounce buffer
                self.start_dma(device, direction)?;
                // Wait for completion
                self.wait_completion()?;
                // Copy from bounce buffer to partition region (validated)
                self.copy_from_bounce(partition_region, offset, length)?;
            }
            DmaDirection::MemToMem => {
                return Err(KernelError::InvalidArgument);
            }
        }
        Ok(())
    }
}

5.3 Interrupt Routing Security

Each interrupt line is a kernel object accessed through capabilities:

  1. Interrupt capability: A partition must hold a capability with READ right on an interrupt object to receive interrupts from that line
  2. Interrupt-to-queue routing: Interrupts are delivered as messages on a queue (via sensor_subscribe), not as direct callbacks. This maintains the queue-based IPC model and prevents a malicious interrupt handler from running in kernel context.
  3. Priority ceiling: Interrupt processing tasks have bounded priority to prevent a flood of interrupts from starving other partitions
  4. Rate limiting: The kernel enforces a maximum interrupt rate per device. Interrupts exceeding the rate are queued and delivered at the rate limit.
/// Interrupt routing configuration.
pub struct InterruptRoute {
    /// Hardware interrupt number (e.g., GIC SPI number).
    pub irq_number: u32,
    /// Capability authorizing access to this interrupt.
    pub cap: CapHandle,
    /// Queue where interrupt messages are delivered.
    pub target_queue: QueueHandle,
    /// Maximum interrupt rate (interrupts per second). 0 = unlimited.
    pub rate_limit_hz: u32,
    /// Priority ceiling for the interrupt processing task.
    pub priority_ceiling: TaskPriority,
}

5.4 Device Capability Model

Every device in the system is represented as a kernel object with its own capability:

pub enum ObjectType {
    Task,
    Region,
    Queue,
    Timer,
    VectorStore,
    GraphStore,
    RvfMount,
    Sensor,
    /// A hardware device (UART, DMA controller, GPU, NIC, etc.)
    Device,
    /// An interrupt line (GIC SPI/PPI/SGI)
    Interrupt,
}

The root task (first task created at boot) receives capabilities to all devices discovered during boot (from DTB parsing). It then distributes device capabilities to appropriate partitions according to the RVF manifest's resource policy.


6. Boot Security

6.1 Secure Boot Chain

RVM implements a four-stage secure boot chain:

Stage 0: Hardware ROM / eFUSE
  |  Root of trust: device-unique key burned in silicon
  |  Measures and verifies Stage 1
  v
Stage 1: RVM Boot Stub (ruvix-aarch64/src/boot.S + boot.rs)
  |  Minimal assembly: set up stack, clear BSS, jump to Rust
  |  Rust entry: initialize MMU, verify Stage 2 signature
  |  Verifies using trusted keys embedded in Stage 1 image
  v
Stage 2: RVM Kernel (ruvix-nucleus)
  |  Full kernel initialization: cap table, proof engine, scheduler
  |  Verifies RVF package signature (ML-DSA-65 or Ed25519)
  |  SEC-001: Signature failure -> PANIC (no fallback)
  v
Stage 3: Boot RVF Package
  |  Contains all initial RVF components
  |  Loaded into immutable regions
  |  Queue wiring and capability distribution per manifest
  v
Stage 4: Application RVF Components
     Runtime-mounted RVF packages, each signature-verified

6.2 Signature Verification

The existing verify_boot_signature_or_panic() in crates/ruvix/crates/cap/src/security.rs implements SEC-001: signature failure panics the system with no fallback path. The security feature flag disable-boot-verify is blocked at compile time for release builds:

// CVE-001 FIX: Prevent disable-boot-verify in release builds
#[cfg(all(feature = "disable-boot-verify", not(debug_assertions)))]
compile_error!(
    "SECURITY ERROR [CVE-001]: The 'disable-boot-verify' feature cannot be used \
     in release builds."
);

Supported algorithms:

Algorithm Status Use Case
Ed25519 Implemented Primary boot signature
ECDSA P-256 Supported Legacy compatibility
RSA-PSS 2048 Supported Legacy compatibility
ML-DSA-65 Planned Post-quantum RVF signatures

6.3 Measured Boot with Witness Log

Every boot stage emits a witness record:

  1. Stage 1 measurement: Hash of the kernel image, stored as WitnessRecordKind::Boot
  2. Stage 2 initialization: Each subsystem (cap manager, proof engine, scheduler) records its initialized state
  3. Stage 3 RVF mount: Each mounted RVF package is recorded as WitnessRecordKind::Mount with the package hash and attestation

The boot witness log forms the root of the system's audit trail. All subsequent witness records chain from it.

6.4 Remote Attestation for Edge Deployment

For edge deployments where RVM nodes must prove their integrity to a remote verifier:

/// Remote attestation protocol.
pub trait RemoteAttestor {
    /// Generate an attestation report that a remote verifier can check.
    /// The report includes:
    ///   - Platform identity (device-unique key signed measurement)
    ///   - Boot chain hashes (all four stages)
    ///   - Current witness log root hash
    ///   - Loaded RVF component inventory
    ///   - Nonce from the challenger (prevents replay)
    fn generate_attestation_report(
        &self,
        challenge_nonce: &[u8; 32],
    ) -> Result<AttestationReport, KernelError>;

    /// Verify an attestation report from another node.
    /// Used in mesh deployments where nodes must mutually attest.
    fn verify_attestation_report(
        &self,
        report: &AttestationReport,
        expected_measurements: &MeasurementPolicy,
    ) -> Result<AttestationVerdict, KernelError>;
}

pub struct AttestationReport {
    /// Platform identifier (public key of device).
    pub platform_id: [u8; 32],
    /// Boot chain measurement (hash of all four stages).
    pub boot_measurement: [u8; 32],
    /// Current witness log chain hash (latest chain_hash).
    pub witness_root: [u8; 32],
    /// List of loaded RVF component hashes.
    pub component_inventory: Vec<[u8; 32]>,
    /// Challenge nonce from the verifier.
    pub nonce: [u8; 32],
    /// Signature over all of the above using the platform key.
    pub signature: [u8; 64],
}

6.5 Code Signing for Partition Images

All RVF packages must be signed before they can be mounted. The signature is verified by the kernel's boot loader (crates/ruvix/crates/boot/src/signature.rs):

  • The RVF manifest specifies the signing key ID and algorithm
  • The kernel maintains a TrustedKeyStore (up to 8 keys, expirable)
  • Keys can be rotated by mounting a key-update RVF signed by an existing trusted key
  • The signing key hierarchy supports a two-level PKI:
    • Root key: Burned in eFUSE or compiled into Stage 1 (immutable)
    • Signing keys: Derived from root key, time-bounded, rotatable

7. Agent-Specific Security

7.1 WASM Sandbox Security Within Partitions

RVF components execute as WASM modules within partitions. The WASM sandbox provides a second layer of isolation inside the capability boundary:

                Partition A (capability-isolated)
   +--------------------------------------------------+
   |  +-----------+  +-----------+  +-----------+     |
   |  | WASM      |  | WASM      |  | WASM      |     |
   |  | Module 1  |  | Module 2  |  | Module 3  |     |
   |  | (Agent)   |  | (Agent)   |  | (Service) |     |
   |  +-----------+  +-----------+  +-----------+     |
   |        |              |              |            |
   |        +--- WASM Host Interface (WASI-like) ----+|
   |                       |                           |
   |  +--------------------------------------------+  |
   |  | RVM Syscall Shim                         |  |
   |  | Maps WASM imports -> cap-gated syscalls     |  |
   |  +--------------------------------------------+  |
   +--------------------------------------------------+
   |  Kernel capability boundary (MMU-enforced)        |
   +--------------------------------------------------+

WASM security properties:

  1. Linear memory isolation: Each WASM module has its own linear memory; it cannot access memory of other modules or the host
  2. Import-only system access: WASM modules can only call functions explicitly imported from the host. The host provides a minimal syscall shim that maps WASM calls to capability-gated RVM syscalls
  3. Resource limits: Each WASM module has configured limits on memory size, stack depth, execution fuel (instruction count), and table size
  4. No raw pointer access: WASM's type system prevents arbitrary memory access. Pointers are offsets into the linear memory, bounds-checked by the runtime
/// WASM module resource limits.
pub struct WasmResourceLimits {
    /// Maximum linear memory size in pages (64 KiB per page).
    pub max_memory_pages: u32,
    /// Maximum call stack depth.
    pub max_stack_depth: u32,
    /// Maximum execution fuel (instructions). 0 = unlimited.
    pub max_fuel: u64,
    /// Maximum number of table entries.
    pub max_table_elements: u32,
    /// Maximum number of globals.
    pub max_globals: u32,
}

/// The host interface exposed to WASM modules.
/// Every function here validates capabilities before performing the operation.
pub trait WasmHostInterface {
    fn vector_get(&self, store: u32, key: u64) -> Result<WasmVectorRef, WasmTrap>;
    fn vector_put(&self, store: u32, key: u64, data: &[f32], proof: WasmProofRef)
        -> Result<(), WasmTrap>;
    fn queue_send(&self, queue: u32, msg: &[u8], priority: u8) -> Result<(), WasmTrap>;
    fn queue_recv(&self, queue: u32, buf: &mut [u8], timeout_ms: u64)
        -> Result<usize, WasmTrap>;
    fn log(&self, level: u8, message: &str);
}

7.2 Inter-Agent Communication Security

Agents communicate exclusively through typed queues. Security properties of queue-based IPC:

  1. Capability-gated: Both sender and receiver must hold capabilities on the queue
  2. Typed messages: Queue schema (WIT types) is validated at send time. Malformed messages are rejected before reaching the receiver
  3. Zero-copy safety: Zero-copy messages use descriptors pointing into immutable or append-only regions. The kernel rejects descriptors pointing into slab regions (TOCTOU mitigation -- SEC-004)
  4. No covert channels: Queue capacity is bounded and visible. The kernel does not leak information about queue occupancy to tasks that do not hold the queue's capability
  5. Message ordering: Messages within a priority level are delivered in FIFO order. Cross-priority ordering is by priority (higher first). This is deterministic and does not leak information.

7.3 Agent Identity and Authentication

Agents do not have traditional identities (no UIDs, no usernames). Instead, agent identity is established through the capability chain:

  1. Boot-time identity: An agent's initial capabilities are assigned by the RVF manifest. The manifest is signed, so the identity is rooted in the code signer.
  2. Runtime identity: An agent can prove its identity by demonstrating possession of specific capabilities. A "who are you?" query is answered by "I hold capability X with badge Y", and the verifier checks that badge against its expected value.
  3. Attestation identity: An agent can emit an attest_emit record that binds its capability badge to a witness entry. External verifiers can trace this back through the witness chain to the boot attestation.
/// Agent identity is derived from capability badges, not global names.
pub struct AgentIdentity {
    /// The agent's task handle (ephemeral, changes across reboots).
    pub task: TaskHandle,
    /// Badge on the agent's primary capability (stable across reboots if
    /// assigned by the RVF manifest).
    pub primary_badge: u64,
    /// RVF component ID that spawned this agent.
    pub component_id: RvfComponentId,
    /// Hash of the WASM module binary (code identity).
    pub code_hash: [u8; 32],
}

7.4 Resource Limits and DoS Prevention

Each partition and each WASM module within a partition has enforceable resource limits:

/// Per-partition resource quota.
pub struct PartitionQuota {
    /// Maximum physical memory (bytes).
    pub max_memory_bytes: usize,
    /// Maximum number of tasks.
    pub max_tasks: u32,
    /// Maximum number of capabilities.
    pub max_capabilities: u32,
    /// Maximum number of queue endpoints.
    pub max_queues: u32,
    /// Maximum number of region mappings.
    pub max_regions: u32,
    /// CPU time budget per scheduling epoch (microseconds). 0 = unlimited.
    pub cpu_budget_us: u64,
    /// Maximum interrupt rate across all devices (per second).
    pub max_interrupt_rate_hz: u32,
    /// Maximum witness log entries per epoch (prevents log flooding).
    pub max_witness_entries_per_epoch: u32,
}

/// Enforcement mechanism.
pub trait QuotaEnforcer {
    /// Check if an allocation would exceed the partition's quota.
    fn check_allocation(
        &self,
        partition: PartitionHandle,
        resource: ResourceKind,
        amount: usize,
    ) -> Result<(), KernelError>;

    /// Record a resource allocation against the quota.
    fn record_allocation(
        &mut self,
        partition: PartitionHandle,
        resource: ResourceKind,
        amount: usize,
    ) -> Result<(), KernelError>;

    /// Release a resource allocation.
    fn release_allocation(
        &mut self,
        partition: PartitionHandle,
        resource: ResourceKind,
        amount: usize,
    );
}

pub enum ResourceKind {
    Memory,
    Tasks,
    Capabilities,
    Queues,
    Regions,
    CpuTime,
    WitnessEntries,
}

DoS prevention mechanisms:

Attack Vector Defense
Memory exhaustion Per-partition memory quota, region_map returns OutOfMemory
CPU starvation Per-partition CPU budget, preemptive scheduler with budget enforcement
Queue flooding Bounded queue capacity, backpressure on queue_send
Interrupt storm Per-device rate limiting, priority ceiling
Capability table exhaustion Per-partition cap table limit (1024 max)
Witness log flooding Per-partition witness entry budget per epoch
Fork bomb task_spawn checks per-partition task count against quota
Proof spam Proof cache limited to 64 entries, nonce tracker bounded

8. Threat Model

8.1 What RVM Defends Against

Attacks from Partitions Against Other Partitions

Attack Defense
Read another partition's memory MMU page tables (TTBR0 per-partition)
Write another partition's memory MMU + capability-gated region mapping
Forge a capability Capabilities are kernel-resident, handles are opaque + epoch-checked
Escalate capability rights derive() enforces monotonic attenuation
Replay a proof token Single-use nonces in ProofVerifier
Use an expired proof Time-bounded validity check
Tamper with witness log Append-only region + hash chaining + optional TEE signing
Spoof another agent's identity Identity is derived from capability badge, not forgeable name
Starve other partitions of CPU Per-partition CPU budget + preemptive scheduling
Exhaust system memory Per-partition memory quota
Flood queues Bounded capacity + backpressure
DMA attack IOMMU page tables or bounce buffers
Interrupt storm DoS Rate limiting + priority ceiling

Attacks from Partitions Against the Kernel

Attack Defense
Corrupt kernel memory EL1/EL0 separation, PAN enabled
Modify page tables Page table pages not mapped in EL0
Disable interrupts DAIF masking only in EL1
Exploit kernel vulnerability Rust's memory safety, #![forbid(unsafe_code)] on most crates
Spectre/Meltdown CSDB barriers, BTI, PAN, branch predictor flush
Supply crafted syscall args All syscall args validated, bounds-checked
Time a kernel operation to leak info Constant-time critical paths, timer virtualization

Boot-Time Attacks

Attack Defense
Boot unsigned kernel SEC-001: panic on signature failure
Tamper with kernel image Boot measurement chain, hash verification
Downgrade attack Algorithm allowlist in TrustedKeyStore
Replay old signed image Boot nonce from hardware RNG, version checking
Compromise signing key Key rotation via signed key-update RVF

Network/Remote Attacks (Multi-Node Mesh)

Attack Defense
Impersonate a node Mutual attestation with device-unique keys
Migrate malicious partition Deep-tier proof with source/destination attestation
Replay migration Nonce in migration proof
Man-in-the-middle on migration Encrypted channel + attestation binding

8.2 What Is Out of Scope for v1

The following are explicitly NOT defended against in v1. They are acknowledged risks that will be addressed in future iterations:

  1. Physical access attacks: An attacker with physical access to the hardware (JTAG, bus probing, cold boot attacks) is out of scope. Hardware security modules (HSMs) and tamper-resistant packaging are future work.

  2. Rowhammer / DRAM disturbance: RVM does not implement guard rows or ECC requirements in v1. Edge hardware with ECC RAM is recommended but not enforced.

  3. Supply chain attacks on the compiler: RVM trusts the Rust compiler. Reproducible builds are recommended but not verified in v1.

  4. Formal verification of the kernel: Unlike seL4, RVM is not formally verified in v1. The kernel is written in safe Rust (with minimal unsafe in the HAL layer), but there is no machine-checked proof of correctness.

  5. Covert channels via power consumption: Power analysis side channels are out of scope. RVM does not implement constant-power execution.

  6. GPU/accelerator isolation: v1 targets CPU-only execution. GPU and accelerator DMA isolation is future work.

  7. Encrypted memory (SEV-SNP/TDX): v1 does not implement memory encryption. The hypervisor trusts the physical memory bus.

  8. Multi-tenant adversarial scheduling: The scheduler provides time isolation through budgets and jitter, but does not defend against a sophisticated adversary performing cache-timing analysis across many scheduling quanta.

8.3 Trust Boundaries

+================================================================+
|                    UNTRUSTED                                    |
|  +----------------------------------------------------------+  |
|  |  RVF Components (WASM agents, services, drivers)          |  |
|  |  - May be malicious                                       |  |
|  |  - May exploit any vulnerability                          |  |
|  |  - Constrained by: capabilities, quotas, WASM sandbox     |  |
|  +----------------------------------------------------------+  |
|                          | syscall                              |
|                          v                                      |
|  +----------------------------------------------------------+  |
|  |  TRUSTED: RVM Kernel (ruvix-nucleus)                    |  |
|  |  - Capability manager, proof verifier, scheduler          |  |
|  |  - Witness log, region manager, queue IPC                 |  |
|  |  - Bug here = system compromise                           |  |
|  |  - Minimized: 12 syscalls, ~15K lines Rust                |  |
|  +----------------------------------------------------------+  |
|                          | hardware interface                   |
|                          v                                      |
|  +----------------------------------------------------------+  |
|  |  TRUSTED: Hardware                                        |  |
|  |  - MMU, GIC, IOMMU, timers                               |  |
|  |  - Assumed correct (no hardware bugs modeled in v1)       |  |
|  +----------------------------------------------------------+  |
|                          | optional                              |
|                          v                                      |
|  +----------------------------------------------------------+  |
|  |  TRUSTED: TrustZone Secure World (when available)         |  |
|  |  - Device-unique key storage                              |  |
|  |  - Witness signing                                        |  |
|  |  - Boot measurement anchoring                             |  |
|  +----------------------------------------------------------+  |
+================================================================+

Key trust assumptions:

  • The kernel is correct (not formally verified, but written in safe Rust)
  • The hardware functions as documented (MMU enforces page permissions, IOMMU restricts DMA)
  • The boot signing key has not been compromised
  • The Rust compiler generates correct code
  • The WASM runtime (Wasmtime or WAMR) correctly enforces sandboxing

8.4 Comparison to KVM and seL4 Threat Models

Property KVM seL4 RVM
TCB size ~2M lines (Linux kernel) ~8.7K lines (C) ~15K lines (Rust)
Formal verification No Yes (full functional correctness) No (safe Rust, not verified)
Memory safety C (manual) C (verified) Rust (compiler-enforced)
Capability model No (uses DAC/MAC) Yes (unforgeable tokens) Yes (seL4-inspired)
Proof-gated mutation No No Yes (unique to RVM)
Witness audit log No (relies on external logging) No Yes (kernel-native)
DMA isolation VT-d/SMMU IOMMU-dependent IOMMU + bounce buffer fallback
Side-channel defense KPTI, IBRS, MDS mitigations Limited (depends on platform) CSDB, BTI, PAN, const-time paths
Agent-native primitives No No Yes (vectors, graphs, coherence)
Hot-code loading Module loading (large TCB) No RVF mount (capability-gated)

Key differentiators:

  1. RVM vs. KVM: RVM has a 100x smaller TCB. KVM inherits the entire Linux kernel as its TCB, including filesystems, networking, drivers, and hundreds of syscalls. RVM has 12 syscalls and no ambient authority. KVM relies on Linux's DAC/MAC; RVM uses capabilities with proof-gated mutation.

  2. RVM vs. seL4: seL4 has formal verification, which RVM does not. However, RVM has proof-gated mutation (no mutation without cryptographic authorization), kernel-native witness logging, and agent-specific primitives (vector stores, graph stores, coherence scoring). seL4 would require these as userspace servers communicating through IPC, reintroducing overhead and expanding the trusted codebase.


9. Security Invariants Summary

The following invariants MUST hold at all times. Violation of any invariant indicates a security breach.

ID Invariant Enforcement
SEC-001 Boot signature failure -> PANIC verify_boot_signature_or_panic(), compile-time block on disable-boot-verify in release
SEC-002 Proof cache: 64 entries max, 100ms TTL, single-use nonces ProofCache + NonceTracker
SEC-003 Capability delegation depth <= 8 DerivationTree depth check
SEC-004 Zero-copy IPC descriptors cannot point into Slab regions Queue descriptor validation
SEC-005 No writable physical page shared between partitions PartitionAddressSpace::map_region() exclusivity check
SEC-006 Capability rights can only decrease through delegation Capability::derive() subset check
SEC-007 Every mutating syscall emits a witness record Witness log append in syscall path
SEC-008 Device MMIO access requires active lease + capability DeviceLeaseManager check
SEC-009 DMA restricted to leaseholder's physical pages IOMMU or bounce buffer
SEC-010 Per-partition resource quotas enforced QuotaEnforcer checks before allocation
SEC-011 Witness log is append-only with hash chaining ChainedWitnessRecord, region policy enforcement
SEC-012 No EL0 code can access kernel memory TTBR1 mappings are PTE_KERNEL_RW, PAN enabled

10. Implementation Roadmap

Phase A (Complete): Linux-Hosted Prototype

Already implemented and tested (760 tests passing):

  • Capability manager with derivation trees
  • 3-tier proof engine with nonce tracking
  • Witness log with serialization
  • 12-syscall nucleus with checkpoint/replay

Phase B (In Progress): Bare-Metal AArch64

Security-specific deliverables:

  • MMU-enforced partition isolation (TTBR0 per-partition)
  • EL1/EL0 separation for kernel/user code
  • PAN + BTI + CSDB speculation barriers
  • Hardware timer virtualization
  • Device capability model with lease management

Phase C (Planned): SMP + DMA

Security-specific deliverables:

  • IOMMU programming for DMA isolation
  • Bounce buffer fallback for platforms without IOMMU
  • Per-CPU TLB management for partition switches
  • IPI-based remote TLB invalidation
  • SpinLock with timing-attack-resistant implementation

Phase D (Planned): Mesh + Attestation

Security-specific deliverables:

  • Remote attestation protocol
  • Mutual node authentication
  • Proof-gated migration
  • Encrypted partition state transfer
  • Distributed witness log with cross-node hash chaining

References

  • ADR-087: RVM Cognition Kernel (accepted, Phase A implemented)
  • ADR-042: Security RVF -- AIDefence + TEE Hardened Cognitive Container
  • ADR-047: Proof-gated mutation protocol
  • ADR-029: RVF canonical binary format
  • ADR-030: RVF cognitive container / self-booting vector files
  • seL4 Reference Manual (capability model inspiration)
  • ARM Architecture Reference Manual (AArch64 exception levels, MMU, PAN, BTI)
  • NIST SP 800-147B: BIOS Protection Guidelines for Servers (measured boot)
  • Dennis & Van Horn, "Programming Semantics for Multiprogrammed Computations" (1966) -- original capability concept

RVM Hypervisor Core -- GOAP Plan

Goal-Oriented Action Planning for the RVM Coherence-Native Microhypervisor

Field Value
Status Draft
Date 2026-04-04
Scope Research + Architecture + Implementation roadmap
Relates to ADR-087 (RVM Cognition Kernel), ADR-106 (Kernel/RVF Integration), ADR-117 (Canonical MinCut)

0. Executive Summary

This document defines the GOAP (Goal-Oriented Action Planning) strategy for RVM Hypervisor Core -- a Rust-first, coherence-native microhypervisor that replaces the VM abstraction with coherence domains: graph-partitioned isolation units managed by dynamic min-cut, governed by proof-gated capabilities, and optimized for multi-agent edge computing.

RVM Hypervisor Core is NOT a KVM VMM. It is NOT a Linux module. It is a standalone hypervisor that boots bare metal, manages hardware directly, and uses coherence domains as its primary scheduling and isolation primitive. Traditional VMs are subsumed as a degenerate case (a coherence domain with a single opaque partition and no graph structure).

Current State Assessment

The RuVector project already has significant infrastructure in place:

  • ruvix kernel workspace -- 22 sub-crates, ~101K lines of Rust, 760 tests passing (Phase A complete)
  • ruvix-cap -- seL4-inspired capability system with derivation trees
  • ruvix-proof -- 3-tier proof engine (Reflex <100ns, Standard <100us, Deep <10ms)
  • ruvix-sched -- Coherence-aware scheduler with novelty boosting
  • ruvix-hal -- HAL traits for AArch64, RISC-V, x86 (trait definitions)
  • ruvix-aarch64 -- AArch64 boot, MMU stubs
  • ruvix-physmem -- Physical memory allocator
  • ruvix-boot -- 5-stage RVF boot with ML-DSA-65 signatures
  • ruvix-nucleus -- 12 syscalls, checkpoint/replay
  • ruvector-mincut -- Subpolynomial dynamic min-cut (the crown jewel)
  • ruvector-sparsifier -- Dynamic spectral graph sparsification
  • ruvector-solver -- Sublinear-time sparse solvers
  • ruvector-coherence -- Coherence scoring with spectral support
  • ruvector-raft -- Distributed consensus
  • ruvector-verified -- Formal verification with lean-agentic dependent types
  • cognitum-gate-kernel -- 256-tile coherence gate fabric (WASM)
  • qemu-swarm -- QEMU cluster simulation

Goal State

A running microhypervisor that:

  1. Boots bare metal on QEMU virt (AArch64) to first witness in <250ms
  2. Manages coherence domains (not VMs) as the primary isolation unit
  3. Uses dynamic min-cut to partition, migrate, and reclaim partitions
  4. Gates every privileged mutation with capability proofs
  5. Emits a complete witness trail for every state change
  6. Achieves hot partition switch in <10us
  7. Reduces remote memory traffic by 20% via coherence-aware placement
  8. Runs WASM agent partitions with zero-copy IPC
  9. Recovers from faults without global reboot

1. Research Phase Goals

1.1 Bare-Metal Rust Hypervisors and Microkernels

Goal state: Deep understanding of existing Rust bare-metal systems to inform what to adopt, adapt, and discard.

Project Why Study It Key Takeaway for RVM
RustyHermit / Hermit Unikernel in Rust; uhyve hypervisor boots directly. No Linux dependency. Boot sequence patterns, QEMU integration, minimal device model
Theseus OS Rust OS with live code swapping, cell-based memory. No address spaces. Intralingual design, state spill prevention, namespace-based isolation
RedLeaf Rust OS with language-level isolation, no hardware protection needed for safety. Ownership-based isolation as alternative to page tables for trusted partitions
Tock OS Embedded Rust OS with grant regions, capsule isolation. Grant-based memory model maps to RVM region policies
Hubris Oxide Computer's embedded Rust RTOS with static task allocation. Panic isolation, supervisor/task fault model
Firecracker KVM VMM in Rust. Anti-pattern for architecture (depends on KVM), but study for device model. Minimal device model pattern, virtio implementation

Actions:

  • A1.1.1: Read Theseus OS OSDI 2020 paper -- extract cell/namespace isolation patterns
  • A1.1.2: Read RedLeaf OSDI 2020 paper -- extract ownership-based isolation model
  • A1.1.3: Study Hubris source for panic-isolated task model
  • A1.1.4: Study RustyHermit uhyve for bare-metal boot without KVM

Preconditions: None Effects: Design vocabulary for coherence domains established; isolation model decision data gathered

1.2 Capability-Based OS Designs

Goal state: Identify which capability model to adopt for proof-gated coherence domains.

System Why Study It Key Takeaway
seL4 Formally verified microkernel with capability-based access control. Gold standard. Derivation trees, CNode design, retype mechanism. Already partially adopted in ruvix-cap.
CHERI Hardware capability extensions (Arm Morello). Capabilities in registers. Hardware-enforced bounds on pointers; could replace page table boundaries for fine-grained isolation
Barrelfish Multikernel OS with capability system and message-passing IPC. Per-core kernel model; capability transfer across cores maps to coherence domain migration
Fuchsia/Zircon Capability-based microkernel from Google. Handles as capabilities. Practical capability system at scale; job/process/thread hierarchy

Actions:

  • A1.2.1: Map seL4 CNode/Untyped/Retype to ruvix-cap derivation trees -- identify gaps
  • A1.2.2: Study CHERI Morello ISA for hardware-backed capability enforcement
  • A1.2.3: Study Barrelfish multikernel design for cross-core capability transfer
  • A1.2.4: Evaluate Fuchsia handle table design for runtime performance characteristics

Preconditions: A1.1 partially complete (understand isolation context) Effects: Capability model finalized; hardware vs. software enforcement decision made

1.3 Graph-Partitioned Scheduling

Goal state: Algorithm for coherence-pressure-driven scheduling that leverages ruvector-mincut.

Topic Reference Relevance
Graph-based task scheduling Kwok & Ahmad survey (1999); modern DAG schedulers Task dependency graphs as scheduling input
Spectral graph partitioning Fiedler vectors, Normalized Cuts (Shi & Malik 2000) Coherence domain boundary identification
Dynamic min-cut for placement ruvector-mincut (our own subpolynomial algorithm) Core algorithm for partition placement
Network flow scheduling Coffman-Graham, Hu's algorithm Precedence-constrained scheduling with graph structure
NUMA-aware scheduling Linux NUMA balancing, AutoNUMA Coherence-aware memory placement (lower-bound benchmark)

Actions:

  • A1.3.1: Formalize the coherence-pressure scheduling problem as a graph optimization
  • A1.3.2: Prove that dynamic min-cut provides bounded-approximation partition quality
  • A1.3.3: Design the scheduler tick: {observe graph state} -> {compute pressure} -> {select task}
  • A1.3.4: Benchmark ruvector-mincut update latency for partition-switch-time budget (<10us)

Preconditions: ruvector-mincut crate functional (it is) Effects: Scheduling algorithm specified; latency bounds proven

1.4 Memory Coherence Protocols

Goal state: Design a memory coherence model for coherence domains that eliminates false sharing and minimizes remote traffic.

Protocol Why Study Relevance
MOESI Snooping protocol; AMD uses for multi-socket Baseline understanding of cache coherence overhead
Directory-based (DASH, SGI Origin) Scalable coherence for >4 sockets Coherence domain as a directory entry abstraction
ARM AMBA CHI Modern ARM coherence protocol for SoC Target hardware protocol for Seed/Appliance chips
CXL (Compute Express Link) Memory-semantic interconnect; CXL.mem for shared memory pools Future interconnect for cross-chip coherence domains
Barrelfish message-passing Eliminates shared memory entirely; replicate instead Alternative: no coherence protocol, only message passing

Actions:

  • A1.4.1: Quantify coherence overhead: how much of current "VM exit" cost is actually coherence traffic
  • A1.4.2: Design coherence domain memory model: regions are either local (no coherence) or shared (proof-gated coherence)
  • A1.4.3: Prototype directory-based coherence for shared regions using ruvix-region policies
  • A1.4.4: Define the "coherence score" metric that drives scheduling and migration decisions

Preconditions: A1.3.1 (graph model defined) Effects: Memory model specified; coherence score formula defined

1.5 Formal Verification Approaches

Goal state: Identify which properties of the hypervisor can be formally verified, and with what tools.

Approach Tool What It Proves
seL4 style Isabelle/HOL Full functional correctness (gold standard, very expensive)
Kani/Verus Rust-native model checkers Bounded verification of Rust code; panic-freedom, overflow
Prusti Viper + Rust Pre/post condition verification with ownership
lean-agentic (ruvector-verified) Lean 4 + Rust FFI Dependent types for proof-carrying operations (we have this)
TLA+/P Model checking Protocol-level correctness (capability transfer, migration)

Actions:

  • A1.5.1: Use Kani to verify panic-freedom in ruvix-cap and ruvix-proof (no_std paths)
  • A1.5.2: Use ruvector-verified to generate proof obligations for capability derivation correctness
  • A1.5.3: Write TLA+ spec for coherence domain migration protocol
  • A1.5.4: Define the verification budget: which crates get full verification vs. testing-only

Preconditions: ruvector-verified crate functional Effects: Verification strategy decided; initial proofs for cap and proof crates

1.6 Agent Runtime Designs

Goal state: Design the WASM/agent partition interface that runs inside coherence domains.

Runtime Why Study Relevance
wasmtime Production WASM runtime with Cranelift JIT Reference for WASM execution in restricted environments
wasmer WASM with multiple backends (Cranelift, LLVM, Singlepass) Singlepass backend for predictable compilation time
lunatic Erlang-like actor system built on WASM Actor model inside WASM for agent isolation
wasm-micro-runtime (WAMR) Lightweight WASM interpreter for embedded Minimal footprint for edge/embedded coherence domains
Component Model (WASI P2) Typed interface composition for WASM Interface types for agent-to-agent IPC

Actions:

  • A1.6.1: Benchmark WAMR vs wasmtime-minimal for partition boot time
  • A1.6.2: Design the coherence domain WASM interface: capabilities exposed as WASM imports
  • A1.6.3: Define agent IPC: WASM component model typed channels backed by ruvix-queue
  • A1.6.4: Prototype: WASM partition that can query ruvix-vecgraph through capability-gated imports

Preconditions: A1.2 (capability model), ruvix-queue functional Effects: Agent runtime selection made; IPC interface defined


2. Architecture Goals

2.1 Coherence Domains Without KVM

Current world state: ruvix has 6 primitives (Task, Capability, Region, Queue, Timer, Proof) but no concept of a "coherence domain" as a first-class hypervisor object.

Goal state: Coherence domains are the primary isolation and scheduling unit, replacing the VM abstraction.

Definition

A coherence domain is a graph-structured isolation unit consisting of:

CoherenceDomain {
    id: DomainId,
    graph: VecGraph,           // from ruvix-vecgraph, nodes=tasks, edges=data dependencies
    regions: Vec<RegionHandle>,  // memory owned by this domain
    capabilities: CapTree,       // capability subtree rooted at domain cap
    coherence_score: f32,        // spectral coherence metric
    mincut_partition: Partition,  // current min-cut boundary from ruvector-mincut
    witness_log: WitnessLog,     // domain-local witness chain
    tier: MemoryTier,            // Hot | Warm | Dormant | Cold
}

How It Replaces VMs

VM Concept Coherence Domain Equivalent
vCPU Tasks within the domain's graph
Guest physical memory Regions with domain-scoped capabilities
VM exit/enter Partition switch (rescheduling at min-cut boundary)
Device passthrough Lease-based device capability (time-bounded, revocable)
VM migration Graph repartitioning via dynamic min-cut
Snapshot/restore Witness log replay from checkpoint

Key Architectural Decisions

AD-1: No hardware virtualization extensions required. RVM uses capability-based isolation (software) + MMU page table partitioning (hardware) instead of VT-x/AMD-V/EL2 trap-and-emulate. This means:

  • No VM exits. No VMCS/VMCB. No nested page tables.
  • Isolation comes from: (a) capability enforcement in ruvix-cap, (b) MMU page table boundaries per domain, (c) proof-gated mutation.
  • A traditional VM is a degenerate coherence domain: single partition, opaque graph, no coherence scoring.

AD-2: EL2 is used for page table management only. On AArch64, the hypervisor runs at EL2. But EL2 is used purely to manage stage-2 page tables that enforce region boundaries -- not for trap-and-emulate virtualization.

AD-3: Coherence score drives everything. The coherence score (computed from the domain's graph structure via ruvector-coherence spectral methods) determines:

  • Scheduling priority (high coherence = more CPU time)
  • Memory tier (high coherence = hot tier; low coherence = demote to warm/dormant)
  • Migration eligibility (domains with suboptimal min-cut partition are candidates)
  • Reclamation order (lowest coherence reclaimed first under memory pressure)

Actions:

  • A2.1.1: Add CoherenceDomain struct to ruvix-types
  • A2.1.2: Add DomainCreate, DomainDestroy, DomainMigrate syscalls to ruvix-nucleus
  • A2.1.3: Implement domain-scoped capability trees in ruvix-cap
  • A2.1.4: Wire ruvector-coherence spectral scoring into ruvix-sched

2.2 Hardware Abstraction Layer

Current state: ruvix-hal defines traits for Console, Timer, InterruptController, Mmu, Power. ruvix-aarch64 has stubs.

Goal state: HAL supports three architectures with hypervisor-level primitives.

Architecture HAL Implementation Priority Hypervisor Feature
AArch64 ruvix-aarch64 P0 (primary) EL2 page table management, GIC-400/GIC-600
RISC-V ruvix-riscv (new) P1 H-extension for VS/HS mode, PLIC/APLIC
x86_64 ruvix-x86 (new) P2 EPT page tables, APIC (lowest priority)

New HAL traits for hypervisor:

pub trait HypervisorMmu {
    fn create_stage2_tables(&mut self, domain: DomainId) -> Result<PageTableRoot, HalError>;
    fn map_region_to_domain(&mut self, region: RegionHandle, domain: DomainId, perms: RegionPolicy) -> Result<(), HalError>;
    fn unmap_region(&mut self, region: RegionHandle, domain: DomainId) -> Result<(), HalError>;
    fn switch_domain(&mut self, from: DomainId, to: DomainId) -> Result<(), HalError>;
}

pub trait CoherenceHardware {
    fn read_cache_miss_counter(&self) -> u64;
    fn read_remote_memory_counter(&self) -> u64;
    fn flush_domain_tlb(&mut self, domain: DomainId) -> Result<(), HalError>;
}

Actions:

  • A2.2.1: Extend ruvix-hal with HypervisorMmu and CoherenceHardware traits
  • A2.2.2: Implement AArch64 EL2 page table management in ruvix-aarch64
  • A2.2.3: Implement GIC-600 interrupt routing per coherence domain
  • A2.2.4: Define RISC-V H-extension HAL (trait impl stubs)

2.3 Memory Model

Current state: ruvix-region provides Immutable/AppendOnly/Slab policies with mmap-backed storage. ruvix-physmem has a buddy allocator.

Goal state: Hybrid memory model with capability-gated regions and tiered coherence.

Design: Four-Tier Memory Hierarchy

Tier     | Backing         | Access Latency | Coherence State | Eviction Policy
---------|-----------------|----------------|-----------------|------------------
Hot      | L1/L2 resident  | <10ns          | Exclusive/Modified | Never (pinned)
Warm     | DRAM            | ~100ns         | Shared/Clean    | LRU with coherence weight
Dormant  | Compressed DRAM | ~1us           | Invalid (reconstructable) | Coherence score threshold
Cold     | NVMe/Flash      | ~10us          | Tombstone       | Witness log pointer only

Key Innovation: Reconstructable Memory. Dormant regions are not stored as raw bytes. They are stored as:

  1. A witness log checkpoint hash
  2. A delta-compressed representation (using ruvector-temporal-tensor compression)
  3. Reconstruction instructions that can rebuild the region from the witness log

This means memory reclamation does not destroy state -- it compresses it into the witness chain.

Actions:

  • A2.3.1: Extend ruvix-region with MemoryTier enum and tier-transition methods
  • A2.3.2: Implement dormant-region compression using witness log + delta encoding
  • A2.3.3: Implement cold-tier eviction to NVMe with tombstone references
  • A2.3.4: Wire physical memory allocator to tier-aware allocation (hot from buddy, warm from slab pool)
  • A2.3.5: Define the page table structure for stage-2 domain isolation

2.4 Scheduler: Graph-Pressure-Driven

Current state: ruvix-sched has a coherence-aware scheduler with deadline pressure, novelty signal, and structural risk. Fixed partition model.

Goal state: Scheduler uses live graph state from ruvector-mincut to make scheduling decisions.

Scheduling Algorithm: CoherencePressure

EVERY scheduler_tick:
  1. For each active coherence domain D:
     a. Read D.graph edge weights (data flow rates between tasks)
     b. Compute min-cut value via ruvector-mincut (amortized O(n^{o(1)}))
     c. Compute coherence_score = spectral_gap(D.graph) / min_cut_value
     d. Compute pressure = deadline_urgency * coherence_score * novelty_boost
  2. Sort domains by pressure (descending)
  3. Assign CPU time proportional to pressure
  4. If any domain's coherence_score < threshold:
     - Trigger repartition: invoke ruvector-mincut to compute new boundary
     - If repartition improves score by >10%: execute migration

Partition Switch Protocol (target: <10us):

switch_partition(from: DomainId, to: DomainId):
  1. Save from.task_state to from.region (register dump, ~500ns)
  2. Switch stage-2 page table root (TTBR write, ~100ns)
  3. TLB invalidate for from domain (TLBI, ~2us on ARM)
  4. Load to.task_state from to.region (~500ns)
  5. Emit witness record for switch (~200ns with reflex proof)
  6. Resume execution in to domain
  Total budget: ~3.3us (well within 10us target)

Actions:

  • A2.4.1: Refactor ruvix-sched to accept graph state from ruvix-vecgraph
  • A2.4.2: Integrate ruvector-mincut as a scheduling oracle (no_std subset)
  • A2.4.3: Implement partition switch protocol in ruvix-aarch64
  • A2.4.4: Benchmark partition switch time on QEMU virt

2.5 IPC: Zero-Copy Message Passing

Current state: ruvix-queue provides io_uring-style ring buffers with zero-copy semantics. 47 tests passing.

Goal state: Cross-domain IPC through shared regions with capability-gated access.

Design

Inter-domain IPC:
  1. Sender domain S holds Capability(Queue Q, WRITE)
  2. Receiver domain R holds Capability(Queue Q, READ)
  3. Queue Q is backed by a shared Region visible in both S and R stage-2 page tables
  4. Messages are written as typed records with coherence metadata
  5. Every send/recv emits a witness record linking the two domains

Intra-domain IPC:
  Same as current ruvix-queue, but within a single stage-2 address space.
  No page table switch required. Pure ring buffer.

Message Format:

struct DomainMessage {
    header: MsgHeader,        // 16 bytes: sender, receiver, type, len
    coherence: CoherenceMeta, // 8 bytes: coherence score at send time
    witness: WitnessHash,     // 32 bytes: hash linking to witness chain
    payload: [u8],            // variable: zero-copy reference into shared region
}

Actions:

  • A2.5.1: Extend ruvix-queue with cross-domain shared region support
  • A2.5.2: Implement capability-gated queue access for inter-domain messages
  • A2.5.3: Add CoherenceMeta and WitnessHash to message headers
  • A2.5.4: Benchmark zero-copy IPC latency (target: <100ns intra-domain, <1us inter-domain)

2.6 Device Model: Lease-Based Access

Goal state: Devices are not "assigned" to domains. They are leased with capability-bounded time windows.

struct DeviceLease {
    device_id: DeviceId,
    domain: DomainId,
    capability: CapHandle,     // Revocable capability for device access
    lease_start: Timestamp,
    lease_duration: Duration,
    max_dma_budget: usize,     // Maximum DMA bytes allowed during lease
    witness: WitnessHash,      // Proof of lease grant
}

Key properties:

  • Lease expiry automatically revokes capability (no explicit release needed)
  • DMA budget prevents device from exhausting memory during lease
  • Multiple domains can hold read-only leases to the same device simultaneously
  • Exclusive write lease requires proof of non-interference (via min-cut: device node has no shared edges)

Actions:

  • A2.6.1: Design DeviceLease struct and lease lifecycle
  • A2.6.2: Implement lease-based MMIO region mapping in ruvix-drivers
  • A2.6.3: Implement DMA budget enforcement in ruvix-dma
  • A2.6.4: Wire lease expiry to capability revocation in ruvix-cap

2.7 Witness Subsystem: Compact Append-Only Log

Current state: ruvix-boot has WitnessLog with SHA-256 chaining. ruvix-proof has 3-tier proof engine.

Goal state: Hypervisor-wide witness log that enables deterministic replay, audit, and fault recovery.

Design

WitnessLog (per coherence domain):
  - Append-only ring buffer in a dedicated Region(AppendOnly)
  - Each entry: [timestamp: u64, action_type: u8, proof_hash: [u8; 32], prev_hash: [u8; 32], payload: [u8; N]]
  - Fixed 82-byte entries (ATTESTATION_SIZE from ruvix-types)
  - Hash chain: entry[i].prev_hash = SHA256(entry[i-1])
  - Compaction: when ring buffer wraps, emit a Merkle root of the evicted segment to cold storage

GlobalWitness (hypervisor-level):
  - Merges per-domain witness chains at partition switch boundaries
  - Enables cross-domain causality reconstruction
  - Uses ruvector-dag for causal ordering

Actions:

  • A2.7.1: Implement per-domain witness log in ruvix-proof
  • A2.7.2: Implement global witness merge at partition switch
  • A2.7.3: Implement Merkle compaction for ring buffer overflow
  • A2.7.4: Implement deterministic replay from witness log + checkpoint

3. Implementation Milestones

M0: Bare-Metal Rust Boot on QEMU (No KVM, Direct Machine Code)

Goal: Boot RVM at EL2 on QEMU aarch64 virt, print to UART, emit first witness record.

Preconditions:

  • ruvix-hal traits defined (done)
  • ruvix-aarch64 boot stubs (partially done)
  • aarch64-boot directory with linker script and build system (exists)

Actions:

  • M0.1: Complete _start assembly: disable MMU, set up stack, branch to Rust
  • M0.2: Initialize PL011 UART via ruvix-drivers
  • M0.3: Initialize GIC-400 minimal (mask all interrupts except timer)
  • M0.4: Set up EL2 translation tables (identity mapping for kernel, device MMIO)
  • M0.5: Initialize witness log in a fixed RAM region
  • M0.6: Emit first witness record (boot attestation)
  • M0.7: Measure cold boot to first witness time (target: <250ms)

Acceptance criteria:

  • qemu-system-aarch64 -machine virt -cpu cortex-a72 -kernel ruvix.bin boots
  • UART prints "RVM Hypervisor Core v0.1.0"
  • First witness hash printed within 250ms of power-on

Estimated effort: 2-3 weeks Dependencies: None (all prerequisites exist)

M1: Partition Object Model + Capability System

Goal: Coherence domains as first-class kernel objects with capability-gated access.

Preconditions: M0 complete

Actions:

  • M1.1: Add CoherenceDomain to ruvix-types
  • M1.2: Add DomainCreate/DomainDestroy/DomainQuery syscalls to ruvix-nucleus
  • M1.3: Implement domain-scoped capability trees in ruvix-cap
  • M1.4: Implement stage-2 page table creation per domain in ruvix-aarch64
  • M1.5: Implement domain switch (save/restore + TTBR switch + TLB invalidate)
  • M1.6: Test: create two domains, switch between them, verify isolation

Acceptance criteria:

  • Two domains running concurrently with isolated memory
  • Capability violation (cross-domain access without cap) triggers fault
  • Domain switch measured at <10us

Estimated effort: 3-4 weeks Dependencies: M0

M2: Witness Logging + Proof Verifier

Goal: Every privileged action emits a witness record; proofs are verified before mutation.

Preconditions: M1 complete

Actions:

  • M2.1: Implement per-domain witness log (AppendOnly region)
  • M2.2: Wire all syscalls through proof verifier (ruvix-proof integration)
  • M2.3: Implement 3-tier proof routing: Reflex for hot path, Standard for normal, Deep for privileged
  • M2.4: Implement global witness merge at domain switch
  • M2.5: Test: replay witness log from checkpoint, verify state reconstruction

Acceptance criteria:

  • No syscall succeeds without valid proof
  • Witness log captures all state changes
  • Replay from checkpoint + witness log produces identical state

Estimated effort: 2-3 weeks Dependencies: M1

M3: Basic Scheduler with Coherence Scoring

Goal: Scheduler uses coherence scores from graph structure to drive scheduling decisions.

Preconditions: M1, M2 complete; ruvector-coherence available

Actions:

  • M3.1: Integrate ruvector-coherence spectral scoring into ruvix-sched
  • M3.2: Implement per-domain graph state tracking in ruvix-vecgraph
  • M3.3: Implement coherence-pressure scheduling algorithm
  • M3.4: Implement partition priority based on coherence score + deadline pressure
  • M3.5: Benchmark: measure scheduling overhead per tick

Acceptance criteria:

  • Domains with higher coherence scores get proportionally more CPU time
  • Scheduling tick overhead < 1us
  • Coherence-driven scheduling demonstrably reduces tail latency

Estimated effort: 2-3 weeks Dependencies: M1, M2

M4: Dynamic MinCut Integration from RuVector Crates

Goal: The hypervisor uses ruvector-mincut for live partition placement, migration, and reclamation.

Preconditions: M3 complete; ruvector-mincut and ruvector-sparsifier crates available

Actions:

  • M4.1: Create no_std-compatible subset of ruvector-mincut for kernel use
  • M4.2: Integrate min-cut computation into scheduler tick (amortized)
  • M4.3: Implement partition migration protocol: compute new cut -> transfer regions -> switch
  • M4.4: Implement memory reclamation: lowest-coherence partitions reclaimed first
  • M4.5: Integrate ruvector-sparsifier for efficient graph state maintenance in kernel
  • M4.6: Benchmark: min-cut update latency in kernel context

Acceptance criteria:

  • Dynamic repartitioning of domains based on workload changes
  • Min-cut computation completes within scheduler tick budget
  • Memory reclamation recovers regions without data loss (witness-backed)
  • 20% reduction in remote memory traffic vs. static partitioning

Estimated effort: 4-5 weeks Dependencies: M3, ruvector-mincut, ruvector-sparsifier

M5: Memory Tier Management (Hot/Warm/Dormant/Cold)

Goal: Four-tier memory hierarchy with coherence-driven promotion/demotion.

Preconditions: M4 complete

Actions:

  • M5.1: Implement MemoryTier enum and tier metadata in ruvix-region
  • M5.2: Implement hot -> warm demotion (unpin, allow eviction)
  • M5.3: Implement warm -> dormant compression (delta encoding via witness log)
  • M5.4: Implement dormant -> cold eviction (to NVMe/flash with tombstone)
  • M5.5: Implement cold -> hot reconstruction (replay witness log from checkpoint)
  • M5.6: Wire tier transitions to coherence score thresholds

Acceptance criteria:

  • Memory tiers transition automatically based on coherence scoring
  • Dormant regions reconstruct correctly from witness log
  • Cold eviction and hot reconstruction maintain data integrity
  • Memory footprint reduced by 50%+ for dormant workloads

Estimated effort: 3-4 weeks Dependencies: M4

M6: Agent Runtime Adapter (WASM Partitions)

Goal: WASM-based agent partitions run inside coherence domains with capability-gated access.

Preconditions: M5 complete; WASM runtime selection made (from A1.6)

Actions:

  • M6.1: Integrate minimal WASM runtime (WAMR or wasmtime-minimal) into kernel
  • M6.2: Implement WASM import interface: capabilities as host functions
  • M6.3: Implement WASM partition boot: load .wasm from RVF package, instantiate in domain
  • M6.4: Implement agent IPC: WASM component model typed channels -> ruvix-queue
  • M6.5: Implement agent lifecycle: spawn, pause, resume, terminate (all proof-gated)
  • M6.6: Test: multi-agent scenario with 10+ WASM partitions in different domains

Acceptance criteria:

  • WASM partitions boot and execute within coherence domains
  • Agent-to-agent IPC through typed channels with <1us latency
  • Capability violations in WASM trapped and logged to witness
  • 10 concurrent WASM agents run without interference

Estimated effort: 4-5 weeks Dependencies: M5, A1.6

M7: Seed/Appliance Hardware Bring-Up

Goal: Boot RVM on Cognitum Seed and Appliance hardware.

Preconditions: M6 complete; hardware available

Actions:

  • M7.1: Implement device tree parsing for Seed/Appliance SoC in ruvix-dtb
  • M7.2: Implement BCM2711 (or target SoC) interrupt controller driver
  • M7.3: Implement board-specific boot sequence in ruvix-rpi-boot or equivalent
  • M7.4: Implement NVMe driver for cold-tier storage
  • M7.5: Implement network driver for cross-node coherence domain migration
  • M7.6: Full integration test: boot, create domains, run WASM agents, migrate, recover

Acceptance criteria:

  • RVM boots on physical hardware
  • All M0-M6 acceptance criteria met on real hardware
  • Fault recovery without global reboot demonstrated
  • Cross-node migration demonstrated (if multi-node hardware available)

Estimated effort: 6-8 weeks Dependencies: M6, hardware availability


4. RuVector Integration Plan

4.1 MinCut Drives Partition Placement

Crate: ruvector-mincut (subpolynomial dynamic min-cut)

Integration points:

Hypervisor Function MinCut Operation Data Flow
Domain creation Initial partition computation Domain graph -> MinCutBuilder -> Partition
Scheduler tick Amortized cut value query MinCut.min_cut_value() -> coherence score input
Migration decision Repartition computation Updated graph -> MinCut.insert/delete_edge -> New partition
Memory reclamation Cut-based ordering MinCut values across domains -> reclamation priority
Fault isolation Cut identifies blast radius MinCut.min_cut_set() -> affected regions

no_std adaptation required:

  • ruvector-mincut currently depends on petgraph, rayon, dashmap, parking_lot (all require std/alloc)
  • Create ruvector-mincut-kernel feature flag that uses:
    • Fixed-size graph representation (no heap allocation)
    • Single-threaded computation (no rayon)
    • Spin locks instead of parking_lot
    • Inline graph storage instead of petgraph

Actions:

  • I4.1.1: Add no_std feature to ruvector-mincut with kernel-compatible subset
  • I4.1.2: Implement fixed-size graph backend (max 256 nodes, 4096 edges)
  • I4.1.3: Benchmark kernel-mode min-cut: target <5us for 64-node graphs
  • I4.1.4: Wire into ruvix-sched as scheduling oracle

4.2 Sparsifier Enables Efficient Graph State

Crate: ruvector-sparsifier (dynamic spectral graph sparsification)

Integration points:

Hypervisor Function Sparsifier Operation Purpose
Graph maintenance Sparsify domain graph Keep O(n log n) edges instead of O(n^2) for coherence queries
Coherence scoring Spectral gap from sparsified Laplacian Fast coherence score without full eigendecomposition
Migration planning Sparsified graph for min-cut Approximate min-cut on sparsified graph (faster)
Memory accounting Sparse representation of access patterns Track which regions are accessed by which tasks

Key insight: The sparsifier maintains a spectrally-equivalent graph with O(n log n / epsilon^2) edges. This means coherence scoring and min-cut computation can run on the sparse representation instead of the full graph, reducing kernel-mode computation time.

Actions:

  • I4.2.1: Add no_std feature to ruvector-sparsifier
  • I4.2.2: Implement incremental sparsification (update sparse graph on edge insert/delete)
  • I4.2.3: Wire sparsified graph into scheduler for fast coherence queries
  • I4.2.4: Benchmark: sparsified vs. full graph coherence scoring latency

4.3 Solver Handles Coherence Scoring

Crate: ruvector-solver (sublinear-time sparse solvers)

Integration points:

Hypervisor Function Solver Operation Purpose
Coherence score Approximate Fiedler vector Spectral gap computation
PageRank-style scoring Forward push on domain graph Task importance ranking
Migration cost estimation Sparse linear system solve Estimate data transfer cost

Key insight: The solver's Neumann series and conjugate gradient methods can compute approximate spectral properties of the domain graph in O(sqrt(n)) time. This is fast enough for per-tick coherence scoring.

Actions:

  • I4.3.1: Add no_std subset of ruvector-solver (Neumann series only, no nalgebra)
  • I4.3.2: Implement approximate Fiedler vector computation for coherence scoring
  • I4.3.3: Implement forward-push task importance ranking
  • I4.3.4: Benchmark: solver latency for 64-node domain graphs

4.4 Embeddings Enable Semantic State Reconstruction

Crate: ruvector-core (HNSW indexing), ruvix-vecgraph (kernel vector/graph stores)

Integration points:

Hypervisor Function Embedding Operation Purpose
Dormant state reconstruction Semantic similarity search Find related state fragments for reconstruction
Novelty detection Vector distance from recent inputs Scheduler novelty signal
Fault diagnosis Embedding-based anomaly detection Detect divergent domain states
Cold tier indexing HNSW index over tombstone references Fast lookup of cold-tier state

Key insight: When a dormant region needs reconstruction, the witness log provides the exact mutation sequence. But semantic embeddings can identify which other regions contain related state, enabling speculative prefetch during reconstruction.

Actions:

  • I4.4.1: Implement kernel-resident micro-HNSW in ruvix-vecgraph (fixed-size, no_std)
  • I4.4.2: Wire novelty detection into scheduler (vector distance from recent inputs)
  • I4.4.3: Implement embedding-based prefetch for dormant region reconstruction
  • I4.4.4: Implement anomaly detection for cross-domain state divergence

4.5 Additional RuVector Crate Integration

Crate Integration Point Priority
ruvector-raft Cross-node consensus for multi-node coherence domains P2 (M7)
ruvector-verified Formal proofs for capability derivation correctness P1 (M2)
ruvector-dag Causal ordering in global witness log P1 (M2)
ruvector-temporal-tensor Delta compression for dormant regions P1 (M5)
ruvector-coherence Spectral coherence scoring P0 (M3)
cognitum-gate-kernel 256-tile fabric as coherence domain topology P2 (M7)
ruvector-snapshot Checkpoint/restore for domain state P1 (M5)

5. Success Metrics

5.1 Performance Targets

Metric Target Measurement Method Milestone
Cold boot to first witness <250ms QEMU timer from power-on to first witness UART print M0
Hot partition switch <10us ARM cycle counter around switch_partition() M1
Remote memory traffic reduction 20% vs. static Hardware perf counters (cache miss/remote access) M4
Tail latency reduction 20% vs. round-robin P99 latency of agent request/response M4
Full witness trail 100% coverage Audit: every syscall has witness record M2
Fault recovery without global reboot Domain-local recovery Kill one domain, verify others unaffected M5
WASM agent boot time <5ms per agent Timer around WASM instantiation M6
Zero-copy IPC latency <100ns intra, <1us inter Benchmark ring buffer round-trip M1
Coherence scoring overhead <1us per domain per tick Cycle counter around scoring function M3
Min-cut update amortized <5us for 64-node graph Benchmark in kernel context M4

5.2 Correctness Targets

Property Verification Method Milestone
Capability safety (no unauthorized access) ruvector-verified + Kani M1
Witness chain integrity (no gaps, no forgery) SHA-256 chain verification M2
Deterministic replay (same inputs -> same state) Replay 10K syscall traces M2
Proof soundness (invalid proofs always rejected) Fuzzing + proptest M2
Isolation (domain fault does not affect others) Inject faults, verify containment M5
Memory safety (no UB in kernel code) Miri + Kani + #![forbid(unsafe_code)] where possible All

5.3 Scale Targets

Dimension Target Milestone
Concurrent coherence domains 256 M4
Tasks per domain 64 M3
Regions per domain 1024 M1
Graph nodes per domain 256 M4
Graph edges per domain 4096 M4
WASM agents total 1024 M6
Witness log entries before compaction 1M M2
Cross-node domains (federated) 16 nodes M7

6. Dependency Graph (GOAP Action Ordering)

Research Phase (parallel):
  A1.1 (bare-metal Rust) ---|
  A1.2 (capability OS)    --+--> Architecture Phase
  A1.3 (graph scheduling) --|
  A1.4 (memory coherence) --|
  A1.5 (formal verification)|
  A1.6 (agent runtimes)   --|

Architecture Phase (partially parallel):
  A2.1 (coherence domains) --> A2.4 (scheduler)
  A2.2 (HAL)               --> M0 (bare-metal boot)
  A2.3 (memory model)      --> A2.4 (scheduler)
  A2.4 (scheduler)         --> M3
  A2.5 (IPC)               --> M1
  A2.6 (device model)      --> M7
  A2.7 (witness)           --> M2

Implementation (sequential with overlap):
  M0 (boot)
   |
  M1 (partitions + caps)
   |
  M2 (witness + proofs) -- can overlap with M3
   |
  M3 (coherence scheduler)
   |
  M4 (mincut integration) -- critical path
   |
  M5 (memory tiers)
   |
  M6 (WASM agents)
   |
  M7 (hardware bring-up)

RuVector Integration (parallel with milestones):
  I4.1 (mincut no_std)      --> M4
  I4.2 (sparsifier no_std)  --> M4
  I4.3 (solver no_std)      --> M3
  I4.4 (embeddings kernel)  --> M5

Critical Path

A2.2 (HAL) -> M0 -> M1 -> M3 -> M4 -> M5 -> M6 -> M7
                           ^
                           |
                     I4.1 (mincut no_std) -- this is the highest-risk integration

Highest risk item: Creating a no_std subset of ruvector-mincut that runs in kernel context within the scheduler tick budget. If the amortized min-cut update exceeds 5us for 64-node graphs, the scheduler design must fall back to periodic (not per-tick) repartitioning.


7. GOAP State Transitions

World State Variables

struct WorldState {
    // Research
    bare_metal_research_complete: bool,
    capability_model_decided: bool,
    scheduling_algorithm_specified: bool,
    memory_model_designed: bool,
    verification_strategy_decided: bool,
    agent_runtime_selected: bool,

    // Infrastructure
    boots_on_qemu: bool,
    uart_works: bool,
    mmu_configured: bool,
    interrupts_working: bool,

    // Core features
    coherence_domains_exist: bool,
    capabilities_enforce_isolation: bool,
    witness_log_records_all: bool,
    proofs_gate_all_mutations: bool,
    scheduler_uses_coherence: bool,
    mincut_drives_partitioning: bool,
    memory_tiers_work: bool,
    wasm_agents_run: bool,

    // Performance
    boot_under_250ms: bool,
    switch_under_10us: bool,
    traffic_reduced_20pct: bool,
    tail_latency_reduced_20pct: bool,

    // Hardware
    runs_on_seed_hardware: bool,
    runs_on_appliance_hardware: bool,
}

Initial State

WorldState {
    bare_metal_research_complete: false,
    capability_model_decided: true,  // seL4-inspired, already in ruvix-cap
    scheduling_algorithm_specified: false,
    memory_model_designed: false,
    verification_strategy_decided: false,
    agent_runtime_selected: false,

    boots_on_qemu: false,
    uart_works: false,
    mmu_configured: false,
    interrupts_working: false,

    coherence_domains_exist: false,
    capabilities_enforce_isolation: true,  // ruvix-cap works in hosted mode
    witness_log_records_all: false,
    proofs_gate_all_mutations: true,  // ruvix-proof works in hosted mode
    scheduler_uses_coherence: false,
    mincut_drives_partitioning: false,
    memory_tiers_work: false,
    wasm_agents_run: false,

    boot_under_250ms: false,
    switch_under_10us: false,
    traffic_reduced_20pct: false,
    tail_latency_reduced_20pct: false,

    runs_on_seed_hardware: false,
    runs_on_appliance_hardware: false,
}

Goal State

All fields set to true.

A* Search Heuristic

The heuristic for GOAP planning uses the number of false fields as the distance estimate. Each action sets one or more fields to true. The planner finds the minimum-cost path from initial to goal state.

Cost model:

  • Research action: 1 week (cost = 1)
  • Architecture action: 1-2 weeks (cost = 1.5)
  • Implementation milestone: 2-5 weeks (cost = 3)
  • Integration action: 1-3 weeks (cost = 2)
  • Hardware bring-up: 6-8 weeks (cost = 7)

Optimal plan total estimated duration: 28-36 weeks (with parallelism in research and integration phases, critical path through M0->M1->M3->M4->M5->M6->M7).


8. Risk Register

Risk Likelihood Impact Mitigation
MinCut no_std too slow for per-tick scheduling Medium High Fall back to periodic repartitioning (every 100 ticks); use sparsified graph
EL2 page table management bugs High Medium Extensive QEMU testing; Miri for unsafe blocks; compare with known-good implementations
WASM runtime too large for kernel integration Medium Medium Use WAMR interpreter (smallest footprint); or run WASM in EL1 with EL2 capability enforcement
Witness log overhead degrades hot path Low High Reflex proof tier (<100ns) is already within budget; batch witness records if needed
Hardware coherence counters unavailable Medium Low Fall back to software instrumentation (memory access tracking via page faults)
Formal verification scope creep High Low Strict verification budget: only ruvix-cap and ruvix-proof get full verification
Cross-node migration protocol correctness High High TLA+ model before implementation; extensive simulation in qemu-swarm

9. Existing Codebase Inventory

What We Have (and reuse directly)

Crate LoC (est.) Reuse Level Notes
ruvix-types ~2,000 Direct Add CoherenceDomain, MemoryTier types
ruvix-cap ~1,500 Direct Add domain-scoped trees
ruvix-proof ~1,800 Direct Add per-domain witness log
ruvix-sched ~1,200 Refactor Wire to coherence scoring
ruvix-region ~1,500 Extend Add tier management
ruvix-queue ~1,000 Extend Add cross-domain shared regions
ruvix-boot ~2,000 Refactor EL2 boot sequence
ruvix-vecgraph ~1,200 Extend Add kernel HNSW
ruvix-nucleus ~3,000 Refactor Add domain syscalls
ruvix-hal ~800 Extend Add HypervisorMmu traits
ruvix-aarch64 ~800 Major work EL2 implementation
ruvix-drivers ~500 Extend Lease-based device model
ruvix-physmem ~800 Direct Tier-aware allocation
ruvix-smp ~500 Direct Multi-core domain placement
ruvix-dma ~400 Extend Budget enforcement
ruvix-dtb ~400 Direct Device tree parsing
ruvix-shell ~600 Direct Debug interface
qemu-swarm ~3,000 Direct Testing infrastructure

What We Have (reuse via no_std adaptation)

Crate Adaptation Needed
ruvector-mincut no_std feature, fixed-size graph backend
ruvector-sparsifier no_std feature, remove rayon
ruvector-solver no_std Neumann series only
ruvector-coherence Already minimal, add spectral feature
ruvector-verified Lean-agentic proofs for cap verification
ruvector-dag no_std causal ordering

What We Need to Build

Component Estimated LoC Milestone
CoherenceDomain lifecycle ~2,000 M1
EL2 page table management ~3,000 M0/M1
Partition switch protocol ~500 M1
Per-domain witness log ~1,000 M2
Global witness merge ~800 M2
Graph-pressure scheduler ~1,500 M3
MinCut kernel integration ~2,000 M4
Memory tier manager ~2,000 M5
WASM runtime adapter ~3,000 M6
Device lease manager ~1,000 M6
Hardware drivers (Seed/Appliance) ~5,000 M7
Total new code ~21,800

Combined with ~20K lines of existing ruvix code being reused/extended, the total codebase at M7 completion is estimated at ~42K lines of Rust.


10. Next Steps (Immediate Actions)

Week 1-2: Research Sprint

  1. Read Theseus OS and RedLeaf papers (A1.1.1, A1.1.2)
  2. Audit ruvix-cap against seL4 CNode spec (A1.2.1)
  3. Formalize coherence-pressure scheduling problem (A1.3.1)
  4. Benchmark ruvector-mincut update latency for kernel budget (A1.3.4)
  5. Select WASM runtime (WAMR vs wasmtime-minimal) (A1.6.1)

Week 3-4: M0 Sprint

  1. Complete _start assembly for AArch64 EL2 boot (M0.1)
  2. Initialize PL011 UART (M0.2)
  3. Configure EL2 translation tables (M0.4)
  4. Emit first witness record (M0.6)
  5. Measure boot time (M0.7)

Week 5-8: M1 + M2 Sprint

  1. Implement CoherenceDomain in ruvix-types (M1.1)
  2. Add domain syscalls (M1.2)
  3. Implement stage-2 page tables (M1.4)
  4. Wire witness logging to all syscalls (M2.2)

Week 9-12: M3 + M4 Sprint (Critical Path)

  1. Integrate ruvector-coherence into scheduler (M3.1)
  2. Create ruvector-mincut no_std kernel subset (I4.1.1)
  3. Wire min-cut into scheduler (I4.1.4 / M4.2)
  4. Implement migration protocol (M4.3)

Week 13-20: M5 + M6

  1. Implement memory tier management (M5)
  2. Integrate WASM runtime (M6)

Week 21-28: M7

  1. Hardware bring-up on target platform (M7)

Appendix A: Glossary

Term Definition
Coherence domain The primary isolation unit in RVM; a graph-structured partition of tasks, regions, and capabilities managed by dynamic min-cut
Coherence score A scalar metric derived from the spectral gap of a domain's task-data dependency graph; higher = more internally coherent, less external dependency
Partition switch The act of saving one domain's state and loading another's; analogous to VM exit/enter but without hardware virtualization extensions
Proof-gated mutation The invariant that no kernel state change occurs without a valid cryptographic proof token
Witness log An append-only, hash-chained log recording every privileged action; enables deterministic replay and audit
Reconstructable memory Dormant/cold memory that is not stored as raw bytes but as witness log references + delta compression, enabling reconstruction on demand
Device lease A time-bounded, capability-gated grant of device access to a coherence domain; auto-revokes on expiry
Min-cut boundary The set of edges in a domain's graph that, when removed, partitions the graph into the minimum-cost cut; used for migration and isolation decisions

Appendix B: Reference Papers

  1. Bhandari et al. "Theseus: an Experiment in Operating System Structure and State Management." OSDI 2020.
  2. Narayanan et al. "RedLeaf: Isolation and Communication in a Safe Operating System." OSDI 2020.
  3. Klein et al. "seL4: Formal Verification of an OS Kernel." SOSP 2009.
  4. Watson et al. "CHERI: A Hybrid Capability-System Architecture for Scalable Software Compartmentalization." IEEE S&P 2015.
  5. Baumann et al. "The Multikernel: A new OS architecture for scalable multicore systems." SOSP 2009.
  6. Karger et al. "Minimum Cuts in Near-Linear Time." JACM 2000.
  7. Shi & Malik. "Normalized Cuts and Image Segmentation." IEEE PAMI 2000.
  8. Spielman & Teng. "Spectral Sparsification of Graphs." SIAM J. Computing 2011.
  9. Levy et al. "Ownership is Theft: Experiences Building an Embedded OS in Rust." PLOS 2015 (Tock).
  10. Klimovic et al. "Pocket: Elastic Ephemeral Storage for Serverless Analytics." OSDI 2018.

Design Constraints (DC-1 through DC-15)

DC Constraint
DC-1 Coherence engine is optional
DC-2 Mincut never blocks scheduling (50us budget)
DC-3 Three-layer proof system (P1/P2/P3)
DC-4 Scheduler starts simple (2 signals)
DC-5 Three systems cleanly separated
DC-6 Degraded mode is explicit
DC-7 Migration has time budget (100ms)
DC-8 Capabilities follow ownership on split
DC-9 Region assignment uses scored placement
DC-10 Partition switch epoch-witnessed
DC-11 Merge requires 7 strong preconditions
DC-12 Logical partitions exceed physical slots
DC-13 WASM optional, native first class
DC-14 Failure classification F1-F4
DC-15 Control partition required on Appliance

Updated with RVM naming convention. 9 ADRs (132-140), 15 design constraints, 13 Rust crates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment