Skip to content

Instantly share code, notes, and snippets.

@bpradipt
Last active March 17, 2026 08:49
Show Gist options
  • Select an option

  • Save bpradipt/90b1ae27a62c82fca2c4479c89d43e0d to your computer and use it in GitHub Desktop.

Select an option

Save bpradipt/90b1ae27a62c82fca2c4479c89d43e0d to your computer and use it in GitHub Desktop.
Kata shim and kata-agent threat analysis

Kata Containers: Shim-Agent Communication Threat Vector Analysis

  • Date: 2026-03-17
  • Branch: main (commit 660e3bb65)
  • Scope: Shim (host-side, Go) to kata-agent (guest-side, Rust) communication
  • Disclaimer: This report is generated using Claude Code and full human review is TBD. Also note than in a real deployment it's always recommended to use defense-in-depth, for example LSM, network policies etc

Table of Contents


Architecture Overview

The kata-shim (host-side, Go) communicates with the kata-agent (guest-side, Rust) over TTRPC (a simplified gRPC without HTTP/2) using Protocol Buffers v3. The transport is either vsock (QEMU/CLH), hybrid-vsock (Firecracker), or Unix domain sockets (remote hypervisor).


1. Transport Layer Threats

1.1 No Encryption (TTRPC is Plaintext)

  • Finding: All shim-agent communication is unencrypted. There is no TLS, mTLS, or any application-layer encryption.
  • File: src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:91
  • Mitigation in place: Relies on implicit transport isolation -- vsock is hypervisor-mediated, Unix sockets are permission-restricted.
  • Risk: If the hypervisor is compromised or a co-tenant VM can sniff vsock traffic (e.g., via hypervisor bug), all RPC payloads -- including OCI specs, environment variables, file contents (CopyFile), and stdin/stdout streams -- are exposed in cleartext.

1.2 No Authentication or Mutual Identity Verification

  • Finding: The shim connects to the agent with zero credentials. No tokens, certificates, or shared secrets are exchanged.
  • File: src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:72-98
  • Risk: Any process that can reach the vsock port (CID:1024) or the Unix socket path can issue arbitrary agent RPCs. This is a confused deputy risk if another process on the host gains access to the socket.

1.3 Hybrid-VSock Handshake Weakness

  • Finding: Firecracker's hybrid-vsock uses a simple text-based handshake: shim sends "CONNECT <port>\n", agent responds with "OK".
  • File: src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:387-449
  • Risk: No integrity check on the handshake. A MITM at the Unix socket layer could intercept and replay or inject the handshake.

1.4 Fixed, Predictable Port

  • Finding: Agent always listens on vsock port 1024 (vSockPort constant).
  • File: src/runtime/virtcontainers/hypervisor.go:80
  • Risk: Reduces attack complexity -- an attacker who compromises the hypervisor layer knows exactly which port to target.

2. API Surface Threats (Agent RPC Methods)

The agent exposes ~35 RPC methods defined in src/libs/protocols/protos/agent.proto. Each is a potential attack vector if an attacker can send crafted requests.

2.1 Arbitrary Process Execution -- ExecProcess

  • File: src/agent/src/rpc.rs:424
  • Risk: Allows spawning arbitrary processes with controlled args, env vars, capabilities, and UID/GID inside the guest. A compromised shim or rogue host process can execute anything inside the VM.

2.2 Kernel Module Loading -- CreateSandbox

  • File: src/agent/src/rpc.rs:1341
  • Risk: CreateSandbox can trigger modprobe with attacker-controlled module names and parameters. Module names and parameters are passed directly to Command::new(MODPROBE_PATH) without sanitization. This is a command injection vector if module parameters contain shell metacharacters.

2.3 Iptables Rule Injection -- SetIPTables

  • File: src/agent/src/rpc.rs:1178-1208
  • Risk: Executes iptables-restore with attacker-controlled stdin data. Malicious rules could open the guest firewall, redirect traffic, or enable exfiltration channels.

2.4 File Write -- CopyFile

  • File: src/agent/src/rpc.rs:2038
  • Mitigation in place: Path must start with /run/kata-containers (line 2041).
  • Remaining risk: Supports symlink creation (line 2097-2121) and custom file modes/ownership. A symlink within /run/kata-containers could point elsewhere in the guest filesystem, creating an escape primitive.

2.5 Network Manipulation -- UpdateInterface, UpdateRoutes, AddARPNeighbors

  • File: src/agent/src/rpc.rs:1046, 1094, 1406
  • Risk: Full guest networking control: IP address injection, default route hijacking, ARP cache poisoning. An attacker can redirect all guest traffic or perform MitM within the guest network namespace.

2.6 System Clock & Entropy -- SetGuestDateTime, ReseedRandomDev

  • File: src/agent/src/rpc.rs:1503, 1448
  • Risk: Time manipulation can break TLS certificate validation, log integrity, and replay protections. Entropy injection can weaken guest RNG state.

2.7 Memory/CPU Hotplug -- OnlineCPUMem, MemHotplugByProbe, AddSwap

  • File: src/agent/src/rpc.rs:1434, 1490, 1601
  • Risk: Memory probe writes to /sys/devices/system/memory/probe with attacker-controlled addresses. Swap manipulation can cause DoS or expose sensitive memory pages.

3. Sandbox Escape & Privilege Escalation Vectors

3.1 CRITICAL -- Container ID Path Traversal

  • File: src/agent/src/rpc.rs:2202-2251 (setup_bundle())
  • Issue: Bundle path is constructed as Path::new(CONTAINER_BASE).join(cid). If verify_id() doesn't reject ../ sequences, the container rootfs could be bind-mounted to arbitrary guest paths.
  • Severity: CRITICAL -- potential guest filesystem escape.

3.2 CRITICAL -- Mount Symlink Following

  • File: src/agent/src/storage/mod.rs:281, src/agent/src/mount.rs:67-122
  • Issue: nix::mount::mount() is called without resolving symlinks in destination paths. The kernel follows symlinks during mount, so a symlink planted in a shared directory could redirect a bind mount outside the intended container boundary.
  • Severity: CRITICAL -- container-to-guest escape.

3.3 HIGH -- Namespace Path Injection

  • File: src/agent/src/rpc.rs:1859-1911
  • Issue: IPC, UTS, and PID namespace paths from the host are used directly (PathBuf::from(&sandbox.shared_ipcns.path)) without validating they point to legitimate namespace files. A compromised host could inject paths to host namespaces.
  • Severity: HIGH -- namespace confusion attack.

3.4 HIGH -- OCI Spec Constraint Stripping

  • File: src/runtime/virtcontainers/kata_agent.go:1056-1066
  • Issue: Device cgroups, PID limits, BlockIO, network limits, and CPU constraints are all set to nil before sending to the agent. This means resource isolation is not enforced inside the guest.
  • Severity: HIGH -- DoS via resource exhaustion within the guest.

3.5 HIGH -- VFIO Sysfs Path Traversal

  • File: src/runtime/virtcontainers/container.go:1362-1363
  • Issue: vfioGroup (derived from device path) is used directly in filepath.Join(config.SysIOMMUGroupPath, vfioGroup, "devices"). If it contains ../, it could read arbitrary sysfs paths on the host.
  • Severity: HIGH -- host information disclosure.

3.6 MEDIUM -- Capability Passthrough

  • File: src/runtime/virtcontainers/kata_agent.go:1014-1103
  • Issue: constrainGRPCSpec() does NOT filter Linux capabilities. If a privileged container spec is passed, full capabilities (including CAP_SYS_ADMIN, CAP_NET_RAW, etc.) are forwarded to the agent and granted inside the guest.
  • Severity: MEDIUM -- depends on guest kernel attack surface.

4. Authorization & Policy Gaps

4.1 Agent Policy is Optional (Off by Default)

  • File: src/agent/src/policy.rs:12-45, src/agent/src/rpc.rs:155-156
  • Issue: The agent-policy feature gate controls whether RPC authorization is enforced. Without it, every RPC method is implicitly allowed. Most deployments do not enable this.
  • Risk: Any entity with socket access has full, unrestricted control over the guest VM.

4.2 No Per-Container Authorization

  • Even with policy enabled, there's no per-container identity or authorization. Any authenticated caller can operate on any container within the sandbox.

5. Threat Summary Matrix

Vector Severity Pre-Condition Impact
Plaintext TTRPC (eavesdropping) HIGH Hypervisor compromise or vsock bug Full data exfiltration
No authentication on agent socket HIGH Host process reaches vsock/UDS Full VM control
Container ID path traversal CRITICAL Crafted container ID bypasses verify_id Guest filesystem escape
Mount symlink following CRITICAL Symlink in shared dir before mount Container-to-guest escape
Kernel module injection via CreateSandbox CRITICAL Compromised shim Arbitrary kernel code in guest
Iptables stdin injection HIGH Compromised shim Guest firewall bypass
Namespace path injection HIGH Compromised shim Namespace confusion
VFIO sysfs path traversal HIGH Malformed device path Host info disclosure
Resource constraint stripping HIGH By design Guest-internal DoS
Missing agent-policy enforcement MEDIUM Default configuration Unrestricted guest API
CopyFile symlink creation MEDIUM Valid shim access Guest file overwrite
DNS/network manipulation MEDIUM Valid shim access Guest traffic hijack
Clock/entropy manipulation LOW Valid shim access Crypto weakening, log tampering

6. Recommendations

  1. Enable agent-policy in production -- compile with agent-policy feature and deploy a restrictive allowlist of permitted RPCs.
  2. Add path canonicalization before all mount and bundle operations (realpath / canonicalize before mount()).
  3. Validate container IDs -- reject any ID containing /, .., or null bytes before path construction.
  4. Sanitize modprobe parameters -- reject module names/params with shell metacharacters.
  5. Consider TTRPC-over-TLS for deployments where vsock isolation guarantees are insufficient (e.g., nested virtualization, shared hypervisor environments).
  6. Audit CopyFile symlink handling -- disallow symlink creation or validate symlink targets stay within /run/kata-containers.
  7. Enforce capability dropping in constrainGRPCSpec() -- strip dangerous capabilities before forwarding to the agent.

7. Agent-Policy Deep Dive: Coverage of Host-to-Guest Attack Vectors

7.1 How Agent-Policy Works

The agent-policy system is an OPA/Rego-based authorization gate built into the kata-agent. It uses regorus (a Rust OPA engine) to evaluate every incoming RPC request against a Rego policy document before execution.

Key architecture:

  • Engine: regorus::Engine in src/agent/policy/src/policy.rs:33
  • Gate function: is_allowed() in src/agent/src/policy.rs:31 -- serializes each request to JSON, then evaluates data.agent_policy.<RequestName> in the Rego engine
  • Enforcement point: Every AgentService trait method in src/agent/src/rpc.rs calls is_allowed(&req).await? before processing
  • Policy delivery: Policy is loaded from a default file (/etc/kata-opa/default-policy.rego), from initdata, or dynamically via the SetPolicy RPC
  • Bundled policies: src/kata-opa/allow-all.rego (permits everything), src/kata-opa/allow-all-except-exec-process.rego, src/kata-opa/allow-set-policy.rego
  • Production policy: src/tools/genpolicy/rules.rego -- comprehensive, per-field validation rules

7.2 The Critical Caveat: Compile-Time Optional

// src/agent/src/rpc.rs:155-158
#[cfg(not(feature = "agent-policy"))]
async fn is_allowed(_req: &impl serde::Serialize) -> ttrpc::Result<()> {
    Ok(())  // ALWAYS ALLOWS EVERYTHING
}

Without the agent-policy feature flag at compile time, is_allowed() is a no-op. This means none of the protections described below exist in a default build.

7.3 Coverage Analysis by Threat Vector

7.3.1 Well-Covered (with genpolicy rules.rego)

Threat Vector Policy Default Depth of Inspection
ExecProcess (arbitrary exec) false (blocked) Deep -- validates against allowlisted commands, regex patterns, container state, capabilities (allow_exec_caps rejects all capability sets), UID/GID, and SELinux/AppArmor labels (rpc.rs:841, rules.rego:1572-1617)
CreateContainer (OCI spec injection) false (blocked) Very deep -- validates OCI version, root readonly, annotations, namespace, sandbox name, container type, process args/env/cwd, capabilities, mounts, storages, devices, Linux namespace config (rules.rego:60-121)
CreateSandbox (kernel module loading) false (blocked) Strong -- explicitly requires kernel_modules count == 0 and guest_hook_path empty (rules.rego:1532-1543). This completely blocks the kernel module injection vector
CopyFile (arbitrary file write) false (blocked) Good -- validates path against regex allowlist AND checks for directory traversal (../) via check_directory_traversal() (rules.rego:1491-1530)
ReadStream (stdout/stderr exfiltration) false (blocked) Unique behavior -- the RPC still executes but redacts the response data if policy denies it (rpc.rs:971-974, 984-987). This prevents log/output exfiltration while keeping container plumbing functional
WriteStream (stdin injection) false (blocked) Binary allow/deny only
Devices (VFIO passthrough) Validated per-container Separates VFIO devices from volume devices and validates each against policy-declared device lists with CDI annotation regex matching (rules.rego:472-498)

7.3.2 Partially Covered

Threat Vector Policy Default Gap
UpdateRoutes (route hijacking) false (blocked) Has route validation with forbidden_source_regex and forbidden_device_names (rules.rego:1629-1636), but the validation is configurable -- a weak policy could still allow malicious routes
UpdateInterface (IP injection) false (blocked) Binary allow/deny only. No inspection of IP address values, MTU, or MAC address -- if allowed, any interface config is accepted
SetIPTables (firewall injection) Not in genpolicy defaults Binary allow/deny -- if allowed, arbitrary iptables rules can be injected. No content inspection of the iptables data payload
AddARPNeighbors (ARP spoofing) false (blocked) Binary allow/deny only. If allowed, arbitrary ARP entries accepted
SignalProcess true (ALLOWED) Always allowed by default in genpolicy. No validation of signal number. A compromised host can send any signal (including SIGKILL) to any process
Capabilities in CreateContainer Validated allow_caps() compares all 5 capability sets against policy -- but policy author must define them correctly. Regex matching means overly broad patterns could grant excessive capabilities (rules.rego:1406-1430)

7.3.3 Not Covered (Gaps)

Threat Vector Issue
SetPolicy itself Default false in genpolicy, but the bundled allow-set-policy.rego sets it to true. If SetPolicy is allowed, an attacker can replace the entire policy with allow-all.rego, completely defeating all protections. The SetPolicy RPC checks policy before applying (policy.rs:37-44), so it's self-protecting -- but only if the initial policy blocks it
AllowRequestsFailingPolicy If set to true in the Rego policy, ALL policy failures are silently ignored (policy.rs:187-190). This is a debug flag that completely disables security. Genpolicy defaults it to false, but nothing prevents a policy author from enabling it
ReseedRandomDev (entropy injection) false (blocked) but binary allow/deny only -- if allowed, arbitrary entropy data accepted without validation
SetGuestDateTime (clock manipulation) false (blocked) but binary allow/deny only -- if allowed, any timestamp accepted
MemHotplugByProbe (memory probe injection) false (blocked) but binary allow/deny only -- if allowed, arbitrary probe addresses accepted
GetIPTables (firewall enumeration) Not in genpolicy defaults. If allowed, leaks full guest firewall rules -- information disclosure
AddSwap / AddSwapPath false (blocked) but binary allow/deny only
OnlineCPUMem true (ALLOWED) by default. No validation of parameters
Env var content in ExecProcess CreateContainer validates env var names/values against policy patterns, but ExecProcess does not deeply validate environment variables
Mount source/destination paths Policy validates storages in CreateContainer but mount symlink following and path canonicalization are not addressable by policy -- these are runtime bugs
Namespace path injection Policy validates namespace type (PID, IPC, UTS) but not namespace paths. Arbitrary namespace paths from the host are accepted

7.4 Critical Architectural Weaknesses in the Policy System

7.4.1 SetPolicy is a Self-Destruct Button

SetPolicy is itself guarded by policy (policy.rs:37-44), creating a bootstrap problem. If the initial policy allows SetPolicy, the entire policy can be replaced at runtime by the host. The bundled allow-set-policy.rego (src/kata-opa/allow-set-policy.rego) does exactly this -- sets only SetPolicyRequest := true, meaning only SetPolicy works and everything else is denied by default. But if a permissive initial policy is loaded that allows SetPolicy, it becomes a complete bypass vector.

7.4.2 AllowRequestsFailingPolicy Silently Disables Everything

# rules.rego:48-52
# AllowRequestsFailingPolicy := true configures the Agent to *allow any
# requests causing a policy failure*.
default AllowRequestsFailingPolicy := false

When true, every denied request is logged as a warning but still executed (policy.rs:187-190). This is documented as a debug feature but is a global policy bypass with zero audit trail beyond warn-level logs.

7.4.3 Request-Level Only, No Session-Level Controls

There's no concept of caller identity, session, or connection-level authorization. Every request is evaluated independently. This means:

  • No rate limiting on requests
  • No detection of anomalous request patterns (e.g., rapid CreateContainer/DestroyContainer cycling)
  • A single allowed RPC can be called unlimited times

7.4.4 Binary Allow/Deny on Most Dangerous Operations

For the most dangerous RPCs (SetIPTables, UpdateInterface, AddARPNeighbors, ReseedRandomDev, MemHotplugByProbe), the policy is a simple boolean gate. If allowed, the full request payload is accepted without any content inspection. Only CreateContainer, ExecProcess, CopyFile, CreateSandbox, and UpdateRoutes have deep content validation in the genpolicy rules.

7.4.5 Serialization-Dependent Inspection

Policy evaluation works by serializing the protobuf request to JSON (policy.rs:32), then passing it to Rego. This means the policy can only inspect fields that survive JSON serialization. Binary fields (like SetIPTablesRequest.data containing raw iptables rules) are base64-encoded in JSON, making content inspection impractical in Rego.

7.5 Policy Effectiveness Summary

With genpolicy rules.rego With allow-all.rego Without agent-policy feature
Arbitrary exec Blocked (allowlisted commands only) OPEN OPEN
Malicious container specs Blocked (deep OCI validation) OPEN OPEN
Kernel module loading Blocked (count==0 enforced) OPEN OPEN
Arbitrary file writes Blocked (path regex + traversal check) OPEN OPEN
Log/output exfiltration Redacted OPEN OPEN
Iptables injection Blocked (default deny) OPEN OPEN
Network config changes Blocked (default deny) OPEN OPEN
Policy replacement Blocked (default deny) Self-destructible N/A
Signal to any process OPEN OPEN OPEN
CPU/memory hotplug OPEN OPEN OPEN
Symlink/mount path attacks NOT ADDRESSABLE NOT ADDRESSABLE NOT ADDRESSABLE
Namespace path injection NOT ADDRESSABLE NOT ADDRESSABLE NOT ADDRESSABLE

7.6 Policy-Specific Recommendations

  1. Always compile with agent-policy -- the no-op fallback makes all other protections meaningless.
  2. Use genpolicy rules.rego as baseline -- never deploy allow-all.rego in production.
  3. Block SetPolicy in the initial policy -- once loaded, the policy should be immutable. Use initdata or the default policy file for delivery.
  4. Never enable AllowRequestsFailingPolicy -- even in staging, as it silently bypasses all authorization.
  5. Add content inspection for SetIPTables -- decode and validate iptables rules in the Rego policy, or block the RPC entirely.
  6. Restrict SignalProcess -- add signal number validation to the genpolicy rules (e.g., only allow SIGTERM/SIGKILL).
  7. Restrict OnlineCPUMem -- add parameter bounds checking or switch default to false.
  8. Address runtime-level bugs separately -- symlink following, namespace path injection, and mount path traversal cannot be solved by policy alone. These require code fixes in the agent's Rust implementation.

8. Guest-to-Host Threat Analysis: Compromised Container Inside Kata VM

8.1 Attack Model

A container running inside the Kata VM gains root privileges (e.g., via kernel exploit, misconfigured capabilities, or privileged: true). The attacker now has:

  • Full root access to the guest Linux kernel
  • Control over the kata-agent process (PID 1 in guest)
  • Access to all guest network interfaces
  • Ability to read/write all guest filesystems including shared mounts

8.2 TTRPC Response Poisoning (Guest -> Shim)

The agent is a TTRPC server only -- it cannot initiate reverse connections to the host. However, the shim continuously polls the agent for data, and trusts all response data without validation.

8.2.1 OOM Event Container ID Injection (HIGH)

  • src/runtime/pkg/containerd-shim-v2/wait.go:165 -- shim calls GetOOMEvent() in a loop
  • Agent returns OOMEvent.container_id from its internal cgroup monitor (src/agent/src/sandbox.rs:438-463)
  • A root-compromised guest can manipulate cgroups to trigger OOM events with arbitrary container IDs
  • Shim uses this container ID to:
    • Look up s.containers[containerID] map (wait.go:177)
    • Create a file at path.Join(c.bundle, "oom") on the host filesystem (wait.go:178-186)
    • Publish a TaskOOM event to containerd with the attacker-controlled ID (wait.go:190-192)
  • The containerd event propagates to the Kubernetes event stream
  • Impact: Host filesystem file creation (limited to existing container bundle paths), event spoofing to Kubernetes control plane

8.2.2 Metrics Response Injection (MEDIUM)

  • src/runtime/pkg/containerd-shim-v2/shim_management.go:107-146 -- decodeAgentMetrics() parses agent's response as Prometheus text format
  • Raw agent string fed to expfmt.NewDecoder() without validation (shim_management.go:122-123)
  • Metric names are prefixed but not sanitized (shim_management.go:137-139)
  • Impact: Malformed metrics could cause DoS on Prometheus scrape endpoints; metric name injection could confuse monitoring/alerting

8.2.3 IPTables Response Passthrough (MEDIUM)

  • src/runtime/pkg/containerd-shim-v2/shim_management.go:258-264 -- GetIPTables() returns raw iptables-save output from guest
  • Written directly to HTTP response body (w.Write(buf)) without validation
  • A compromised guest controls what iptables-save returns (or replaces the binary entirely)
  • Impact: Misrepresentation of guest firewall state to management tools; if parsing tools consume this output, potential injection

8.2.4 Stats/CgroupStats JSON Deserialization (LOW)

  • src/runtime/virtcontainers/kata_agent.go:2146-2154 -- agent's StatsContainerResponse.CgroupStats is JSON-marshaled then unmarshaled into host-side struct
  • Protobuf provides structural safety, but deeply nested or huge cgroup stats could cause resource exhaustion
  • Impact: DoS against shim process via memory exhaustion

8.2.5 Process Exit Code Manipulation (MEDIUM)

  • WaitProcess RPC returns exit status from agent
  • A compromised agent can return arbitrary exit codes
  • Exit codes influence Kubernetes restart policy decisions (CrashLoopBackOff, etc.)
  • Impact: Can prevent pod restart (return 0) or force restart loops

8.2.6 Stream Read Length Mismatch (HIGH)

  • src/runtime/virtcontainers/kata_agent.go:2500-2514 -- readProcessStream():
    copy(data, resp.Data)           // copies min(len(data), len(resp.Data)) bytes
    return len(resp.Data), nil      // returns len(resp.Data), NOT bytes actually copied
  • The shim requests uint32(len(data)) bytes via ReadStreamRequest.Len, but the agent can return more bytes than requested
  • Go's copy() is memory-safe (copies min(len(dst), len(src)) bytes), so there is no buffer overflow
  • However, the function returns len(resp.Data) -- the attacker-controlled length -- not the actual number of bytes copied into the destination buffer
  • This io.Reader implementation (iostream.go:80-96) feeds into containerd's I/O pump
  • Impact: The caller's bookkeeping of bytes read will be wrong: it believes N bytes were read when only min(N, bufsize) were actually copied. This can cause log truncation, stream offset misalignment, or data duplication depending on how the consumer advances its position. A compromised agent can exploit this to corrupt container log output visible to kubectl logs.

8.2.7 No TTRPC Response Size Limit (HIGH)

  • src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:91 -- ttrpc.NewClient(conn, ...) is created with no MaxRecvMsgSize option
  • The TTRPC library's default max message size (typically 4MB in containerd/ttrpc) applies, but no kata-specific limit is configured
  • The shim's grpcMaxDataSize of 1MB (kata_agent.go:133) only constrains outbound CopyFile request chunks, not inbound responses
  • Responses are fully deserialized into memory before the shim processes them
  • A compromised agent can send near-maximum-size responses to every single RPC call
  • RPCs called in loops (GetOOMEvent, ReadStdout, ReadStderr, GetMetrics) amplify this
  • Impact: Sustained large responses cause shim process memory exhaustion on the host. Since the shim is a per-pod process, this can destabilize the node if many pods are targeted simultaneously.

8.2.8 Error Message Injection (MEDIUM)

  • Throughout src/runtime/pkg/containerd-shim-v2/shim_management.go, error messages from failed agent RPCs are written directly to HTTP response bodies:
    // shim_management.go:252-253
    w.WriteHeader(http.StatusInternalServerError)
    w.Write([]byte(err.Error()))    // agent error string -> HTTP body
  • This pattern appears for iptables (line 252), volume stats (line 170), volume resize (line 195), and policy (line 219)
  • A compromised agent can craft TTRPC error messages containing:
    • HTTP response splitting characters (if consumed by a naive HTTP client)
    • Log injection payloads (newlines, ANSI escape sequences)
    • Misleading error text to mask real failures
  • Impact: Log poisoning on the host; potential HTTP response injection if management API consumers don't sanitize error bodies

8.2.9 Volume Stats Response Passthrough (MEDIUM)

  • src/runtime/pkg/containerd-shim-v2/shim_management.go:148-173 -- serveVolumeStats():
    buf, err := s.sandbox.GuestVolumeStats(context.Background(), volumePath)
    w.Write(buf)   // raw agent JSON written to HTTP response
  • kata_agent.go:2714 JSON-marshals the agent's VolumeStatsResponse and passes it directly to the HTTP response consumed by kubelet
  • A compromised agent can return:
    • Fabricated capacity/inode numbers influencing kubelet eviction decisions
    • Extremely large JSON payloads causing kubelet memory pressure
  • Impact: Kubelet volume management decisions based on false data; potential eviction of pods on node due to fake "disk full" reports

8.2.10 GuestDetails Version/Feature Spoofing (MEDIUM)

  • src/runtime/virtcontainers/kata_agent.go:2517-2527 -- getGuestDetails() returns GuestDetailsResponse containing:
    • AgentDetails.Version -- used for version-based feature decisions
    • AgentDetails.SupportedFeatures -- determines which RPCs the shim will attempt
    • MemBlockSizeBytes -- used for memory hotplug calculations
    • AgentDetails.DeviceHandlers / StorageHandlers -- determines device handling paths
  • A compromised agent can report false capabilities, causing the shim to:
    • Use wrong code paths based on fake version info
    • Skip security-relevant operations it believes are unsupported
    • Miscalculate memory block sizes for hotplug, potentially causing guest instability or overcommit
  • Impact: Shim logic manipulation; incorrect resource accounting; potential bypass of version-gated security features

8.2.11 Interface/Route Response Divergence (LOW)

  • src/runtime/virtcontainers/kata_agent.go:641-644 and 673-676:
    if resultInterface, ok := resultingInterface.(*pbTypes.Interface); ok {
        return resultInterface, err
    }
  • updateInterface() and updateRoutes() return the agent's response to the caller. The shim uses this "resulting" state to update its internal network model
  • A compromised agent returning different interface/route data than what was requested causes the shim's internal network state to diverge from actual guest network state
  • Impact: Incorrect network teardown, resource leaks, or stale state preventing proper pod cleanup

8.2.12 Architectural Root Cause

The fundamental issue across all response vectors is that the shim treats the agent as a trusted component. Every RPC response is consumed without:

  • Response size validation or per-RPC size limits
  • Content sanitization or escaping before writing to HTTP/logs
  • Semantic validation (does the response match what was requested?)
  • Rate limiting on event-style RPCs (OOM, metrics, stream reads)
  • Bounds checking on returned lengths against request parameters

This trust model was designed for the "protect host from untrusted container" threat model -- the agent was part of the trusted computing base. However, in the compromised container scenario (root escape inside guest VM), the agent itself becomes adversarial, and neither direction has adequate validation. In the confidential computing use case (where the host is untrusted), this same trust inversion applies in the opposite direction.

8.3 Kubernetes API Server Access via Pod Network (CRITICAL)

This is the highest-impact attack vector. The guest VM has full pod network connectivity with no guest-side filtering.

8.3.1 Network Architecture -- No Guest-Side Filtering

  • TC filter setup (src/runtime/virtcontainers/network_linux.go:940-1010) creates transparent bidirectional redirect between host veth and guest TAP
  • addRedirectTCFilter() (network_linux.go:1040-1063) uses u32 match u8 0 0 (match all traffic) with TCA_EGRESS_REDIR action
  • No packet filtering, no egress rules, no ingress rules at the TC level
  • Network policies (Calico/Cilium) are enforced on the host veth -- they apply to traffic entering/leaving the pod, but a root guest has already bypassed the container's internal network namespace
Host Network Namespace                         Guest VM (Isolated Kernel)
+-------------------------------------------+  +-------------------------------------------+
|                                           |  |                                           |
|  veth<pod>  ----TC REDIRECT (no filter)----->  TAP device  --->  eth0 (guest)          |
|  (host side)  <--TC REDIRECT (no filter)-----  (VM side)   <---  (full network stack)  |
|       |                                   |  |                                           |
|  Host Network Policy (Calico/Cilium)      |  |  NO firewall rules by default             |
|  (iptables/eBPF on veth)                  |  |  Can reach: all cluster IPs               |
|       |                                   |  |  Can resolve: all cluster DNS              |
|  Physical NIC --> cluster network         |  |  Can use: ServiceAccount tokens            |
+-------------------------------------------+  +-------------------------------------------+

8.3.2 ServiceAccount Token Exposure

  • Kubernetes mounts SA token at /var/run/secrets/kubernetes.io/serviceaccount/token
  • Token is shared into guest via virtiofs/9p as a regular file (src/runtime/virtcontainers/mount.go:312, src/runtime/virtcontainers/fs_share_linux.go:37-52)
  • File watcher (fs_share_linux.go:817-927) even syncs token rotations from host to guest
  • A root guest can read the token in plaintext

8.3.3 Full Cluster API Access

  • Guest DNS resolves cluster services (DNS config passed via CreateSandboxRequest, kata_agent.go:854-862)
  • Guest can resolve kubernetes.default.svc.cluster.local to API server
  • With SA token + API server access: full Kubernetes API access at the pod's RBAC level
  • Can list secrets, create pods, exec into other pods (depending on RBAC)

8.3.4 Lateral Movement via Cluster Network

  • Guest can reach any ClusterIP service
  • Can perform service/port scanning across the cluster network
  • Can reach NodePort services on all nodes
  • Can reach external services unless egress NetworkPolicy is enforced on host
  • DNS tunneling available for covert data exfiltration

8.4 Shared Filesystem Escape (Guest -> Host Filesystem)

8.4.1 Virtiofs Shared Directory Access (HIGH)

  • virtiofsd runs as root on host, serving the entire SharedPath directory (src/runtime/virtcontainers/virtiofsd.go:191: --shared-dir= + v.sourcePath)
  • Host shared path: /run/kata-containers/shared/sandboxes/$SANDBOX_ID/shared/
  • Guest mounts this at /run/kata-containers/shared/containers/
  • A root guest can access all containers' shared files in the same sandbox -- no per-container isolation at virtiofs level

8.4.2 Cross-Container Data Access (HIGH)

  • Sandbox-wide bind mounts (sandboxMountsDir) visible to all containers
  • Root guest can read/modify: /run/kata-containers/shared/containers/$OTHER_CONTAINER_ID/rootfs/
  • Can steal data, modify binaries, or plant backdoors in other containers' rootfs

8.4.3 Symlink Attack on Host Mounts (MEDIUM)

  • Host uses bind mounts with slave propagation from mounts/ to shared/ (fs_share_linux.go:222-230)
  • If guest creates symlinks in shared directory pointing outside the mount boundary, and host-side tools follow those symlinks, host filesystem could be accessed
  • virtiofsd's --shared-dir scopes access, but symlink resolution within that scope is still dangerous

8.4.4 ConfigMap/Secret Update Race (MEDIUM)

  • File watcher uses fsnotify to detect ConfigMap/Secret updates (fs_share_linux.go:839-913)
  • Copies updated files from host to guest on REMOVE events
  • TOCTOU race: guest could exploit timing between symlink target change and file copy

8.5 Virtio Device Attack Surface (Guest -> Hypervisor -> Host)

8.5.1 Virtio Device Emulation (HIGH)

  • QEMU exposes multiple virtio devices to guest: virtio-net, virtio-blk, virtio-vsock, virtio-fs, virtio-console
  • Root guest can craft malicious virtio descriptors targeting QEMU's device emulation code
  • Historically, QEMU virtio device emulation has been a rich source of CVEs (guest-to-host escape)

8.5.2 VFIO/IOMMU Bypass (CRITICAL if IOMMU disabled)

  • IOMMU protection is optional (src/runtime/virtcontainers/qemu.go:449-450): if q.config.IOMMU { ... }
  • Without IOMMU/vIOMMU, VFIO-passthrough devices can perform arbitrary DMA to host memory
  • Even with IOMMU, misconfigured IOMMU groups or IOMMUFD bugs could allow bypass

8.5.3 Vhost-User Shared Memory (HIGH)

  • Vhost-user devices (blk, scsi, fs, net) use shared memory regions between guest and host
  • Root guest could exploit vhost-user protocol to corrupt shared memory pages
  • virtiofsd uses vhost-user-fs protocol -- a malicious guest could send crafted vhost-user messages

8.6 Containerd/Kubernetes Event Spoofing

8.6.1 TaskOOM Event Spoofing (MEDIUM)

  • As described in 8.2.1, fake OOM events with arbitrary container IDs are published to containerd as /tasks/oom events (src/runtime/pkg/containerd-shim-v2/service.go:325-326)
  • Kubernetes consumes these to trigger pod eviction/rescheduling
  • Impact: Denial of service by triggering false OOM-based pod evictions

8.6.2 Process Exit Code Manipulation (MEDIUM)

  • WaitProcess RPC returns exit status from compromised agent
  • Exit codes influence Kubernetes restart policy decisions (CrashLoopBackOff, etc.)
  • Impact: Can prevent pod restart (return exit code 0) or force infinite restart loops

8.7 Full Attack Chain: Compromised Container to Cluster

Container gains root in Kata VM
|
+-- 1. Read ServiceAccount token from shared filesystem
+-- 2. Resolve kubernetes.default via cluster DNS
+-- 3. Authenticate to Kubernetes API server
|   +-- List/steal secrets across namespace (if RBAC permits)
|   +-- Create new privileged pods on other nodes
|   +-- Exec into other pods
|   +-- Modify deployments, inject backdoors
|
+-- 4. Network lateral movement
|   +-- Scan cluster network (all ClusterIP services)
|   +-- Reach NodePort services on all nodes
|   +-- Exfiltrate data via DNS tunneling or egress
|
+-- 5. Manipulate agent responses
|   +-- Spoof OOM events -> trigger pod evictions
|   +-- Poison metrics -> corrupt monitoring
|   +-- Fake iptables output -> misrepresent security posture
|
+-- 6. Cross-container attack via shared virtiofs
|   +-- Read other containers' filesystems
|   +-- Modify other containers' binaries
|   +-- Plant symlinks for host-side TOCTOU attacks
|
+-- 7. Hypervisor escape attempts
    +-- Craft malicious virtio descriptors
    +-- Exploit QEMU device emulation CVEs
    +-- DMA attack via VFIO (if IOMMU disabled)

8.8 Guest-to-Host Threat Matrix

Vector Severity Pre-Condition Impact
K8s API access via SA token + pod network CRITICAL Root in guest + SA token exists Full cluster compromise (RBAC-dependent)
VFIO DMA without IOMMU CRITICAL IOMMU disabled + VFIO device Arbitrary host memory read/write
Stream read length mismatch HIGH Compromised agent Data corruption in container logs/streams, kubectl logs output manipulation
No TTRPC response size limit HIGH Compromised agent Shim memory exhaustion, node destabilization
QEMU virtio device exploit HIGH Root in guest + unpatched QEMU Host code execution
Cross-container virtiofs access HIGH Root in guest Data theft, binary tampering
Cluster network lateral movement HIGH Root in guest + no egress NetworkPolicy Service scanning, data exfiltration
OOM event spoofing to Kubernetes MEDIUM Root in guest (cgroup manipulation) Pod eviction DoS, host file creation
Volume stats response fabrication MEDIUM Compromised agent Kubelet eviction decisions based on false data
GuestDetails version/feature spoofing MEDIUM Compromised agent Shim logic manipulation, security feature bypass
Error message injection MEDIUM Compromised agent Log poisoning, HTTP response injection
Metrics/iptables response poisoning MEDIUM Compromised agent Monitoring corruption
Exit code manipulation MEDIUM Compromised agent Restart policy bypass
Virtiofs symlink TOCTOU MEDIUM Root in guest + timing Potential host file access
ConfigMap/Secret update race MEDIUM Root in guest + timing Token/secret interception
Interface/route response divergence LOW Compromised agent Stale shim network state, resource leaks
Stats JSON DoS LOW Compromised agent Shim memory exhaustion

8.9 Guest-to-Host Recommendations

  1. Minimize SA token exposure -- use automountServiceAccountToken: false on pods unless strictly needed; use projected volume tokens with short TTL and audience binding.
  2. Enforce egress NetworkPolicy -- restrict guest pod network access to only required services; block API server access unless explicitly needed.
  3. Enable IOMMU/vIOMMU -- always enable when using VFIO device passthrough.
  4. Validate agent response data -- add container ID validation in watchOOMEvents() against known container set; sanitize metrics strings; validate iptables output format.
  5. Per-container virtiofs isolation -- consider separate virtiofsd instances per container or use mount namespaces within the guest to prevent cross-container access.
  6. Harden QEMU attack surface -- use machine type with minimal device set; enable sandboxing (seccomp, AppArmor for QEMU process); keep QEMU patched.
  7. Rate-limit agent responses -- add throttling on OOM events and metrics to prevent DoS amplification.
  8. Restrict virtiofsd -- run with --sandbox mode, minimize shared directory scope, consider read-only shares where possible.
  9. Use read-only rootfs -- set readOnlyRootFilesystem: true in SecurityContext to limit guest filesystem writes.
  10. Restrict guest capabilities -- never run containers with privileged: true in Kata VMs; drop all unnecessary capabilities.
  11. Fix stream read length mismatch -- in readProcessStream() (kata_agent.go:2514), return min(len(resp.Data), len(data)) instead of len(resp.Data) to match actual bytes copied.
  12. Set TTRPC max response size -- configure ttrpc.NewClient() with an explicit MaxRecvMsgSize option (e.g., 1MB) to limit memory consumption from malicious responses.
  13. Sanitize error messages -- escape or truncate agent error strings before writing to HTTP responses in shim_management.go to prevent log/response injection.
  14. Validate GuestDetails responses -- sanity-check MemBlockSizeBytes, version strings, and feature lists against expected ranges before using them in shim logic.

Key Source Files Reference

Component File Path Purpose
Proto definitions src/libs/protocols/protos/agent.proto All RPC method and message definitions
Agent RPC handlers src/agent/src/rpc.rs All agent-side RPC implementations
Agent policy gate src/agent/src/policy.rs Optional authorization enforcement
Agent policy engine src/agent/policy/src/policy.rs Regorus OPA engine, allow_request(), set_policy()
Genpolicy rules src/tools/genpolicy/rules.rego Production Rego policy with deep request validation
Allow-all policy src/kata-opa/allow-all.rego Permissive policy (all RPCs allowed)
Allow-set-policy src/kata-opa/allow-set-policy.rego Bootstrap policy (only SetPolicy allowed)
Agent device handling src/agent/src/device/mod.rs Device add/CDI processing
Agent VFIO handler src/agent/src/device/vfio_device_handler.rs VFIO PCI/AP device passthrough
Agent storage/mounts src/agent/src/storage/mod.rs Mount and storage operations
Agent mount primitives src/agent/src/mount.rs Low-level mount calls
Agent OOM monitor src/agent/src/sandbox.rs OOM cgroup event monitoring
Shim TTRPC client src/runtime/virtcontainers/pkg/agent/protocols/client/client.go Transport, dial, connection
Shim kata-agent glue src/runtime/virtcontainers/kata_agent.go OCI spec constraining, RPC dispatch
Shim OOM handler src/runtime/pkg/containerd-shim-v2/wait.go OOM event consumption, file creation
Shim management API src/runtime/pkg/containerd-shim-v2/shim_management.go Metrics, iptables HTTP endpoints
Shim container/VFIO src/runtime/virtcontainers/container.go VFIO annotation, CDI metadata
Network setup src/runtime/virtcontainers/network_linux.go TC filter, TAP, veth, namespace
Veth endpoints src/runtime/virtcontainers/veth_endpoint.go Virtual ethernet pair management
Filesystem sharing src/runtime/virtcontainers/fs_share_linux.go Virtiofs/9p share, bind mounts, watchers
Virtiofsd daemon src/runtime/virtcontainers/virtiofsd.go Host-side virtiofs daemon management
QEMU device config src/runtime/virtcontainers/qemu.go Hypervisor device setup, IOMMU config
Hypervisor socket src/runtime/virtcontainers/hypervisor.go VSock port constant
Hypervisor socket gen src/runtime/virtcontainers/hypervisor_linux.go CID generation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment