- Date: 2026-03-17
- Branch:
main(commit660e3bb65) - Scope: Shim (host-side, Go) to kata-agent (guest-side, Rust) communication
- Disclaimer: This report is generated using Claude Code and full human review is TBD. Also note than in a real deployment it's always recommended to use defense-in-depth, for example LSM, network policies etc
- Architecture Overview
- 1. Transport Layer Threats
- 2. API Surface Threats (Agent RPC Methods)
- 3. Sandbox Escape & Privilege Escalation Vectors
- 4. Authorization & Policy Gaps
- 5. Threat Summary Matrix
- 6. Recommendations
- 7. Agent-Policy Deep Dive: Coverage of Host-to-Guest Attack Vectors
- 8. Guest-to-Host Threat Analysis: Compromised Container Inside Kata VM
- 8.1 Attack Model
- 8.2 TTRPC Response Poisoning (Guest -> Shim)
- 8.2.1 OOM Event Container ID Injection (HIGH)
- 8.2.2 Metrics Response Injection (MEDIUM)
- 8.2.3 IPTables Response Passthrough (MEDIUM)
- 8.2.4 Stats/CgroupStats JSON Deserialization (LOW)
- 8.2.5 Process Exit Code Manipulation (MEDIUM)
- 8.2.6 Stream Read Length Mismatch (HIGH)
- 8.2.7 No TTRPC Response Size Limit (HIGH)
- 8.2.8 Error Message Injection (MEDIUM)
- 8.2.9 Volume Stats Response Passthrough (MEDIUM)
- 8.2.10 GuestDetails Version/Feature Spoofing (MEDIUM)
- 8.2.11 Interface/Route Response Divergence (LOW)
- 8.2.12 Architectural Root Cause
- 8.3 Kubernetes API Server Access via Pod Network (CRITICAL)
- 8.4 Shared Filesystem Escape (Guest -> Host Filesystem)
- 8.5 Virtio Device Attack Surface (Guest -> Hypervisor -> Host)
- 8.6 Containerd/Kubernetes Event Spoofing
- 8.7 Full Attack Chain: Compromised Container to Cluster
- 8.8 Guest-to-Host Threat Matrix
- 8.9 Guest-to-Host Recommendations
- Key Source Files Reference
The kata-shim (host-side, Go) communicates with the kata-agent (guest-side, Rust) over TTRPC (a simplified gRPC without HTTP/2) using Protocol Buffers v3. The transport is either vsock (QEMU/CLH), hybrid-vsock (Firecracker), or Unix domain sockets (remote hypervisor).
- Finding: All shim-agent communication is unencrypted. There is no TLS, mTLS, or any application-layer encryption.
- File:
src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:91 - Mitigation in place: Relies on implicit transport isolation -- vsock is hypervisor-mediated, Unix sockets are permission-restricted.
- Risk: If the hypervisor is compromised or a co-tenant VM can sniff vsock traffic (e.g., via hypervisor bug), all RPC payloads -- including OCI specs, environment variables, file contents (
CopyFile), and stdin/stdout streams -- are exposed in cleartext.
- Finding: The shim connects to the agent with zero credentials. No tokens, certificates, or shared secrets are exchanged.
- File:
src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:72-98 - Risk: Any process that can reach the vsock port (CID:1024) or the Unix socket path can issue arbitrary agent RPCs. This is a confused deputy risk if another process on the host gains access to the socket.
- Finding: Firecracker's hybrid-vsock uses a simple text-based handshake: shim sends
"CONNECT <port>\n", agent responds with"OK". - File:
src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:387-449 - Risk: No integrity check on the handshake. A MITM at the Unix socket layer could intercept and replay or inject the handshake.
- Finding: Agent always listens on vsock port 1024 (
vSockPortconstant). - File:
src/runtime/virtcontainers/hypervisor.go:80 - Risk: Reduces attack complexity -- an attacker who compromises the hypervisor layer knows exactly which port to target.
The agent exposes ~35 RPC methods defined in src/libs/protocols/protos/agent.proto. Each is a potential attack vector if an attacker can send crafted requests.
- File:
src/agent/src/rpc.rs:424 - Risk: Allows spawning arbitrary processes with controlled args, env vars, capabilities, and UID/GID inside the guest. A compromised shim or rogue host process can execute anything inside the VM.
- File:
src/agent/src/rpc.rs:1341 - Risk:
CreateSandboxcan triggermodprobewith attacker-controlled module names and parameters. Module names and parameters are passed directly toCommand::new(MODPROBE_PATH)without sanitization. This is a command injection vector if module parameters contain shell metacharacters.
- File:
src/agent/src/rpc.rs:1178-1208 - Risk: Executes
iptables-restorewith attacker-controlled stdin data. Malicious rules could open the guest firewall, redirect traffic, or enable exfiltration channels.
- File:
src/agent/src/rpc.rs:2038 - Mitigation in place: Path must start with
/run/kata-containers(line 2041). - Remaining risk: Supports symlink creation (line 2097-2121) and custom file modes/ownership. A symlink within
/run/kata-containerscould point elsewhere in the guest filesystem, creating an escape primitive.
- File:
src/agent/src/rpc.rs:1046, 1094, 1406 - Risk: Full guest networking control: IP address injection, default route hijacking, ARP cache poisoning. An attacker can redirect all guest traffic or perform MitM within the guest network namespace.
- File:
src/agent/src/rpc.rs:1503, 1448 - Risk: Time manipulation can break TLS certificate validation, log integrity, and replay protections. Entropy injection can weaken guest RNG state.
- File:
src/agent/src/rpc.rs:1434, 1490, 1601 - Risk: Memory probe writes to
/sys/devices/system/memory/probewith attacker-controlled addresses. Swap manipulation can cause DoS or expose sensitive memory pages.
- File:
src/agent/src/rpc.rs:2202-2251(setup_bundle()) - Issue: Bundle path is constructed as
Path::new(CONTAINER_BASE).join(cid). Ifverify_id()doesn't reject../sequences, the container rootfs could be bind-mounted to arbitrary guest paths. - Severity: CRITICAL -- potential guest filesystem escape.
- File:
src/agent/src/storage/mod.rs:281,src/agent/src/mount.rs:67-122 - Issue:
nix::mount::mount()is called without resolving symlinks in destination paths. The kernel follows symlinks during mount, so a symlink planted in a shared directory could redirect a bind mount outside the intended container boundary. - Severity: CRITICAL -- container-to-guest escape.
- File:
src/agent/src/rpc.rs:1859-1911 - Issue: IPC, UTS, and PID namespace paths from the host are used directly (
PathBuf::from(&sandbox.shared_ipcns.path)) without validating they point to legitimate namespace files. A compromised host could inject paths to host namespaces. - Severity: HIGH -- namespace confusion attack.
- File:
src/runtime/virtcontainers/kata_agent.go:1056-1066 - Issue: Device cgroups, PID limits, BlockIO, network limits, and CPU constraints are all set to nil before sending to the agent. This means resource isolation is not enforced inside the guest.
- Severity: HIGH -- DoS via resource exhaustion within the guest.
- File:
src/runtime/virtcontainers/container.go:1362-1363 - Issue:
vfioGroup(derived from device path) is used directly infilepath.Join(config.SysIOMMUGroupPath, vfioGroup, "devices"). If it contains../, it could read arbitrary sysfs paths on the host. - Severity: HIGH -- host information disclosure.
- File:
src/runtime/virtcontainers/kata_agent.go:1014-1103 - Issue:
constrainGRPCSpec()does NOT filter Linux capabilities. If a privileged container spec is passed, full capabilities (includingCAP_SYS_ADMIN,CAP_NET_RAW, etc.) are forwarded to the agent and granted inside the guest. - Severity: MEDIUM -- depends on guest kernel attack surface.
- File:
src/agent/src/policy.rs:12-45,src/agent/src/rpc.rs:155-156 - Issue: The
agent-policyfeature gate controls whether RPC authorization is enforced. Without it, every RPC method is implicitly allowed. Most deployments do not enable this. - Risk: Any entity with socket access has full, unrestricted control over the guest VM.
- Even with policy enabled, there's no per-container identity or authorization. Any authenticated caller can operate on any container within the sandbox.
| Vector | Severity | Pre-Condition | Impact |
|---|---|---|---|
| Plaintext TTRPC (eavesdropping) | HIGH | Hypervisor compromise or vsock bug | Full data exfiltration |
| No authentication on agent socket | HIGH | Host process reaches vsock/UDS | Full VM control |
| Container ID path traversal | CRITICAL | Crafted container ID bypasses verify_id |
Guest filesystem escape |
| Mount symlink following | CRITICAL | Symlink in shared dir before mount | Container-to-guest escape |
Kernel module injection via CreateSandbox |
CRITICAL | Compromised shim | Arbitrary kernel code in guest |
| Iptables stdin injection | HIGH | Compromised shim | Guest firewall bypass |
| Namespace path injection | HIGH | Compromised shim | Namespace confusion |
| VFIO sysfs path traversal | HIGH | Malformed device path | Host info disclosure |
| Resource constraint stripping | HIGH | By design | Guest-internal DoS |
| Missing agent-policy enforcement | MEDIUM | Default configuration | Unrestricted guest API |
| CopyFile symlink creation | MEDIUM | Valid shim access | Guest file overwrite |
| DNS/network manipulation | MEDIUM | Valid shim access | Guest traffic hijack |
| Clock/entropy manipulation | LOW | Valid shim access | Crypto weakening, log tampering |
- Enable agent-policy in production -- compile with
agent-policyfeature and deploy a restrictive allowlist of permitted RPCs. - Add path canonicalization before all mount and bundle operations (
realpath/canonicalizebeforemount()). - Validate container IDs -- reject any ID containing
/,.., or null bytes before path construction. - Sanitize modprobe parameters -- reject module names/params with shell metacharacters.
- Consider TTRPC-over-TLS for deployments where vsock isolation guarantees are insufficient (e.g., nested virtualization, shared hypervisor environments).
- Audit
CopyFilesymlink handling -- disallow symlink creation or validate symlink targets stay within/run/kata-containers. - Enforce capability dropping in
constrainGRPCSpec()-- strip dangerous capabilities before forwarding to the agent.
The agent-policy system is an OPA/Rego-based authorization gate built into the kata-agent. It uses regorus (a Rust OPA engine) to evaluate every incoming RPC request against a Rego policy document before execution.
Key architecture:
- Engine:
regorus::Engineinsrc/agent/policy/src/policy.rs:33 - Gate function:
is_allowed()insrc/agent/src/policy.rs:31-- serializes each request to JSON, then evaluatesdata.agent_policy.<RequestName>in the Rego engine - Enforcement point: Every
AgentServicetrait method insrc/agent/src/rpc.rscallsis_allowed(&req).await?before processing - Policy delivery: Policy is loaded from a default file (
/etc/kata-opa/default-policy.rego), from initdata, or dynamically via theSetPolicyRPC - Bundled policies:
src/kata-opa/allow-all.rego(permits everything),src/kata-opa/allow-all-except-exec-process.rego,src/kata-opa/allow-set-policy.rego - Production policy:
src/tools/genpolicy/rules.rego-- comprehensive, per-field validation rules
// src/agent/src/rpc.rs:155-158
#[cfg(not(feature = "agent-policy"))]
async fn is_allowed(_req: &impl serde::Serialize) -> ttrpc::Result<()> {
Ok(()) // ALWAYS ALLOWS EVERYTHING
}Without the agent-policy feature flag at compile time, is_allowed() is a no-op. This means none of the protections described below exist in a default build.
| Threat Vector | Policy Default | Depth of Inspection |
|---|---|---|
| ExecProcess (arbitrary exec) | false (blocked) |
Deep -- validates against allowlisted commands, regex patterns, container state, capabilities (allow_exec_caps rejects all capability sets), UID/GID, and SELinux/AppArmor labels (rpc.rs:841, rules.rego:1572-1617) |
| CreateContainer (OCI spec injection) | false (blocked) |
Very deep -- validates OCI version, root readonly, annotations, namespace, sandbox name, container type, process args/env/cwd, capabilities, mounts, storages, devices, Linux namespace config (rules.rego:60-121) |
| CreateSandbox (kernel module loading) | false (blocked) |
Strong -- explicitly requires kernel_modules count == 0 and guest_hook_path empty (rules.rego:1532-1543). This completely blocks the kernel module injection vector |
| CopyFile (arbitrary file write) | false (blocked) |
Good -- validates path against regex allowlist AND checks for directory traversal (../) via check_directory_traversal() (rules.rego:1491-1530) |
| ReadStream (stdout/stderr exfiltration) | false (blocked) |
Unique behavior -- the RPC still executes but redacts the response data if policy denies it (rpc.rs:971-974, 984-987). This prevents log/output exfiltration while keeping container plumbing functional |
| WriteStream (stdin injection) | false (blocked) |
Binary allow/deny only |
| Devices (VFIO passthrough) | Validated per-container | Separates VFIO devices from volume devices and validates each against policy-declared device lists with CDI annotation regex matching (rules.rego:472-498) |
| Threat Vector | Policy Default | Gap |
|---|---|---|
| UpdateRoutes (route hijacking) | false (blocked) |
Has route validation with forbidden_source_regex and forbidden_device_names (rules.rego:1629-1636), but the validation is configurable -- a weak policy could still allow malicious routes |
| UpdateInterface (IP injection) | false (blocked) |
Binary allow/deny only. No inspection of IP address values, MTU, or MAC address -- if allowed, any interface config is accepted |
| SetIPTables (firewall injection) | Not in genpolicy defaults | Binary allow/deny -- if allowed, arbitrary iptables rules can be injected. No content inspection of the iptables data payload |
| AddARPNeighbors (ARP spoofing) | false (blocked) |
Binary allow/deny only. If allowed, arbitrary ARP entries accepted |
| SignalProcess | true (ALLOWED) |
Always allowed by default in genpolicy. No validation of signal number. A compromised host can send any signal (including SIGKILL) to any process |
| Capabilities in CreateContainer | Validated | allow_caps() compares all 5 capability sets against policy -- but policy author must define them correctly. Regex matching means overly broad patterns could grant excessive capabilities (rules.rego:1406-1430) |
| Threat Vector | Issue |
|---|---|
| SetPolicy itself | Default false in genpolicy, but the bundled allow-set-policy.rego sets it to true. If SetPolicy is allowed, an attacker can replace the entire policy with allow-all.rego, completely defeating all protections. The SetPolicy RPC checks policy before applying (policy.rs:37-44), so it's self-protecting -- but only if the initial policy blocks it |
| AllowRequestsFailingPolicy | If set to true in the Rego policy, ALL policy failures are silently ignored (policy.rs:187-190). This is a debug flag that completely disables security. Genpolicy defaults it to false, but nothing prevents a policy author from enabling it |
| ReseedRandomDev (entropy injection) | false (blocked) but binary allow/deny only -- if allowed, arbitrary entropy data accepted without validation |
| SetGuestDateTime (clock manipulation) | false (blocked) but binary allow/deny only -- if allowed, any timestamp accepted |
| MemHotplugByProbe (memory probe injection) | false (blocked) but binary allow/deny only -- if allowed, arbitrary probe addresses accepted |
| GetIPTables (firewall enumeration) | Not in genpolicy defaults. If allowed, leaks full guest firewall rules -- information disclosure |
| AddSwap / AddSwapPath | false (blocked) but binary allow/deny only |
| OnlineCPUMem | true (ALLOWED) by default. No validation of parameters |
| Env var content in ExecProcess | CreateContainer validates env var names/values against policy patterns, but ExecProcess does not deeply validate environment variables |
| Mount source/destination paths | Policy validates storages in CreateContainer but mount symlink following and path canonicalization are not addressable by policy -- these are runtime bugs |
| Namespace path injection | Policy validates namespace type (PID, IPC, UTS) but not namespace paths. Arbitrary namespace paths from the host are accepted |
SetPolicy is itself guarded by policy (policy.rs:37-44), creating a bootstrap problem. If the initial policy allows SetPolicy, the entire policy can be replaced at runtime by the host. The bundled allow-set-policy.rego (src/kata-opa/allow-set-policy.rego) does exactly this -- sets only SetPolicyRequest := true, meaning only SetPolicy works and everything else is denied by default. But if a permissive initial policy is loaded that allows SetPolicy, it becomes a complete bypass vector.
# rules.rego:48-52
# AllowRequestsFailingPolicy := true configures the Agent to *allow any
# requests causing a policy failure*.
default AllowRequestsFailingPolicy := falseWhen true, every denied request is logged as a warning but still executed (policy.rs:187-190). This is documented as a debug feature but is a global policy bypass with zero audit trail beyond warn-level logs.
There's no concept of caller identity, session, or connection-level authorization. Every request is evaluated independently. This means:
- No rate limiting on requests
- No detection of anomalous request patterns (e.g., rapid CreateContainer/DestroyContainer cycling)
- A single allowed RPC can be called unlimited times
For the most dangerous RPCs (SetIPTables, UpdateInterface, AddARPNeighbors, ReseedRandomDev, MemHotplugByProbe), the policy is a simple boolean gate. If allowed, the full request payload is accepted without any content inspection. Only CreateContainer, ExecProcess, CopyFile, CreateSandbox, and UpdateRoutes have deep content validation in the genpolicy rules.
Policy evaluation works by serializing the protobuf request to JSON (policy.rs:32), then passing it to Rego. This means the policy can only inspect fields that survive JSON serialization. Binary fields (like SetIPTablesRequest.data containing raw iptables rules) are base64-encoded in JSON, making content inspection impractical in Rego.
| With genpolicy rules.rego | With allow-all.rego | Without agent-policy feature | |
|---|---|---|---|
| Arbitrary exec | Blocked (allowlisted commands only) | OPEN | OPEN |
| Malicious container specs | Blocked (deep OCI validation) | OPEN | OPEN |
| Kernel module loading | Blocked (count==0 enforced) | OPEN | OPEN |
| Arbitrary file writes | Blocked (path regex + traversal check) | OPEN | OPEN |
| Log/output exfiltration | Redacted | OPEN | OPEN |
| Iptables injection | Blocked (default deny) | OPEN | OPEN |
| Network config changes | Blocked (default deny) | OPEN | OPEN |
| Policy replacement | Blocked (default deny) | Self-destructible | N/A |
| Signal to any process | OPEN | OPEN | OPEN |
| CPU/memory hotplug | OPEN | OPEN | OPEN |
| Symlink/mount path attacks | NOT ADDRESSABLE | NOT ADDRESSABLE | NOT ADDRESSABLE |
| Namespace path injection | NOT ADDRESSABLE | NOT ADDRESSABLE | NOT ADDRESSABLE |
- Always compile with
agent-policy-- the no-op fallback makes all other protections meaningless. - Use genpolicy rules.rego as baseline -- never deploy
allow-all.regoin production. - Block
SetPolicyin the initial policy -- once loaded, the policy should be immutable. Use initdata or the default policy file for delivery. - Never enable
AllowRequestsFailingPolicy-- even in staging, as it silently bypasses all authorization. - Add content inspection for
SetIPTables-- decode and validate iptables rules in the Rego policy, or block the RPC entirely. - Restrict
SignalProcess-- add signal number validation to the genpolicy rules (e.g., only allow SIGTERM/SIGKILL). - Restrict
OnlineCPUMem-- add parameter bounds checking or switch default tofalse. - Address runtime-level bugs separately -- symlink following, namespace path injection, and mount path traversal cannot be solved by policy alone. These require code fixes in the agent's Rust implementation.
A container running inside the Kata VM gains root privileges (e.g., via kernel exploit, misconfigured capabilities, or privileged: true). The attacker now has:
- Full root access to the guest Linux kernel
- Control over the kata-agent process (PID 1 in guest)
- Access to all guest network interfaces
- Ability to read/write all guest filesystems including shared mounts
The agent is a TTRPC server only -- it cannot initiate reverse connections to the host. However, the shim continuously polls the agent for data, and trusts all response data without validation.
src/runtime/pkg/containerd-shim-v2/wait.go:165-- shim callsGetOOMEvent()in a loop- Agent returns
OOMEvent.container_idfrom its internal cgroup monitor (src/agent/src/sandbox.rs:438-463) - A root-compromised guest can manipulate cgroups to trigger OOM events with arbitrary container IDs
- Shim uses this container ID to:
- Look up
s.containers[containerID]map (wait.go:177) - Create a file at
path.Join(c.bundle, "oom")on the host filesystem (wait.go:178-186) - Publish a
TaskOOMevent to containerd with the attacker-controlled ID (wait.go:190-192)
- Look up
- The containerd event propagates to the Kubernetes event stream
- Impact: Host filesystem file creation (limited to existing container bundle paths), event spoofing to Kubernetes control plane
src/runtime/pkg/containerd-shim-v2/shim_management.go:107-146--decodeAgentMetrics()parses agent's response as Prometheus text format- Raw agent string fed to
expfmt.NewDecoder()without validation (shim_management.go:122-123) - Metric names are prefixed but not sanitized (
shim_management.go:137-139) - Impact: Malformed metrics could cause DoS on Prometheus scrape endpoints; metric name injection could confuse monitoring/alerting
src/runtime/pkg/containerd-shim-v2/shim_management.go:258-264--GetIPTables()returns rawiptables-saveoutput from guest- Written directly to HTTP response body (
w.Write(buf)) without validation - A compromised guest controls what
iptables-savereturns (or replaces the binary entirely) - Impact: Misrepresentation of guest firewall state to management tools; if parsing tools consume this output, potential injection
src/runtime/virtcontainers/kata_agent.go:2146-2154-- agent'sStatsContainerResponse.CgroupStatsis JSON-marshaled then unmarshaled into host-side struct- Protobuf provides structural safety, but deeply nested or huge cgroup stats could cause resource exhaustion
- Impact: DoS against shim process via memory exhaustion
WaitProcessRPC returns exit status from agent- A compromised agent can return arbitrary exit codes
- Exit codes influence Kubernetes restart policy decisions (CrashLoopBackOff, etc.)
- Impact: Can prevent pod restart (return 0) or force restart loops
src/runtime/virtcontainers/kata_agent.go:2500-2514--readProcessStream():copy(data, resp.Data) // copies min(len(data), len(resp.Data)) bytes return len(resp.Data), nil // returns len(resp.Data), NOT bytes actually copied
- The shim requests
uint32(len(data))bytes viaReadStreamRequest.Len, but the agent can return more bytes than requested - Go's
copy()is memory-safe (copiesmin(len(dst), len(src))bytes), so there is no buffer overflow - However, the function returns
len(resp.Data)-- the attacker-controlled length -- not the actual number of bytes copied into the destination buffer - This
io.Readerimplementation (iostream.go:80-96) feeds into containerd's I/O pump - Impact: The caller's bookkeeping of bytes read will be wrong: it believes N bytes were read when only
min(N, bufsize)were actually copied. This can cause log truncation, stream offset misalignment, or data duplication depending on how the consumer advances its position. A compromised agent can exploit this to corrupt container log output visible tokubectl logs.
src/runtime/virtcontainers/pkg/agent/protocols/client/client.go:91--ttrpc.NewClient(conn, ...)is created with noMaxRecvMsgSizeoption- The TTRPC library's default max message size (typically 4MB in containerd/ttrpc) applies, but no kata-specific limit is configured
- The shim's
grpcMaxDataSizeof 1MB (kata_agent.go:133) only constrains outbound CopyFile request chunks, not inbound responses - Responses are fully deserialized into memory before the shim processes them
- A compromised agent can send near-maximum-size responses to every single RPC call
- RPCs called in loops (
GetOOMEvent,ReadStdout,ReadStderr,GetMetrics) amplify this - Impact: Sustained large responses cause shim process memory exhaustion on the host. Since the shim is a per-pod process, this can destabilize the node if many pods are targeted simultaneously.
- Throughout
src/runtime/pkg/containerd-shim-v2/shim_management.go, error messages from failed agent RPCs are written directly to HTTP response bodies:// shim_management.go:252-253 w.WriteHeader(http.StatusInternalServerError) w.Write([]byte(err.Error())) // agent error string -> HTTP body
- This pattern appears for iptables (line 252), volume stats (line 170), volume resize (line 195), and policy (line 219)
- A compromised agent can craft TTRPC error messages containing:
- HTTP response splitting characters (if consumed by a naive HTTP client)
- Log injection payloads (newlines, ANSI escape sequences)
- Misleading error text to mask real failures
- Impact: Log poisoning on the host; potential HTTP response injection if management API consumers don't sanitize error bodies
src/runtime/pkg/containerd-shim-v2/shim_management.go:148-173--serveVolumeStats():buf, err := s.sandbox.GuestVolumeStats(context.Background(), volumePath) w.Write(buf) // raw agent JSON written to HTTP response
kata_agent.go:2714JSON-marshals the agent'sVolumeStatsResponseand passes it directly to the HTTP response consumed by kubelet- A compromised agent can return:
- Fabricated capacity/inode numbers influencing kubelet eviction decisions
- Extremely large JSON payloads causing kubelet memory pressure
- Impact: Kubelet volume management decisions based on false data; potential eviction of pods on node due to fake "disk full" reports
src/runtime/virtcontainers/kata_agent.go:2517-2527--getGuestDetails()returnsGuestDetailsResponsecontaining:AgentDetails.Version-- used for version-based feature decisionsAgentDetails.SupportedFeatures-- determines which RPCs the shim will attemptMemBlockSizeBytes-- used for memory hotplug calculationsAgentDetails.DeviceHandlers/StorageHandlers-- determines device handling paths
- A compromised agent can report false capabilities, causing the shim to:
- Use wrong code paths based on fake version info
- Skip security-relevant operations it believes are unsupported
- Miscalculate memory block sizes for hotplug, potentially causing guest instability or overcommit
- Impact: Shim logic manipulation; incorrect resource accounting; potential bypass of version-gated security features
src/runtime/virtcontainers/kata_agent.go:641-644and673-676:if resultInterface, ok := resultingInterface.(*pbTypes.Interface); ok { return resultInterface, err }
updateInterface()andupdateRoutes()return the agent's response to the caller. The shim uses this "resulting" state to update its internal network model- A compromised agent returning different interface/route data than what was requested causes the shim's internal network state to diverge from actual guest network state
- Impact: Incorrect network teardown, resource leaks, or stale state preventing proper pod cleanup
The fundamental issue across all response vectors is that the shim treats the agent as a trusted component. Every RPC response is consumed without:
- Response size validation or per-RPC size limits
- Content sanitization or escaping before writing to HTTP/logs
- Semantic validation (does the response match what was requested?)
- Rate limiting on event-style RPCs (OOM, metrics, stream reads)
- Bounds checking on returned lengths against request parameters
This trust model was designed for the "protect host from untrusted container" threat model -- the agent was part of the trusted computing base. However, in the compromised container scenario (root escape inside guest VM), the agent itself becomes adversarial, and neither direction has adequate validation. In the confidential computing use case (where the host is untrusted), this same trust inversion applies in the opposite direction.
This is the highest-impact attack vector. The guest VM has full pod network connectivity with no guest-side filtering.
- TC filter setup (
src/runtime/virtcontainers/network_linux.go:940-1010) creates transparent bidirectional redirect between host veth and guest TAP addRedirectTCFilter()(network_linux.go:1040-1063) usesu32 match u8 0 0(match all traffic) withTCA_EGRESS_REDIRaction- No packet filtering, no egress rules, no ingress rules at the TC level
- Network policies (Calico/Cilium) are enforced on the host veth -- they apply to traffic entering/leaving the pod, but a root guest has already bypassed the container's internal network namespace
Host Network Namespace Guest VM (Isolated Kernel)
+-------------------------------------------+ +-------------------------------------------+
| | | |
| veth<pod> ----TC REDIRECT (no filter)-----> TAP device ---> eth0 (guest) |
| (host side) <--TC REDIRECT (no filter)----- (VM side) <--- (full network stack) |
| | | | |
| Host Network Policy (Calico/Cilium) | | NO firewall rules by default |
| (iptables/eBPF on veth) | | Can reach: all cluster IPs |
| | | | Can resolve: all cluster DNS |
| Physical NIC --> cluster network | | Can use: ServiceAccount tokens |
+-------------------------------------------+ +-------------------------------------------+
- Kubernetes mounts SA token at
/var/run/secrets/kubernetes.io/serviceaccount/token - Token is shared into guest via virtiofs/9p as a regular file (
src/runtime/virtcontainers/mount.go:312,src/runtime/virtcontainers/fs_share_linux.go:37-52) - File watcher (
fs_share_linux.go:817-927) even syncs token rotations from host to guest - A root guest can read the token in plaintext
- Guest DNS resolves cluster services (DNS config passed via
CreateSandboxRequest,kata_agent.go:854-862) - Guest can resolve
kubernetes.default.svc.cluster.localto API server - With SA token + API server access: full Kubernetes API access at the pod's RBAC level
- Can list secrets, create pods, exec into other pods (depending on RBAC)
- Guest can reach any ClusterIP service
- Can perform service/port scanning across the cluster network
- Can reach NodePort services on all nodes
- Can reach external services unless egress NetworkPolicy is enforced on host
- DNS tunneling available for covert data exfiltration
- virtiofsd runs as root on host, serving the entire
SharedPathdirectory (src/runtime/virtcontainers/virtiofsd.go:191:--shared-dir=+v.sourcePath) - Host shared path:
/run/kata-containers/shared/sandboxes/$SANDBOX_ID/shared/ - Guest mounts this at
/run/kata-containers/shared/containers/ - A root guest can access all containers' shared files in the same sandbox -- no per-container isolation at virtiofs level
- Sandbox-wide bind mounts (
sandboxMountsDir) visible to all containers - Root guest can read/modify:
/run/kata-containers/shared/containers/$OTHER_CONTAINER_ID/rootfs/ - Can steal data, modify binaries, or plant backdoors in other containers' rootfs
- Host uses bind mounts with slave propagation from
mounts/toshared/(fs_share_linux.go:222-230) - If guest creates symlinks in shared directory pointing outside the mount boundary, and host-side tools follow those symlinks, host filesystem could be accessed
- virtiofsd's
--shared-dirscopes access, but symlink resolution within that scope is still dangerous
- File watcher uses fsnotify to detect ConfigMap/Secret updates (
fs_share_linux.go:839-913) - Copies updated files from host to guest on REMOVE events
- TOCTOU race: guest could exploit timing between symlink target change and file copy
- QEMU exposes multiple virtio devices to guest: virtio-net, virtio-blk, virtio-vsock, virtio-fs, virtio-console
- Root guest can craft malicious virtio descriptors targeting QEMU's device emulation code
- Historically, QEMU virtio device emulation has been a rich source of CVEs (guest-to-host escape)
- IOMMU protection is optional (
src/runtime/virtcontainers/qemu.go:449-450):if q.config.IOMMU { ... } - Without IOMMU/vIOMMU, VFIO-passthrough devices can perform arbitrary DMA to host memory
- Even with IOMMU, misconfigured IOMMU groups or IOMMUFD bugs could allow bypass
- Vhost-user devices (blk, scsi, fs, net) use shared memory regions between guest and host
- Root guest could exploit vhost-user protocol to corrupt shared memory pages
- virtiofsd uses vhost-user-fs protocol -- a malicious guest could send crafted vhost-user messages
- As described in 8.2.1, fake OOM events with arbitrary container IDs are published to containerd as
/tasks/oomevents (src/runtime/pkg/containerd-shim-v2/service.go:325-326) - Kubernetes consumes these to trigger pod eviction/rescheduling
- Impact: Denial of service by triggering false OOM-based pod evictions
WaitProcessRPC returns exit status from compromised agent- Exit codes influence Kubernetes restart policy decisions (CrashLoopBackOff, etc.)
- Impact: Can prevent pod restart (return exit code 0) or force infinite restart loops
Container gains root in Kata VM
|
+-- 1. Read ServiceAccount token from shared filesystem
+-- 2. Resolve kubernetes.default via cluster DNS
+-- 3. Authenticate to Kubernetes API server
| +-- List/steal secrets across namespace (if RBAC permits)
| +-- Create new privileged pods on other nodes
| +-- Exec into other pods
| +-- Modify deployments, inject backdoors
|
+-- 4. Network lateral movement
| +-- Scan cluster network (all ClusterIP services)
| +-- Reach NodePort services on all nodes
| +-- Exfiltrate data via DNS tunneling or egress
|
+-- 5. Manipulate agent responses
| +-- Spoof OOM events -> trigger pod evictions
| +-- Poison metrics -> corrupt monitoring
| +-- Fake iptables output -> misrepresent security posture
|
+-- 6. Cross-container attack via shared virtiofs
| +-- Read other containers' filesystems
| +-- Modify other containers' binaries
| +-- Plant symlinks for host-side TOCTOU attacks
|
+-- 7. Hypervisor escape attempts
+-- Craft malicious virtio descriptors
+-- Exploit QEMU device emulation CVEs
+-- DMA attack via VFIO (if IOMMU disabled)
| Vector | Severity | Pre-Condition | Impact |
|---|---|---|---|
| K8s API access via SA token + pod network | CRITICAL | Root in guest + SA token exists | Full cluster compromise (RBAC-dependent) |
| VFIO DMA without IOMMU | CRITICAL | IOMMU disabled + VFIO device | Arbitrary host memory read/write |
| Stream read length mismatch | HIGH | Compromised agent | Data corruption in container logs/streams, kubectl logs output manipulation |
| No TTRPC response size limit | HIGH | Compromised agent | Shim memory exhaustion, node destabilization |
| QEMU virtio device exploit | HIGH | Root in guest + unpatched QEMU | Host code execution |
| Cross-container virtiofs access | HIGH | Root in guest | Data theft, binary tampering |
| Cluster network lateral movement | HIGH | Root in guest + no egress NetworkPolicy | Service scanning, data exfiltration |
| OOM event spoofing to Kubernetes | MEDIUM | Root in guest (cgroup manipulation) | Pod eviction DoS, host file creation |
| Volume stats response fabrication | MEDIUM | Compromised agent | Kubelet eviction decisions based on false data |
| GuestDetails version/feature spoofing | MEDIUM | Compromised agent | Shim logic manipulation, security feature bypass |
| Error message injection | MEDIUM | Compromised agent | Log poisoning, HTTP response injection |
| Metrics/iptables response poisoning | MEDIUM | Compromised agent | Monitoring corruption |
| Exit code manipulation | MEDIUM | Compromised agent | Restart policy bypass |
| Virtiofs symlink TOCTOU | MEDIUM | Root in guest + timing | Potential host file access |
| ConfigMap/Secret update race | MEDIUM | Root in guest + timing | Token/secret interception |
| Interface/route response divergence | LOW | Compromised agent | Stale shim network state, resource leaks |
| Stats JSON DoS | LOW | Compromised agent | Shim memory exhaustion |
- Minimize SA token exposure -- use
automountServiceAccountToken: falseon pods unless strictly needed; use projected volume tokens with short TTL and audience binding. - Enforce egress NetworkPolicy -- restrict guest pod network access to only required services; block API server access unless explicitly needed.
- Enable IOMMU/vIOMMU -- always enable when using VFIO device passthrough.
- Validate agent response data -- add container ID validation in
watchOOMEvents()against known container set; sanitize metrics strings; validate iptables output format. - Per-container virtiofs isolation -- consider separate virtiofsd instances per container or use mount namespaces within the guest to prevent cross-container access.
- Harden QEMU attack surface -- use machine type with minimal device set; enable sandboxing (seccomp, AppArmor for QEMU process); keep QEMU patched.
- Rate-limit agent responses -- add throttling on OOM events and metrics to prevent DoS amplification.
- Restrict virtiofsd -- run with
--sandboxmode, minimize shared directory scope, consider read-only shares where possible. - Use read-only rootfs -- set
readOnlyRootFilesystem: truein SecurityContext to limit guest filesystem writes. - Restrict guest capabilities -- never run containers with
privileged: truein Kata VMs; drop all unnecessary capabilities. - Fix stream read length mismatch -- in
readProcessStream()(kata_agent.go:2514), returnmin(len(resp.Data), len(data))instead oflen(resp.Data)to match actual bytes copied. - Set TTRPC max response size -- configure
ttrpc.NewClient()with an explicitMaxRecvMsgSizeoption (e.g., 1MB) to limit memory consumption from malicious responses. - Sanitize error messages -- escape or truncate agent error strings before writing to HTTP responses in
shim_management.goto prevent log/response injection. - Validate GuestDetails responses -- sanity-check
MemBlockSizeBytes, version strings, and feature lists against expected ranges before using them in shim logic.
| Component | File Path | Purpose |
|---|---|---|
| Proto definitions | src/libs/protocols/protos/agent.proto |
All RPC method and message definitions |
| Agent RPC handlers | src/agent/src/rpc.rs |
All agent-side RPC implementations |
| Agent policy gate | src/agent/src/policy.rs |
Optional authorization enforcement |
| Agent policy engine | src/agent/policy/src/policy.rs |
Regorus OPA engine, allow_request(), set_policy() |
| Genpolicy rules | src/tools/genpolicy/rules.rego |
Production Rego policy with deep request validation |
| Allow-all policy | src/kata-opa/allow-all.rego |
Permissive policy (all RPCs allowed) |
| Allow-set-policy | src/kata-opa/allow-set-policy.rego |
Bootstrap policy (only SetPolicy allowed) |
| Agent device handling | src/agent/src/device/mod.rs |
Device add/CDI processing |
| Agent VFIO handler | src/agent/src/device/vfio_device_handler.rs |
VFIO PCI/AP device passthrough |
| Agent storage/mounts | src/agent/src/storage/mod.rs |
Mount and storage operations |
| Agent mount primitives | src/agent/src/mount.rs |
Low-level mount calls |
| Agent OOM monitor | src/agent/src/sandbox.rs |
OOM cgroup event monitoring |
| Shim TTRPC client | src/runtime/virtcontainers/pkg/agent/protocols/client/client.go |
Transport, dial, connection |
| Shim kata-agent glue | src/runtime/virtcontainers/kata_agent.go |
OCI spec constraining, RPC dispatch |
| Shim OOM handler | src/runtime/pkg/containerd-shim-v2/wait.go |
OOM event consumption, file creation |
| Shim management API | src/runtime/pkg/containerd-shim-v2/shim_management.go |
Metrics, iptables HTTP endpoints |
| Shim container/VFIO | src/runtime/virtcontainers/container.go |
VFIO annotation, CDI metadata |
| Network setup | src/runtime/virtcontainers/network_linux.go |
TC filter, TAP, veth, namespace |
| Veth endpoints | src/runtime/virtcontainers/veth_endpoint.go |
Virtual ethernet pair management |
| Filesystem sharing | src/runtime/virtcontainers/fs_share_linux.go |
Virtiofs/9p share, bind mounts, watchers |
| Virtiofsd daemon | src/runtime/virtcontainers/virtiofsd.go |
Host-side virtiofs daemon management |
| QEMU device config | src/runtime/virtcontainers/qemu.go |
Hypervisor device setup, IOMMU config |
| Hypervisor socket | src/runtime/virtcontainers/hypervisor.go |
VSock port constant |
| Hypervisor socket gen | src/runtime/virtcontainers/hypervisor_linux.go |
CID generation |