Status: prototype against Kata 3.31.0 (runtime-rs). Validated end-to-end (via shim-ctl):
a counter's in-memory state survives 3 checkpoint/restore cycles (monotonic, no reset), a
live TCP LISTEN socket survives, a ~10 MB memory buffer survives, and a checkpoint
restores in a fresh microVM (migration-style). Engine-driven: both ctr (containerd
native) and crictl (CRI, no kubelet) do the full checkpoint→restore cycle with the counter
surviving; crictl restore needs containerd ≥ 2.3.0. See Proof of concept for the branches.
Checkpoint/restore (C/R) — snapshotting a running container's process tree to disk with
CRIU and restoring it later — underpins forensic container analysis,
fast start-up, and live migration. The OCI runtime and the CRI (CheckpointContainer) both
expose it, and runc-based runtimes implement it by running criu on the host against the
container's host-visible PIDs.
Kata Containers cannot use that approach. A Kata container's processes run inside the guest
VM; the host sees only the VMM (qemu) and the shim, so host-side criu has nothing to
dump. Closing the gap requires running CRIU inside the guest and threading the engine's
existing C/R calls through the Kata stack — shim and agent — without the engine having to
know it is talking to a VM.
Run CRIU in the guest, driven by the kata-agent, and surface it through the shim's existing Task service so the container engine drives C/R unchanged:
engine: shim-ctl │ ctr c checkpoint/restore │ crictl checkpoint + create-from-ckpt
│ Task Checkpoint RPC / Create-with-checkpoint
▼
containerd-shim-kata-v2 (host)
│ ttRPC: CheckpointContainer / RestoreContainer
▼
kata-agent (guest) ──► criu dump/restore on the container's PID tree
│
CRIU image set ──(container rootfs .kata-cr → host overlay upperdir)──► host
│ rides the engine's checkpoint image (rw / criu layer)
The agent performs the CRIU work; the shim maps engine RPCs to agent verbs and moves the image set across the host/guest boundary through the container's own rootfs, so no separate transport channel is needed.
- The container init must be a session leader — CRIU requires the PID-namespace init to be a session leader to dump it. Handled in rustjail.
- CRIU runs in the guest, so the guest kernel must carry the CRIU-required options
(
CONFIG_CHECKPOINT_RESTORE, socket-diag modules, …). A guest-kernel config fragment is shipped for this. - Restore needs a writable rootfs (CRIU
put_root). A read-only rootfs (root.readonly: true) requires a guest writable overlay — not yet covered. - The image set is transported through the container rootfs (virtio-fs over the host
overlay upperdir). Assumes a virtio-fs shared rootfs; block/
devmapperrootfs is not covered. - CRI restore is a create-from-checkpoint-image operation owned by the engine's CRI layer,
not a CRI RPC —
crictl createwith the checkpoint tar/image as the container image. It needs containerd ≥ 2.3.0; 2.2.x panics inCRImportCheckpoint(a nil-map bug, fixed upstream by9018c75d5).
C/R is driven by whichever layer the operator already uses; the Kata stack is identical underneath.
| Driver | Checkpoint | Restore |
|---|---|---|
shim-ctl (no engine) |
SHIMCTL_MODE=checkpoint |
SHIMCTL_MODE=restore / cycle |
ctr (containerd native) |
ctr c checkpoint --task <id> <ref> |
ctr c restore --live <id> <ref> |
crictl (CRI, no kubelet) |
crictl checkpoint --export=<tar> <id> |
crictl create with the tar as the container image (needs containerd ≥ 2.3.0) |
crictl checkpoint emits a standard CRI checkpoint archive (checkpoint/ CRIU images +
spec.dump / config.dump / rootfs-diff.tar), identical in shape to what runc-based
runtimes produce.
Two new agent verbs run CRIU in the guest:
rpc CheckpointContainer(CheckpointContainerRequest) returns (google.protobuf.Empty);
rpc RestoreContainer(RestoreContainerRequest) returns (google.protobuf.Empty);
message CheckpointContainerRequest { string container_id = 1; string image_path = 2; }
message RestoreContainerRequest { string container_id = 1; string image_path = 2; }CheckpointContainer runs criu dump on the container init's PID tree with
--manage-cgroups --tcp-established --ext-unix-sk --file-locks --link-remap --shell-job --leave-running — so established + listening TCP, UNIX sockets, and file locks survive;
RestoreContainer runs criu restore --restore-detached. Four guest-specific hazards shaped
the implementation:
- Thaw the cgroup before dumping. Engines freeze the container cgroup (their
Pause→Checkpoint→Resume flow) before calling
Checkpoint; CRIU cannotptrace-seize a frozen task. The handler writes0to the container'scgroup.freezebefore invokingcriu;criu --leave-runningthen keeps the container running (CRI semantics). - Do not
wait()for CRIU — poll an rc-file. The agent is a sub-reaper and reaps its own children, racing anywait()(→ECHILD). CRIU is launched via/bin/sh -c "…; echo $? > <rcfile>", the child is dropped, and the handler polls the rc-file for the exit code. This also keeps the long dump off the agent's tiny (≈ guest-vCPU-count) tokio worker pool. - Handle the guest mounts. The container's mounts are
rslave(propagated from the guest); CRIU rejects that (“unreachable sharing”), so the dump first makes them private:nsenter -t <pid> -m -- mount --make-rprivate /. CRIU also cannot recreate the virtio-fs rootfs, so it is declared external —--external mnt[/]:rootfsat dump, re-mapped at restore with--root <rootfs> --ext-mount-map rootfs:<rootfs> --ext-mount-map auto. The CRI-injected file binds (/etc/resolv.conf,/etc/hostname,/etc/hosts) are externalized the same way at dump (exactly those three — externalizing all sub-path binds breaks restore) and re-mapped symmetrically at restore, one--ext-mount-map extmnt_<path>:<rootfs><path>per bind; without that,crictlrestore aborts withmnt: No mapping for …. That restore-side mapping is what makes CRI-driven restore work. - Refresh the init after restore.
criu restore --restore-detachedyields a new PID tree that is not the original init; the handler reads the new PID from--pidfileand updates the stored initProcess(pid + a fresh exit channel). The detached tree reparents to the agent (a sub-reaper), so its exit is reaped normally andkill/wait/deletekeep working.
The agent-client ttRPC timeout for both verbs is unbounded (Some(0)); a large memory
dump/restore otherwise trips the default deadline.
The container init is made a session leader (setsid()), required by CRIU to dump the
PID-namespace init.
- Checkpoint. The Task service's
CheckpointRPC maps toTaskRequest::CheckpointContainer→agent.CheckpointContainer. The agent dumps into the container's own rootfs (/run/kata-containers/<cid>/rootfs/.kata-cr); because the rootfs is virtio-fs over the host overlay upperdir, the images appear host-side automatically. The shim then stages them where the engine expects: the engine-provided checkpoint path (ctr) and/or the CRI plugin's per-container state dir (crictl). - Restore (create-with-checkpoint). Containerd/CRI restore is create-with-checkpoint:
CreateTaskRequest.checkpointis set. The shim carries this ontoContainerConfig.checkpoint; atstart, if set, the container is restored (agent.RestoreContainer) instead of started fresh, andProcess::run_io_waitis wired so the task-exit event fires when the restored process exits (engine seesstopped→rmworks).
CRI has no RestoreContainer RPC — only CheckpointContainer. Restore is expressed as
creating a container whose image is a checkpoint image; the runtime detects it
(checkIfCheckpointOCIImage) and restores (CRImportCheckpoint → create with restore=true,
then StartContainer performs the restore). Both CRI-O and containerd's CRI plugin implement
this. Because those implementations unpack the CRIU images host-side (for runc + host
criu), the Kata shim stages them into the guest rootfs on restore (containerd hands it
opts.Checkpoint=<host checkpoint dir>). Both ctr restore --live and crictl create-with-checkpoint
funnel into the same shim create-with-checkpoint path (CreateTaskRequest.checkpoint /
opts.Checkpoint), so both are complete today (each does a 3-cycle counter-surviving C/R) —
crictl needs containerd ≥ 2.3.0 and the agent restore-side --ext-mount-map for the CRI binds
(below).
Checkpoint — ctr c checkpoint --task <id> <ref>:
- containerd freezes the task cgroup and calls the shim
CheckpointTask RPC. - Shim →
agent.CheckpointContainer(image_path = …/rootfs/.kata-cr). - Agent thaws the cgroup, makes the container mounts private (
mount --make-rprivate /), then runscriu dump --leave-running --external mnt[/]:rootfs …into.kata-cr(via sh + rc-file) and polls the rc-file. - The images surface on the host overlay upperdir; the shim copies them into containerd's checkpoint path → containerd packages the checkpoint image (criu layer + rootfs).
Restore — ctr c restore --live <id> <ref>:
- containerd unpacks the checkpoint image into a new snapshot and calls the shim
Createwithcheckpointset. - Shim records
ContainerConfig.checkpoint;agent.create_containermounts the rootfs (now carrying.kata-cr) and creates the init. - On
start, the shim restores:agent.RestoreContainerkills the created init and runscriu restore --restore-detached(with--root <rootfs>+--ext-mount-map auto) from.kata-cr; the agent refreshes the init PID; the shim wiresrun_io_wait. - The container resumes with its in-guest process state intact.
Checkpoint — crictl checkpoint --export=<tar> <id> (CRI, no kubelet):
- The CRI plugin calls the shim
CheckpointTask RPC — the same agent path asctr. - The agent dumps into the rootfs
.kata-cr; the shim stages the images into the CRI plugin's per-container state dir. - containerd packages the standard CRI checkpoint archive:
checkpoint/<criu images>+spec.dump+config.dump+rootfs-diff.tar.
Restore — crictl create with the checkpoint tar as the image (works, containerd ≥ 2.3.0):
CRI has no restore call; restore is crictl create with the checkpoint tar (or image) as the
container image. containerd's CRI plugin detects it (a file-path image is logged "Assuming it
is a checkpoint archive"; or checkIfCheckpointOCIImage for an image), runs CRImportCheckpoint
which unpacks the CRIU images on the host, and at start hands the shim
opts.Checkpoint=<host checkpoint dir>. The shim's create-with-checkpoint path then runs
Container::restore → agent.restore_container → criu restore in the guest. The agent maps
each CRI-injected bind (/etc/{resolv.conf,hostname,hosts}) the checkpoint externalized back to
the new rootfs file via --ext-mount-map extmnt_<path>:<rootfs><path>; the container resumes with
state intact (3-cycle counter-surviving C/R, 17→28→39→49→54). Requires containerd ≥ 2.3.0 — on
≤ 2.2.x CRImportCheckpoint hits a nil-map panic and crashes containerd (fixed in 2.3.0, commit
9018c75d5).
- One-shot CRI checkpoint image, no engine round-trip. containerd diffs the
--rwsnapshot before the--taskdump, so a fresh dump must ride the criu layer (staged by the shim) rather than the rw layer. - Read-only rootfs. Provide a guest writable overlay for CRIU
put_root. - Engine-driven cross-node migration. Single-node engine restore now works (
ctrandcrictl). Cross-VM restore is proven viashim-ctl(checkpoint in one microVM, restore in a fresh one with state intact — the image set persists on the host rw layer, and reusing the container id keeps CRIU mount paths aligned). Doing it cross-node through an engine (push/pull the checkpoint image to another host, then restore there) is the remaining piece.
Prototyped against the Kata 3.31.0 base (cec98e0) on github.com/dims/kata-containers:
criu-cr-containerd— the engine-driven work: agent verbs, shim TaskCheckpoint+ create-with-checkpoint, the criu-image transport, and the agent restore-side--ext-mount-mapfor the CRI binds that makescrictl-driven restore work — kept as a clean 4-commit series (packaging · rustjail · agent · runtime-rs) (full diff).rustjail-init-session-leader— the standalone session-leader fix.criu-checkpoint-restore— the original no-engine prototype: the agent C/R verbs + hardening + theshim-ctldriver that validated sockets, ~10 MB memory, and cross-VM restore.
The shim-ctl driver is the no-engine test harness; it lives in the criu-checkpoint-restore
branch, not in criu-cr-containerd.
- CRIU
- CRI forensic container checkpointing — KEP-2008 (
CheckpointContainer) - runc-based equivalents: containerd
internal/cri/server/container_checkpoint_linux.go(checkIfCheckpointOCIImage,CRImportCheckpoint) +container_create.go(a file-path image is "Assuming it is a checkpoint archive"); CRI-Oserver/container_restore.go+internal/lib/restore.go - containerd ≥ 2.3.0 required for CRI restore — the ≤ 2.2.x
CRImportCheckpointnil-map crash was fixed by9018c75d5 - Proof-of-concept branches:
criu-cr-containerd,rustjail-init-session-leader,criu-checkpoint-restoreongithub.com/dims/kata-containers