Skip to content

Instantly share code, notes, and snippets.

@dims
Created June 10, 2026 17:32
Show Gist options
  • Select an option

  • Save dims/57d2f78550d749d6340bc426d1d9dd29 to your computer and use it in GitHub Desktop.

Select an option

Save dims/57d2f78550d749d6340bc426d1d9dd29 to your computer and use it in GitHub Desktop.
Kata Containers CRIU checkpoint/restore — design + intern test guide (PoC: dims/kata-containers criu-cr-containerd)

CRIU Checkpoint / Restore for Kata Containers

Status: prototype against Kata 3.31.0 (runtime-rs). Validated end-to-end (via shim-ctl): a counter's in-memory state survives 3 checkpoint/restore cycles (monotonic, no reset), a live TCP LISTEN socket survives, a ~10 MB memory buffer survives, and a checkpoint restores in a fresh microVM (migration-style). Engine-driven: both ctr (containerd native) and crictl (CRI, no kubelet) do the full checkpoint→restore cycle with the counter surviving; crictl restore needs containerd ≥ 2.3.0. See Proof of concept for the branches.

Motivation

Checkpoint/restore (C/R) — snapshotting a running container's process tree to disk with CRIU and restoring it later — underpins forensic container analysis, fast start-up, and live migration. The OCI runtime and the CRI (CheckpointContainer) both expose it, and runc-based runtimes implement it by running criu on the host against the container's host-visible PIDs.

Kata Containers cannot use that approach. A Kata container's processes run inside the guest VM; the host sees only the VMM (qemu) and the shim, so host-side criu has nothing to dump. Closing the gap requires running CRIU inside the guest and threading the engine's existing C/R calls through the Kata stack — shim and agent — without the engine having to know it is talking to a VM.

Proposed Solution

Run CRIU in the guest, driven by the kata-agent, and surface it through the shim's existing Task service so the container engine drives C/R unchanged:

   engine:  shim-ctl │ ctr c checkpoint/restore │ crictl checkpoint + create-from-ckpt
                     │   Task Checkpoint RPC  /  Create-with-checkpoint
                     ▼
   containerd-shim-kata-v2   (host)
                     │   ttRPC: CheckpointContainer / RestoreContainer
                     ▼
   kata-agent   (guest)  ──►  criu dump/restore on the container's PID tree
                     │
   CRIU image set ──(container rootfs .kata-cr  →  host overlay upperdir)──►  host
                     │   rides the engine's checkpoint image (rw / criu layer)

The agent performs the CRIU work; the shim maps engine RPCs to agent verbs and moves the image set across the host/guest boundary through the container's own rootfs, so no separate transport channel is needed.

Assumptions and Limitations

  1. The container init must be a session leader — CRIU requires the PID-namespace init to be a session leader to dump it. Handled in rustjail.
  2. CRIU runs in the guest, so the guest kernel must carry the CRIU-required options (CONFIG_CHECKPOINT_RESTORE, socket-diag modules, …). A guest-kernel config fragment is shipped for this.
  3. Restore needs a writable rootfs (CRIU put_root). A read-only rootfs (root.readonly: true) requires a guest writable overlay — not yet covered.
  4. The image set is transported through the container rootfs (virtio-fs over the host overlay upperdir). Assumes a virtio-fs shared rootfs; block/devmapper rootfs is not covered.
  5. CRI restore is a create-from-checkpoint-image operation owned by the engine's CRI layer, not a CRI RPC — crictl create with the checkpoint tar/image as the container image. It needs containerd ≥ 2.3.0; 2.2.x panics in CRImportCheckpoint (a nil-map bug, fixed upstream by 9018c75d5).

End User Interface

C/R is driven by whichever layer the operator already uses; the Kata stack is identical underneath.

Driver Checkpoint Restore
shim-ctl (no engine) SHIMCTL_MODE=checkpoint SHIMCTL_MODE=restore / cycle
ctr (containerd native) ctr c checkpoint --task <id> <ref> ctr c restore --live <id> <ref>
crictl (CRI, no kubelet) crictl checkpoint --export=<tar> <id> crictl create with the tar as the container image (needs containerd ≥ 2.3.0)

crictl checkpoint emits a standard CRI checkpoint archive (checkpoint/ CRIU images + spec.dump / config.dump / rootfs-diff.tar), identical in shape to what runc-based runtimes produce.

Implementation Details

Kata agent changes

Two new agent verbs run CRIU in the guest:

rpc CheckpointContainer(CheckpointContainerRequest) returns (google.protobuf.Empty);
rpc RestoreContainer(RestoreContainerRequest)       returns (google.protobuf.Empty);

message CheckpointContainerRequest { string container_id = 1; string image_path = 2; }
message RestoreContainerRequest    { string container_id = 1; string image_path = 2; }

CheckpointContainer runs criu dump on the container init's PID tree with --manage-cgroups --tcp-established --ext-unix-sk --file-locks --link-remap --shell-job --leave-running — so established + listening TCP, UNIX sockets, and file locks survive; RestoreContainer runs criu restore --restore-detached. Four guest-specific hazards shaped the implementation:

  • Thaw the cgroup before dumping. Engines freeze the container cgroup (their Pause→Checkpoint→Resume flow) before calling Checkpoint; CRIU cannot ptrace-seize a frozen task. The handler writes 0 to the container's cgroup.freeze before invoking criu; criu --leave-running then keeps the container running (CRI semantics).
  • Do not wait() for CRIU — poll an rc-file. The agent is a sub-reaper and reaps its own children, racing any wait() (→ ECHILD). CRIU is launched via /bin/sh -c "…; echo $? > <rcfile>", the child is dropped, and the handler polls the rc-file for the exit code. This also keeps the long dump off the agent's tiny (≈ guest-vCPU-count) tokio worker pool.
  • Handle the guest mounts. The container's mounts are rslave (propagated from the guest); CRIU rejects that (“unreachable sharing”), so the dump first makes them private: nsenter -t <pid> -m -- mount --make-rprivate /. CRIU also cannot recreate the virtio-fs rootfs, so it is declared external — --external mnt[/]:rootfs at dump, re-mapped at restore with --root <rootfs> --ext-mount-map rootfs:<rootfs> --ext-mount-map auto. The CRI-injected file binds (/etc/resolv.conf, /etc/hostname, /etc/hosts) are externalized the same way at dump (exactly those three — externalizing all sub-path binds breaks restore) and re-mapped symmetrically at restore, one --ext-mount-map extmnt_<path>:<rootfs><path> per bind; without that, crictl restore aborts with mnt: No mapping for …. That restore-side mapping is what makes CRI-driven restore work.
  • Refresh the init after restore. criu restore --restore-detached yields a new PID tree that is not the original init; the handler reads the new PID from --pidfile and updates the stored init Process (pid + a fresh exit channel). The detached tree reparents to the agent (a sub-reaper), so its exit is reaped normally and kill/wait/delete keep working.

The agent-client ttRPC timeout for both verbs is unbounded (Some(0)); a large memory dump/restore otherwise trips the default deadline.

rustjail changes

The container init is made a session leader (setsid()), required by CRIU to dump the PID-namespace init.

containerd-shim-kata-v2 changes

  • Checkpoint. The Task service's Checkpoint RPC maps to TaskRequest::CheckpointContaineragent.CheckpointContainer. The agent dumps into the container's own rootfs (/run/kata-containers/<cid>/rootfs/.kata-cr); because the rootfs is virtio-fs over the host overlay upperdir, the images appear host-side automatically. The shim then stages them where the engine expects: the engine-provided checkpoint path (ctr) and/or the CRI plugin's per-container state dir (crictl).
  • Restore (create-with-checkpoint). Containerd/CRI restore is create-with-checkpoint: CreateTaskRequest.checkpoint is set. The shim carries this onto ContainerConfig.checkpoint; at start, if set, the container is restored (agent.RestoreContainer) instead of started fresh, and Process::run_io_wait is wired so the task-exit event fires when the restored process exits (engine sees stoppedrm works).

How the engines map onto this

CRI has no RestoreContainer RPC — only CheckpointContainer. Restore is expressed as creating a container whose image is a checkpoint image; the runtime detects it (checkIfCheckpointOCIImage) and restores (CRImportCheckpoint → create with restore=true, then StartContainer performs the restore). Both CRI-O and containerd's CRI plugin implement this. Because those implementations unpack the CRIU images host-side (for runc + host criu), the Kata shim stages them into the guest rootfs on restore (containerd hands it opts.Checkpoint=<host checkpoint dir>). Both ctr restore --live and crictl create-with-checkpoint funnel into the same shim create-with-checkpoint path (CreateTaskRequest.checkpoint / opts.Checkpoint), so both are complete today (each does a 3-cycle counter-surviving C/R) — crictl needs containerd ≥ 2.3.0 and the agent restore-side --ext-mount-map for the CRI binds (below).

Step by step walk-through

Checkpoint — ctr c checkpoint --task <id> <ref>:

  1. containerd freezes the task cgroup and calls the shim Checkpoint Task RPC.
  2. Shim → agent.CheckpointContainer(image_path = …/rootfs/.kata-cr).
  3. Agent thaws the cgroup, makes the container mounts private (mount --make-rprivate /), then runs criu dump --leave-running --external mnt[/]:rootfs … into .kata-cr (via sh + rc-file) and polls the rc-file.
  4. The images surface on the host overlay upperdir; the shim copies them into containerd's checkpoint path → containerd packages the checkpoint image (criu layer + rootfs).

Restore — ctr c restore --live <id> <ref>:

  1. containerd unpacks the checkpoint image into a new snapshot and calls the shim Create with checkpoint set.
  2. Shim records ContainerConfig.checkpoint; agent.create_container mounts the rootfs (now carrying .kata-cr) and creates the init.
  3. On start, the shim restores: agent.RestoreContainer kills the created init and runs criu restore --restore-detached (with --root <rootfs> + --ext-mount-map auto) from .kata-cr; the agent refreshes the init PID; the shim wires run_io_wait.
  4. The container resumes with its in-guest process state intact.

Checkpoint — crictl checkpoint --export=<tar> <id> (CRI, no kubelet):

  1. The CRI plugin calls the shim Checkpoint Task RPC — the same agent path as ctr.
  2. The agent dumps into the rootfs .kata-cr; the shim stages the images into the CRI plugin's per-container state dir.
  3. containerd packages the standard CRI checkpoint archive: checkpoint/<criu images> + spec.dump + config.dump + rootfs-diff.tar.

Restore — crictl create with the checkpoint tar as the image (works, containerd ≥ 2.3.0): CRI has no restore call; restore is crictl create with the checkpoint tar (or image) as the container image. containerd's CRI plugin detects it (a file-path image is logged "Assuming it is a checkpoint archive"; or checkIfCheckpointOCIImage for an image), runs CRImportCheckpoint which unpacks the CRIU images on the host, and at start hands the shim opts.Checkpoint=<host checkpoint dir>. The shim's create-with-checkpoint path then runs Container::restore → agent.restore_container → criu restore in the guest. The agent maps each CRI-injected bind (/etc/{resolv.conf,hostname,hosts}) the checkpoint externalized back to the new rootfs file via --ext-mount-map extmnt_<path>:<rootfs><path>; the container resumes with state intact (3-cycle counter-surviving C/R, 17→28→39→49→54). Requires containerd ≥ 2.3.0 — on ≤ 2.2.x CRImportCheckpoint hits a nil-map panic and crashes containerd (fixed in 2.3.0, commit 9018c75d5).

Open items

  • One-shot CRI checkpoint image, no engine round-trip. containerd diffs the --rw snapshot before the --task dump, so a fresh dump must ride the criu layer (staged by the shim) rather than the rw layer.
  • Read-only rootfs. Provide a guest writable overlay for CRIU put_root.
  • Engine-driven cross-node migration. Single-node engine restore now works (ctr and crictl). Cross-VM restore is proven via shim-ctl (checkpoint in one microVM, restore in a fresh one with state intact — the image set persists on the host rw layer, and reusing the container id keeps CRIU mount paths aligned). Doing it cross-node through an engine (push/pull the checkpoint image to another host, then restore there) is the remaining piece.

Proof of concept

Prototyped against the Kata 3.31.0 base (cec98e0) on github.com/dims/kata-containers:

  • criu-cr-containerd — the engine-driven work: agent verbs, shim Task Checkpoint + create-with-checkpoint, the criu-image transport, and the agent restore-side --ext-mount-map for the CRI binds that makes crictl-driven restore work — kept as a clean 4-commit series (packaging · rustjail · agent · runtime-rs) (full diff).
  • rustjail-init-session-leader — the standalone session-leader fix.
  • criu-checkpoint-restore — the original no-engine prototype: the agent C/R verbs + hardening + the shim-ctl driver that validated sockets, ~10 MB memory, and cross-VM restore.

The shim-ctl driver is the no-engine test harness; it lives in the criu-checkpoint-restore branch, not in criu-cr-containerd.

References

  • CRIU
  • CRI forensic container checkpointing — KEP-2008 (CheckpointContainer)
  • runc-based equivalents: containerd internal/cri/server/container_checkpoint_linux.go (checkIfCheckpointOCIImage, CRImportCheckpoint) + container_create.go (a file-path image is "Assuming it is a checkpoint archive"); CRI-O server/container_restore.go + internal/lib/restore.go
  • containerd ≥ 2.3.0 required for CRI restore — the ≤ 2.2.x CRImportCheckpoint nil-map crash was fixed by 9018c75d5
  • Proof-of-concept branches: criu-cr-containerd, rustjail-init-session-leader, criu-checkpoint-restore on github.com/dims/kata-containers

Kata CRIU Checkpoint/Restore — Intern Test Guide

Stand up a Lima VM on a Mac, build the checkpoint/restore-enabled Kata stack from the PoC branch, and run the 3× checkpoint/restore counter test under both ctr and crictl.

  • Code: branch criu-cr-containerd on github.com/dims/kata-containers (full diff) — includes the agent restore fix (in the agent: commit) so crictl restore works; cloning the branch gets it. Design: criu-checkpoint-restore-design.md.
  • Host: Apple-Silicon Mac (M3 or newer — nested virtualization is required), Homebrew installed.
  • Time: the one-time build is ~60–90 min (mostly the guest kernel + rootfs); each test takes ~5–10 min (the crictl restore reboots a pod per cycle).
  • What works: both ctr and crictl do the full 3× checkpoint→restore with the counter surviving (monotonic, never resetting). crictl restore is crictl create with the checkpoint tar as the container image (containerd's CRImportCheckpoint → the kata shim → agent → criu in the guest); it needs containerd ≥ 2.3.0 (2.2.x panics), which §3 installs.

Copy each fenced block into a file in the VM and run it; each ends with a *_DONE marker.


1. Create the Lima VM (on the Mac)

brew install lima
cat > kata.yaml <<'EOF'
vmType: vz
nestedVirtualization: true
cpus: 8
memory: 12GiB
disk: 60GiB
images:
- location: "https://cloud-images.ubuntu.com/releases/noble/release/ubuntu-24.04-server-cloudimg-arm64.img"
  arch: aarch64
mounts:
- location: "~"
EOF
limactl create --name=kata kata.yaml
limactl start kata
limactl shell kata        # <- from here on, you are inside the VM

2. Build the Kata C/R stack (in the VM) → bash build-kata-cr.sh

#!/usr/bin/env bash
set -eo pipefail
export DEBIAN_FRONTEND=noninteractive
cat > ~/.kata-build-env <<'EOF'
export PATH=/usr/local/go/bin:$HOME/.cargo/bin:$HOME/go/bin:$PATH
export GOPATH=$HOME/go
EOF

### deps + toolchains ###
sudo apt-get update -q
sudo apt-get install -y --no-install-recommends \
  build-essential gcc g++ make cmake git curl wget ca-certificates pkg-config xz-utils \
  gnupg2 file zstd python3-pip python3-dev socat \
  flex bison libelf-dev libdw-dev dwarves bc libssl-dev rsync cpio kmod patch gettext \
  clang libclang-dev libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler \
  python3-protobuf libnl-3-dev libnet-dev libcap-dev libbsd-dev libgnutls28-dev libnftables-dev \
  libdevmapper-dev musl musl-dev musl-tools makedev qemu-system-arm qemu-utils virtiofsd \
  e2fsprogs xfsprogs erofs-utils parted gdisk cryptsetup-bin dosfstools \
  mmdebstrap debootstrap runc busybox-static skopeo
curl -fsSL -o /tmp/go.tgz https://go.dev/dl/go1.25.11.linux-arm64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf /tmp/go.tgz
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --profile minimal
source ~/.kata-build-env && source "$HOME/.cargo/env"

### the PoC branch (carries the agent verbs, shim wiring, rustjail session-leader fix, kernel fragment) ###
cd ~ && rm -rf kata
git clone --depth 1 --branch criu-cr-containerd https://github.com/dims/kata-containers.git kata
( cd ~/kata/src/agent && rustup target add aarch64-unknown-linux-musl )

### CRIU 4.2 from source (noble has no criu package) ###
rm -rf ~/criu-src && git clone --depth 1 https://github.com/checkpoint-restore/criu.git ~/criu-src
make -C ~/criu-src -j"$(nproc)" criu
~/criu-src/criu/criu --version

### guest kernel — the branch's fragment sets CONFIG_CHECKPOINT_RESTORE + the socket-diag modules ###
cd ~/kata/tools/packaging/kernel
./build-kernel.sh -a aarch64 -t qemu setup
./build-kernel.sh -a aarch64 -t qemu build
sudo -E env PATH="$PATH" DESTDIR=/opt/kata PREFIX=/ ./build-kernel.sh -a aarch64 -t qemu install

### agent (musl) ###
cd ~/kata/src/agent && make SECCOMP=no
AGENT_BIN="$HOME/kata/target/aarch64-unknown-linux-musl/release/kata-agent"

### rootfs WITH criu (the binary + its runtime libs) + guest image ###
cd ~/kata/tools/osbuilder/rootfs-builder
export ROOTFS_DIR=$(realpath ./rootfs) && sudo rm -rf "$ROOTFS_DIR"
sudo -E env PATH="$PATH" OS_VERSION=noble ROOTFS_DIR="$ROOTFS_DIR" \
  AGENT_SOURCE_BIN="$AGENT_BIN" AGENT_INIT=no SECCOMP=no REPO_COMPONENTS="main universe" \
  EXTRA_PKGS="libbsd0 libgnutls30 libnftables1 libprotobuf-c1 libnl-3-200 libnet1 libuuid1 libselinux1 iproute2 nftables" \
  ./rootfs.sh ubuntu
cd ~/kata/src/agent && sudo -E env PATH="$PATH" make install DESTDIR="$ROOTFS_DIR" LIBC=musl INIT=no SECCOMP=no
sudo install -o root -g root -m 0755 ~/criu-src/criu/criu "$ROOTFS_DIR/usr/sbin/criu"
sudo modprobe loop || true
cd ~/kata/tools/osbuilder/image-builder
sudo -E env PATH="$PATH" ./image_builder.sh "$ROOTFS_DIR"
sudo install -o root -g root -m 0640 -D kata-containers.img /opt/kata/share/kata-containers/kata-containers.img

### the shim (containerd-shim-kata-v2) — `make` generates config.rs and builds it ###
cd ~/kata/src/runtime-rs && make
sudo install -o root -g root -m 0755 \
  "$(find ~/kata -path '*release/containerd-shim-kata-v2' -type f | head -1)" \
  /usr/local/bin/containerd-shim-kata-v2
sudo -E env PATH="$PATH" make install-configs PREFIX=/opt/kata
CFG=/opt/kata/share/defaults/kata-containers/runtime-rs/configuration-qemu-runtime-rs.toml
sudo sed -i \
  -e 's#^path = .*#path = "/usr/bin/qemu-system-aarch64"#' \
  -e 's#^valid_hypervisor_paths = .*#valid_hypervisor_paths = ["/usr/bin/qemu-system-aarch64"]#' \
  -e 's#^virtio_fs_daemon = .*#virtio_fs_daemon = "/usr/libexec/virtiofsd"#' \
  -e 's#^valid_virtio_fs_daemon_paths = .*#valid_virtio_fs_daemon_paths = ["/usr/libexec/virtiofsd"]#' \
  -e 's#^firmware = .*#firmware = ""#' \
  -e 's#^debug_console_enabled = .*#debug_console_enabled = true#' \
  -e 's#^reconnect_timeout_ms = .*#reconnect_timeout_ms = 30000#' \
  "$CFG"
echo "BUILD_KATA_CR_DONE"

3. Install + wire the engine (containerd + ctr + crictl) → bash setup-engine.sh

#!/usr/bin/env bash
set -eo pipefail
sudo apt-get install -y containerd                       # apt 2.2.1 = systemd unit + scaffolding; ships `ctr`
# overlay containerd 2.3.0 (2.2.x panics in CRImportCheckpoint on CRI restore; 2.3.0 fixes it via 9018c75d5)
cd /tmp && curl -fsSL -o ctd.tgz https://github.com/containerd/containerd/releases/download/v2.3.0/containerd-2.3.0-linux-arm64.tar.gz
sudo systemctl stop containerd 2>/dev/null || true
sudo tar -C /usr -xzf ctd.tgz bin/containerd bin/ctr
cd /tmp && curl -fsSL -o crictl.tgz \
  https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.31.1/crictl-v1.31.1-linux-arm64.tar.gz
sudo tar -C /usr/local/bin -xzf crictl.tgz
sudo tee /etc/crictl.yaml >/dev/null <<'EOF'
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 30
EOF
# containerd config = defaults + a `kata` CRI runtime handler
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
sudo tee -a /etc/containerd/config.toml >/dev/null <<'EOF'

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.kata]
  runtime_type = 'io.containerd.kata.v2'
EOF
# the shim reads KATA_CONF_FILE from containerd's environment
sudo mkdir -p /etc/systemd/system/containerd.service.d
sudo tee /etc/systemd/system/containerd.service.d/kata.conf >/dev/null <<'EOF'
[Service]
Environment=KATA_CONF_FILE=/opt/kata/share/defaults/kata-containers/runtime-rs/configuration-qemu-runtime-rs.toml
EOF
# containerd's CRI CheckpointContainer does a HOST-side `criu --version` gate (it's written for
# runc). kata runs criu in the guest, so this host binary is only there to satisfy the gate:
sudo install -o root -g root -m 0755 ~/criu-src/criu/criu /usr/sbin/criu
sudo systemctl daemon-reload && sudo systemctl restart containerd && sleep 3
sudo modprobe vhost_vsock
sudo ctr image pull docker.io/library/busybox:latest
sudo crictl pull docker.io/library/busybox:latest
echo "SETUP_ENGINE_DONE"

4. Test — ctr: 3× checkpoint→restore, counter survives → bash test-ctr.sh

#!/usr/bin/env bash
set -uo pipefail
C=ctrcounter; R=io.containerd.kata.v2; IMG=docker.io/library/busybox:latest
sudo modprobe vhost_vsock
cid_of(){ for p in $(pgrep -f qemu-system); do tr '\0' ' ' </proc/$p/cmdline 2>/dev/null \
  | grep -q "sandbox-$C" && tr '\0' ' ' </proc/$p/cmdline | grep -oE 'guest-cid=[0-9]+' | cut -d= -f2; done | head -1; }
read_counter(){ local CID=$(cid_of); ( printf '%s\n' "echo CV=\$(cat /proc/\$(pgrep -f dev/shm/counter|head -1)/root/dev/shm/counter.log 2>/dev/null)"; sleep 3 ) \
  | timeout 12 sudo socat - VSOCK-CONNECT:"$CID":1026 2>&1 | grep -oE 'CV=[0-9]+' | grep -oE '[0-9]+' | head -1; }
launch(){ sudo ctr run -d --runtime $R $IMG $C sh -c 'i=0; while true; do i=$((i+1)); echo $i > /dev/shm/counter.log; sleep 1; done' >/dev/null 2>&1; }
clean(){ sudo pkill -9 -f "sandbox-$C" 2>/dev/null; sudo pkill -9 -f "id $C" 2>/dev/null; sleep 2
         sudo ctr t rm $C 2>/dev/null; sudo ctr c rm $C 2>/dev/null; sudo ctr snapshot rm $C 2>/dev/null; }

clean; sudo ctr images rm cp1:latest cp2:latest cp3:latest 2>/dev/null
sudo timeout 45 ctr run --rm --runtime $R $IMG warmup true >/dev/null 2>&1   # warm-up (first boot can be flaky)
launch; sleep 10
V=$(read_counter); [ -z "$V" ] && { clean; sleep 2; launch; sleep 12; V=$(read_counter); }
echo "LAUNCH value=$V"; prev=0
for i in 1 2 3; do
  NC=$(read_counter)
  sudo ctr c checkpoint --task $C cp$i:latest >/dev/null 2>&1; CK=$?
  sudo ctr t kill -s SIGKILL $C >/dev/null 2>&1; sleep 2; sudo ctr t rm $C >/dev/null 2>&1; sudo ctr c rm $C >/dev/null 2>&1; sleep 2
  sudo ctr c restore --live $C cp$i:latest >/dev/null 2>&1; RS=$?
  sleep 5; RC=$(read_counter)
  echo "CYCLE $i: at_checkpoint=$NC ckpt_rc=$CK restore_rc=$RS after_restore=$RC survived=$([ -n "$RC" ]&&[ "$RC" -gt 1 ]&&echo YES||echo NO) monotonic=$([ -n "$NC" ]&&[ "$NC" -gt "$prev" ]&&echo YES||echo NO)"
  prev=$NC
done
sleep 3; echo "FINAL value=$(read_counter)"
clean
echo "TEST_CTR_DONE"

Expected — every cycle ckpt_rc=0 restore_rc=0 survived=YES monotonic=YES, the counter climbing the whole way (never resetting to 1), e.g.:

LAUNCH value=10
CYCLE 1: at_checkpoint=12 ckpt_rc=0 restore_rc=0 after_restore=20 survived=YES monotonic=YES
CYCLE 2: at_checkpoint=22 ckpt_rc=0 restore_rc=0 after_restore=31 survived=YES monotonic=YES
CYCLE 3: at_checkpoint=34 ckpt_rc=0 restore_rc=0 after_restore=41 survived=YES monotonic=YES
FINAL value=46

(The exact numbers vary with timing; what matters is ckpt_rc=0 restore_rc=0 survived=YES monotonic=YES on every cycle. Killed ... pkill lines from the cleanup are harmless noise.)


5. Test — crictl: 3× checkpoint→restore, counter survives → bash test-crictl-restore3.sh

Same as §4 but through CRI: crictl checkpoint --export writes a tar, then crictl create with that tar as the container image restores it (containerd's CRImportCheckpoint → the kata shim → agent.restore_container → criu in the guest). Needs the containerd 2.3.0 from §3.

#!/usr/bin/env bash
set -uo pipefail
sudo modprobe vhost_vsock
for p in $(sudo crictl pods -q 2>/dev/null); do sudo crictl rmp -f "$p" >/dev/null 2>&1; done
cat > /tmp/ctr.json <<'EOF'
{"metadata":{"name":"ctrcounter"},"image":{"image":"docker.io/library/busybox:latest"},
 "command":["sh","-c","i=0; while true; do i=$((i+1)); echo $i > /dev/shm/counter.log; sleep 1; done"],
 "log_path":"c.log","linux":{"security_context":{"namespace_options":{"pid":1}}}}
EOF
cat > /tmp/restore.json <<'EOF'
{"metadata":{"name":"ctrcounter"},"image":{"image":"/tmp/cpcur.tar"},
 "log_path":"r.log","linux":{"security_context":{"namespace_options":{"pid":1}}}}
EOF
read_counter(){ local CID=$(for p in $(pgrep -f qemu-system); do tr '\0' ' ' </proc/$p/cmdline 2>/dev/null | grep -oE 'guest-cid=[0-9]+' | cut -d= -f2; done | head -1)
  ( printf '%s\n' "echo CV=\$(cat /proc/\$(pgrep -f dev/shm/counter|head -1)/root/dev/shm/counter.log 2>/dev/null)"; sleep 3 ) \
  | timeout 12 sudo socat - VSOCK-CONNECT:"$CID":1026 2>&1 | grep -oE 'CV=[0-9]+' | grep -oE '[0-9]+' | head -1; }
mkpod(){ cat > /tmp/pod.json <<EOF
{"metadata":{"name":"cr-pod","namespace":"default","attempt":1,"uid":"uid-$1"},
 "log_directory":"/tmp","linux":{"security_context":{"namespace_options":{"network":2}}}}
EOF
}
mkpod init
POD=$(sudo crictl runp --runtime kata /tmp/pod.json); sleep 5
CTR=$(sudo crictl create "$POD" /tmp/ctr.json /tmp/pod.json); sudo crictl start "$CTR" >/dev/null; sleep 18
echo "LAUNCH counter=$(read_counter)"; prev=0
for i in 1 2 3; do
  NC=$(read_counter)
  sudo crictl checkpoint --export=/tmp/cpcur.tar "$CTR" >/dev/null 2>&1; CK=$?     # CRI checkpoint -> tar
  sudo crictl rmp -f "$POD" >/dev/null 2>&1; sleep 3                               # tear the pod down
  mkpod "cyc$i"; POD=$(sudo crictl runp --runtime kata /tmp/pod.json); sleep 5     # fresh pod
  CTR=$(sudo crictl create "$POD" /tmp/restore.json /tmp/pod.json 2>/dev/null)     # create FROM the tar = restore
  RS=$(sudo crictl start "$CTR" 2>&1 | grep -ciE "fail|error"); sleep 6
  RC=$(read_counter)
  echo "CYCLE $i: at_checkpoint=$NC ckpt_rc=$CK start_errs=$RS after_restore=$RC survived=$([ -n "$RC" ]&&[ -n "$NC" ]&&[ "$RC" -ge "$NC" ]&&echo YES||echo NO) monotonic=$([ -n "$NC" ]&&[ "$NC" -gt "$prev" ]&&echo YES||echo NO)"
  prev=$NC
done
sleep 3; echo "FINAL counter=$(read_counter)"
sudo crictl rmp -f "$POD" >/dev/null 2>&1
echo "TEST_CRICTL_RESTORE_DONE"

Expected — every cycle ckpt_rc=0 start_errs=0 survived=YES monotonic=YES, the counter climbing across all three CRI restores (never resetting):

LAUNCH counter=17
CYCLE 1: at_checkpoint=19 ckpt_rc=0 start_errs=0 after_restore=28 survived=YES monotonic=YES
CYCLE 2: at_checkpoint=30 ckpt_rc=0 start_errs=0 after_restore=39 survived=YES monotonic=YES
CYCLE 3: at_checkpoint=41 ckpt_rc=0 start_errs=0 after_restore=49 survived=YES monotonic=YES
FINAL counter=54

How restore works: there's no CRI Restore RPC — crictl create with the checkpoint tar as the image is the trigger; containerd's CRImportCheckpoint unpacks it host-side and hands the kata shim opts.Checkpoint, which restores via agent.restore_container → criu in the guest. The agent maps the CRI-injected /etc/{resolv.conf,hostname,hosts} binds back with --ext-mount-map (the fix that made this work). Needs containerd ≥ 2.3.0 (§3) — 2.2.x panics in CRImportCheckpoint. See criu-checkpoint-restore-design.md → "How the engines map onto this".


Troubleshooting

  • MicroVM won't boot / vhost_vsock errors: sudo modprobe vhost_vsock (the test scripts do this).
  • First ctr run/crictl runp after a fresh boot fails: transient; just re-run (the ctr test warms up + retries).
  • After many microVM boots the VM wedges (orphaned qemu, CreateContainer fails): limactl stop kata && limactl start kata recovers it (the built stack persists on disk); then sudo modprobe vhost_vsock.
  • Read the guest by hand: CID=$(pgrep -f qemu-system | head -1 | xargs -I{} sh -c "tr \"\\0\" \" \" </proc/{}/cmdline" | grep -oE "guest-cid=[0-9]+" | cut -d= -f2); ( echo 'ps -ef'; sleep 3 ) | sudo socat - VSOCK-CONNECT:$CID:1026 (the guest debug console is on vsock port 1026).
  • A restored ctr container won't ctr t kill/rm in older builds: the branch fixes this (the agent refreshes the init pid + the shim wires the exit wait); if you hit a stuck one, sudo pkill -9 -f "sandbox-ctrcounter".
  • crictl create from a checkpoint tar crashes containerd (crictl gets EOF; containerd restarts): you're on containerd ≤ 2.2.x — CRImportCheckpoint has a nil-map panic (fixed in 2.3.0 by 9018c75d5). Check containerd --version; §3 overlays 2.3.0.
  • crictl restore fails with criu mnt: No mapping for …: the agent's restore-side --ext-mount-map for the CRI binds is missing — rebuild the agent from the current criu-cr-containerd branch (its agent: commit carries the fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment