Skip to content

Instantly share code, notes, and snippets.

@dims
dims / criu-checkpoint-restore-design.md
Created June 10, 2026 17:32
Kata Containers CRIU checkpoint/restore — design + intern test guide (PoC: dims/kata-containers criu-cr-containerd)

CRIU Checkpoint / Restore for Kata Containers

Status: prototype against Kata 3.31.0 (runtime-rs). Validated end-to-end (via shim-ctl): a counter's in-memory state survives 3 checkpoint/restore cycles (monotonic, no reset), a live TCP LISTEN socket survives, a ~10 MB memory buffer survives, and a checkpoint restores in a fresh microVM (migration-style). Engine-driven: both ctr (containerd native) and crictl (CRI, no kubelet) do the full checkpoint→restore cycle with the counter surviving; crictl restore needs containerd ≥ 2.3.0. See Proof of concept for the branches.

Motivation

@dims
dims / 2026-06-03-substrate-cross-vendor-contributors.md
Created June 4, 2026 13:35
Agent Substrate (agent-substrate/substrate) — Cross-Vendor Contributor & Affiliation Report (2026-06-03)

Agent Substrate — Cross-Vendor Contributor & Affiliation Report

Generated: 2026-06-03 Repo: agent-substrate/substrate"Agent Substrate: the core system" (public, 468★, Apache-2.0) What it is: a system on top of Kubernetes that manages agent-like workloads at higher scale/lower latency by taking the K8s control-plane out of the critical path — actors run in gVisor sandboxes (ateom), managed by a kubelet-like agent (atelet), with GCS checkpoint/restore (ategcs) and a router (atenet). Window: 2026-05-13 → 2026-06-03 (~3 weeks — a brand-new seed project) Volume analyzed: 95 commits · 117 PRs (all states) · 63 issues (all states) Analysis basis: upstream agent-substrate/substrate@main (e26cfa22), cloned fresh — the local dims/substrate fork checkout (4cbac18) was a few commits behind.

Framing: Unlike the NVIDIA-owned reports (nvsentinel / dra-driver / aicr / OpenShell), this repo is not NVIDIA-owned — it's a

@dims
dims / external-contributor-report.md
Created June 3, 2026 14:40
OpenShell (NVIDIA/OpenShell) — External Contributor & DCO-Hygiene Report (2026-05-26)

External Contributor & DCO-Hygiene Report — nvidia/OpenShell

  • Generated: 2026-05-26
  • Repository: nvidia/OpenShell (working copy: /Users/dsrinivas/go/src/github.com/nvidia/OpenShell)
  • Total commits analyzed (full main history): 754
  • Total unique commit-author emails: 58
  • Total unique GitHub handles (resolved): 51 (excluding bots)

Methodology summary

@dims
dims / 2026-06-03-aicr-external-contributors.md
Created June 3, 2026 14:30
aicr (NVIDIA/aicr) — External Contributor & DCO-Hygiene Report (2026-06-03)

aicr — External Contributor & DCO-Hygiene Report

Generated: 2026-06-03 Repo: NVIDIA/aicr"Tooling for optimized, validated, and reproducible GPU-accelerated AI runtime in Kubernetes" (323★) History analyzed: 2026-01-30 → 2026-06-03 (~4 months), main @ f65d7b0 Total commits analyzed: 1,205 (44 unique author emails → 35 distinct GitHub handles + 3 bots) Analysis basis: working copy is the dims/aicr2 fork; its main HEAD (f65d7b0eddcda…) is identical to upstream NVIDIA/aicr@main, so the local history faithfully represents upstream.

Methodology: Extracted every commit author via git log (email, name, date, and Signed-off-by trailer via %(trailers)) → resolved each email to a GitHub login through the upstream commit API (GET /repos/NVIDIA/aicr/commits/{sha}.author.login) → classified each handle by (1) Helios LDAP match, (2) @nvidia.com commit email, (3) NVIDIA GitHub-org membership (`GET /orgs/NVIDIA/member

@dims
dims / 1-2026-05-29-firecracker-ateom-poc-bigbox.md
Created May 29, 2026 19:18
Agent Substrate — pluggable ateom backend: Firecracker (microVM). [1] PoC on bigbox, [2] design proposal, [3] implementation log.

Firecracker ateom Backend — Working PoC on bigbox (counter demo)

Update (2026-05-29): this standalone PoC has since been turned into a full in-repo implementation (Phases 0–3) and a cluster e2e — a counter actor on a Firecracker worker driven through the real control plane (ate-api-server + atenet), state preserved across suspend/resume, on the existing kind cluster. Branch firecracker-backend (pushed to dims/substrate, commit bc533f5; worktree ~/go/src/github.com/agent-substrate/substrate-firecracker). Full journal: ~/notes/agent-substrate/2026-05-29-firecracker-backend-implementation-log.md. The PoC notes below are retained for the from-scratch microVM bring-up details (rootfs build, Firecracker API sequence, gotchas).

  • Date: 2026-05-29 · Host: bigbox (Ubuntu 24.04, AMD EPYC 7763, nested KVM) · Firecracker: v1.15.1 · Guest kernel: vmlinux-6.1.128
  • Goal: prove a Firecracker backend can satisfy substrate's ateom Run/Checkpoint/Restore contract, preserving
@dims
dims / host-managed-imex-design-v2.md
Last active May 29, 2026 17:04
Host-managed IMEX v2 design and operator guide

Design v2: Host-Managed IMEX, Minimal Alpha

Field Value
Status Implementable minimal alpha
Feature gate HostManagedIMEX
Scope Install-wide, not per-ComputeDomain
Primary goal Stop launching per-ComputeDomain IMEX DaemonSets when the host already runs nvidia-imex
Primary non-goal Per-ComputeDomain channel isolation across an IMEX fabric
# set PATH and check if cluster is present (all terminals)
export PATH=$HOME/go/bin:$PATH:
kubectl version
# ============================================================
# Terminal A — keep this running, watches and port-forwards.
# ============================================================
kubectl port-forward -n ate-system svc/atenet-router 8000:80 &
kubectl port-forward -n ate-openshell-m0 svc/openshell-gateway-substrate 50051:50051 &
@dims
dims / 2026-05-11-dra-driver-nvidia-gpu-external-contributors.md
Last active May 11, 2026 18:20
dra-driver-nvidia-gpu — External Contributor Report (2026-05-11)

dra-driver-nvidia-gpu — External Contributor Report

Generated: 2026-05-11 (rev. 2 — Helios cross-check added) Repo: kubernetes-sigs/dra-driver-nvidia-gpu Repo history: 2022-07-14 → 2026-05-11 (~3.8 years) Total commits analyzed: 1,853 (47 unique author emails) Methodology: Extracted all unique commit authors via git log → classified by email domain (@nvidia.com = NVIDIA, all others = candidates) → mapped commits to GitHub logins via GET /repos/.../commits/{sha} → verified every candidate against GET /orgs/NVIDIA/members/{username} (HTTP 204 = confirmed member, 404 = not a member) → for ambiguous cases, additionally cross-referenced against NVIDIA Helios LDAP (helios-cli user search) to detect NVIDIA employees who contribute via personal GitHub accounts not registered in the NVIDIA org → cross-referenced GitHub profiles, DCO Signed-off-by trailers, LinkedIn, and corporate-email patterns → folded NVIDIA-personal-e

@dims
dims / 2026-05-10-k8s-ci-failures-triage-v3.md
Created May 11, 2026 00:44
K8s CI triage runbook + v3 flakes report + v3 failures report (2026-05-10)

Kubernetes CI Failures — Triage Report (v3, independent)

Date: 2026-05-10 (PM) Source: failures-latest.json (HTML view: failures-latest.html). Snapshot: 231 jobs. Method: 10 parallel cluster-investigation agents → 1 independent cross-check verifier (8 claims: 6 CONFIRMED / 2 PARTIAL / 0 REFUTED) → live PR/issue state sweep on 56 references → drift detection against 2026-05-09 snapshot. Truly independent: no prior triage markdown was read; every claim re-derived from raw artifacts.

⚠️ Status banner:

  • 6 fix PRs merged today: k/k#138934 (coverage), k/k#138851 (ContainerMetrics), k/k#138584 (compat-versions, INCOMPLETE — needs release-1.36 cherry-pick), k/k#137936 (storage-kind), kops#18296 (upgrade-gossip), provider-aws-test-infra#550 (AMI build), cloud-provider-kind#407 (Pattern A digest pin).
  • Drift recovery: `ci-kubernetes-e2e
@dims
dims / 2026-05-05-kubernetes-security-findings.md
Last active May 5, 2026 18:08
Kubernetes Security Findings — May 2026

Kubernetes Security Findings — May 2026

Repository: kubernetes/kubernetes
Commit: 47f990437458a2b171f51b5e97a0c28c81d949d1 (master, 2026-05-05)
Methods: Static multi-agent source review (87 files across 4 researchers) + dynamic execution harness (kubectl, 3 agents)
Subsystems: authentication, authorization/RBAC, admission control/webhooks, node authorization (NodeAuthorizer + DRA graph)


Table of Contents