Skip to content

Instantly share code, notes, and snippets.

@dims
Last active May 29, 2026 17:04
Show Gist options
  • Select an option

  • Save dims/6a505ea49e5f9cc85fed0992932257d3 to your computer and use it in GitHub Desktop.

Select an option

Save dims/6a505ea49e5f9cc85fed0992932257d3 to your computer and use it in GitHub Desktop.
Host-managed IMEX v2 design and operator guide

Design v2: Host-Managed IMEX, Minimal Alpha

Field Value
Status Implementable minimal alpha
Feature gate HostManagedIMEX
Scope Install-wide, not per-ComputeDomain
Primary goal Stop launching per-ComputeDomain IMEX DaemonSets when the host already runs nvidia-imex
Primary non-goal Per-ComputeDomain channel isolation across an IMEX fabric

1. Summary

HostManagedIMEX is a narrow mode for clusters where the operator already owns the host nvidia-imex daemon lifecycle. When enabled, the driver keeps the existing ComputeDomain user API and the existing DRA channel injection path, but stops creating the in-cluster compute-domain-daemon DaemonSets.

The smallest safe version is intentionally limited:

  • one schedulable IMEX channel device per node: channel-0
  • one prepared channel claim per node
  • one active host-managed ComputeDomain per host IMEX domain/fabric when isolation matters
  • allocationMode: Single only
  • no driver-managed host IMEX readiness or peer discovery

This version reuses the code the repo already has:

  • existing ComputeDomain CRD
  • existing ComputeDomainChannelConfig
  • existing workload ResourceClaimTemplate rendering
  • existing compute-domain-default-channel.nvidia.com DeviceClass
  • existing checkpoint V2 and CDI prepare/unprepare flow
  • existing node-local channel conflict check for channel 0

One caveat about the "reuse" framing: the existing channel-prepare code in device_state.go:551-554,586 already supports AllocationMode: All by slicing nvCapImexChanDevInfos[:maxImexChannelCount] and injecting every channel chardev. Host-managed v2 must actively suppress that branch (see §7.2 step 3 and §14.1); the gate doesn't just inherit "what's already there."

It deliberately does not add an allocator CRD, new status fields, checkpoint V3, ResourceClaim finalizers, mandatory webhook behavior, or multi-slot ResourceSlice publishing.

2. Why This Shape

The current repo already publishes channel-0 and the Helm chart already installs a DeviceClass that selects only that channel. The controller already creates a workload ResourceClaimTemplate whose opaque config carries the ComputeDomain UID and allocation mode. The kubelet plugin already writes a checkpoint, creates a CDI spec, and injects /dev/nvidia-caps-imex-channels/channel0.

The high-footprint design work was needed to support multiple concurrent ComputeDomains on the same IMEX fabric with unique channel IDs. That is not part of this alpha. Dropping that requirement removes the need for:

  • an IMEXChannelAllocation CRD and reaper
  • clique-wide optimistic concurrency
  • new ComputeDomain.status.channels or conditions
  • new per-claim finalizers
  • a checkpoint schema migration
  • a scheduler-visible slot model
  • live Kubernetes lookups in admission

The tradeoff is explicit: v2 is a small operational mode, not a complete multi-tenant isolation design.

3. Contract

3.1 Driver-owned behavior

With HostManagedIMEX=true, the driver:

  • watches ComputeDomain objects
  • adds/removes the existing ComputeDomain finalizer (resource.nvidia.com/computeDomain) and lets the existing workload-RCT manager add/remove its own RCT finalizer (unchanged)
  • creates the workload ResourceClaimTemplate
  • publishes a per-node ResourceSlice with channel-0
  • prepares channel claims by injecting channel 0 through CDI
  • accepts only empty or allocationMode: Single during Prepare
  • ignores (does not reject) spec.numNodes
  • rejects (with a permanent error) host-managed channel prepare on nodes whose local NVML clique ID is empty. This is new behavior, not a reuse of the existing silent-skip path at device_state.go:581-584 — that path currently returns an empty configState without an error, which would mask a misconfigured node. Host-managed mode treats "no clique" as a hard prepare failure so the operator sees the misconfiguration immediately.
  • maintains existing checkpoint/CDI cleanup behavior

The driver does not:

  • create per-ComputeDomain DaemonSets
  • create daemon ResourceClaimTemplates
  • prepare daemon claims
  • run or restart nvidia-imex
  • write nodes_config.cfg
  • update node labels for a ComputeDomain
  • wait for ComputeDomainClique or daemon readiness before preparing workloads
  • prove host IMEX health in Kubernetes status

3.2 Operator-owned behavior

The operator owns everything below the DRA boundary:

  • installing the nvidia-imex package
  • configuring and starting nvidia-imex.service
  • populating /etc/nvidia-imex/nodes_config.cfg
  • ensuring IMEX peers agree on the node map
  • ensuring the IMEX channel kernel device major is registered before the kubelet plugin starts
  • ensuring channel 0 is usable on every participating node
  • monitoring nvidia-imex health with host tooling
  • draining workloads before restarting or reconfiguring host IMEX
  • preventing multiple active isolated jobs from sharing channel 0 on the same host IMEX domain

3.3 User-visible behavior

Users still create a ComputeDomain and use the generated ResourceClaimTemplate:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: train-a
spec:
  numNodes: 0
  channel:
    resourceClaimTemplate:
      name: train-a-imex-channel
    allocationMode: Single

The ComputeDomain remains namespaced. Its generated workload ResourceClaimTemplate remains namespaced with the ComputeDomain.

status.status: Ready in host-managed mode means only that the controller has admitted the ComputeDomain and the workload ResourceClaimTemplate exists. It does not mean host IMEX is running, connected, or healthy; it does not mean any future Prepare will succeed.

4. Non-Goals

This alpha does not support:

  • multiple simultaneous isolated ComputeDomains on one host IMEX fabric
  • assigning unique IMEX channel IDs per ComputeDomain
  • allocationMode: All
  • publishing slot-0..slot-N abstract devices
  • creating channels dynamically after plugin startup
  • waiting for host IMEX health before Prepare
  • per-ComputeDomain or per-namespace mode selection
  • in-place migration while ComputeDomain workloads are running
  • automatic cleanup of stale objects from a previous driver-managed install
  • webhook-enforced host-managed policy

The beta path can add a real allocator if multi-tenant isolation is required. That should be a separate design because it introduces API, status, RBAC, and lifecycle complexity that this v2 intentionally avoids.

5. Feature Gate

Add a project feature gate:

HostManagedIMEX featuregate.Feature = "HostManagedIMEX"

Default:

Default:    false
PreRelease: featuregate.Alpha
Version:    version.MajorMinor(0, 5)

The gate is install-wide. Mixed driver-managed and host-managed ComputeDomains in the same Helm release are not supported.

When HostManagedIMEX is enabled, resolve feature gates before the existing dependency validation runs. Concretely, extend pkg/featuregates/featuregates.go to call a new override helper before ValidateFeatureGates:

// resolveHostManagedIMEXOverrides forces the two compatible gates
// off when HostManagedIMEX is on. Runs BEFORE ValidateFeatureGates
// so the existing "ComputeDomainCliques implies IMEXDaemonsWithDNSNames"
// dependency rule trivially holds afterwards.
func resolveHostManagedIMEXOverrides(gates featuregate.MutableFeatureGate) {
    if !gates.Enabled(HostManagedIMEX) {
        return
    }
    if gates.Enabled(IMEXDaemonsWithDNSNames) {
        klog.Infof("HostManagedIMEX is enabled; forcing IMEXDaemonsWithDNSNames=false")
        _ = gates.Set("IMEXDaemonsWithDNSNames=false")
    }
    if gates.Enabled(ComputeDomainCliques) {
        klog.Infof("HostManagedIMEX is enabled; forcing ComputeDomainCliques=false")
        _ = gates.Set("ComputeDomainCliques=false")
    }
}

Both defaults are true upstream — the helper explicitly sets them to false, it does not "reset to default."

Resolved gate values when the operator sets HostManagedIMEX=true:

HostManagedIMEX=true
IMEXDaemonsWithDNSNames=false  (forced)
ComputeDomainCliques=false      (forced)

The host-managed-specific Helm flag is exactly one:

--set featureGates.HostManagedIMEX=true

The operator may need additional pre-existing chart-level flags to scope the install (these are not introduced by this design and not specific to host-managed mode):

--set resources.computeDomains.enabled=true   # chart default; pass for clarity
--set resources.gpus.enabled=false            # only if GPU DRA is not wanted

If the operator wants both compute-domains and GPU DRA in the same chart, leave resources.gpus.enabled=true (chart default) and pass --set gpuResourcesEnabledOverride=true — that's an existing chart safety guard unrelated to host-managed IMEX.

The controller and kubelet plugin must log the resolved gate overrides at startup. This keeps the existing defaults for normal driver-managed mode while avoiding a three-gate operator recipe for host-managed mode.

6. Controller Design

6.1 Add/update path

For a non-deleting ComputeDomain, host-managed mode does this:

  1. Fetch the current ComputeDomain by UID.
  2. Add the existing ComputeDomain finalizer if missing. The constant is resource.nvidia.com/computeDomain, defined at cmd/compute-domain-controller/computedomain.go:50-52 as computeDomainFinalizer = computeDomainLabelKey. The same string is also the node-label key; this overload is in the existing codebase. Implementers should reuse computeDomainFinalizer, not introduce a new constant.
  3. Create the workload ResourceClaimTemplate using the existing WorkloadResourceClaimTemplateManager.Create. That helper already adds and tracks its own finalizer on the workload RCT; no new RCT-level finalizer logic is required for v2.
  4. Set ComputeDomain.status.status=Ready.

It skips:

  • MultiNamespaceDaemonSetManager.Create — this is the wrapper around per-namespace DaemonSetManager instances. The daemon ResourceClaimTemplate creation lives inside DaemonSetManager.Create (via NewDaemonSetResourceClaimTemplateManager at daemonset.go:103,161), so skipping the wrapper implicitly skips daemon-RCT creation. There is no separate daemon-RCT call site to skip.
  • stale node-label cleanup (NodeManager.RemoveStaleComputeDomainLabelsAsync)
  • NodeManager (constructed via NewNodeManager — there is no ComputeDomainNodeManager type in the repo)
  • ComputeDomainStatusManager
  • ComputeDomainCliqueManager
  • status calculation from status.nodes and spec.numNodes

spec.numNodes is ignored for host-managed status. Operators should still set it to 0 because the field is deprecated and has no host-managed readiness meaning.

6.2 Delete path

For a deleting ComputeDomain, host-managed mode does this:

  1. Delete the workload ResourceClaimTemplate.
  2. Remove the finalizer from the workload ResourceClaimTemplate.
  3. Assert the workload ResourceClaimTemplate is gone.
  4. Remove the ComputeDomain finalizer.
  5. Forget metrics for that ComputeDomain.

It does not delete DaemonSets, daemon claim templates, node labels, or ComputeDomainClique objects because it never creates them in this mode.

Migration relies on deleting all existing ComputeDomains before flipping the gate. v2 does not include a startup sweep for legacy objects.

6.3 Controller files touched

Expected small code surface:

File Change
pkg/featuregates/featuregates.go Add HostManagedIMEX and resolved override helper
cmd/compute-domain-controller/computedomain.go Branch add/delete reconciliation when the gate is enabled
cmd/compute-domain-controller/controller.go Avoid constructing or starting daemon/node/status/clique managers if needed by the implementation
cmd/compute-domain-controller/*_test.go Add host-managed controller tests

No CRD type or generated client changes are required.

7. Kubelet Plugin Design

7.1 ResourceSlice

Host-managed mode publishes the current channel-zero device and no daemon device:

channel-0
  attributes:
    compute-domain.nvidia.com/type = "channel"
    compute-domain.nvidia.com/id   = 0

The existing compute-domain-default-channel.nvidia.com DeviceClass remains valid because it already selects type == "channel" and id == 0.

The plugin does not publish slot-* devices and does not expose more than one IMEX channel per node.

7.2 Prepare path

For ComputeDomainChannelConfig, host-managed mode keeps the current prepare ordering:

  1. Decode and validate opaque config from allocation status.
  2. Require exactly one allocation result.
  3. Reject any AllocationMode value that is not the empty string and not Single. Today's ComputeDomainChannelConfig.Validate() at computedomainconfig.go:49-55 only checks DomainID is non-empty, so an unknown AllocationMode like "foo" currently falls through and is silently treated as Single (because the only branch in applyComputeDomainChannelConfig checks == "All"). Under HostManagedIMEX, Prepare must add an explicit allowlist check on the opaque-config value rather than relying on the if AllocationMode == "All" branch alone. The CRD-level enum (computedomain.go:100+kubebuilder:validation:Enum=All;Single) only protects user-created ComputeDomain objects, not the opaque configs the kubelet sees.
  4. Build DeviceConfigState with Type=channel and ComputeDomain=<domainID>.
  5. Check the local checkpoint for an existing completed allocation of channel 0.
  6. Assert the ComputeDomain exists in the same namespace as the claim.
  7. Require a non-empty local clique ID. If s.computeDomainManager.cliqueID == "", return a permanent error ("host-managed IMEX requires an NVLink clique on this node; NVML reports none"). This replaces the existing silent-skip behavior at device_state.go:581-584 for the host-managed code path only; the gate-off path is unchanged.
  8. Append CDI edits for nvCapImexChanDevInfos[0].
  9. Let the existing checkpoint/CDI code mark PrepareCompleted.

Host-managed mode skips:

  • AddNodeLabel
  • AssertComputeDomainReady
  • any call to daemon settings Prepare
  • any /imexd mount generation

The node-local channel conflict check remains intentionally strict. If a second ResourceClaim lands on the same node while a completed channel-zero claim is prepared, Prepare fails via the existing checkpoint conflict path. This design does not require changing that existing conflict into a permanent error; the scheduler should normally prevent the conflict, and the checkpoint check remains the node-local backstop.

7.3 Unprepare path

Unprepare keeps the existing checkpoint-driven cleanup:

  1. Read the prepared claim from checkpoint.
  2. Delete the generated CDI spec.
  3. Remove the checkpoint entry.

Host-managed mode skips RemoveNodeLabel. It does not touch host nvidia-imex.

7.4 Daemon configs

In host-managed mode, ComputeDomainDaemonConfig should not be allocated because:

  • the controller does not create daemon claim templates
  • Helm does not render the daemon DeviceClass
  • the plugin does not publish daemon devices

If a stale or manually created daemon claim reaches Prepare, the plugin returns a permanent error explaining that daemon claims are disabled under HostManagedIMEX.

7.5 Kubelet files touched

Expected small code surface:

File Change
cmd/compute-domain-kubelet-plugin/driver.go Do not publish daemon devices when the gate is enabled
cmd/compute-domain-kubelet-plugin/device_state.go Add the host-managed branch in channel prepare/unprepare and reject daemon prepare
cmd/compute-domain-kubelet-plugin/device_state_test.go Cover Single, All, no-clique, and channel conflict cases

No checkpoint schema change is required.

8. Helm Design

Use the existing featureGates values map:

featureGates:
  HostManagedIMEX: true

Template changes:

Template Host-managed behavior
controller.yaml No new env var; FEATURE_GATES is already plumbed
kubeletplugin.yaml No new env var; FEATURE_GATES is already plumbed
deviceclass-compute-domain-default-channel.yaml Keep rendering
deviceclass-compute-domain-daemon.yaml Do not render
rbac-compute-domain-daemon.yaml Do not render

No Helm values for slotsPerNode, maxIMEXChannels, allocator reaper intervals, webhook requirements, or mode markers are added.

The chart should not require webhook.enabled=true for HostManagedIMEX. The current webhook may remain enabled for existing GPU-driver behavior, but this v2 does not depend on it and does not extend it into a host-managed compute-domain admission contract. Host-managed safety is enforced in the controller and kubelet plugin.

8.1 Beta follow-up: optional admission validation (non-blocking)

v2 enforces host-managed policy at Prepare (a kubelet-plugin permanentError), so an invalid claim is accepted by the API server and surfaces later as a pod-level event rather than being rejected synchronously at kubectl apply. The earlier v1 design closed this UX gap with a mandatory webhook; v2 deliberately does not, to avoid the cert-manager requirement, the webhook.enabled=true chart fail-guard, the ComputeDomainChannelConfig schema change (the DomainNamespace/DomainName triple), and the live-Get RBAC that came with it.

A clean beta follow-up — separate from this alpha — is an optional admission rule (default-off, no chart fail-guard, reusing the existing config schema) that mirrors the kubelet-plugin allowlist for HostManagedIMEX:

  • reject ComputeDomainChannelConfig.AllocationMode not in {"", Single}
  • reject ComputeDomainDaemonConfig opaque configs
  • optionally reject obvious multi-device shapes (exactly.count > 1, firstAvailable[*])

This would convert those Prepare-time pod failures into immediate kubectl apply errors. It must stay advisory/defense-in-depth: the kubelet-plugin permanentError paths remain the source of truth (they also cover the upgrade-skew and pre-existing-claim windows a webhook cannot), and the gate must never require the webhook to be enabled.

9. Scheduling and Isolation

The scheduler sees only per-node channel-zero capacity. That means Kubernetes can prevent two separate channel claims from being allocated to the same node, but it cannot prevent two different ComputeDomains on different nodes from using the same host IMEX channel in the same fabric.

The isolation rule for v2 is therefore operational:

Run at most one active isolated host-managed ComputeDomain per host IMEX domain/fabric.

If two workloads intentionally share the same IMEX communication domain, they can use the same ComputeDomain and the same host IMEX configuration. If they need isolation, v2 is not sufficient.

This matches NVIDIA IMEX channel behavior: channel-based isolation requires consistent channel assignment across all nodes, and broad access to channel 0 means workloads are not isolated from each other.

10. Status and Observability

ComputeDomain.status.status has weak semantics in host-managed mode:

Field Meaning under HostManagedIMEX
status.status=Ready Controller admitted the ComputeDomain and the workload ResourceClaimTemplate exists. Says nothing about host IMEX or whether a future Prepare will succeed.
status.nodes Not populated by host-managed mode
Host IMEX health Not represented
Channel ID Always 0, not recorded in API status

Operators must monitor host IMEX directly, for example with systemctl status nvidia-imex, logs, and nvidia-imex-ctl -N when the command service is enabled.

The driver can expose ordinary controller/plugin logs and existing DRA metrics, but it does not scrape host IMEX health in this version.

11. Boot Order

The minimal implementation keeps the current startup assumption: the kubelet plugin must be able to discover the nvidia-caps-imex-channels device major when it starts.

Required host state before the plugin starts:

  • NVIDIA driver loaded
  • /proc/devices contains nvidia-caps-imex-channels
  • channel 0 can be used by workloads
  • host nvidia-imex.service is configured and running for real workloads

v2 does not add lazy channel discovery, fsnotify, or ResourceSlice republish when host channels appear later. If the host state is missing, fix the node and restart the kubelet plugin.

12. Migration

Gate flips are stop-the-world operations.

12.1 Driver-managed to host-managed

  1. Cordon/drain all nodes that run ComputeDomain workloads.
  2. Delete all ComputeDomain objects.
  3. Wait for generated DaemonSets and claim templates to disappear.
  4. Start and validate host nvidia-imex.service on every participating node.
  5. Upgrade Helm with featureGates.HostManagedIMEX=true.
  6. Recreate ComputeDomains with numNodes: 0 and allocationMode: Single.
  7. Uncordon and resubmit workloads.

12.2 Host-managed to driver-managed

  1. Cordon/drain ComputeDomain workloads.
  2. Delete all ComputeDomain objects.
  3. Stop and mask host nvidia-imex.service so it cannot conflict with the driver-managed daemon pods.
  4. Upgrade Helm with featureGates.HostManagedIMEX=false.
  5. Recreate ComputeDomains for driver-managed mode.
  6. Uncordon workloads.

No in-place adoption is provided. Running workloads are not migrated.

13. Failure Modes

Failure Result Owner
Host IMEX is down at plugin start or at workload run time Pod may start, CUDA IMEX operations fail. Driver never restarts host IMEX. Operator
nodes_config.cfg differs across nodes IMEX domain stays down or degraded; not visible from K8s Operator
nodes_config.cfg missing or empty nvidia-imex.service exits or fails its ConditionPathExists; driver still publishes channel-0 but workloads fail at CUDA time Operator
Channel major (nvidia-caps-imex-channels) missing in /proc/devices at plugin start Kubelet plugin fails to initialize (existing behavior, not changed by v2) Operator
Host channel0 chardev missing but channel major is registered Prepare can still succeed because CDI emits mknod instructions inside the container from major/minor. Treat this as a host setup/observability warning, not by itself as proof the workload will fail. Operator
allocationMode: All (or any unknown non-empty value) Channel claim Prepare fails permanently User/operator
Unknown AllocationMode opaque-config string (e.g. "foo") Existing Validate() only checks DomainID; under the gate, Prepare adds an explicit allowlist check and rejects Driver (under gate)
Two channel claims on one node Second Prepare fails due to checkpoint conflict (existing behavior) Driver
Two isolated CDs on same fabric Both can use channel 0; isolation is not guaranteed Operator
Empty local clique ID at Prepare Permanent prepare error (new under gate; replaces today's silent-skip path) Driver (under gate)
Nonzero numNodes Silently ignored; never rejected User/operator
Daemon claim reaches Prepare Permanent error (no daemon devices published, no daemon RCTs created — claim is from a stale or hand-crafted object) Driver (under gate)
Existing driver-managed objects present during gate flip Undefined/stale objects; migration procedure was skipped Operator

14. Test Plan

14.1 Unit tests

  • feature gate registration and resolveHostManagedIMEXOverrides behavior (both IMEXDaemonsWithDNSNames and ComputeDomainCliques end up false after the helper runs and the existing dependency validator passes)
  • controller add path creates only the workload RCT and sets status.status=Ready; the ComputeDomain finalizer (resource.nvidia.com/computeDomain) is added
  • controller delete path deletes only the workload RCT, removes RCT and CD finalizers, and calls metrics.ForgetComputeDomain
  • kubelet ResourceSlice omits daemon device under the gate
  • channel prepare skips AddNodeLabel and AssertComputeDomainReady
  • channel prepare rejects allocationMode: All AND unknown non-empty opaque-config modes (e.g. "foo") via the new explicit allowlist check, not via the existing if == "All" branch
  • channel prepare rejects empty clique ID as a permanent error (not the existing silent-skip)
  • daemon prepare returns permanent error under the gate
  • unprepare skips RemoveNodeLabel and still removes CDI/checkpoint state (unchanged from today otherwise)

14.2 Integration tests without GPUs

  • render Helm with the gate on and verify:
    • default channel DeviceClass exists
    • daemon DeviceClass does not exist
    • daemon RBAC/service account do not exist
    • controller and kubelet plugin receive FEATURE_GATES
  • fake a ComputeDomain and verify only the workload claim template is created
  • verify no computedomain-daemon-* DaemonSet is created

14.3 GPU/fabric tests

  • host IMEX running, one ComputeDomain, one pod per node: pod sees /dev/nvidia-caps-imex-channels/channel0 inside the container.
  • stop host IMEX (systemctl stop nvidia-imex on a node holding a workload pod): verify (a) the kubelet plugin does NOT restart it (no Pod restarts; no plugin log lines about IMEX lifecycle), and (b) any subsequent CUDA shareable-handle operation in the workload surfaces a CUDA error inside the workload container. Verification is end-to-end against the workload itself, not against any driver-exposed status field.
  • second channel claim on same node: scheduler or Prepare prevents it (the per-node checkpoint conflict check is the backstop).
  • allocationMode: All: pod fails with a clear permanent prepare error.
  • no clique/fabric node: pod fails with a clear permanent prepare error (the new behavior introduced for host-managed mode; verify the error message names the missing clique condition).

15. Implementation Checklist

  1. Add HostManagedIMEX feature gate and the resolveHostManagedIMEXOverrides helper. Wire it in front of ValidateFeatureGates in the controller and kubelet-plugin startup paths. Log the resolved overrides.
  2. Branch controller onAddOrUpdate add path under the gate to:
    • add the existing computeDomainFinalizer (the resource.nvidia.com/computeDomain constant);
    • call only WorkloadResourceClaimTemplateManager.Create (skip MultiNamespaceDaemonSetManager.Create, which transitively skips daemon-RCT creation since the daemon RCT manager lives inside DaemonSetManager);
    • set ComputeDomain.status.status=Ready.
  3. Branch controller onAddOrUpdate delete path under the gate to:
    • delete the workload RCT and remove its finalizer;
    • remove the ComputeDomain finalizer;
    • call metrics.ForgetComputeDomain for that CD.
  4. Branch kubelet ResourceSlice publishing under the gate to publish only the existing channel-0 device (no daemon device).
  5. Branch channel Prepare under the gate to:
    • skip AddNodeLabel;
    • skip AssertComputeDomainReady;
    • validate AllocationMode is empty or "Single" (return permanent error otherwise — this is new validation, not the existing if == "All" check);
    • return a permanent error when cliqueID == "" (replaces today's silent-skip for the host-managed path only).
  6. Branch channel Unprepare under the gate to skip RemoveNodeLabel. Everything else stays as today (CDI delete + checkpoint delete via the existing helpers).
  7. Reject ComputeDomainDaemonConfig Prepare under the gate with a clear permanent error.
  8. Hide deviceclass-compute-domain-daemon.yaml and rbac-compute-domain-daemon.yaml in Helm when the gate is on.
  9. Add focused tests, including the test-plan §14 scenarios — in particular fault-injection at every prepare step listed in §7.2 and verification that delete-path metrics forget runs.
  10. Update docs and examples to show numNodes: 0, allocationMode: Single, and the one-active-isolated-CD-per-host-IMEX-domain limitation. Document boot-order preconditions from §11 in the operator guide and docs/prerequisites.md.

16. References

Host-Managed IMEX Operator Guide v2

This guide matches the minimal v2 design. It is meant for a driver build that has implemented the HostManagedIMEX alpha feature gate, stops creating per-ComputeDomain IMEX DaemonSets, and uses the existing channel-zero DRA path.

1. Read This First

Host-managed v2 is intentionally small.

The driver still creates the workload ResourceClaimTemplate for each ComputeDomain, and the kubelet plugin still injects /dev/nvidia-caps-imex-channels/channel0 into workload containers.

You now own the host IMEX service:

  • nvidia-imex.service
  • /etc/nvidia-imex/nodes_config.cfg
  • host IMEX restarts and upgrades
  • host IMEX health monitoring
  • channel 0 availability

Hard limits in v2:

  • one IMEX channel: channel0
  • allocationMode: Single only
  • one prepared IMEX channel claim per node
  • one active isolated ComputeDomain per host IMEX domain/fabric
  • no driver-side host IMEX health gate
  • no in-place migration between driver-managed and host-managed modes

If you need multiple isolated jobs sharing one IMEX fabric at the same time, do not use this v2 mode. That requires a channel allocator design.

2. Compatibility

Area Requirement
Kubernetes Same DRA version requirements as the current driver
DRA driver Build that includes HostManagedIMEX
Hardware Multi-Node NVLink fabric where IMEX is supported
NVIDIA driver Driver/package set that includes nvidia-imex and IMEX channels
CDI Enabled in the container runtime
GFD/NFD Nodes should have fabric/clique labels so workloads can target fabric nodes
Webhook Optional for this mode
cert-manager Only needed if you enable the chart webhook with cert-manager TLS

The existing repo documentation says host nvidia-imex.service must be masked for driver-managed ComputeDomains. Host-managed mode inverts that rule: the host service must be configured and running before workloads use IMEX.

3. Prepare Each Node

3.1 Install IMEX

Install the NVIDIA driver and nvidia-imex packages using your normal node image or GPU Operator flow. Confirm the binaries exist:

command -v nvidia-imex
command -v nvidia-imex-ctl || true

nvidia-imex-ctl is optional but strongly recommended for diagnostics.

3.2 Configure the IMEX node list

Create /etc/nvidia-imex/nodes_config.cfg on every node in the host IMEX domain. The file must contain the same ordered set of peer IPs or hostnames on each node in that domain.

Example:

sudo install -d -m 0755 /etc/nvidia-imex
sudo tee /etc/nvidia-imex/nodes_config.cfg >/dev/null <<'EOF'
10.10.0.11
10.10.0.12
10.10.0.13
10.10.0.14
EOF

Check that every node can resolve and reach every entry. IMEX will fail or run degraded if the nodes disagree on the map. See §3.7 for the sanity-check script you should run after writing this file (and after every topology change).

Discovery options

nodes_config.cfg is the hardest single thing about host-managed IMEX. Get it right and the rest is mechanical; get it wrong and you'll have silent IMEX failures. The rules (worth stating again before the options):

  • One peer per line — IPv4, IPv6, or DNS name.
  • The file is read once at nvidia-imex startup. Changes after startup are ignored until you systemctl restart nvidia-imex (or systemctl reload if your IMEX version supports SIGUSR1; this guide treats reload as untested per §8.4).
  • Every node in the same host IMEX domain MUST have an identical file. Different lists across peers = IMEX refuses connections = cross-node memory ops silently fail.
  • The file must contain THIS node's own IP/hostname too — IMEX treats itself as a peer.
  • A node's own line should resolve to a real local interface; if IMEX can't bind to it, the daemon exits.

Pick one of the following discovery strategies based on how dynamic your cluster is.

Option A: static list baked into the image. Simplest, works for fixed racks (e.g. one rack = one immutable list of node IPs). Bake the file into your image build. Use a different file per rack/clique.

Pros: zero runtime complexity, totally deterministic. Cons: rack changes (replace a node, add a node) require re-imaging or manual file edits + an IMEX restart on every peer.

Example: keep the file under config-management as rack-A.nodes_config.cfg, rack-B.nodes_config.cfg, …; the image-build step picks the right one based on the rack the node is in.

Option B: prolog/boot-time script that scrapes Kubernetes. Useful if you want zero per-node manual config. Run once at boot (e.g. via a oneshot systemd unit ordered Before=nvidia-imex.service):

#!/bin/bash
# /usr/local/bin/build-imex-nodes-config.sh
# Populate /etc/nvidia-imex/nodes_config.cfg from Kubernetes.
set -e

MY_CLIQUE_LABEL=$(kubectl get node "$(hostname)" \
    -o jsonpath='{.metadata.labels.nvidia\.com/gpu\.clique}')
if [[ -z "$MY_CLIQUE_LABEL" ]]; then
    echo "ERROR: no nvidia.com/gpu.clique label on this node"
    exit 1
fi

# Peers in the same clique.
kubectl get nodes -l "nvidia.com/gpu.clique=$MY_CLIQUE_LABEL" \
    -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' \
    | sort -u > /etc/nvidia-imex/nodes_config.cfg

# Sanity-check: must contain this node's own IP.
my_ip=$(ip -4 -j addr | jq -r '.[] | .addr_info[] | select(.scope=="global") | .local' | head -1)
grep -q "^$my_ip\$" /etc/nvidia-imex/nodes_config.cfg \
    || { echo "ERROR: own IP $my_ip not in clique list"; exit 1; }

Pros: minimal manual config; adapts to rack composition changes (after IMEX restart). Cons: requires kubectl on the node (with a kubeconfig that has read access to nodes); requires reordering systemd to put this script before nvidia-imex.service.

Option C: dedicated node-local controller. Run a small daemon on each node that watches Kubernetes for clique membership changes and rewrites nodes_config.cfg. Most production deployments end up here. Reference implementations: NVIDIA Mission Control's nmx-controller, BCM's IMEX integration, or a custom small Go program with a Kubernetes Node informer.

Pros: handles elastic clusters, node replacements, scale-up/scale-down without manual file edits. Cons: more moving parts; another thing to monitor; you still need to treat config rewrites as disruptive (drain workloads before the controller restarts IMEX) unless you have tested a safe reload path.

Option D: BCM / NMX / equivalent fabric manager. If you're already running BCM (Base Command Manager) or NMX, they'll have IMEX integration that handles all of this. Use it.

Whichever option you pick, run the sanity checks in §3.7 after the file is written.

3.3 Configure IMEX

Use the config path expected by your IMEX package. The NVIDIA docs use /etc/nvidia-imex/config.cfg; older local examples sometimes use /etc/nvidia-imex/imexd.cfg. Pick one and keep your systemd unit consistent.

Minimal example:

sudo tee /etc/nvidia-imex/config.cfg >/dev/null <<'EOF'
DAEMONIZE=0
LOG_FILE_NAME=/var/log/nvidia-imex.log
IMEX_NODE_CONFIG_FILE=/etc/nvidia-imex/nodes_config.cfg
IMEX_CONN_WAIT_TIMEOUT=70
IMEX_WAIT_FOR_QUORUM=RECOVERY
IMEX_CMD_ENABLED=1
IMEX_CMD_PORT=50005
EOF

Changing the config or node list requires an IMEX restart in this guide. Drain IMEX workloads first.

3.4 Ensure channel0 exists

The kubelet plugin must be able to discover the nvidia-caps-imex-channels major when it starts, and workloads need channel 0.

Validate the major:

grep nvidia-caps-imex-channels /proc/devices

Validate or create channel0 using the method supported by your node image:

sudo mkdir -p /dev/nvidia-caps-imex-channels
major="$(awk '$2 == "nvidia-caps-imex-channels" {print $1}' /proc/devices)"
test -n "$major"
test -e /dev/nvidia-caps-imex-channels/channel0 || \
  sudo mknod /dev/nvidia-caps-imex-channels/channel0 c "$major" 0
sudo chmod 0666 /dev/nvidia-caps-imex-channels/channel0
ls -l /dev/nvidia-caps-imex-channels/channel0

If your driver supports NVreg_CreateImexChannel0=1, using that module parameter is cleaner because it recreates channel0 when the kernel module is loaded. If your environment uses nvidia-modprobe -c 0, that is also fine.

Only channel 0 is used by this v2 driver mode.

3.5 Start host IMEX

Example unit:

[Unit]
Description=NVIDIA IMEX Service
After=network-online.target
Wants=network-online.target
ConditionPathExists=/etc/nvidia-imex/config.cfg
ConditionPathExists=/etc/nvidia-imex/nodes_config.cfg

[Service]
Type=simple
ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Install and start it:

sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-imex.service
sudo systemctl status nvidia-imex.service --no-pager

If the command service is enabled, check the domain:

if command -v nvidia-imex-ctl >/dev/null; then
  nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg
fi

You want all expected peers connected and Domain State: UP before running real workloads when nvidia-imex-ctl is available. Otherwise use systemctl status, journalctl, and your node-image-specific IMEX health checks.

3.6 Node preflight

After host prep and before installing or upgrading the driver, run a simple check on every NVLink-capable node:

#!/usr/bin/env bash
set -eu

fail() { echo "FAIL: $*" >&2; exit 1; }
warn() { echo "WARN: $*" >&2; }

echo "== NVIDIA driver =="
command -v nvidia-smi >/dev/null || fail "nvidia-smi not found"
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1

echo "== IMEX binaries =="
command -v nvidia-imex >/dev/null || fail "nvidia-imex not found"
command -v nvidia-imex-ctl >/dev/null || warn "nvidia-imex-ctl not found"

echo "== IMEX config =="
test -s /etc/nvidia-imex/config.cfg || fail "missing /etc/nvidia-imex/config.cfg"
test -s /etc/nvidia-imex/nodes_config.cfg || fail "missing /etc/nvidia-imex/nodes_config.cfg"

echo "== IMEX channel major =="
grep nvidia-caps-imex-channels /proc/devices || fail "nvidia-caps-imex-channels major not registered"

echo "== channel0 =="
test -e /dev/nvidia-caps-imex-channels/channel0 || warn "channel0 host node missing"

echo "== nvidia-imex.service =="
# This script runs AFTER §3.5 (which enables and starts the service).
# An inactive service at this point is a real failure, not a warning.
systemctl is-active --quiet nvidia-imex.service || fail "nvidia-imex.service is not active"

echo "OK"

The host /dev/nvidia-caps-imex-channels/channel0 node is still useful to check, even though CDI can create the character device inside containers from major/minor numbers. If the host node is missing, confirm your boot-time channel creation path before running production workloads.

3.7 nodes_config.cfg sanity checks

Run these checks on every node after writing nodes_config.cfg, and again after every topology change (node add, node remove, IP renumber):

# Count of peers should match your clique size:
echo "== peer count =="
wc -l /etc/nvidia-imex/nodes_config.cfg

# Every peer should resolve (no typos):
echo "== peer resolution =="
while read -r peer; do
  test -z "$peer" && continue
  getent hosts "$peer" >/dev/null && echo "OK $peer" || echo "FAIL $peer"
done < /etc/nvidia-imex/nodes_config.cfg

# Your own IP must be in the file — IMEX treats itself as a peer.
# Use jq if available for a global-scope-only filter; otherwise fall
# back to the first hostname -I address.
echo "== local address appears in node map =="
if command -v jq >/dev/null; then
  my_ip=$(ip -4 -j addr | jq -r '.[] | .addr_info[] | select(.scope=="global") | .local' | head -1)
else
  my_ip=$(hostname -I | awk '{print $1}')
fi
grep -Fx "$my_ip" /etc/nvidia-imex/nodes_config.cfg && echo "OK self present"

The file must be consistent across the host IMEX domain. If your environment uses hostnames instead of IPs, adapt the self-check to compare the node's chosen IMEX hostname. If nodes_config.cfg is even one entry off across peers, IMEX will refuse cross-node memory operations silently — re-run these checks on every node, not just the one you changed.

4. Install the Driver

Install or upgrade the chart with the host-managed gate:

helm upgrade --install dra-driver-nvidia-gpu \
  ./deployments/helm/dra-driver-nvidia-gpu \
  --namespace dra-driver-nvidia-gpu \
  --create-namespace \
  --set resources.computeDomains.enabled=true \
  --set resources.gpus.enabled=false \
  --set featureGates.HostManagedIMEX=true

If you also want this chart to provide GPU DRA resources, keep resources.gpus.enabled=true and pass --set gpuResourcesEnabledOverride=true instead of disabling GPU resources. The override is an existing chart safety guard and is unrelated to host-managed IMEX.

No host-managed-specific CRD preflight is required. This v2 uses the existing ComputeDomain CRD and does not add IMEXChannelAllocation.

Validate the rendered behavior:

kubectl get pods -n dra-driver-nvidia-gpu
kubectl get deviceclass compute-domain-default-channel.nvidia.com
kubectl get deviceclass compute-domain-daemon.nvidia.com 2>/dev/null && \
  echo "unexpected daemon DeviceClass" || \
  echo "daemon DeviceClass not rendered"

Expected:

  • controller is running
  • kubelet plugin is running on fabric-capable nodes
  • compute-domain-default-channel.nvidia.com exists
  • compute-domain-daemon.nvidia.com does not exist
  • daemon RBAC and service accounts do not exist
  • no computedomain-daemon-* pods appear

Check the gate reached the pods:

kubectl get ds -n dra-driver-nvidia-gpu -o yaml | grep -A1 FEATURE_GATES
kubectl get deploy -n dra-driver-nvidia-gpu -o yaml | grep -A1 FEATURE_GATES

4.1 Validation script

Save this as validate-host-managed-imex.sh and run it after the Helm upgrade:

#!/usr/bin/env bash
set -eu

NS="${NS:-dra-driver-nvidia-gpu}"

fail() { echo "FAIL: $*" >&2; exit 1; }

echo "== driver pods =="
kubectl get pods -n "$NS"

echo "== no computedomain-daemon pods =="
if kubectl get pods -A 2>/dev/null | grep -q "computedomain-daemon"; then
  fail "found computedomain-daemon pods"
fi

echo "== DeviceClasses =="
kubectl get deviceclass compute-domain-default-channel.nvidia.com >/dev/null
# Under HostManagedIMEX, the chart should NOT render the daemon
# DeviceClass. NOTE: in the current repo, the daemon DeviceClass
# template is gated only on `resources.computeDomains.enabled`,
# not yet on `featureGates.HostManagedIMEX`. If this check fails,
# you are running against a build that hasn't landed the chart
# changes from design v2 §8 yet — file a build/chart issue rather
# than treating the failure as a host misconfig.
if kubectl get deviceclass compute-domain-daemon.nvidia.com >/dev/null 2>&1; then
  fail "compute-domain-daemon.nvidia.com is present; chart hasn't been updated to gate it on HostManagedIMEX (see design v2 §8)"
fi

echo "== no daemon RBAC/service accounts =="
# The current repo also renders daemon RBAC unconditionally. A v2 build
# should gate these objects with HostManagedIMEX just like the daemon
# DeviceClass.
if kubectl get clusterrole compute-domain-daemon-role >/dev/null 2>&1; then
  fail "compute-domain-daemon-role is present; chart hasn't been updated to gate daemon RBAC on HostManagedIMEX"
fi
if kubectl get clusterrolebinding compute-domain-daemon-role-binding >/dev/null 2>&1; then
  fail "compute-domain-daemon-role-binding is present; chart hasn't been updated to gate daemon RBAC on HostManagedIMEX"
fi
if kubectl get serviceaccount -A 2>/dev/null | grep -q "compute-domain-daemon-service-account"; then
  fail "compute-domain-daemon-service-account is present; chart hasn't been updated to gate daemon service accounts on HostManagedIMEX"
fi

echo "== HostManagedIMEX gate =="
kubectl get deploy -n "$NS" dra-driver-nvidia-gpu-controller -o yaml | grep "HostManagedIMEX=true" >/dev/null
kubectl get ds -n "$NS" dra-driver-nvidia-gpu-kubelet-plugin -o yaml | grep "HostManagedIMEX=true" >/dev/null

echo "== channel-zero ResourceSlice device =="
kubectl get resourceslice -A -o yaml | grep "name: channel-0" >/dev/null

echo "All checks passed."

5. Create a ComputeDomain

Use numNodes: 0 and omit allocationMode or set it to Single.

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: train-a
  namespace: default
spec:
  numNodes: 0
  channel:
    resourceClaimTemplate:
      name: train-a-imex-channel
    allocationMode: Single

Apply it:

kubectl apply -f train-a-computedomain.yaml
kubectl get computedomain train-a -o yaml
kubectl get resourceclaimtemplate train-a-imex-channel

status.status: Ready means the driver created the workload claim template. It does not prove host IMEX is healthy.

6. Run a Smoke Pod

Example:

apiVersion: v1
kind: Pod
metadata:
  name: imex-channel0-smoke
  namespace: default
spec:
  restartPolicy: Never
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-lc"]
    args:
    - |
      ls -l /dev/nvidia-caps-imex-channels
      sleep 3600
    resources:
      claims:
      - name: imex
  resourceClaims:
  - name: imex
    resourceClaimTemplateName: train-a-imex-channel

Apply and check:

kubectl apply -f smoke-pod.yaml
kubectl wait --for=condition=Ready pod/imex-channel0-smoke --timeout=120s
kubectl logs imex-channel0-smoke

Expected output includes:

channel0

Clean up:

kubectl delete pod imex-channel0-smoke
kubectl delete computedomain train-a

7. Operating Rules

7.1 One active isolated domain per host IMEX domain

This v2 uses channel 0 for every host-managed ComputeDomain. Kubernetes prevents two separate channel claims from landing on the same node, but it does not coordinate channel IDs across the fabric.

If two isolated workloads use the same host IMEX domain at the same time, they can both use channel 0 on different nodes. That is not isolated.

Operational rule:

At most one active isolated ComputeDomain per host IMEX domain/fabric.

If multiple pods are part of the same distributed job, use one ComputeDomain for that job.

7.2 One channel claim per node

The driver publishes one channel device per node. If the scheduler cannot place another IMEX-using pod because channel 0 is already allocated on all eligible nodes, the pod remains pending. If a stale or conflicting claim gets to Prepare, the kubelet plugin rejects it using its local checkpoint.

7.3 Do not use allocationMode: All

Host-managed v2 rejects allocationMode: All.

Use:

allocationMode: Single

or omit the field.

7.4 Host IMEX restarts are disruptive

The driver does not restart or supervise host IMEX. Restarting nvidia-imex.service can break in-flight CUDA shareable handles. Drain IMEX workloads before host IMEX restarts, driver upgrades that change host driver components, or node-list changes.

7.5 Driver upgrades do not manage host IMEX

Upgrading the Helm chart may restart controller and kubelet plugin pods. It does not restart nvidia-imex.service. Existing running workloads should not be treated as a host IMEX restart, but new Prepare calls depend on the new kubelet plugin coming up cleanly.

8. Day-2 Operations

8.1 Adding nodes

When a new NVLink-capable node joins a host IMEX domain:

  1. Prepare the node image: driver, IMEX package, config, channel 0, and nvidia-imex.service.
  2. Update nodes_config.cfg on the new node and on every existing peer.
  3. Restart host IMEX on nodes whose node map changed. Drain affected workloads first.
  4. Validate the domain with nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg when available; otherwise use host service logs and your platform IMEX health checks.
  5. Let NFD/GFD label the node and let the kubelet plugin publish channel-0.

The DRA driver does not perform peer discovery or host IMEX restarts.

8.2 Removing nodes

When removing a node from a host IMEX domain:

  1. Cordon and drain the node.
  2. Wait for IMEX workloads on that node to terminate.
  3. Remove the node from nodes_config.cfg on every remaining peer.
  4. Restart host IMEX on the remaining peers after draining affected workloads.
  5. Validate the domain before scheduling new IMEX workloads.

If you skip the node-map update, remaining peers can keep trying to reconnect to the removed node and workloads can see IMEX-level failures.

8.3 Restarting IMEX on one node

Treat a host IMEX restart as disruptive for workloads on that node:

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
sudo systemctl restart nvidia-imex.service
if command -v nvidia-imex-ctl >/dev/null; then
  nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg
fi
kubectl uncordon <node>

The driver does not detect, restart, or repair host IMEX.

8.4 Config and node-map changes

This guide assumes config and node-map changes are disruptive. Drain workloads, update files consistently, restart host IMEX, validate the domain, then resume scheduling.

If your IMEX package and systemd unit explicitly support a safe reload path, for example through a validated ExecReload, you can use that in your local runbook. Do not rely on reload semantics unless you have tested them for your IMEX version and config change type.

8.5 Kubernetes maintenance

Normal Kubernetes drain semantics still apply. Draining a node evicts workload pods and the kubelet plugin pod, but it does not stop nvidia-imex.service unless your host maintenance tooling does that separately.

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# host maintenance here
kubectl uncordon <node>

9. Monitoring

Monitor the host service directly.

Useful checks:

systemctl is-active nvidia-imex.service
journalctl -u nvidia-imex.service -n 100 --no-pager
if command -v nvidia-imex-ctl >/dev/null; then
  nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg
fi
grep nvidia-caps-imex-channels /proc/devices
ls -l /dev/nvidia-caps-imex-channels/channel0

Kubernetes checks:

kubectl get computedomain -A
kubectl get resourceclaimtemplate -A
kubectl get resourceslice -A -o yaml | grep -E "name: channel-0|compute-domain.nvidia.com"
kubectl logs -n dra-driver-nvidia-gpu deploy/dra-driver-nvidia-gpu-controller
kubectl logs -n dra-driver-nvidia-gpu ds/dra-driver-nvidia-gpu-kubelet-plugin

Do not use ComputeDomain.status.nodes as a host IMEX health source in this mode. It is not populated by host-managed v2.

Useful things to monitor:

Source Why
nvidia-imex.service active state Host IMEX availability
nvidia-imex.service restart count Crash loops or host instability
journalctl -u nvidia-imex.service error rate Peer/config/auth failures
nvidia-imex-ctl -N domain state, when available Peer connectivity
/proc/devices entry for nvidia-caps-imex-channels Kubelet plugin startup prerequisite
/dev/nvidia-caps-imex-channels/channel0 Host channel setup signal
controller and kubelet plugin pod readiness Driver availability
kubelet plugin logs mentioning Prepare Allocation/config failures
ResourceSlice containing channel-0 Scheduler-visible channel capacity

Example alerts to adapt to your monitoring stack:

groups:
- name: host-managed-imex
  rules:
  - alert: NvidiaIMEXDaemonDown
    expr: node_systemd_unit_state{name="nvidia-imex.service",state="active"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "nvidia-imex is not active on {{ $labels.instance }}"

  - alert: NvidiaIMEXRestarting
    expr: increase(node_systemd_service_restart_total{name="nvidia-imex.service"}[15m]) > 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "nvidia-imex restart loop on {{ $labels.instance }}"

10. Migration

10.1 Driver-managed to host-managed

Use a maintenance window.

# 1. Stop new placement.
# Repeat cordon/drain for every fabric node that can run ComputeDomain
# workloads.
kubectl cordon <node>

# 2. Drain ComputeDomain workloads from all fabric nodes.
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 3. Delete all ComputeDomains.
kubectl delete computedomain -A --all

# 4. Wait for driver-managed generated objects to disappear.
while kubectl get daemonset -A -l resource.nvidia.com/computeDomain -o name 2>/dev/null | grep -q . || \
      kubectl get resourceclaimtemplate -A -l resource.nvidia.com/computeDomain -o name 2>/dev/null | grep -q .; do
  echo "waiting for generated ComputeDomain objects to disappear"
  sleep 5
done

# 5. Start and validate host IMEX on every participating node.
sudo systemctl enable --now nvidia-imex.service
if command -v nvidia-imex-ctl >/dev/null; then
  nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg
fi

# 6. Flip the driver gate.
helm upgrade --install dra-driver-nvidia-gpu \
  ./deployments/helm/dra-driver-nvidia-gpu \
  --namespace dra-driver-nvidia-gpu \
  --set resources.computeDomains.enabled=true \
  --set resources.gpus.enabled=false \
  --set featureGates.HostManagedIMEX=true

# 7a. Recreate ComputeDomains for host-managed mode.
#     Use numNodes: 0 and allocationMode: Single (or omit it).
kubectl apply -f ./your-host-managed-computedomains.yaml

# 7b. Uncordon nodes.
kubectl uncordon <node>

Do not leave old ComputeDomains in place during the gate flip. v2 does not adopt or sweep legacy daemon objects.

10.2 Host-managed to driver-managed

Use the reverse maintenance flow.

# 1. Stop placement and drain ComputeDomain workloads from all fabric nodes.
# Repeat cordon/drain for every fabric node that can run ComputeDomain
# workloads.
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. Delete all ComputeDomains.
kubectl delete computedomain -A --all

# 3. Stop host IMEX so it cannot conflict with driver-managed daemon pods.
sudo systemctl disable --now nvidia-imex.service
sudo systemctl mask nvidia-imex.service

# 4. Flip the gate off.
helm upgrade --install dra-driver-nvidia-gpu \
  ./deployments/helm/dra-driver-nvidia-gpu \
  --namespace dra-driver-nvidia-gpu \
  --set resources.computeDomains.enabled=true \
  --set resources.gpus.enabled=false \
  --set featureGates.HostManagedIMEX=false

# 5. Recreate ComputeDomains for driver-managed mode.

# 6. Uncordon nodes after driver-managed daemons and workloads are ready to run.
kubectl uncordon <node>

11. Troubleshooting

11.1 Pod stays Pending

Check whether all eligible nodes already have channel 0 allocated:

kubectl describe pod <pod>
kubectl get resourceslice -A -o yaml | grep "name: channel-0"

Also check node affinity. Host-managed IMEX workloads should target nodes with NVLink fabric/clique labels.

11.2 Pod fails during ContainerCreating or Prepare

Check events and kubelet plugin logs:

kubectl describe pod <pod>
kubectl logs -n dra-driver-nvidia-gpu ds/dra-driver-nvidia-gpu-kubelet-plugin --since=30m

Common causes:

  • allocationMode: All was used
  • another channel-zero claim is already prepared on the same node
  • the ComputeDomain is in a different namespace from the generated claim
  • the node has no clique/fabric ID
  • nvidia-caps-imex-channels was missing when the plugin started

11.3 Pod starts but CUDA IMEX operations fail

The driver has already done its part if the container sees channel0. Investigate host IMEX:

systemctl status nvidia-imex.service --no-pager
journalctl -u nvidia-imex.service -n 200 --no-pager
if command -v nvidia-imex-ctl >/dev/null; then
  nvidia-imex-ctl -N -c /etc/nvidia-imex/config.cfg
fi

Look for:

  • domain state not UP
  • disconnected peers
  • node map mismatch
  • version mismatch
  • authentication/encryption mismatch
  • wrong or inconsistent nodes_config.cfg

11.4 ComputeDomain is Ready but workloads fail

That is expected for some host failures. In host-managed v2, ComputeDomain.status.status=Ready means the workload claim template exists. It does not mean host IMEX is healthy.

11.5 computedomain-daemon-* pods exist

Under HostManagedIMEX=true, new daemon pods should not be created.

If they exist:

  1. Confirm the Helm release has featureGates.HostManagedIMEX=true.
  2. Confirm you deleted all old ComputeDomains before the migration.
  3. Confirm the controller pod restarted onto the new config.
  4. Delete stale ComputeDomains and rerun the migration cleanup.

11.6 compute-domain-daemon.nvidia.com DeviceClass exists

The host-managed chart should stop rendering it. If it remains:

helm get values -n dra-driver-nvidia-gpu dra-driver-nvidia-gpu
kubectl get deviceclass compute-domain-daemon.nvidia.com -o yaml

It may be left over from a failed or partial Helm operation. The kubelet plugin should not publish daemon devices under the gate, so daemon claims still should not allocate.

12. Coexistence With GPU Operator

GPU Operator may manage several prerequisites for you:

  • NVIDIA host driver
  • NVIDIA Container Toolkit and CDI
  • NFD/GFD labels
  • DCGM
  • the nvidia-imex package and, in some versions, nvidia-imex.service

That is compatible with host-managed v2 as long as the host IMEX service is configured for your fabric and the DRA driver is not also trying to launch per-ComputeDomain daemon pods.

Verify what GPU Operator installed:

systemctl status nvidia-imex.service --no-pager
cat /etc/nvidia-imex/config.cfg
cat /etc/nvidia-imex/nodes_config.cfg
grep nvidia-caps-imex-channels /proc/devices
ls -l /dev/nvidia-caps-imex-channels/channel0

GPU Operator might install the service but not populate the right nodes_config.cfg for your fabric. Treat peer discovery as your responsibility unless your GPU Operator configuration explicitly owns it.

If the standard NVIDIA device plugin is still running for GPU allocation, keep the DRA chart's GPU resources disabled with resources.gpus.enabled=false. That setting does not disable ComputeDomains.

13. FAQ

Can I run two isolated jobs on the same IMEX fabric?

Not with this v2 design. It always injects channel 0. Use one active isolated ComputeDomain per host IMEX domain/fabric.

Can I use allocationMode: All?

No. It is rejected under HostManagedIMEX.

Does the webhook need to be enabled?

No. The v2 contract does not depend on webhook changes. The kubelet plugin and controller enforce the host-managed-specific behavior.

Does status.status=Ready mean IMEX is healthy?

No. It means the driver created the workload ResourceClaimTemplate.

Can I reload nodes_config.cfg without draining?

Do not assume that for this guide. Treat node-list changes as disruptive: drain workloads, update the file consistently, restart host IMEX, verify the domain, then resume scheduling.

Why only channel0?

Because channel0 is what the repo already advertises and injects today. More channels require a fabric-wide allocator and a different lifecycle design.

14. References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment