| Field | Value |
|---|---|
| Status | Implementable minimal alpha |
| Feature gate | HostManagedIMEX |
| Scope | Install-wide, not per-ComputeDomain |
| Primary goal | Stop launching per-ComputeDomain IMEX DaemonSets when the host already runs nvidia-imex |
| Primary non-goal | Per-ComputeDomain channel isolation across an IMEX fabric |
HostManagedIMEX is a narrow mode for clusters where the operator already
owns the host nvidia-imex daemon lifecycle. When enabled, the driver keeps
the existing ComputeDomain user API and the existing DRA channel injection
path, but stops creating the in-cluster compute-domain-daemon DaemonSets.
The smallest safe version is intentionally limited:
- one schedulable IMEX channel device per node:
channel-0 - one prepared channel claim per node
- one active host-managed
ComputeDomainper host IMEX domain/fabric when isolation matters allocationMode: Singleonly- no driver-managed host IMEX readiness or peer discovery
This version reuses the code the repo already has:
- existing
ComputeDomainCRD - existing
ComputeDomainChannelConfig - existing workload
ResourceClaimTemplaterendering - existing
compute-domain-default-channel.nvidia.comDeviceClass - existing checkpoint V2 and CDI prepare/unprepare flow
- existing node-local channel conflict check for channel
0
One caveat about the "reuse" framing: the existing channel-prepare
code in device_state.go:551-554,586 already supports
AllocationMode: All by slicing
nvCapImexChanDevInfos[:maxImexChannelCount] and injecting every
channel chardev. Host-managed v2 must actively suppress that
branch (see §7.2 step 3 and §14.1); the gate doesn't just inherit
"what's already there."
It deliberately does not add an allocator CRD, new status fields, checkpoint V3, ResourceClaim finalizers, mandatory webhook behavior, or multi-slot ResourceSlice publishing.
The current repo already publishes channel-0 and the Helm
chart already installs a DeviceClass that selects only that channel. The
controller already creates a workload ResourceClaimTemplate whose opaque
config carries the ComputeDomain UID and allocation mode. The kubelet plugin
already writes a checkpoint, creates a CDI spec, and injects
/dev/nvidia-caps-imex-channels/channel0.
The high-footprint design work was needed to support multiple concurrent
ComputeDomains on the same IMEX fabric with unique channel IDs. That is not
part of this alpha. Dropping that requirement removes the need for:
- an
IMEXChannelAllocationCRD and reaper - clique-wide optimistic concurrency
- new
ComputeDomain.status.channelsor conditions - new per-claim finalizers
- a checkpoint schema migration
- a scheduler-visible slot model
- live Kubernetes lookups in admission
The tradeoff is explicit: v2 is a small operational mode, not a complete multi-tenant isolation design.
With HostManagedIMEX=true, the driver:
- watches
ComputeDomainobjects - adds/removes the existing
ComputeDomainfinalizer (resource.nvidia.com/computeDomain) and lets the existing workload-RCT manager add/remove its own RCT finalizer (unchanged) - creates the workload
ResourceClaimTemplate - publishes a per-node
ResourceSlicewithchannel-0 - prepares channel claims by injecting channel
0through CDI - accepts only empty or
allocationMode: SingleduringPrepare - ignores (does not reject)
spec.numNodes - rejects (with a permanent error) host-managed channel prepare on
nodes whose local NVML clique ID is empty. This is new
behavior, not a reuse of the existing silent-skip path at
device_state.go:581-584— that path currently returns an empty configState without an error, which would mask a misconfigured node. Host-managed mode treats "no clique" as a hard prepare failure so the operator sees the misconfiguration immediately. - maintains existing checkpoint/CDI cleanup behavior
The driver does not:
- create per-
ComputeDomainDaemonSets - create daemon
ResourceClaimTemplates - prepare daemon claims
- run or restart
nvidia-imex - write
nodes_config.cfg - update node labels for a
ComputeDomain - wait for
ComputeDomainCliqueor daemon readiness before preparing workloads - prove host IMEX health in Kubernetes status
The operator owns everything below the DRA boundary:
- installing the
nvidia-imexpackage - configuring and starting
nvidia-imex.service - populating
/etc/nvidia-imex/nodes_config.cfg - ensuring IMEX peers agree on the node map
- ensuring the IMEX channel kernel device major is registered before the kubelet plugin starts
- ensuring channel
0is usable on every participating node - monitoring
nvidia-imexhealth with host tooling - draining workloads before restarting or reconfiguring host IMEX
- preventing multiple active isolated jobs from sharing channel
0on the same host IMEX domain
Users still create a ComputeDomain and use the generated
ResourceClaimTemplate:
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: train-a
spec:
numNodes: 0
channel:
resourceClaimTemplate:
name: train-a-imex-channel
allocationMode: SingleThe ComputeDomain remains namespaced. Its generated workload
ResourceClaimTemplate remains namespaced with the ComputeDomain.
status.status: Ready in host-managed mode means only that the
controller has admitted the ComputeDomain and the workload
ResourceClaimTemplate exists. It does not mean host IMEX is
running, connected, or healthy; it does not mean any future
Prepare will succeed.
This alpha does not support:
- multiple simultaneous isolated
ComputeDomains on one host IMEX fabric - assigning unique IMEX channel IDs per
ComputeDomain allocationMode: All- publishing
slot-0..slot-Nabstract devices - creating channels dynamically after plugin startup
- waiting for host IMEX health before
Prepare - per-
ComputeDomainor per-namespace mode selection - in-place migration while
ComputeDomainworkloads are running - automatic cleanup of stale objects from a previous driver-managed install
- webhook-enforced host-managed policy
The beta path can add a real allocator if multi-tenant isolation is required. That should be a separate design because it introduces API, status, RBAC, and lifecycle complexity that this v2 intentionally avoids.
Add a project feature gate:
HostManagedIMEX featuregate.Feature = "HostManagedIMEX"Default:
Default: false
PreRelease: featuregate.Alpha
Version: version.MajorMinor(0, 5)The gate is install-wide. Mixed driver-managed and host-managed
ComputeDomains in the same Helm release are not supported.
When HostManagedIMEX is enabled, resolve feature gates before
the existing dependency validation runs. Concretely, extend
pkg/featuregates/featuregates.go to call a new override helper
before ValidateFeatureGates:
// resolveHostManagedIMEXOverrides forces the two compatible gates
// off when HostManagedIMEX is on. Runs BEFORE ValidateFeatureGates
// so the existing "ComputeDomainCliques implies IMEXDaemonsWithDNSNames"
// dependency rule trivially holds afterwards.
func resolveHostManagedIMEXOverrides(gates featuregate.MutableFeatureGate) {
if !gates.Enabled(HostManagedIMEX) {
return
}
if gates.Enabled(IMEXDaemonsWithDNSNames) {
klog.Infof("HostManagedIMEX is enabled; forcing IMEXDaemonsWithDNSNames=false")
_ = gates.Set("IMEXDaemonsWithDNSNames=false")
}
if gates.Enabled(ComputeDomainCliques) {
klog.Infof("HostManagedIMEX is enabled; forcing ComputeDomainCliques=false")
_ = gates.Set("ComputeDomainCliques=false")
}
}Both defaults are true upstream — the helper explicitly sets them
to false, it does not "reset to default."
Resolved gate values when the operator sets HostManagedIMEX=true:
HostManagedIMEX=true
IMEXDaemonsWithDNSNames=false (forced)
ComputeDomainCliques=false (forced)
The host-managed-specific Helm flag is exactly one:
--set featureGates.HostManagedIMEX=trueThe operator may need additional pre-existing chart-level flags to scope the install (these are not introduced by this design and not specific to host-managed mode):
--set resources.computeDomains.enabled=true # chart default; pass for clarity
--set resources.gpus.enabled=false # only if GPU DRA is not wantedIf the operator wants both compute-domains and GPU DRA in the same
chart, leave resources.gpus.enabled=true (chart default) and pass
--set gpuResourcesEnabledOverride=true — that's an existing chart
safety guard unrelated to host-managed IMEX.
The controller and kubelet plugin must log the resolved gate overrides at startup. This keeps the existing defaults for normal driver-managed mode while avoiding a three-gate operator recipe for host-managed mode.
For a non-deleting ComputeDomain, host-managed mode does this:
- Fetch the current
ComputeDomainby UID. - Add the existing
ComputeDomainfinalizer if missing. The constant isresource.nvidia.com/computeDomain, defined atcmd/compute-domain-controller/computedomain.go:50-52ascomputeDomainFinalizer = computeDomainLabelKey. The same string is also the node-label key; this overload is in the existing codebase. Implementers should reusecomputeDomainFinalizer, not introduce a new constant. - Create the workload
ResourceClaimTemplateusing the existingWorkloadResourceClaimTemplateManager.Create. That helper already adds and tracks its own finalizer on the workload RCT; no new RCT-level finalizer logic is required for v2. - Set
ComputeDomain.status.status=Ready.
It skips:
MultiNamespaceDaemonSetManager.Create— this is the wrapper around per-namespaceDaemonSetManagerinstances. The daemonResourceClaimTemplatecreation lives insideDaemonSetManager.Create(viaNewDaemonSetResourceClaimTemplateManageratdaemonset.go:103,161), so skipping the wrapper implicitly skips daemon-RCT creation. There is no separate daemon-RCT call site to skip.- stale node-label cleanup (
NodeManager.RemoveStaleComputeDomainLabelsAsync) NodeManager(constructed viaNewNodeManager— there is noComputeDomainNodeManagertype in the repo)ComputeDomainStatusManagerComputeDomainCliqueManager- status calculation from
status.nodesandspec.numNodes
spec.numNodes is ignored for host-managed status. Operators should still set
it to 0 because the field is deprecated and has no host-managed readiness
meaning.
For a deleting ComputeDomain, host-managed mode does this:
- Delete the workload
ResourceClaimTemplate. - Remove the finalizer from the workload
ResourceClaimTemplate. - Assert the workload
ResourceClaimTemplateis gone. - Remove the
ComputeDomainfinalizer. - Forget metrics for that
ComputeDomain.
It does not delete DaemonSets, daemon claim templates, node labels, or
ComputeDomainClique objects because it never creates them in this mode.
Migration relies on deleting all existing ComputeDomains before flipping the
gate. v2 does not include a startup sweep for legacy objects.
Expected small code surface:
| File | Change |
|---|---|
pkg/featuregates/featuregates.go |
Add HostManagedIMEX and resolved override helper |
cmd/compute-domain-controller/computedomain.go |
Branch add/delete reconciliation when the gate is enabled |
cmd/compute-domain-controller/controller.go |
Avoid constructing or starting daemon/node/status/clique managers if needed by the implementation |
cmd/compute-domain-controller/*_test.go |
Add host-managed controller tests |
No CRD type or generated client changes are required.
Host-managed mode publishes the current channel-zero device and no daemon device:
channel-0
attributes:
compute-domain.nvidia.com/type = "channel"
compute-domain.nvidia.com/id = 0
The existing compute-domain-default-channel.nvidia.com DeviceClass remains
valid because it already selects type == "channel" and id == 0.
The plugin does not publish slot-* devices and does not expose more than one
IMEX channel per node.
For ComputeDomainChannelConfig, host-managed mode keeps the current prepare
ordering:
- Decode and validate opaque config from allocation status.
- Require exactly one allocation result.
- Reject any
AllocationModevalue that is not the empty string and notSingle. Today'sComputeDomainChannelConfig.Validate()atcomputedomainconfig.go:49-55only checksDomainIDis non-empty, so an unknownAllocationModelike"foo"currently falls through and is silently treated asSingle(because the only branch inapplyComputeDomainChannelConfigchecks== "All"). UnderHostManagedIMEX, Prepare must add an explicit allowlist check on the opaque-config value rather than relying on theif AllocationMode == "All"branch alone. The CRD-level enum (computedomain.go:100—+kubebuilder:validation:Enum=All;Single) only protects user-createdComputeDomainobjects, not the opaque configs the kubelet sees. - Build
DeviceConfigStatewithType=channelandComputeDomain=<domainID>. - Check the local checkpoint for an existing completed allocation of channel
0. - Assert the
ComputeDomainexists in the same namespace as the claim. - Require a non-empty local clique ID. If
s.computeDomainManager.cliqueID == "", return a permanent error ("host-managed IMEX requires an NVLink clique on this node; NVML reports none"). This replaces the existing silent-skip behavior atdevice_state.go:581-584for the host-managed code path only; the gate-off path is unchanged. - Append CDI edits for
nvCapImexChanDevInfos[0]. - Let the existing checkpoint/CDI code mark
PrepareCompleted.
Host-managed mode skips:
AddNodeLabelAssertComputeDomainReady- any call to daemon settings
Prepare - any
/imexdmount generation
The node-local channel conflict check remains intentionally strict. If a
second ResourceClaim lands on the same node while a completed channel-zero
claim is prepared, Prepare fails via the existing checkpoint conflict path.
This design does not require changing that existing conflict into a permanent
error; the scheduler should normally prevent the conflict, and the checkpoint
check remains the node-local backstop.
Unprepare keeps the existing checkpoint-driven cleanup:
- Read the prepared claim from checkpoint.
- Delete the generated CDI spec.
- Remove the checkpoint entry.
Host-managed mode skips RemoveNodeLabel. It does not touch host
nvidia-imex.
In host-managed mode, ComputeDomainDaemonConfig should not be allocated
because:
- the controller does not create daemon claim templates
- Helm does not render the daemon
DeviceClass - the plugin does not publish daemon devices
If a stale or manually created daemon claim reaches Prepare, the plugin
returns a permanent error explaining that daemon claims are disabled under
HostManagedIMEX.
Expected small code surface:
| File | Change |
|---|---|
cmd/compute-domain-kubelet-plugin/driver.go |
Do not publish daemon devices when the gate is enabled |
cmd/compute-domain-kubelet-plugin/device_state.go |
Add the host-managed branch in channel prepare/unprepare and reject daemon prepare |
cmd/compute-domain-kubelet-plugin/device_state_test.go |
Cover Single, All, no-clique, and channel conflict cases |
No checkpoint schema change is required.
Use the existing featureGates values map:
featureGates:
HostManagedIMEX: trueTemplate changes:
| Template | Host-managed behavior |
|---|---|
controller.yaml |
No new env var; FEATURE_GATES is already plumbed |
kubeletplugin.yaml |
No new env var; FEATURE_GATES is already plumbed |
deviceclass-compute-domain-default-channel.yaml |
Keep rendering |
deviceclass-compute-domain-daemon.yaml |
Do not render |
rbac-compute-domain-daemon.yaml |
Do not render |
No Helm values for slotsPerNode, maxIMEXChannels, allocator reaper
intervals, webhook requirements, or mode markers are added.
The chart should not require webhook.enabled=true for HostManagedIMEX.
The current webhook may remain enabled for existing GPU-driver behavior, but
this v2 does not depend on it and does not extend it into a host-managed
compute-domain admission contract. Host-managed safety is enforced in the
controller and kubelet plugin.
v2 enforces host-managed policy at Prepare (a kubelet-plugin
permanentError), so an invalid claim is accepted by the API server and
surfaces later as a pod-level event rather than being rejected synchronously at
kubectl apply. The earlier v1 design closed this UX gap with a mandatory
webhook; v2 deliberately does not, to avoid the cert-manager requirement, the
webhook.enabled=true chart fail-guard, the ComputeDomainChannelConfig
schema change (the DomainNamespace/DomainName triple), and the live-Get
RBAC that came with it.
A clean beta follow-up — separate from this alpha — is an optional admission
rule (default-off, no chart fail-guard, reusing the existing config schema)
that mirrors the kubelet-plugin allowlist for HostManagedIMEX:
- reject
ComputeDomainChannelConfig.AllocationModenot in {"",Single} - reject
ComputeDomainDaemonConfigopaque configs - optionally reject obvious multi-device shapes (
exactly.count > 1,firstAvailable[*])
This would convert those Prepare-time pod failures into immediate kubectl apply errors. It must stay advisory/defense-in-depth: the kubelet-plugin
permanentError paths remain the source of truth (they also cover the
upgrade-skew and pre-existing-claim windows a webhook cannot), and the gate must
never require the webhook to be enabled.
The scheduler sees only per-node channel-zero capacity. That means Kubernetes
can prevent two separate channel claims from being allocated to the same node,
but it cannot prevent two different ComputeDomains on different nodes from
using the same host IMEX channel in the same fabric.
The isolation rule for v2 is therefore operational:
Run at most one active isolated host-managed
ComputeDomainper host IMEX domain/fabric.
If two workloads intentionally share the same IMEX communication domain, they
can use the same ComputeDomain and the same host IMEX configuration. If they
need isolation, v2 is not sufficient.
This matches NVIDIA IMEX channel behavior: channel-based isolation requires
consistent channel assignment across all nodes, and broad access to channel
0 means workloads are not isolated from each other.
ComputeDomain.status.status has weak semantics in host-managed mode:
| Field | Meaning under HostManagedIMEX |
|---|---|
status.status=Ready |
Controller admitted the ComputeDomain and the workload ResourceClaimTemplate exists. Says nothing about host IMEX or whether a future Prepare will succeed. |
status.nodes |
Not populated by host-managed mode |
| Host IMEX health | Not represented |
| Channel ID | Always 0, not recorded in API status |
Operators must monitor host IMEX directly, for example with
systemctl status nvidia-imex, logs, and nvidia-imex-ctl -N when the command
service is enabled.
The driver can expose ordinary controller/plugin logs and existing DRA metrics, but it does not scrape host IMEX health in this version.
The minimal implementation keeps the current startup assumption: the kubelet
plugin must be able to discover the nvidia-caps-imex-channels device major
when it starts.
Required host state before the plugin starts:
- NVIDIA driver loaded
/proc/devicescontainsnvidia-caps-imex-channels- channel
0can be used by workloads - host
nvidia-imex.serviceis configured and running for real workloads
v2 does not add lazy channel discovery, fsnotify, or ResourceSlice republish when host channels appear later. If the host state is missing, fix the node and restart the kubelet plugin.
Gate flips are stop-the-world operations.
- Cordon/drain all nodes that run ComputeDomain workloads.
- Delete all
ComputeDomainobjects. - Wait for generated DaemonSets and claim templates to disappear.
- Start and validate host
nvidia-imex.serviceon every participating node. - Upgrade Helm with
featureGates.HostManagedIMEX=true. - Recreate
ComputeDomains withnumNodes: 0andallocationMode: Single. - Uncordon and resubmit workloads.
- Cordon/drain ComputeDomain workloads.
- Delete all
ComputeDomainobjects. - Stop and mask host
nvidia-imex.serviceso it cannot conflict with the driver-managed daemon pods. - Upgrade Helm with
featureGates.HostManagedIMEX=false. - Recreate
ComputeDomains for driver-managed mode. - Uncordon workloads.
No in-place adoption is provided. Running workloads are not migrated.
| Failure | Result | Owner |
|---|---|---|
| Host IMEX is down at plugin start or at workload run time | Pod may start, CUDA IMEX operations fail. Driver never restarts host IMEX. | Operator |
nodes_config.cfg differs across nodes |
IMEX domain stays down or degraded; not visible from K8s | Operator |
nodes_config.cfg missing or empty |
nvidia-imex.service exits or fails its ConditionPathExists; driver still publishes channel-0 but workloads fail at CUDA time |
Operator |
Channel major (nvidia-caps-imex-channels) missing in /proc/devices at plugin start |
Kubelet plugin fails to initialize (existing behavior, not changed by v2) | Operator |
Host channel0 chardev missing but channel major is registered |
Prepare can still succeed because CDI emits mknod instructions inside the container from major/minor. Treat this as a host setup/observability warning, not by itself as proof the workload will fail. |
Operator |
allocationMode: All (or any unknown non-empty value) |
Channel claim Prepare fails permanently |
User/operator |
Unknown AllocationMode opaque-config string (e.g. "foo") |
Existing Validate() only checks DomainID; under the gate, Prepare adds an explicit allowlist check and rejects |
Driver (under gate) |
| Two channel claims on one node | Second Prepare fails due to checkpoint conflict (existing behavior) |
Driver |
| Two isolated CDs on same fabric | Both can use channel 0; isolation is not guaranteed |
Operator |
| Empty local clique ID at Prepare | Permanent prepare error (new under gate; replaces today's silent-skip path) | Driver (under gate) |
Nonzero numNodes |
Silently ignored; never rejected | User/operator |
| Daemon claim reaches Prepare | Permanent error (no daemon devices published, no daemon RCTs created — claim is from a stale or hand-crafted object) | Driver (under gate) |
| Existing driver-managed objects present during gate flip | Undefined/stale objects; migration procedure was skipped | Operator |
- feature gate registration and
resolveHostManagedIMEXOverridesbehavior (bothIMEXDaemonsWithDNSNamesandComputeDomainCliquesend upfalseafter the helper runs and the existing dependency validator passes) - controller add path creates only the workload RCT and sets
status.status=Ready; theComputeDomainfinalizer (resource.nvidia.com/computeDomain) is added - controller delete path deletes only the workload RCT, removes
RCT and CD finalizers, and calls
metrics.ForgetComputeDomain - kubelet
ResourceSliceomits daemon device under the gate - channel prepare skips
AddNodeLabelandAssertComputeDomainReady - channel prepare rejects
allocationMode: AllAND unknown non-empty opaque-config modes (e.g."foo") via the new explicit allowlist check, not via the existingif == "All"branch - channel prepare rejects empty clique ID as a permanent error (not the existing silent-skip)
- daemon prepare returns permanent error under the gate
- unprepare skips
RemoveNodeLabeland still removes CDI/checkpoint state (unchanged from today otherwise)
- render Helm with the gate on and verify:
- default channel
DeviceClassexists - daemon
DeviceClassdoes not exist - daemon RBAC/service account do not exist
- controller and kubelet plugin receive
FEATURE_GATES
- default channel
- fake a
ComputeDomainand verify only the workload claim template is created - verify no
computedomain-daemon-*DaemonSet is created
- host IMEX running, one
ComputeDomain, one pod per node: pod sees/dev/nvidia-caps-imex-channels/channel0inside the container. - stop host IMEX (
systemctl stop nvidia-imexon a node holding a workload pod): verify (a) the kubelet plugin does NOT restart it (no Pod restarts; no plugin log lines about IMEX lifecycle), and (b) any subsequent CUDA shareable-handle operation in the workload surfaces a CUDA error inside the workload container. Verification is end-to-end against the workload itself, not against any driver-exposed status field. - second channel claim on same node: scheduler or
Prepareprevents it (the per-node checkpoint conflict check is the backstop). allocationMode: All: pod fails with a clear permanent prepare error.- no clique/fabric node: pod fails with a clear permanent prepare error (the new behavior introduced for host-managed mode; verify the error message names the missing clique condition).
- Add
HostManagedIMEXfeature gate and theresolveHostManagedIMEXOverrideshelper. Wire it in front ofValidateFeatureGatesin the controller and kubelet-plugin startup paths. Log the resolved overrides. - Branch controller
onAddOrUpdateadd path under the gate to:- add the existing
computeDomainFinalizer(theresource.nvidia.com/computeDomainconstant); - call only
WorkloadResourceClaimTemplateManager.Create(skipMultiNamespaceDaemonSetManager.Create, which transitively skips daemon-RCT creation since the daemon RCT manager lives insideDaemonSetManager); - set
ComputeDomain.status.status=Ready.
- add the existing
- Branch controller
onAddOrUpdatedelete path under the gate to:- delete the workload RCT and remove its finalizer;
- remove the
ComputeDomainfinalizer; - call
metrics.ForgetComputeDomainfor that CD.
- Branch kubelet
ResourceSlicepublishing under the gate to publish only the existingchannel-0device (no daemon device). - Branch channel Prepare under the gate to:
- skip
AddNodeLabel; - skip
AssertComputeDomainReady; - validate
AllocationModeis empty or"Single"(return permanent error otherwise — this is new validation, not the existingif == "All"check); - return a permanent error when
cliqueID == ""(replaces today's silent-skip for the host-managed path only).
- skip
- Branch channel Unprepare under the gate to skip
RemoveNodeLabel. Everything else stays as today (CDI delete + checkpoint delete via the existing helpers). - Reject
ComputeDomainDaemonConfigPrepare under the gate with a clear permanent error. - Hide
deviceclass-compute-domain-daemon.yamlandrbac-compute-domain-daemon.yamlin Helm when the gate is on. - Add focused tests, including the test-plan §14 scenarios — in particular fault-injection at every prepare step listed in §7.2 and verification that delete-path metrics forget runs.
- Update docs and examples to show
numNodes: 0,allocationMode: Single, and the one-active-isolated-CD-per-host-IMEX-domain limitation. Document boot-order preconditions from §11 in the operator guide anddocs/prerequisites.md.
- Repo:
api/nvidia.com/resource/v1beta1/computedomain.go - Repo:
api/nvidia.com/resource/v1beta1/computedomainconfig.go - Repo:
cmd/compute-domain-controller/computedomain.go - Repo:
cmd/compute-domain-controller/resourceclaimtemplate.go - Repo:
cmd/compute-domain-kubelet-plugin/device_state.go - Repo:
cmd/compute-domain-kubelet-plugin/driver.go - Repo:
deployments/helm/dra-driver-nvidia-gpu/templates/deviceclass-compute-domain-default-channel.yaml - NVIDIA IMEX Service for NVLink Networks: IMEX channels https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html
- NVIDIA IMEX Service for NVLink Networks: deployment models https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html
- NVIDIA IMEX Service for NVLink Networks: config options https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html
- NVIDIA IMEX Service for NVLink Networks: command service https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/cmdservice.html