| GPUs | 8× NVIDIA A100-SXM4-80GB (640 GB total VRAM) |
| CUDA | 12.4 |
| Driver | 535.161.08 |
| OS | Ubuntu 22.04 |
| Docker | 24.0.7 (user in docker group, no sudo needed) |
| NVIDIA container runtime | Confirmed working ✓ |
| vLLM docker image | vllm/vllm-openai:latest (v0.17.1) |
NemoClaw = OpenClaw AI coding agent running inside an OpenShell sandbox, backed by local vLLM inference.
Goal: A fully local (no cloud inference) agent loop where:
- The LLM (
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) runs on GPU 0 via a vLLM Docker container - All agent network egress is intercepted and requires explicit approval via the OpenShell TUI
- Model fits on a single A100-80G (59 GB BF16 weights, 11.6 GB KV cache at 131k context)
cd /raid/vjawa/nemo_claw_test/openshell-openclaw-plugin
# 1. Start vLLM (waits until healthy — ~80s on first load)
./scripts/start-vllm.sh
# 2. Ensure OpenShell gateway is running
openshell status # should say "Connected"
# if not: openshell gateway start --name nemoclaw
# 3. Launch the agent walkthrough (no API key needed)
./scripts/walkthrough.shIn the right tmux pane, press Up and edit the prompt:
openclaw agent --agent main --local --session-id live -m "Fetch the current NVIDIA stock price"Left pane shows the OpenShell TUI — approve/deny each outbound network request.
Problem: NGC_API_KEY from ~/.ngc/config is a Docker registry credential for nvcr.io pulls. It is rejected by integrate.api.nvidia.com with HTTP 401. A separate nvapi-* key from build.nvidia.com is required for NIM inference.
→ Fix: Use local vLLM instead. No inference API key needed.
Problem:
Error: × No active gateway.
openshell inference set fails if called before the gateway is up.
→ Fix: Start gateway first (takes ~30s):
openshell gateway start --name nemoclaw
# Verify: openshell status → "Connected"Problem: The OpenShell gateway runs inside a Docker/k3s pod. Using localhost:8000 as the provider URL resolves to the pod's loopback — not the host — so inference calls never reach vLLM.
→ Fix:
openshell provider create \
--name vllm-local \
--type openai \
--credential "OPENAI_API_KEY=dummy" \
--config "OPENAI_BASE_URL=http://host.docker.internal:8000/v1"walkthrough.sh now runs this automatically whenever a local vLLM is detected.
Problem: openshell sandbox connect <name> -- bash -c '...' rejects any argument after the sandbox name:
error: unexpected argument 'bash' found
→ Fix: Generate SSH config and use ssh -t with a pre-uploaded startup script:
openshell sandbox ssh-config "$SANDBOX_NAME" > /tmp/ssh.cfg
ssh -F /tmp/ssh.cfg "openshell-${SANDBOX_NAME}" 'bash /tmp/startup.sh'walkthrough.sh does this automatically.
Problem: nemoclaw onboard creates the sandbox with a name derived from the API key placeholder. The original walkthrough.sh hardcoded --name nemoclaw, so the right pane immediately exited.
→ Fix: Auto-detect the first Ready sandbox:
SANDBOX_NAME=$(openshell sandbox list 2>/dev/null \
| sed 's/\x1b\[[0-9;]*m//g' \
| awk 'NR>1 && $NF=="Ready" { print $1; exit }')Pass an explicit name as argument to override: ./scripts/walkthrough.sh my-sandbox.
Problem: After onboarding, openshell inference get showed Model: vllm-local. The gateway forwarded requests with that string as the model ID, causing:
404 The model 'vllm-local' does not exist
→ Fix: walkthrough.sh now detects the actual model name from /v1/models and passes it to openshell inference set:
VLLM_MODEL=$(curl -sf http://localhost:8000/v1/models \
| python3 -c "import json,sys; d=json.load(sys.stdin); print(d['data'][0]['id'])")
openshell inference set --no-verify --provider vllm-local --model "$VLLM_MODEL"Problem (part 1): Without --enable-auto-tool-choice, vLLM rejects all tool-use requests:
400 "auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set
Problem (part 2): Nemotron uses a custom XML parameter format for tool calls — not the llama3_json or hermes JSON formats:
<tool_call>
<function=tool_name>
<parameter=arg1>
value
</parameter>
</function>
</tool_call>
Both llama3_json and hermes expect JSON inside <tool_call> and silently drop Nemotron's tool calls.
→ Fix: Custom nemotron_tool_parser.py that parses the XML parameter format.
Also patch vllm/tool_parsers/__init__.py to register it (validation happens before import):
docker run -d \
--gpus '"device=0"' \
-p 8000:8000 \
--shm-size 16g \
-v "${MODEL_PARENT}:/model-parent:ro" \
-v "scripts/nemotron_tool_parser.py:/usr/local/.../vllm/tool_parsers/nemotron_tool_parser.py:ro" \
-v "scripts/vllm_tool_parsers_init.py:/usr/local/.../vllm/tool_parsers/__init__.py:ro" \
vllm/vllm-openai:latest \
--model "/model-parent/snapshots/${SNAPSHOT}" \
--served-model-name "nvidia/nemotron-3-nano-30b-a3b" \
--enable-auto-tool-choice \
--tool-call-parser nemotron \
--trust-remote-code \
--max-model-len 131072Shortcut: ./scripts/start-vllm.sh handles this.
Problem: which vllm → not found.
→ Fix: Use the pre-pulled vllm/vllm-openai:latest Docker image. start-vllm.sh wraps the docker run.
Problem: HF cache snapshots contain symlinks like config.json -> ../../blobs/.... Mounting only the snapshot dir means Docker can't resolve ../../blobs/, causing:
Invalid repository ID or local directory: '/model'
→ Fix: Mount the parent model directory (which contains both snapshots/ and blobs/) and point vLLM at the snapshot inside it:
-v "/raid/praateekm/hf_cache/hub/models--nvidia--...:/model-parent:ro"
# then: --model /model-parent/snapshots/<hash>Problem: OpenClaw agent sessions accumulate tool call history. With --max-model-len 32768 and max_tokens=4096, only 28,672 input tokens are usable. A medium-length coding session hits this after a few turns:
400 You passed 28673 input tokens and requested 4096 output tokens.
However, the model's context length is only 32768 tokens.
The GPU KV cache was only 0.5% utilized at 32768 — there was ample VRAM headroom.
→ Fix: Use --max-model-len 131072 (128K context). This gives:
- 11.6 GB KV cache
- 404,528 total cached tokens
- 14.45× max concurrency at 131K tokens per request
--max-model-len 131072start-vllm.sh now defaults to 131072.
| File | Description |
|---|---|
scripts/start-vllm.sh |
Start/stop/status for vLLM Docker container (defaults: Nemotron Nano 30B, 131k context, GPU 0) |
scripts/nemotron_tool_parser.py |
Custom vLLM tool parser for Nemotron XML tool call format |
scripts/vllm_tool_parsers_init.py |
Patched vLLM __init__.py that registers the nemotron parser name |
scripts/walkthrough.sh |
Fixed: auto-configures openshell provider, detects sandbox + model name, no API key required |
scripts/walkthrough-nim.sh |
Variant: NIM cloud/local endpoint instead of vLLM |
/raid/praateekm/hf_cache/hub/models--nvidia--NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/
snapshots/378df16e4b54901a3f514f38ea9a34db9d061634/ # 59 GB BF16