Poolside Laguna: build, run, and smoke test howto
| Field | Value |
|---|---|
| Model | Poolside Laguna (model card TBD) — poolside.ai |
| Image | quay.io/vllm/rhaiis-early-access:poolside-laguna |
| Build Run | 25024994672 |
| nm-cicd branch | doug/poolside-laguna-0day |
| nm-vllm-ent branch | doug/poolside-laguna-0day |
| Target device | cuda |
| Python | 3.12 |
Build wheel + image (from scratch):
gh workflow run build-whl-image.yml \
--repo neuralmagic/nm-cicd \
--ref doug/poolside-laguna-0day \
-f repo=neuralmagic/nm-vllm-ent \
-f branch=doug/poolside-laguna-0day \
-f build_label=k8s-a100-build-13-0 \
-f build_timeout=120 \
-f image_label=ibm-wdc-k8s-h100-dind \
-f python=3.12 \
-f release_image=false \
-f target_device=cudaImportant: Use
build_label=k8s-a100-build-13-0(CUDA 13 runner). Building on a CUDA 12 runner produces a wheel that links againstlibcudart.so.12, which fails at runtime because the container base image ships CUDA 13 only.
Rebuild image only (reusing an existing wheel from a previous run):
gh workflow run build-image.yml \
--repo neuralmagic/nm-cicd \
--ref doug/poolside-laguna-0day \
-f repo=neuralmagic/nm-vllm-ent \
-f branch=doug/poolside-laguna-0day \
-f build_label=ibm-wdc-k8s-h100-dind \
-f release_image=false \
-f run_id=25024994672 \
-f target_device=cuda- 1x H100 80GB GPU (FP8 model is ~33 GiB, fits on a single GPU at 95% utilization)
- Or 1x H200 140GB GPU (more headroom, no need for
--gpu-memory-utilization) - FP8 weights downloaded to local disk
- podman with NVIDIA CDI support
Laguna is a sparse Mixture-of-Experts model:
- 256 experts, top-8 routing (~33.7B total params, ~3.3B activated)
- Heterogeneous attention: some layers use sliding window attention (SWA), others use global attention with different numbers of query heads
- Custom model code:
configuration_laguna.pyandmodeling_laguna.pyare bundled with the weights (requires--trust-remote-code)
podman pull quay.io/vllm/rhaiis-early-access:poolside-lagunapodman run --rm \
--name vllm-laguna \
--device nvidia.com/gpu=0 \
--security-opt=label=disable \
--ipc=host \
-p 8000:8000 \
-v /path/to/laguna-fp8:/model \
-e CUDA_VISIBLE_DEVICES=0 \
quay.io/vllm/rhaiis-early-access:poolside-laguna \
--model /model \
--tensor-parallel-size 1 \
--enforce-eager \
--max-model-len 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.95podman run --rm \
--name vllm-laguna \
--device nvidia.com/gpu=0 \
--security-opt=label=disable \
--ipc=host \
-p 8000:8000 \
-v /path/to/laguna-fp8:/model \
-e CUDA_VISIBLE_DEVICES=0 \
quay.io/vllm/rhaiis-early-access:poolside-laguna \
--model /model \
--tensor-parallel-size 1 \
--enforce-eager \
--max-model-len 4096 \
--trust-remote-codeNotes:
--enforce-eageris recommended to avoid CUDAGraph issues with the MoE architecture.--trust-remote-codeis required — Laguna ships custom model code with the weights.- On H100 (80 GiB),
--gpu-memory-utilization 0.95is needed to fit the FP8 weights plus KV cache.- Rope parameter warnings and the Mistral tokenizer regex warning in the logs are cosmetic — ignore them.
podman logs -f vllm-lagunaLook for Application startup complete. — model loading takes ~60-90 seconds.
curl http://127.0.0.1:8000/healthcurl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"messages": [{"role": "user", "content": "Tell me about the tech scene in Burlington, Vermont."}],
"max_tokens": 512
}'Expected output (truncated):
Burlington, Vermont has a surprisingly vibrant and dynamic tech scene for its size! While it's much smaller than major tech hubs, it offers a unique blend of innovation, entrepreneurship, and community spirit...
curl -s http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/model",
"prompt": "The capital of France is",
"max_tokens": 32
}'podman stop vllm-laguna
podman rm vllm-laguna