Poolside Laguna: build, run, and smoke test howto

Poolside Laguna Howto

Build Info

Field	Value
Model	Poolside Laguna (model card TBD) — poolside.ai
Image	`quay.io/vllm/rhaiis-early-access:poolside-laguna`
Build Run	25024994672
nm-cicd branch	`doug/poolside-laguna-0day`
nm-vllm-ent branch	`doug/poolside-laguna-0day`
Target device	cuda
Python	3.12

Build Commands

Build wheel + image (from scratch):

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/poolside-laguna-0day \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/poolside-laguna-0day \
  -f build_label=k8s-a100-build-13-0 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Important: Use build_label=k8s-a100-build-13-0 (CUDA 13 runner). Building on a CUDA 12 runner produces a wheel that links against libcudart.so.12, which fails at runtime because the container base image ships CUDA 13 only.

Rebuild image only (reusing an existing wheel from a previous run):

gh workflow run build-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/poolside-laguna-0day \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/poolside-laguna-0day \
  -f build_label=ibm-wdc-k8s-h100-dind \
  -f release_image=false \
  -f run_id=25024994672 \
  -f target_device=cuda

How to Run

Prerequisites

1x H100 80GB GPU (FP8 model is ~33 GiB, fits on a single GPU at 95% utilization)
Or 1x H200 140GB GPU (more headroom, no need for --gpu-memory-utilization)
FP8 weights downloaded to local disk
podman with NVIDIA CDI support

Architecture

Laguna is a sparse Mixture-of-Experts model:

256 experts, top-8 routing (~33.7B total params, ~3.3B activated)
Heterogeneous attention: some layers use sliding window attention (SWA), others use global attention with different numbers of query heads
Custom model code: configuration_laguna.py and modeling_laguna.py are bundled with the weights (requires --trust-remote-code)

Pull the image

podman pull quay.io/vllm/rhaiis-early-access:poolside-laguna

Start the server (H100 — 80 GiB)

podman run --rm \
  --name vllm-laguna \
  --device nvidia.com/gpu=0 \
  --security-opt=label=disable \
  --ipc=host \
  -p 8000:8000 \
  -v /path/to/laguna-fp8:/model \
  -e CUDA_VISIBLE_DEVICES=0 \
  quay.io/vllm/rhaiis-early-access:poolside-laguna \
    --model /model \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --max-model-len 4096 \
    --trust-remote-code \
    --gpu-memory-utilization 0.95

Start the server (H200 — 140 GiB)

podman run --rm \
  --name vllm-laguna \
  --device nvidia.com/gpu=0 \
  --security-opt=label=disable \
  --ipc=host \
  -p 8000:8000 \
  -v /path/to/laguna-fp8:/model \
  -e CUDA_VISIBLE_DEVICES=0 \
  quay.io/vllm/rhaiis-early-access:poolside-laguna \
    --model /model \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --max-model-len 4096 \
    --trust-remote-code

Notes:

--enforce-eager is recommended to avoid CUDAGraph issues with the MoE architecture.

--trust-remote-code is required — Laguna ships custom model code with the weights.

On H100 (80 GiB), --gpu-memory-utilization 0.95 is needed to fit the FP8 weights plus KV cache.

Rope parameter warnings and the Mistral tokenizer regex warning in the logs are cosmetic — ignore them.

Watch startup logs

podman logs -f vllm-laguna

Look for Application startup complete. — model loading takes ~60-90 seconds.

Smoke Test

Health check

curl http://127.0.0.1:8000/health

Chat request

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{"role": "user", "content": "Tell me about the tech scene in Burlington, Vermont."}],
    "max_tokens": 512
  }'

Expected output (truncated):

Burlington, Vermont has a surprisingly vibrant and dynamic tech scene for its size! While it's much smaller than major tech hubs, it offers a unique blend of innovation, entrepreneurship, and community spirit...

Completion request

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "prompt": "The capital of France is",
    "max_tokens": 32
  }'

Cleanup

podman stop vllm-laguna
podman rm vllm-laguna

dougbtv/laguna-howto.md

Select an option

No results found