Skip to content

Instantly share code, notes, and snippets.

@dougbtv
Last active April 28, 2026 12:30
Show Gist options
  • Select an option

  • Save dougbtv/50811d1655b570ddb89aa59242ea9310 to your computer and use it in GitHub Desktop.

Select an option

Save dougbtv/50811d1655b570ddb89aa59242ea9310 to your computer and use it in GitHub Desktop.
Poolside Laguna: build, run, and smoke test howto

Poolside Laguna: build, run, and smoke test howto

Poolside Laguna Howto

Build Info

Field Value
Model Poolside Laguna (model card TBD) — poolside.ai
Image quay.io/vllm/rhaiis-early-access:poolside-laguna
Build Run 25024994672
nm-cicd branch doug/poolside-laguna-0day
nm-vllm-ent branch doug/poolside-laguna-0day
Target device cuda
Python 3.12

Build Commands

Build wheel + image (from scratch):

gh workflow run build-whl-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/poolside-laguna-0day \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/poolside-laguna-0day \
  -f build_label=k8s-a100-build-13-0 \
  -f build_timeout=120 \
  -f image_label=ibm-wdc-k8s-h100-dind \
  -f python=3.12 \
  -f release_image=false \
  -f target_device=cuda

Important: Use build_label=k8s-a100-build-13-0 (CUDA 13 runner). Building on a CUDA 12 runner produces a wheel that links against libcudart.so.12, which fails at runtime because the container base image ships CUDA 13 only.

Rebuild image only (reusing an existing wheel from a previous run):

gh workflow run build-image.yml \
  --repo neuralmagic/nm-cicd \
  --ref doug/poolside-laguna-0day \
  -f repo=neuralmagic/nm-vllm-ent \
  -f branch=doug/poolside-laguna-0day \
  -f build_label=ibm-wdc-k8s-h100-dind \
  -f release_image=false \
  -f run_id=25024994672 \
  -f target_device=cuda

How to Run

Prerequisites

  • 1x H100 80GB GPU (FP8 model is ~33 GiB, fits on a single GPU at 95% utilization)
  • Or 1x H200 140GB GPU (more headroom, no need for --gpu-memory-utilization)
  • FP8 weights downloaded to local disk
  • podman with NVIDIA CDI support

Architecture

Laguna is a sparse Mixture-of-Experts model:

  • 256 experts, top-8 routing (~33.7B total params, ~3.3B activated)
  • Heterogeneous attention: some layers use sliding window attention (SWA), others use global attention with different numbers of query heads
  • Custom model code: configuration_laguna.py and modeling_laguna.py are bundled with the weights (requires --trust-remote-code)

Pull the image

podman pull quay.io/vllm/rhaiis-early-access:poolside-laguna

Start the server (H100 — 80 GiB)

podman run --rm \
  --name vllm-laguna \
  --device nvidia.com/gpu=0 \
  --security-opt=label=disable \
  --ipc=host \
  -p 8000:8000 \
  -v /path/to/laguna-fp8:/model \
  -e CUDA_VISIBLE_DEVICES=0 \
  quay.io/vllm/rhaiis-early-access:poolside-laguna \
    --model /model \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --max-model-len 4096 \
    --trust-remote-code \
    --gpu-memory-utilization 0.95

Start the server (H200 — 140 GiB)

podman run --rm \
  --name vllm-laguna \
  --device nvidia.com/gpu=0 \
  --security-opt=label=disable \
  --ipc=host \
  -p 8000:8000 \
  -v /path/to/laguna-fp8:/model \
  -e CUDA_VISIBLE_DEVICES=0 \
  quay.io/vllm/rhaiis-early-access:poolside-laguna \
    --model /model \
    --tensor-parallel-size 1 \
    --enforce-eager \
    --max-model-len 4096 \
    --trust-remote-code

Notes:

  • --enforce-eager is recommended to avoid CUDAGraph issues with the MoE architecture.
  • --trust-remote-code is required — Laguna ships custom model code with the weights.
  • On H100 (80 GiB), --gpu-memory-utilization 0.95 is needed to fit the FP8 weights plus KV cache.
  • Rope parameter warnings and the Mistral tokenizer regex warning in the logs are cosmetic — ignore them.

Watch startup logs

podman logs -f vllm-laguna

Look for Application startup complete. — model loading takes ~60-90 seconds.

Smoke Test

Health check

curl http://127.0.0.1:8000/health

Chat request

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [{"role": "user", "content": "Tell me about the tech scene in Burlington, Vermont."}],
    "max_tokens": 512
  }'

Expected output (truncated):

Burlington, Vermont has a surprisingly vibrant and dynamic tech scene for its size! While it's much smaller than major tech hubs, it offers a unique blend of innovation, entrepreneurship, and community spirit...

Completion request

curl -s http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "prompt": "The capital of France is",
    "max_tokens": 32
  }'

Cleanup

podman stop vllm-laguna
podman rm vllm-laguna
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment