Skip to content

Instantly share code, notes, and snippets.

@sdboyer
Last active February 1, 2026 11:36
Show Gist options
  • Select an option

  • Save sdboyer/83c4f0af30d514e3cb76d03997ecb9de to your computer and use it in GitHub Desktop.

Select an option

Save sdboyer/83c4f0af30d514e3cb76d03997ecb9de to your computer and use it in GitHub Desktop.
Poolside Platform Install Log - Slot 6

Install Log - Slot 6

Cluster Provisioning

  • Date: 2026-01-31
  • Slot: 6 (previously used slot 7)
  • Profile: AWS_PROFILE=sandbox
  • Command: AWS_PROFILE=sandbox ./cluster.sh apply 6 --gpu -auto-approve
  • Reason for rebuild: Previous slot-7 cluster had TLS/certificate issues after installation; decided to tear down and start fresh.
  • Previous session PRs merged: PR #56 (tpl + awsCredentialsSecretName), PR #57 (slice handling + inference SA + path-style S3)
  • Slot 7 teardown: Completed. Required manual cleanup of 3 orphaned K8s security groups blocking VPC deletion.
  • Status: Cluster provisioned. 6 nodes Ready (3 masters + 3 workers including GPU).
  • NooBaa fix: defaultBackingStoreSpec is deprecated in NooBaa 5.14.21. Used manualDefaultBackingStore: true instead, then created a separate BackingStore resource of type pv-pool. Also needed manualDefaultBackingStore to skip AWS cloud credential auto-detection. Fixed in tfoc2 post-install.tf.
  • NooBaa status: Ready, with PVPool BackingStore (250Gi on gp3-csi) and S3 route at s3.apps.ocp-slot-6.openshift.sandboxes.poolsi.de

values-slot6.yaml

  • Copied from values-slot7.yaml with hostnames changed to ocp-slot-6
  • Located at helm-charts-workspace/values-slot6.yaml

INSTALL.md Walkthrough

Step 1: Prerequisites

  • Cluster is up, tools available. Satisfied.

Step 2: Image Preparation

  • Enabled default route for internal OpenShift image registry
  • Registry: default-route-openshift-image-registry.apps.ocp-slot-6.openshift.sandboxes.poolsi.de
  • Pushed all 10 container images from bundle/containers/ to poolside/ namespace in internal registry via skopeo
  • Images pushed: atlas:20260123, forge_api:5.1.15, forge_bridge_sandbox:0.0.3, forge_sandbox:0.0.2, forge_sandbox_daemon:0.0.2, sandbox_runsc_installer:0.2.0, sandworm:0.0.1, ubuntu:22.04.0, web-assistant:5.1.15, web-bridge:0.0.3
  • Applied system:image-puller role to system:serviceaccounts:poolside-models on the poolside namespace for cross-namespace pull access

Step 3: External Dependencies

  • 3A (Database): Created poolside-db-secret in poolside namespace with random POSTGRESQL_PASSWORD. Using bundled postgres.
  • 3B (Encryption Key): Created encryption-key-secret in poolside namespace with random 32-byte key.
  • 3C (S3 Credentials): Created aws-credentials secret in both poolside and poolside-models namespaces using NooBaa admin credentials.
  • S3 Bucket: Created ObjectBucketClaim in openshift-storage. Actual bucket name: poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4. Updated values-slot6.yaml accordingly.

Step 4: TLS

  • Using default OpenShift router wildcard cert via edge-terminated Routes. No custom TLS injection.
  • Decision: avoid the custom TLS approach that caused issues on slot-7.

Step 5: Installation

  • helm install poolside-deployment ./poolside-deployment --namespace poolside -f values-slot6.yaml
  • All pods Running: core-api (3 replicas), postgres, models-reconciler, web-assistant (3), web-bridge (3)
  • 12 Routes created with edge/Redirect TLS termination
  • Web assistant responding at https://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/ (HTTP 200)
  • API docs responding at https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/docs (HTTP 302)

Step 6: Post-Installation

Cognito IdP setup

  • Cognito user pool: us-east-2_IZs6J1SLb (poolside-greg-test) in AWS sandbox profile — same pool used for prior slots
  • App client: 6tdb1tkogda28rnh7oq7ett139
  • Added slot-6 callback URLs: https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callback and https://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callback
  • Provider URL: https://cognito-idp.us-east-2.amazonaws.com/us-east-2_IZs6J1SLb
  • IdP bound via web UI. Tenant reset required (TRUNCATE tenant CASCADE) after initially binding to wrong pool (poolside-ocp-slot-6 / us-east-2_fQ3QSyfAk), then rebound to correct pool.
  • Trusted router CA cert on local machine: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/ocp-slot6-router-ca.crt

GPU Operators

  • p5e.48xlarge node: ip-10-100-52-169.us-east-2.compute.internal
  • terraform cluster.sh apply failed at NooBaa step, so NFD/GPU operators were never installed
  • Installed NFD operator (channel: stable, source: redhat-operators) manually
  • Created NodeFeatureDiscovery instance in openshift-nfd namespace
  • Installed NVIDIA GPU Operator (channel: v25.3, source: certified-operators, CSV: gpu-operator-certified.v25.3.4) manually
  • Created ClusterPolicy with driver, toolkit, device-plugin, dcgm, gfd, migManager enabled
  • Result: 8x nvidia.com/gpu detected on p5e.48xlarge node

Step 7: Model Deployment (INSTALL.md)

  • Following INSTALL.md Option A (Poolside Model Downloader) — using splash models create with --mode kubernetes
  • Configured splash CLI: updated apiBaseURL to slot-6, ran splash login

Model checkpoint transfer to NooBaa

  • Source: s3://poolside-ue2-versions/checkpoints/malibu-v2.20251021/ (AWS S3, us-east-2, ~74.3 GiB)
  • Destination: s3://poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4/checkpoints/malibu-v2.20251021/ (NooBaa, in-cluster)
  • Created temporary s3-source-credentials secret with AWS SSO session credentials (expires 2026-02-01T22:52:35Z)
  • Created s3-transfer pod in poolside namespace running amazon/aws-cli — two-stage transfer: download from AWS S3 to local disk, then upload to NooBaa
  • NooBaa CA cert mounted from openshift-service-ca.crt ConfigMap at /etc/ssl/noobaa-ca/
  • Transfer speed: ~172 MiB/s from AWS S3
  • Status: in progress...

Model registration

  • TODO: splash models create Malibu once checkpoint transfer completes

Known Fixes Applied (from previous session)

These are already merged into the release/1.20260128.0 branch:

  1. additional_config_maps.yaml.tmpl - supports map and slice data types with tpl processing
  2. container.yaml format - uses [{name, value}] array format for reconciler compatibility
  3. AWS_S3_ADDRESSING_STYLE=path - required for NooBaa path-style S3 addressing
  4. inference_service_account.yaml template - creates inference SA in models namespace (reconciler hardcodes Create: false)
  5. awsCredentialsSecretName wired through to models-reconciler env

Lessons Learned (from previous session)

  • Always follow INSTALL.md step by step
  • Update this log for EVERY decision and state change
  • Delete old *-0.3.1.tgz files before helm upgrade (stale tgz pitfall)
  • Create poolside-registry-secret in poolside-models namespace per INSTALL.md Step 2
  • Apply system:image-puller RoleBinding for cross-namespace image pull access
  • Full pipeline for chart changes: terraform apply -> helm package -> helm upgrade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment