Install Log - Slot 6

Cluster Provisioning

Date: 2026-01-31
Slot: 6 (previously used slot 7)
Profile: AWS_PROFILE=sandbox
Command: AWS_PROFILE=sandbox ./cluster.sh apply 6 --gpu -auto-approve
Reason for rebuild: Previous slot-7 cluster had TLS/certificate issues after installation; decided to tear down and start fresh.
Previous session PRs merged: PR #56 (tpl + awsCredentialsSecretName), PR #57 (slice handling + inference SA + path-style S3)
Slot 7 teardown: Completed. Required manual cleanup of 3 orphaned K8s security groups blocking VPC deletion.
Status: Cluster provisioned. 6 nodes Ready (3 masters + 3 workers including GPU).
NooBaa fix: defaultBackingStoreSpec is deprecated in NooBaa 5.14.21. Used manualDefaultBackingStore: true instead, then created a separate BackingStore resource of type pv-pool. Also needed manualDefaultBackingStore to skip AWS cloud credential auto-detection. Fixed in tfoc2 post-install.tf.
NooBaa status: Ready, with PVPool BackingStore (250Gi on gp3-csi) and S3 route at s3.apps.ocp-slot-6.openshift.sandboxes.poolsi.de

values-slot6.yaml

Copied from values-slot7.yaml with hostnames changed to ocp-slot-6
Located at helm-charts-workspace/values-slot6.yaml

INSTALL.md Walkthrough

Step 1: Prerequisites

Cluster is up, tools available. Satisfied.

Step 2: Image Preparation

Enabled default route for internal OpenShift image registry
Registry: default-route-openshift-image-registry.apps.ocp-slot-6.openshift.sandboxes.poolsi.de
Pushed all 10 container images from bundle/containers/ to poolside/ namespace in internal registry via skopeo
Images pushed: atlas:20260123, forge_api:5.1.15, forge_bridge_sandbox:0.0.3, forge_sandbox:0.0.2, forge_sandbox_daemon:0.0.2, sandbox_runsc_installer:0.2.0, sandworm:0.0.1, ubuntu:22.04.0, web-assistant:5.1.15, web-bridge:0.0.3
Applied system:image-puller role to system:serviceaccounts:poolside-models on the poolside namespace for cross-namespace pull access

Step 3: External Dependencies

3A (Database): Created poolside-db-secret in poolside namespace with random POSTGRESQL_PASSWORD. Using bundled postgres.
3B (Encryption Key): Created encryption-key-secret in poolside namespace with random 32-byte key.
3C (S3 Credentials): Created aws-credentials secret in both poolside and poolside-models namespaces using NooBaa admin credentials.
S3 Bucket: Created ObjectBucketClaim in openshift-storage. Actual bucket name: poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4. Updated values-slot6.yaml accordingly.

Step 4: TLS

Using default OpenShift router wildcard cert via edge-terminated Routes. No custom TLS injection.
Decision: avoid the custom TLS approach that caused issues on slot-7.

Step 5: Installation

helm install poolside-deployment ./poolside-deployment --namespace poolside -f values-slot6.yaml
All pods Running: core-api (3 replicas), postgres, models-reconciler, web-assistant (3), web-bridge (3)
12 Routes created with edge/Redirect TLS termination
Web assistant responding at https://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/ (HTTP 200)
API docs responding at https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/docs (HTTP 302)

Step 6: Post-Installation

Cognito IdP setup

Cognito user pool: us-east-2_IZs6J1SLb (poolside-greg-test) in AWS sandbox profile — same pool used for prior slots
App client: 6tdb1tkogda28rnh7oq7ett139
Added slot-6 callback URLs: https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callback and https://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callback
Provider URL: https://cognito-idp.us-east-2.amazonaws.com/us-east-2_IZs6J1SLb
IdP bound via web UI. Tenant reset required (TRUNCATE tenant CASCADE) after initially binding to wrong pool (poolside-ocp-slot-6 / us-east-2_fQ3QSyfAk), then rebound to correct pool.
Trusted router CA cert on local machine: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/ocp-slot6-router-ca.crt

GPU Operators

p5e.48xlarge node: ip-10-100-52-169.us-east-2.compute.internal
terraform cluster.sh apply failed at NooBaa step, so NFD/GPU operators were never installed
Installed NFD operator (channel: stable, source: redhat-operators) manually
Created NodeFeatureDiscovery instance in openshift-nfd namespace
Installed NVIDIA GPU Operator (channel: v25.3, source: certified-operators, CSV: gpu-operator-certified.v25.3.4) manually
Created ClusterPolicy with driver, toolkit, device-plugin, dcgm, gfd, migManager enabled
Result: 8x nvidia.com/gpu detected on p5e.48xlarge node

Step 7: Model Deployment (INSTALL.md)

Following INSTALL.md Option A (Poolside Model Downloader) — using splash models create with --mode kubernetes
Configured splash CLI: updated apiBaseURL to slot-6, ran splash login

Model checkpoint transfer to NooBaa

Source: s3://poolside-ue2-versions/checkpoints/malibu-v2.20251021/ (AWS S3, us-east-2, ~74.3 GiB)
Destination: s3://poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4/checkpoints/malibu-v2.20251021/ (NooBaa, in-cluster)
Created temporary s3-source-credentials secret with AWS SSO session credentials (expires 2026-02-01T22:52:35Z)
Created s3-transfer pod in poolside namespace running amazon/aws-cli — two-stage transfer: download from AWS S3 to local disk, then upload to NooBaa
NooBaa CA cert mounted from openshift-service-ca.crt ConfigMap at /etc/ssl/noobaa-ca/
Transfer speed: ~172 MiB/s from AWS S3
Status: in progress...

Model registration

TODO: splash models create Malibu once checkpoint transfer completes

Known Fixes Applied (from previous session)

These are already merged into the release/1.20260128.0 branch:

additional_config_maps.yaml.tmpl - supports map and slice data types with tpl processing
container.yaml format - uses [{name, value}] array format for reconciler compatibility
AWS_S3_ADDRESSING_STYLE=path - required for NooBaa path-style S3 addressing
inference_service_account.yaml template - creates inference SA in models namespace (reconciler hardcodes Create: false)
awsCredentialsSecretName wired through to models-reconciler env

Lessons Learned (from previous session)

Always follow INSTALL.md step by step
Update this log for EVERY decision and state change
Delete old *-0.3.1.tgz files before helm upgrade (stale tgz pitfall)
Create poolside-registry-secret in poolside-models namespace per INSTALL.md Step 2
Apply system:image-puller RoleBinding for cross-namespace image pull access
Full pipeline for chart changes: terraform apply -> helm package -> helm upgrade

sdboyer/INSTALL_LOG_SLOT6.md

Select an option

No results found

Select an option

No results found

Install Log - Slot 6

Cluster Provisioning

values-slot6.yaml

INSTALL.md Walkthrough

Step 1: Prerequisites

Step 2: Image Preparation

Step 3: External Dependencies

Step 4: TLS

Step 5: Installation

Step 6: Post-Installation

Cognito IdP setup

GPU Operators

Step 7: Model Deployment (INSTALL.md)

Model checkpoint transfer to NooBaa

Model registration

Known Fixes Applied (from previous session)

Lessons Learned (from previous session)