- Date: 2026-01-31
- Slot: 6 (previously used slot 7)
- Profile: AWS_PROFILE=sandbox
- Command:
AWS_PROFILE=sandbox ./cluster.sh apply 6 --gpu -auto-approve - Reason for rebuild: Previous slot-7 cluster had TLS/certificate issues after installation; decided to tear down and start fresh.
- Previous session PRs merged: PR #56 (tpl + awsCredentialsSecretName), PR #57 (slice handling + inference SA + path-style S3)
- Slot 7 teardown: Completed. Required manual cleanup of 3 orphaned K8s security groups blocking VPC deletion.
- Status: Cluster provisioned. 6 nodes Ready (3 masters + 3 workers including GPU).
- NooBaa fix:
defaultBackingStoreSpecis deprecated in NooBaa 5.14.21. UsedmanualDefaultBackingStore: trueinstead, then created a separateBackingStoreresource of typepv-pool. Also neededmanualDefaultBackingStoreto skip AWS cloud credential auto-detection. Fixed in tfoc2post-install.tf. - NooBaa status: Ready, with PVPool BackingStore (250Gi on gp3-csi) and S3 route at
s3.apps.ocp-slot-6.openshift.sandboxes.poolsi.de
- Copied from values-slot7.yaml with hostnames changed to
ocp-slot-6 - Located at
helm-charts-workspace/values-slot6.yaml
- Cluster is up, tools available. Satisfied.
- Enabled default route for internal OpenShift image registry
- Registry:
default-route-openshift-image-registry.apps.ocp-slot-6.openshift.sandboxes.poolsi.de - Pushed all 10 container images from
bundle/containers/topoolside/namespace in internal registry via skopeo - Images pushed: atlas:20260123, forge_api:5.1.15, forge_bridge_sandbox:0.0.3, forge_sandbox:0.0.2, forge_sandbox_daemon:0.0.2, sandbox_runsc_installer:0.2.0, sandworm:0.0.1, ubuntu:22.04.0, web-assistant:5.1.15, web-bridge:0.0.3
- Applied
system:image-pullerrole tosystem:serviceaccounts:poolside-modelson thepoolsidenamespace for cross-namespace pull access
- 3A (Database): Created
poolside-db-secretinpoolsidenamespace with random POSTGRESQL_PASSWORD. Using bundled postgres. - 3B (Encryption Key): Created
encryption-key-secretinpoolsidenamespace with random 32-byte key. - 3C (S3 Credentials): Created
aws-credentialssecret in bothpoolsideandpoolside-modelsnamespaces using NooBaa admin credentials. - S3 Bucket: Created
ObjectBucketClaimin openshift-storage. Actual bucket name:poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4. Updated values-slot6.yaml accordingly.
- Using default OpenShift router wildcard cert via edge-terminated Routes. No custom TLS injection.
- Decision: avoid the custom TLS approach that caused issues on slot-7.
helm install poolside-deployment ./poolside-deployment --namespace poolside -f values-slot6.yaml- All pods Running: core-api (3 replicas), postgres, models-reconciler, web-assistant (3), web-bridge (3)
- 12 Routes created with edge/Redirect TLS termination
- Web assistant responding at
https://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/(HTTP 200) - API docs responding at
https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/docs(HTTP 302)
- Cognito user pool:
us-east-2_IZs6J1SLb(poolside-greg-test) in AWSsandboxprofile — same pool used for prior slots - App client:
6tdb1tkogda28rnh7oq7ett139 - Added slot-6 callback URLs:
https://api.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callbackandhttps://chat.apps.ocp-slot-6.openshift.sandboxes.poolsi.de/auth/callback - Provider URL:
https://cognito-idp.us-east-2.amazonaws.com/us-east-2_IZs6J1SLb - IdP bound via web UI. Tenant reset required (TRUNCATE tenant CASCADE) after initially binding to wrong pool (
poolside-ocp-slot-6/us-east-2_fQ3QSyfAk), then rebound to correct pool. - Trusted router CA cert on local machine:
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /tmp/ocp-slot6-router-ca.crt
- p5e.48xlarge node:
ip-10-100-52-169.us-east-2.compute.internal - terraform
cluster.sh applyfailed at NooBaa step, so NFD/GPU operators were never installed - Installed NFD operator (channel: stable, source: redhat-operators) manually
- Created
NodeFeatureDiscoveryinstance inopenshift-nfdnamespace - Installed NVIDIA GPU Operator (channel: v25.3, source: certified-operators, CSV:
gpu-operator-certified.v25.3.4) manually - Created
ClusterPolicywith driver, toolkit, device-plugin, dcgm, gfd, migManager enabled - Result: 8x
nvidia.com/gpudetected on p5e.48xlarge node
- Following INSTALL.md Option A (Poolside Model Downloader) — using
splash models createwith--mode kubernetes - Configured splash CLI: updated
apiBaseURLto slot-6, ransplash login
- Source:
s3://poolside-ue2-versions/checkpoints/malibu-v2.20251021/(AWS S3, us-east-2, ~74.3 GiB) - Destination:
s3://poolside-data-a03a0b18-adab-40f4-a5a7-7d53494e0de4/checkpoints/malibu-v2.20251021/(NooBaa, in-cluster) - Created temporary
s3-source-credentialssecret with AWS SSO session credentials (expires 2026-02-01T22:52:35Z) - Created
s3-transferpod inpoolsidenamespace runningamazon/aws-cli— two-stage transfer: download from AWS S3 to local disk, then upload to NooBaa - NooBaa CA cert mounted from
openshift-service-ca.crtConfigMap at/etc/ssl/noobaa-ca/ - Transfer speed: ~172 MiB/s from AWS S3
- Status: in progress...
- TODO:
splash models create Malibuonce checkpoint transfer completes
These are already merged into the release/1.20260128.0 branch:
additional_config_maps.yaml.tmpl- supports map and slice data types with tpl processingcontainer.yamlformat - uses[{name, value}]array format for reconciler compatibilityAWS_S3_ADDRESSING_STYLE=path- required for NooBaa path-style S3 addressinginference_service_account.yamltemplate - createsinferenceSA in models namespace (reconciler hardcodesCreate: false)awsCredentialsSecretNamewired through to models-reconciler env
- Always follow INSTALL.md step by step
- Update this log for EVERY decision and state change
- Delete old
*-0.3.1.tgzfiles before helm upgrade (stale tgz pitfall) - Create
poolside-registry-secretin poolside-models namespace per INSTALL.md Step 2 - Apply
system:image-pullerRoleBinding for cross-namespace image pull access - Full pipeline for chart changes: terraform apply -> helm package -> helm upgrade