Rebooted via OCI CLI, now Ready again. Still need Fluent Bit memory fix (100M → 256M).
| Namespace | PVCs | In use | Orphaned |
|---|---|---|---|
| tekton-ci | 120 | 0 | 120 |
| tekton-nightly | 24 | 0 | 24 |
| bastion-p | 24 | 0 | 24 |
| default | 21 | 0 | 21 |
| bastion-z | 20 | 0 | 20 |
| Others | 11 | 5 | 6 |
| Total | 220 | 5 | 215 |
Templates request 1Gi workspaces, but OCI BV enforces a 50 GiB minimum. Every PipelineRun creates a 50 GiB block volume for a workspace that probably uses <100 MiB.
There IS a cleanup system: cleanup-trigger-dogfooding-* CronJobs fire daily, triggering cleanup-runs TaskRuns that run tkn pr delete --keep 200 and tkn tr delete --keep 400. But all cleanup TaskRuns are timing out — every single one for the past week:
cleanup-runs-...-tekton-ci-* False TaskRunTimeout
cleanup-runs-...-tekton-nightly-* False TaskRunTimeout
The cleanup runs but PVCs owned by still-existing PipelineRuns won't be deleted. PVCs have ownerReferences to their PipelineRuns, so they're only GC'd when the PipelineRun is deleted. Since cleanup is timing out, PipelineRuns accumulate → PVCs accumulate.
- Fix the cleanup TaskRuns — they're timing out, probably because
tkn pr deletewith 100+ runs is slow. Increase timeout or batch the deletes - Reduce
--keepfrom 200 to something smaller (50?) — less to process, faster cleanup - One-time manual cleanup:
tkn pr delete -f -n tekton-ci --keep 50+ same for other namespaces - Add PVC cleanup step to the cleanup Task — after deleting PipelineRuns, also delete any unbound/unattached PVCs
- emptyDir for workspaces that fit in memory/local disk — most CI workloads (clone, lint, test) would be fine with this
- Requires: audit which pipelines actually need cross-step persistence vs just passing small artifacts
- tekton-experiments demonstrates OCI artifact-based data transport between tasks (no PVCs at all)
- Inspired by Konflux CI trusted artifacts
- Validates TEP-0164 (Tekton Artifacts Phase 2) design
- Would completely eliminate PVC needs for most workloads
- Requires: TEP-0164 to land, or custom step wrappers like in tekton-experiments
| Component | Status | Problem |
|---|---|---|
tekton-results-postgres-0 |
ImageInspectError | CRI-O short name enforcement blocks bitnami/postgresql |
tekton-results-api |
CrashLoopBackOff | Can't reach postgres (19 days) |
tekton-results-watcher |
CrashLoopBackOff | Can't reach postgres (21 days) |
tekton-results-retention-policy-agent |
CrashLoopBackOff | Can't reach postgres (19 days) |
The upstream Results release manifest (v0.18.0 referenced in kustomization, but v0.16.0 actually deployed) uses bare bitnami/postgresql image — CRI-O on OKE rejects short names.
- ArgoCD app
tekton-results→tekton/cd/results/overlays/oci-ci-cd/in plumbing repo - Base:
https://infra.tekton.dev/tekton-releases/results/previous/v0.18.0/release.yaml - Overlay patches: ingress, RBAC (viewer SA), service
- Add an image patch to the overlay to fully-qualify the postgres image:
docker.io/bitnami/postgresql@sha256:... - Consider: should we also upgrade from v0.16.0 (running) to v0.18.0 (configured)? The base already points to v0.18.0 but ArgoCD seems stuck at v0.16.0 — possibly the sync failed and rolled back
- Logs storage: currently
LOGS_API=false,LOGS_TYPE=File— Results is NOT configured to store or serve logs. If we want log storage, we'd need to configure an S3-compatible backend (e.g., OCI Object Storage) and setLOGS_API=true
tekton/cd/results/
├── base/
│ └── kustomization.yaml # points to v0.18.0 release
└── overlays/oci-ci-cd/
├── kustomization.yaml
├── ingress.yaml # results.infra.tekton.dev
├── rbac.yaml # viewer SA
└── service.yaml
Hub is deprecated. Currently broken and wasting resources:
tekton-hub-db: ImageInspectError (same CRI-O short name issue onpostgres:13)tekton-hub-api: CrashLoopBackOff — 7,836 restarts over 27 daystekton-hub-ui: Running (pointless without API)swagger: Running (pointless without API)- 2 PVCs: 50 GiB each (100 GiB total wasted)
- No ArgoCD app — Hub is not managed by any of the 16 ArgoCD applications
- Likely deployed manually or via a now-removed ArgoCD app
- The namespace
tekton-huband all its resources are standalone
tekton/images/hub/Dockerfile— builds an Alpine image with thehubCLI tool (NOT Tekton Hub itself, just the GitHubhubcommand — confusing naming)tekton/cronjobs/dogfooding/images/hub-nightly/— nightly CronJob to rebuild that image- These are unrelated to the Tekton Hub deployment — they build
ghcr.io/tektoncd/plumbing/hub(the GitHub CLI wrapper)
- Delete the namespace:
kubectl delete namespace tekton-hub— removes all resources, PVCs, secrets, services - Clean up DNS/certs: check if
*hub.tekton.devDNS records point here and remove them - The
tekton/images/hub/Dockerfile and nightly CronJob should stay or be evaluated separately — they're for thehubCLI tool, not Tekton Hub the product. ThoughhubCLI is also deprecated in favor ofgh— could be removed too - No ArgoCD changes needed — there's no app to remove
api-hub-tekton-dev-tls,auth-hub-tekton-dev-tls,swagger-hub-tekton-dev-tls,ui-hub-tekton-dev-tls— Let's Encrypt certs, will stop renewing once deletedtekton-hub-apisecret — contains auth tokens, DB credentialscatalog-refreshsecret
| Stream | Quick win | Medium-term | Longer-term |
|---|---|---|---|
| PVCs | Fix cleanup timeouts, manual purge, reduce --keep | Switch to emptyDir for CI workspaces | OCI artifacts (TEP-0164) |
| Results | Patch postgres image to FQ name | Upgrade v0.16→v0.18, configure log storage | S3 backend for logs |
| Hub | Delete namespace | Remove hub CLI image if unused |
— |
| Fluent Bit | Bump memory 100M→256M | — | — |
The cluster runs Tekton Pipelines v1.12.0 with coschedule: workspaces. Two relevant features are already available:
Add this annotation to PipelineRuns to auto-delete volumeClaimTemplate PVCs on completion:
metadata:
annotations:
tekton.dev/auto-cleanup-pvc: "true"Only affects volumeClaimTemplate workspaces, never user-provided PVCs.
When a PipelineRun is deleted (e.g., by cleanup), its volumeClaimTemplate PVCs are now also deleted. This means the existing cleanup-runs CronJob (which does tkn pr delete --keep N) should cascade-delete PVCs — if the cleanup stops timing out.
- Add
tekton.dev/auto-cleanup-pvc: "true"to all TriggerTemplates — this covers CI workloads going forward:tekton/ci/repos/community/template.yamltekton/ci/repos/website/template.yamltekton/ci/repos/catalog/base/template.yamltekton/ci/repos/shared/doc-reviews/template.yaml
- Fix cleanup TaskRun timeouts — all
cleanup-runsTaskRuns are timing out, blocking PipelineRun (and now PVC) garbage collection - One-time manual purge of the 215 orphaned PVCs
- Longer term: consider
emptyDirfor workspaces that don't need persistence, and OCI artifacts (TEP-0164 / tekton-experiments patterns) for cross-task data transport