tekton-oracle cluster — work streams analysis (2026-05-13)

Status: node `10.0.128.239` recovered ✅

Rebooted via OCI CLI, now Ready again. Still need Fluent Bit memory fix (100M → 256M).

1. PVC sprawl — 220 PVCs, only 5 in use

The numbers

Namespace	PVCs	In use	Orphaned
tekton-ci	120	0	120
tekton-nightly	24	0	24
bastion-p	24	0	24
default	21	0	21
bastion-z	20	0	20
Others	11	5	6
Total	220	5	215

Root cause: OCI Block Volume minimum is 50 GiB

Templates request 1Gi workspaces, but OCI BV enforces a 50 GiB minimum. Every PipelineRun creates a 50 GiB block volume for a workspace that probably uses <100 MiB.

Why aren't they cleaned up?

There IS a cleanup system: cleanup-trigger-dogfooding-* CronJobs fire daily, triggering cleanup-runs TaskRuns that run tkn pr delete --keep 200 and tkn tr delete --keep 400. But all cleanup TaskRuns are timing out — every single one for the past week:

cleanup-runs-...-tekton-ci-*     False   TaskRunTimeout
cleanup-runs-...-tekton-nightly-* False   TaskRunTimeout

The cleanup runs but PVCs owned by still-existing PipelineRuns won't be deleted. PVCs have ownerReferences to their PipelineRuns, so they're only GC'd when the PipelineRun is deleted. Since cleanup is timing out, PipelineRuns accumulate → PVCs accumulate.

Quick wins (no architectural changes)

Fix the cleanup TaskRuns — they're timing out, probably because tkn pr delete with 100+ runs is slow. Increase timeout or batch the deletes
Reduce --keep from 200 to something smaller (50?) — less to process, faster cleanup
One-time manual cleanup: tkn pr delete -f -n tekton-ci --keep 50 + same for other namespaces
Add PVC cleanup step to the cleanup Task — after deleting PipelineRuns, also delete any unbound/unattached PVCs

Medium-term: stop using PVCs for ephemeral workspaces

emptyDir for workspaces that fit in memory/local disk — most CI workloads (clone, lint, test) would be fine with this
Requires: audit which pipelines actually need cross-step persistence vs just passing small artifacts

Longer-term: OCI artifacts instead of workspaces

tekton-experiments demonstrates OCI artifact-based data transport between tasks (no PVCs at all)
Inspired by Konflux CI trusted artifacts
Validates TEP-0164 (Tekton Artifacts Phase 2) design
Would completely eliminate PVC needs for most workloads
Requires: TEP-0164 to land, or custom step wrappers like in tekton-experiments

2. Tekton Results — fully broken

Current state

Component	Status	Problem
`tekton-results-postgres-0`	ImageInspectError	CRI-O short name enforcement blocks `bitnami/postgresql`
`tekton-results-api`	CrashLoopBackOff	Can't reach postgres (19 days)
`tekton-results-watcher`	CrashLoopBackOff	Can't reach postgres (21 days)
`tekton-results-retention-policy-agent`	CrashLoopBackOff	Can't reach postgres (19 days)

Root cause

The upstream Results release manifest (v0.18.0 referenced in kustomization, but v0.16.0 actually deployed) uses bare bitnami/postgresql image — CRI-O on OKE rejects short names.

How it's deployed

ArgoCD app tekton-results → tekton/cd/results/overlays/oci-ci-cd/ in plumbing repo
Base: https://infra.tekton.dev/tekton-releases/results/previous/v0.18.0/release.yaml
Overlay patches: ingress, RBAC (viewer SA), service

What needs to happen

Add an image patch to the overlay to fully-qualify the postgres image: docker.io/bitnami/postgresql@sha256:...
Consider: should we also upgrade from v0.16.0 (running) to v0.18.0 (configured)? The base already points to v0.18.0 but ArgoCD seems stuck at v0.16.0 — possibly the sync failed and rolled back
Logs storage: currently LOGS_API=false, LOGS_TYPE=File — Results is NOT configured to store or serve logs. If we want log storage, we'd need to configure an S3-compatible backend (e.g., OCI Object Storage) and set LOGS_API=true

Config in repo

tekton/cd/results/
├── base/
│   └── kustomization.yaml  # points to v0.18.0 release
└── overlays/oci-ci-cd/
    ├── kustomization.yaml
    ├── ingress.yaml          # results.infra.tekton.dev
    ├── rbac.yaml             # viewer SA
    └── service.yaml

3. Tekton Hub — needs full removal

Current state

Hub is deprecated. Currently broken and wasting resources:

tekton-hub-db: ImageInspectError (same CRI-O short name issue on postgres:13)
tekton-hub-api: CrashLoopBackOff — 7,836 restarts over 27 days
tekton-hub-ui: Running (pointless without API)
swagger: Running (pointless without API)
2 PVCs: 50 GiB each (100 GiB total wasted)

How it's deployed

No ArgoCD app — Hub is not managed by any of the 16 ArgoCD applications
Likely deployed manually or via a now-removed ArgoCD app
The namespace tekton-hub and all its resources are standalone

What's in the plumbing repo

tekton/images/hub/Dockerfile — builds an Alpine image with the hub CLI tool (NOT Tekton Hub itself, just the GitHub hub command — confusing naming)
tekton/cronjobs/dogfooding/images/hub-nightly/ — nightly CronJob to rebuild that image
These are unrelated to the Tekton Hub deployment — they build ghcr.io/tektoncd/plumbing/hub (the GitHub CLI wrapper)

Cleanup plan

Delete the namespace: kubectl delete namespace tekton-hub — removes all resources, PVCs, secrets, services
Clean up DNS/certs: check if *hub.tekton.dev DNS records point here and remove them
The tekton/images/hub/ Dockerfile and nightly CronJob should stay or be evaluated separately — they're for the hub CLI tool, not Tekton Hub the product. Though hub CLI is also deprecated in favor of gh — could be removed too
No ArgoCD changes needed — there's no app to remove

Secrets/certs to consider

api-hub-tekton-dev-tls, auth-hub-tekton-dev-tls, swagger-hub-tekton-dev-tls, ui-hub-tekton-dev-tls — Let's Encrypt certs, will stop renewing once deleted
tekton-hub-api secret — contains auth tokens, DB credentials
catalog-refresh secret

Summary of actions (none taken yet)

Stream	Quick win	Medium-term	Longer-term
PVCs	Fix cleanup timeouts, manual purge, reduce --keep	Switch to emptyDir for CI workspaces	OCI artifacts (TEP-0164)
Results	Patch postgres image to FQ name	Upgrade v0.16→v0.18, configure log storage	S3 backend for logs
Hub	Delete namespace	Remove `hub` CLI image if unused	—
Fluent Bit	Bump memory 100M→256M	—	—

PVC auto-cleanup — available NOW on v1.12.0

The cluster runs Tekton Pipelines v1.12.0 with coschedule: workspaces. Two relevant features are already available:

1. `tekton.dev/auto-cleanup-pvc: "true"` annotation (commit 0e9378b8)

Add this annotation to PipelineRuns to auto-delete volumeClaimTemplate PVCs on completion:

metadata:
  annotations:
    tekton.dev/auto-cleanup-pvc: "true"

Only affects volumeClaimTemplate workspaces, never user-provided PVCs.

2. PVC deletion on PipelineRun deletion (commit 8dcad6af)

When a PipelineRun is deleted (e.g., by cleanup), its volumeClaimTemplate PVCs are now also deleted. This means the existing cleanup-runs CronJob (which does tkn pr delete --keep N) should cascade-delete PVCs — if the cleanup stops timing out.

Recommended actions

Add tekton.dev/auto-cleanup-pvc: "true" to all TriggerTemplates — this covers CI workloads going forward:
- tekton/ci/repos/community/template.yaml
- tekton/ci/repos/website/template.yaml
- tekton/ci/repos/catalog/base/template.yaml
- tekton/ci/repos/shared/doc-reviews/template.yaml
Fix cleanup TaskRun timeouts — all cleanup-runs TaskRuns are timing out, blocking PipelineRun (and now PVC) garbage collection
One-time manual purge of the 215 orphaned PVCs
Longer term: consider emptyDir for workspaces that don't need persistence, and OCI artifacts (TEP-0164 / tekton-experiments patterns) for cross-task data transport

vdemeester/cluster-issues.md

Select an option

No results found

Select an option

No results found

tekton-oracle cluster — work streams analysis (2026-05-13)

Status: node `10.0.128.239` recovered ✅

1. PVC sprawl — 220 PVCs, only 5 in use

The numbers

Root cause: OCI Block Volume minimum is 50 GiB

Why aren't they cleaned up?

Quick wins (no architectural changes)

Medium-term: stop using PVCs for ephemeral workspaces

Longer-term: OCI artifacts instead of workspaces

2. Tekton Results — fully broken

Current state

Root cause

How it's deployed

What needs to happen

Config in repo

3. Tekton Hub — needs full removal

Current state

How it's deployed

What's in the plumbing repo

Cleanup plan

Secrets/certs to consider

Summary of actions (none taken yet)

PVC auto-cleanup — available NOW on v1.12.0

1. `tekton.dev/auto-cleanup-pvc: "true"` annotation (commit 0e9378b8)

2. PVC deletion on PipelineRun deletion (commit 8dcad6af)

Recommended actions

vdemeester/cluster-issues.md

tekton-oracle cluster — work streams analysis (2026-05-13)

Status: node 10.0.128.239 recovered ✅

1. PVC sprawl — 220 PVCs, only 5 in use

The numbers

Root cause: OCI Block Volume minimum is 50 GiB

Why aren't they cleaned up?

Quick wins (no architectural changes)

Medium-term: stop using PVCs for ephemeral workspaces

Longer-term: OCI artifacts instead of workspaces

2. Tekton Results — fully broken

Current state

Root cause

How it's deployed

What needs to happen

Config in repo

3. Tekton Hub — needs full removal

Current state

How it's deployed

What's in the plumbing repo

Cleanup plan

Secrets/certs to consider

Summary of actions (none taken yet)

PVC auto-cleanup — available NOW on v1.12.0

1. tekton.dev/auto-cleanup-pvc: "true" annotation (commit 0e9378b8)

2. PVC deletion on PipelineRun deletion (commit 8dcad6af)

Recommended actions

Status: node `10.0.128.239` recovered ✅

1. `tekton.dev/auto-cleanup-pvc: "true"` annotation (commit 0e9378b8)