Skip to content

Instantly share code, notes, and snippets.

@vdemeester
Last active May 13, 2026 10:23
Show Gist options
  • Select an option

  • Save vdemeester/c672aca00280d20692225b286dce29a6 to your computer and use it in GitHub Desktop.

Select an option

Save vdemeester/c672aca00280d20692225b286dce29a6 to your computer and use it in GitHub Desktop.
tekton-oracle cluster health issues — 2026-05-13

tekton-oracle cluster — work streams analysis (2026-05-13)

Status: node 10.0.128.239 recovered ✅

Rebooted via OCI CLI, now Ready again. Still need Fluent Bit memory fix (100M → 256M).


1. PVC sprawl — 220 PVCs, only 5 in use

The numbers

Namespace PVCs In use Orphaned
tekton-ci 120 0 120
tekton-nightly 24 0 24
bastion-p 24 0 24
default 21 0 21
bastion-z 20 0 20
Others 11 5 6
Total 220 5 215

Root cause: OCI Block Volume minimum is 50 GiB

Templates request 1Gi workspaces, but OCI BV enforces a 50 GiB minimum. Every PipelineRun creates a 50 GiB block volume for a workspace that probably uses <100 MiB.

Why aren't they cleaned up?

There IS a cleanup system: cleanup-trigger-dogfooding-* CronJobs fire daily, triggering cleanup-runs TaskRuns that run tkn pr delete --keep 200 and tkn tr delete --keep 400. But all cleanup TaskRuns are timing out — every single one for the past week:

cleanup-runs-...-tekton-ci-*     False   TaskRunTimeout
cleanup-runs-...-tekton-nightly-* False   TaskRunTimeout

The cleanup runs but PVCs owned by still-existing PipelineRuns won't be deleted. PVCs have ownerReferences to their PipelineRuns, so they're only GC'd when the PipelineRun is deleted. Since cleanup is timing out, PipelineRuns accumulate → PVCs accumulate.

Quick wins (no architectural changes)

  1. Fix the cleanup TaskRuns — they're timing out, probably because tkn pr delete with 100+ runs is slow. Increase timeout or batch the deletes
  2. Reduce --keep from 200 to something smaller (50?) — less to process, faster cleanup
  3. One-time manual cleanup: tkn pr delete -f -n tekton-ci --keep 50 + same for other namespaces
  4. Add PVC cleanup step to the cleanup Task — after deleting PipelineRuns, also delete any unbound/unattached PVCs

Medium-term: stop using PVCs for ephemeral workspaces

  • emptyDir for workspaces that fit in memory/local disk — most CI workloads (clone, lint, test) would be fine with this
  • Requires: audit which pipelines actually need cross-step persistence vs just passing small artifacts

Longer-term: OCI artifacts instead of workspaces

  • tekton-experiments demonstrates OCI artifact-based data transport between tasks (no PVCs at all)
  • Inspired by Konflux CI trusted artifacts
  • Validates TEP-0164 (Tekton Artifacts Phase 2) design
  • Would completely eliminate PVC needs for most workloads
  • Requires: TEP-0164 to land, or custom step wrappers like in tekton-experiments

2. Tekton Results — fully broken

Current state

Component Status Problem
tekton-results-postgres-0 ImageInspectError CRI-O short name enforcement blocks bitnami/postgresql
tekton-results-api CrashLoopBackOff Can't reach postgres (19 days)
tekton-results-watcher CrashLoopBackOff Can't reach postgres (21 days)
tekton-results-retention-policy-agent CrashLoopBackOff Can't reach postgres (19 days)

Root cause

The upstream Results release manifest (v0.18.0 referenced in kustomization, but v0.16.0 actually deployed) uses bare bitnami/postgresql image — CRI-O on OKE rejects short names.

How it's deployed

  • ArgoCD app tekton-resultstekton/cd/results/overlays/oci-ci-cd/ in plumbing repo
  • Base: https://infra.tekton.dev/tekton-releases/results/previous/v0.18.0/release.yaml
  • Overlay patches: ingress, RBAC (viewer SA), service

What needs to happen

  1. Add an image patch to the overlay to fully-qualify the postgres image: docker.io/bitnami/postgresql@sha256:...
  2. Consider: should we also upgrade from v0.16.0 (running) to v0.18.0 (configured)? The base already points to v0.18.0 but ArgoCD seems stuck at v0.16.0 — possibly the sync failed and rolled back
  3. Logs storage: currently LOGS_API=false, LOGS_TYPE=File — Results is NOT configured to store or serve logs. If we want log storage, we'd need to configure an S3-compatible backend (e.g., OCI Object Storage) and set LOGS_API=true

Config in repo

tekton/cd/results/
├── base/
│   └── kustomization.yaml  # points to v0.18.0 release
└── overlays/oci-ci-cd/
    ├── kustomization.yaml
    ├── ingress.yaml          # results.infra.tekton.dev
    ├── rbac.yaml             # viewer SA
    └── service.yaml

3. Tekton Hub — needs full removal

Current state

Hub is deprecated. Currently broken and wasting resources:

  • tekton-hub-db: ImageInspectError (same CRI-O short name issue on postgres:13)
  • tekton-hub-api: CrashLoopBackOff — 7,836 restarts over 27 days
  • tekton-hub-ui: Running (pointless without API)
  • swagger: Running (pointless without API)
  • 2 PVCs: 50 GiB each (100 GiB total wasted)

How it's deployed

  • No ArgoCD app — Hub is not managed by any of the 16 ArgoCD applications
  • Likely deployed manually or via a now-removed ArgoCD app
  • The namespace tekton-hub and all its resources are standalone

What's in the plumbing repo

  • tekton/images/hub/Dockerfile — builds an Alpine image with the hub CLI tool (NOT Tekton Hub itself, just the GitHub hub command — confusing naming)
  • tekton/cronjobs/dogfooding/images/hub-nightly/ — nightly CronJob to rebuild that image
  • These are unrelated to the Tekton Hub deployment — they build ghcr.io/tektoncd/plumbing/hub (the GitHub CLI wrapper)

Cleanup plan

  1. Delete the namespace: kubectl delete namespace tekton-hub — removes all resources, PVCs, secrets, services
  2. Clean up DNS/certs: check if *hub.tekton.dev DNS records point here and remove them
  3. The tekton/images/hub/ Dockerfile and nightly CronJob should stay or be evaluated separately — they're for the hub CLI tool, not Tekton Hub the product. Though hub CLI is also deprecated in favor of gh — could be removed too
  4. No ArgoCD changes needed — there's no app to remove

Secrets/certs to consider

  • api-hub-tekton-dev-tls, auth-hub-tekton-dev-tls, swagger-hub-tekton-dev-tls, ui-hub-tekton-dev-tls — Let's Encrypt certs, will stop renewing once deleted
  • tekton-hub-api secret — contains auth tokens, DB credentials
  • catalog-refresh secret

Summary of actions (none taken yet)

Stream Quick win Medium-term Longer-term
PVCs Fix cleanup timeouts, manual purge, reduce --keep Switch to emptyDir for CI workspaces OCI artifacts (TEP-0164)
Results Patch postgres image to FQ name Upgrade v0.16→v0.18, configure log storage S3 backend for logs
Hub Delete namespace Remove hub CLI image if unused
Fluent Bit Bump memory 100M→256M

PVC auto-cleanup — available NOW on v1.12.0

The cluster runs Tekton Pipelines v1.12.0 with coschedule: workspaces. Two relevant features are already available:

1. tekton.dev/auto-cleanup-pvc: "true" annotation (commit 0e9378b8)

Add this annotation to PipelineRuns to auto-delete volumeClaimTemplate PVCs on completion:

metadata:
  annotations:
    tekton.dev/auto-cleanup-pvc: "true"

Only affects volumeClaimTemplate workspaces, never user-provided PVCs.

2. PVC deletion on PipelineRun deletion (commit 8dcad6af)

When a PipelineRun is deleted (e.g., by cleanup), its volumeClaimTemplate PVCs are now also deleted. This means the existing cleanup-runs CronJob (which does tkn pr delete --keep N) should cascade-delete PVCs — if the cleanup stops timing out.

Recommended actions

  1. Add tekton.dev/auto-cleanup-pvc: "true" to all TriggerTemplates — this covers CI workloads going forward:
    • tekton/ci/repos/community/template.yaml
    • tekton/ci/repos/website/template.yaml
    • tekton/ci/repos/catalog/base/template.yaml
    • tekton/ci/repos/shared/doc-reviews/template.yaml
  2. Fix cleanup TaskRun timeouts — all cleanup-runs TaskRuns are timing out, blocking PipelineRun (and now PVC) garbage collection
  3. One-time manual purge of the 215 orphaned PVCs
  4. Longer term: consider emptyDir for workspaces that don't need persistence, and OCI artifacts (TEP-0164 / tekton-experiments patterns) for cross-task data transport
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment