Skip to content

Instantly share code, notes, and snippets.

@frobware
Last active March 18, 2026 17:33
Show Gist options
  • Select an option

  • Save frobware/86fe137d5f33612022858525f433081d to your computer and use it in GitHub Desktop.

Select an option

Save frobware/86fe137d5f33612022858525f433081d to your computer and use it in GitHub Desktop.
konflux

Konflux bundle coordination problem

Background

The openshift/bpfman-operator repository builds three container images from the same codebase via Konflux:

  • bpfman-operator (the operator binary)
  • bpfman-agent (the agent DaemonSet binary)
  • bpfman-operator-bundle (the OLM bundle)

A fourth image, bpfman (the daemon), is built from a separate repository (openshift/bpfman) but its pullspec is also consumed by the bundle.

Each component's Tekton push pipeline declares which file it owns via the build.appstudio.openshift.io/build-nudge-files annotation. When Konflux successfully builds a component, it opens a PR that updates a single .txt file containing the image digest:

  • hack/konflux/images/bpfman-operator.txt
  • hack/konflux/images/bpfman-agent.txt
  • hack/konflux/images/bpfman.txt

Each file contains exactly one line: a registry.redhat.io/...@sha256:... pullspec.

Current flow (pre-PR #498)

The bundle push pipeline (bpfman-operator-bundle-ystream-push.yaml) has a CEL trigger that fires when any of the .txt files (or bundle/, hack/openshift/, config/, OPENSHIFT-VERSION) change on main. When a nudge PR merges, Containerfile.bundle.openshift runs:

  1. update-bundle.py -- transforms the CSV with Red Hat branding, the operator pullspec (from bpfman-operator.txt), architecture labels, and version string.

  2. update-configmap.py -- stamps the agent and bpfman pullspecs (from bpfman-agent.txt and bpfman.txt) into the bundle's ConfigMap manifest (bundle/manifests/bpfman-config_v1_configmap.yaml).

At release time, validate-snapshot.py extracts the bundle image, parses the CSV and ConfigMap, and checks that every sha256 digest matches the corresponding component digest in the Konflux snapshot. If any mismatch is found, the release is blocked.

The coordination problem (exists today)

The three component builds and their nudge PRs are independent. They complete and merge at different times. Each merge triggers a bundle rebuild, but the bundle is built from whatever .txt files are on main at that moment. This means:

  • Agent builds and nudges bpfman-agent.txt. Bundle rebuilds with the new agent digest but the old operator and bpfman digests.

  • Operator builds and nudges bpfman-operator.txt. Bundle rebuilds with the new operator digest but potentially the old agent digest (if that nudge hasn't merged yet, or merged in a different cycle).

The snapshot assembled from these components is not self-consistent: the bundle references digests that do not match the component digests in the same snapshot. validate-snapshot.py catches this and blocks the release. The system eventually converges -- after all nudge PRs merge and the final bundle rebuild runs, the snapshot becomes consistent -- but there is no mechanism to ensure this happens atomically.

This is the core issue: Konflux does not provide a way to gate the bundle build until all parent component nudge PRs have merged. Each nudge is independent, each merge triggers a separate bundle rebuild, and only the last rebuild in a cycle produces a valid snapshot.

What upstream PR #498 changes

bpfman/bpfman-operator#498 ("Bootstrap Config CR from operator on startup") makes the following changes relevant to the downstream bundle:

  1. Removes the static Config CR manifest from the bundle. The file bundle/manifests/bpfman-config_v1_configmap.yaml (previously migrated to bundle/manifests/bpfman.io_v1alpha1_config.yaml by openshift/bpfman-operator commit e10e766) is deleted entirely. OLM rejects custom resource instances in bundles as UnsupportedResource, so the Config CR cannot be shipped this way.

  2. The operator bootstraps the Config CR on startup. Image references are read from environment variables BPFMAN_IMG and BPFMAN_AGENT_IMG on the operator deployment. Both are required; missing either is a fatal startup error. The deployment manifest (config/bpfman-operator-deployment/deployment.yaml) carries upstream defaults (quay.io/bpfman/bpfman:latest and quay.io/bpfman/bpfman-agent:latest).

  3. The deleted config/bpfman-deployment/ directory contained the kustomise overlay (config.yaml, kustomization.yaml.env) that was previously used by make patch-image-references. That Makefile target now patches the env vars directly on deployment.yaml via sed.

Impact on downstream tooling

After merging PR #498 upstream and pulling it into openshift/bpfman-operator:

  • update-configmap.py / update-config.py has no target file. There is no standalone Config CR or ConfigMap manifest in the bundle to patch.

  • The agent and bpfman pullspecs must instead be stamped into the CSV's deployment spec as env var values for BPFMAN_IMG and BPFMAN_AGENT_IMG. This is where the operator reads them at runtime.

  • validate-snapshot.py must extract the image refs from the CSV deployment env vars instead of from a standalone manifest.

  • Containerfile.bundle.openshift must be updated to call the revised script targeting the CSV rather than a removed manifest.

The coordination problem remains

The substitution target changes (standalone manifest to CSV env vars) but the fundamental coordination problem is unchanged:

  1. Agent builds. Konflux nudges bpfman-agent.txt. PR merges. Bundle rebuilds with the new agent digest but old operator/bpfman digests stamped into the CSV env vars.

  2. Operator builds. Konflux nudges bpfman-operator.txt. PR merges. Bundle rebuilds again, now with both new, but only if step 1 has already merged.

  3. If both nudge PRs are open simultaneously and merge in sequence, only the bundle built after the second merge is self-consistent.

The snapshot produced between steps 1 and 2 fails validate-snapshot.py and cannot be released. This is the same race that exists today; the upstream change does not make it worse, but it does not fix it either.

What would fix it

The root cause is that Konflux treats each component nudge as an independent event. To produce a self-consistent snapshot on every bundle build, one of the following would be needed:

  1. Atomic multi-component nudge. Konflux would wait for all components in an application (or a defined group) to complete their builds before raising a single nudge PR that updates all .txt files at once. The bundle would then rebuild exactly once with all digests current.

  2. Snapshot-level gating. Rather than triggering the bundle build on each .txt file change, Konflux would only trigger it when all component digests in the snapshot are newer than those currently in the bundle. This is effectively a "quorum" gate.

  3. Release-time validation only. Accept that intermediate bundle builds may be inconsistent. Rely on validate-snapshot.py (or Konflux Enterprise Contract policies) to block release of any snapshot where the bundle's embedded digests do not match the component digests. The system converges after the final nudge merges. The cost is wasted bundle builds and a slower release cadence.

Option 3 is what exists today. Options 1 and 2 would require changes to Konflux's nudging and build-triggering infrastructure.

Monorepo support in Konflux

The operator and agent are built from the same repository, which makes this a monorepo problem. Konflux has some monorepo awareness but it does not solve this case:

  • PR-time group snapshots. When a PR targets a monorepo with multiple components, separate build pipelines trigger for each component. Konflux supports "group snapshot testing" that combines all updated component builds into a single snapshot for unified integration testing. This works at PR time.

  • Post-merge: no grouping. After merge, each component build creates its own intermediate snapshot. The Konflux documentation explicitly acknowledges this gap: group snapshot testing "is unfortunately currently not directly available after the Pull Request is merged", so "individual build pipelines will result in intermediate Snapshots which will not contain all the changes until the final build pipelineRun completes."

  • Recommended workaround. The documentation recommends creating a custom IntegrationTestScenario for push events that validates whether a snapshot contains all expected component updates, failing the test if incomplete. This is a hand-rolled quorum gate: the integration test checks "are all component digests in this snapshot consistent?" and blocks release until they are.

This is exactly the pattern validate-snapshot.py already implements. The problem is not that invalid snapshots slip through to release -- they don't -- but that valid snapshots are slow to materialise. Every intermediate (inconsistent) bundle build is wasted work, and the release pipeline stalls until the final nudge lands and the last bundle rebuild produces a consistent snapshot.

See: Managing Monorepo Applications

Prior attempts to solve this

This problem has been actively investigated since October 2025 across a series of PRs in openshift/bpfman-operator. Every approach tried has either introduced new problems or only partially mitigated the race.

Approach 1: Pipeline-level synchronisation (Oct--Nov 2025)

The idea was to route all component updates through the operator pipeline as a synchronisation point, so the bundle only rebuilds after the operator has incorporated all upstream changes.

  • PR #1083 -- Added nudge file path triggers to component push pipelines so components would rebuild when their image references changed. This caused an infinite build loop: component builds, updates its own .txt file, which triggers itself again.

  • PR #1090 -- Reverted #1083 to break the infinite loop.

  • PR #1094 -- Removed nudge file path triggers from component push pipelines (kept them in pull-request pipelines only).

  • PR #1097 -- Made the operator pipeline watch bpfman-agent.txt and bpfman.txt, and the bundle pipeline watch only bpfman-operator.txt. The operator becomes the synchronisation point: agent/daemon changes flow through the operator before reaching the bundle. This reduced the race window but did not eliminate it.

  • PR #1100 -- Removed wasteful bundle validation triggers for component nudge files (pull-request pipelines only).

Approach 2: Nudge configuration fixes (Nov 2025--Jan 2026)

Attempted to fix the nudging topology and use Konflux annotations to collapse competing PRs.

  • PR #1276 -- Changed agent and daemon to nudge the bundle directly (instead of going through the operator). Added build-nudge-simple-branch: 'true' annotation to all components so competing nudge PRs targeting the same repo collapse into a single branch. This helped reduce duplicate PRs but did not solve the timing problem: the single branch still updates one .txt file at a time.

  • PR #1282 -- Applied the same nudge configuration fix to z-stream pipelines.

  • PR #1299 -- Deliberately triggered a race condition experiment by changing a file in cmd/ that triggers both agent and operator builds from the same commit. Goal: observe whether simple-branch annotation prevents the mismatch. Result: the race still exists.

Approach 3: Release-time snapshot validation (Jan 2026)

Since pipeline-level fixes could not eliminate the race, the next approach was to let inconsistent snapshots happen but block their release.

  • PR #1393 (closed), then PR #1401 (merged) -- Added validate-snapshot.py as a Konflux IntegrationTestScenario. The script extracts the bundle image, parses the CSV and ConfigMap for embedded sha256 digests, and compares them against the component digests in the snapshot. Blocks release if any mismatch is found. At the time of implementation, the observed release failure rate due to inconsistent snapshots was ~70%.

  • PR #1407 -- Scoped validation to push events only (not PR snapshots).

  • PR #1415 -- Further scoped validation to bundle component snapshots only.

  • PR #1436, PR #1437 -- Removed the snapshot validation pipeline and test-scripts integration test. This was a temporary measure while investigating release pipeline issues. The validation was correct but the integration with Konflux's release pipeline had its own problems.

Where this leaves us

Every mitigation tried has either:

  • Introduced new problems (infinite build loops, over-triggering)
  • Only narrowed the race window without closing it (operator as synchronisation point, simple-branch annotations)
  • Correctly identified bad snapshots but not prevented the wasted work (snapshot validation)

The fundamental issue remains: Konflux does not offer a post-merge mechanism to atomically coordinate multiple component builds before triggering a dependent (bundle) build.

Summary

The coordination problem is inherent to how Konflux handles post-merge builds for multi-component applications. The upstream change (PR #498) shifts where image references are stamped (from a standalone bundle manifest to CSV deployment env vars) but does not change the fundamental issue: nudge PRs arrive independently, each triggers a bundle rebuild, and only the final rebuild in a cycle produces a releasable snapshot. Konflux does not currently offer post-merge group snapshots or atomic multi-component nudges that would eliminate this race.

This is not a configuration error. It has been actively investigated over five months with multiple approaches, none of which fully solve the problem without changes to Konflux's nudging and snapshot infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment