e2e-failure-investigation.md

Investigation Plan: E2E Test Failure - TestGithubPushRequestGitOpsCommentCancel

Problem Summary

The E2E test TestGithubPushRequestGitOpsCommentCancel is failing after 111.40 seconds. The test exercises the GitOps comment-based cancellation feature for push events.

Failure Timeline

05:54:58 - /test comment triggers third PipelineRun creation
05:55:00 - Third PipelineRun created (pipelinerun-on-push-wlddg)
05:55:01 - /cancel comment is sent
05:55:03 - PipelineRun is successfully cancelled (status: "Cancelled")
05:55:07 - Repository status updated to 3/3 items
05:55:09 - Test begins searching for skip message in controller logs
05:56:04 - Test fails with timeout error

Actual Error Message

neither a cancelled pipelinerun in repo status or a request to skip the cancellation in the controller log was found: could not find a match using the labelSelector: app.kubernetes.io/name=controller in container pac-controller for regexp: .*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done.

Root Cause Analysis

Test Logic (test/github_push_retest_test.go:132-205)

The test has two success paths:

Path A (Lines 176-181): Check if Repository status contains a Cancelled condition for _, c := range repo.Status { if c.Conditions[0].Reason == tektonv1.TaskRunReasonCancelled.String() { cancelled = true } }

Path B (Lines 184-191): If no cancelled status found, check controller logs for skip message reg := regexp.MustCompile(".*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done.") err = twait.RegexpMatchingInControllerLog(ctx, g.Cnx, *reg, 10, "controller", &numLines)

What Actually Happened

PipelineRun was cancelled successfully - Confirmed in pac-pipelineruns.yaml:

Line 305: message: PipelineRun "pipelinerun-on-push-wlddg" was cancelled
Line 306: reason: Cancelled
Line 291: status: CancelledRunFinally

Repository status was NOT updated - The pac-repositories.yaml file is empty: apiVersion: v1 items: [] kind: List
No skip message in logs - The regex pattern wasn't found in controller logs

The Bug

The Repository CRD status is not being updated when a PipelineRun is cancelled via a comment. This appears to be one of two issues:

Race condition: The cancellation happens so fast that the Repository reconciler doesn't process it
Missing status update: The code path for comment-based cancellation doesn't update the Repository status

Investigation Needed

Files to Examine

pkg/reconciler/status.go - Repository status update logic

Method: updateRepoRunStatus
Need to verify it handles cancelled PipelineRuns

pkg/pipelineascode/cancel_pipelineruns.go - Cancellation logic

Methods: cancelPipelineRuns, CancelPipelineRuns
Line 118: "cancel-in-progress for event" log
Lines with "skipping cancelling pipelinerun" messages

pkg/reconciler/reconciler.go - Main reconciliation loop

Need to understand when Repository status gets updated

test/github_push_retest_test.go:132-205 - The failing test

Understand test expectations and timing

Questions to Answer

Does the Repository reconciler run after a PipelineRun is cancelled?
Is there a specific code path for updating Repository status when cancellation occurs via comment?
Should the test wait longer for the Repository status to be updated?
Is the empty pac-repositories.yaml expected, or was the Repository CR deleted before logs were collected?

Root Cause Confirmed

After thorough investigation, this is a TEST BUG with a timing/race condition, not a product bug.

What Actually Happened

At 05:55:03: PipelineRun pipelinerun-on-push-wlddg was successfully cancelled
At 05:55:07: Repository status showed 3/3 items (correct)
At teardown: Repository CR was deleted by NSTearDown (line 14 of repository/teardown.go)
The logs were collected AFTER teardown, so pac-repositories.yaml shows empty items

The Real Problem

The test expects either:

Repository.Status to contain a Cancelled condition (but it checks AFTER the repo is deleted) OR
Controller logs to show "skipping cancelling pipelinerun.*on-push.*already done"

The regex pattern at line 186 is too broad: ".*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done." - it matches ANY pipelinerun with "on-push" in the name, not specifically the one created by /test.

When cancellation happens so quickly, the PipelineRun completes before the cancel comment is processed, but the test doesn't verify it's checking the RIGHT PipelineRun.

Implementation Plan

Fix: Improve Test Logic and Specificity

File: test/github_push_retest_test.go (lines 163-205)

Changes Required

Replace the current logic (lines 163-205) with a more robust approach:

Track the specific PipelineRun created by /test comment
Use existing UntilPipelineRunHasReason function (already available in wait.go:95-116)
Make regex pattern specific to the actual PipelineRun name
Check Repository status BEFORE teardown

Detailed Implementation

// After line 163: Create the 3rd PipelineRun with /test comment err = twait.UntilPipelineRunCreated(ctx, g.Cnx.Clients, waitOpts) assert.NilError(t, err)

// NEW: Get the specific PipelineRun name that was created prs, err := g.Cnx.Clients.Tekton.TektonV1().PipelineRuns(g.TargetNamespace).List(ctx, metav1.ListOptions{ LabelSelector: fmt.Sprintf("%s=%s", keys.SHA, g.SHA) }) assert.NilError(t, err)

var targetPRName string for _, pr := range prs.Items { if pr.GetAnnotations()[keys.OriginalPRName] == "pipelinerun-on-push" & pr.GetAnnotations()[keys.EventType] == "test-comment" targetPRName = pr.GetName bre

} assert.Assert(t, targetPRName != "", "Could not find the PipelineRun created by /test comment")

// Send /cancel comment comment := "/cancel pipelinerun-on-push branch:" + g.TargetNamespace g.Cnx.Clients.Log.Infof("%s on Push Request", comment) _, _, err = g.Provider.Client().Repositories.CreateComment(ctx, g.Options.Organization g.Options.Repo, g.SHA &github.RepositoryComment{Body: github.Ptr(comment)} assert.NilError(t, err)

g.Cnx.Clients.Log.Infof("Waiting for PipelineRun %s to be cancelled or skip message", targetPRName)

// IMPROVED: Wait for PipelineRun to be cancelled using existing function waitOpts.MinNumberStatus = 1 // Looking for at least 1 cancelled PR err = twait.UntilPipelineRunHasReason(ctx, g.Cnx.Clients, tektonv1.PipelineRunReasonCancelled, waitOpts)

if err == nil { // PipelineRun was successfully cancelled, verify in Repository statu g.Cnx.Clients.Log.Infof("PipelineRun %s was cancelled, verifying Repository status", targetPRName repo, err := g.Cnx.Clients.PipelineAsCode.PipelinesascodeV1alpha1().Repositories(g.TargetNamespace).Get(ctx, g.TargetNamespace, metav1.GetOptions{} assert.NilError(t, err

cancelled := fals
for _, status := range repo.Status
	if status.PipelineRunName == targetPRName
	  len(status.Conditions) > 0
	  status.Conditions[0].Reason == tektonv1.TaskRunReasonCancelled.String()
		cancelled = t
		br


assert.Assert(t, cancelled, fmt.Sprintf("Repository status does not show PipelineRun %s as cancelled", targetPRName)

// Final verification: ensure we have the expected number of PipelineRun
pruns, err := g.Cnx.Clients.Tekton.TektonV1().PipelineRuns(g.TargetNamespace).List(ctx, metav1.ListOptions{}
assert.NilError(t, err
assert.Equal(t, len(pruns.Items), numberOfStatus
retur

}

// If PipelineRun didn't get cancelled, it likely completed before cancel was processed // Check for the skip message in controller logs g.Cnx.Clients.Log.Infof("PipelineRun may have completed before cancellation, checking logs for skip message") numLines := int64(100) // Increased from 20 to capture more logs reg := regexp.MustCompile(fmt.Sprintf(".*skipping cancelling pipelinerun %s/%s.already done.", g.TargetNamespace, targetPRName) err = twait.RegexpMatchingInControllerLog(ctx, g.Cnx, *reg, 10, "controller", &numLines) if err != nil { t.Errorf("neither a cancelled pipelinerun in repo status or a request to skip the cancellation in the controller log was found: %s", err.Error() retur }

g.Cnx.Clients.Log.Infof("Found skip message for PipelineRun %s in controller logs", targetPRName)

Why This Fix Works

Specific PipelineRun tracking: We now identify the exact PR created by /test comment
Uses existing helper: UntilPipelineRunHasReason already exists in wait.go (line 95-116)
Precise regex: Pattern now includes the actual namespace and PR name, not just "on-push"
Checks before teardown: Repository status is verified before the test cleanup
Better error messages: Includes the specific PR name in error messages
Handles race condition: Accepts either cancelled status OR skip message

Files to Modify

File	Lines	Change Description
test/github_push_retest_test.go	163-205	Replace test logic with improved version above

Optional Enhancement

File: pkg/pipelineascode/cancel_pipelineruns.go (line 222)

Add original PR name to log message for better debugging:

// Current (line 222): p.logger.Infof("cancel-in-progress: skipping cancelling pipelinerun %v/%v, already done", pr.GetNamespace(), pr.GetName())

// Improved: originalName := pr.GetAnnotations()[keys.OriginalPRName] p.logger.Infof("cancel-in-progress: skipping cancelling pipelinerun %v/%v (original: %s), already done", pr.GetNamespace(), pr.GetName(), originalName

This makes the log message more informative and helps with debugging.

Testing the Fix

Run the specific test multiple times: make test-e2e TEST_ARGS="-run TestGithubPushRequestGitOpsCommentCancel"
Verify both scenarios work:

Fast cancellation (PR completes first): Should find skip message
Normal cancellation: Should find cancelled status in Repository

Check that false positives are eliminated (only matches correct PR)

Summary

Root cause: Test had imprecise wait conditions and overly broad regex pattern
Product behavior: Correct - cancellation works as designed
Fix complexity: Low - mainly improving test logic with existing helper functions
Risk: Minimal - only changes test code, no product code changes needed
Expected outcome: Test will reliably pass in both cancellation scenarios

chmouel/e2e-failure-investigation.md

Select an option

No results found

Select an option

No results found