Investigation Plan: E2E Test Failure - TestGithubPushRequestGitOpsCommentCancel
Problem Summary
The E2E test TestGithubPushRequestGitOpsCommentCancel is failing after 111.40 seconds. The test exercises the GitOps comment-based cancellation feature for push events.
Failure Timeline
- 05:54:58 - /test comment triggers third PipelineRun creation
- 05:55:00 - Third PipelineRun created (pipelinerun-on-push-wlddg)
- 05:55:01 - /cancel comment is sent
- 05:55:03 - PipelineRun is successfully cancelled (status: "Cancelled")
- 05:55:07 - Repository status updated to 3/3 items
- 05:55:09 - Test begins searching for skip message in controller logs
- 05:56:04 - Test fails with timeout error
Actual Error Message
neither a cancelled pipelinerun in repo status or a request to skip the cancellation in the controller log was found: could not find a match using the labelSelector: app.kubernetes.io/name=controller in container pac-controller for regexp: .*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done.
Root Cause Analysis
Test Logic (test/github_push_retest_test.go:132-205)
The test has two success paths:
Path A (Lines 176-181): Check if Repository status contains a Cancelled condition for _, c := range repo.Status { if c.Conditions[0].Reason == tektonv1.TaskRunReasonCancelled.String() { cancelled = true } }
Path B (Lines 184-191): If no cancelled status found, check controller logs for skip message reg := regexp.MustCompile(".*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done.") err = twait.RegexpMatchingInControllerLog(ctx, g.Cnx, *reg, 10, "controller", &numLines)
What Actually Happened
- PipelineRun was cancelled successfully - Confirmed in pac-pipelineruns.yaml:
- Line 305: message: PipelineRun "pipelinerun-on-push-wlddg" was cancelled
- Line 306: reason: Cancelled
- Line 291: status: CancelledRunFinally
- Repository status was NOT updated - The pac-repositories.yaml file is empty: apiVersion: v1 items: [] kind: List
- No skip message in logs - The regex pattern wasn't found in controller logs
The Bug
The Repository CRD status is not being updated when a PipelineRun is cancelled via a comment. This appears to be one of two issues:
- Race condition: The cancellation happens so fast that the Repository reconciler doesn't process it
- Missing status update: The code path for comment-based cancellation doesn't update the Repository status
Investigation Needed
Files to Examine
- pkg/reconciler/status.go - Repository status update logic
- Method: updateRepoRunStatus
- Need to verify it handles cancelled PipelineRuns
- pkg/pipelineascode/cancel_pipelineruns.go - Cancellation logic
- Methods: cancelPipelineRuns, CancelPipelineRuns
- Line 118: "cancel-in-progress for event" log
- Lines with "skipping cancelling pipelinerun" messages
- pkg/reconciler/reconciler.go - Main reconciliation loop
- Need to understand when Repository status gets updated
- test/github_push_retest_test.go:132-205 - The failing test
- Understand test expectations and timing
Questions to Answer
- Does the Repository reconciler run after a PipelineRun is cancelled?
- Is there a specific code path for updating Repository status when cancellation occurs via comment?
- Should the test wait longer for the Repository status to be updated?
- Is the empty pac-repositories.yaml expected, or was the Repository CR deleted before logs were collected?
Root Cause Confirmed
After thorough investigation, this is a TEST BUG with a timing/race condition, not a product bug.
What Actually Happened
- At 05:55:03: PipelineRun pipelinerun-on-push-wlddg was successfully cancelled
- At 05:55:07: Repository status showed 3/3 items (correct)
- At teardown: Repository CR was deleted by NSTearDown (line 14 of repository/teardown.go)
- The logs were collected AFTER teardown, so pac-repositories.yaml shows empty items
The Real Problem
The test expects either:
- Repository.Status to contain a Cancelled condition (but it checks AFTER the repo is deleted) OR
- Controller logs to show "skipping cancelling pipelinerun.*on-push.*already done"
The regex pattern at line 186 is too broad: ".*pipelinerun.*skipping cancelling pipelinerun.*on-push.already done." - it matches ANY pipelinerun with "on-push" in the name, not specifically the one created by /test.
When cancellation happens so quickly, the PipelineRun completes before the cancel comment is processed, but the test doesn't verify it's checking the RIGHT PipelineRun.
Implementation Plan
Fix: Improve Test Logic and Specificity
File: test/github_push_retest_test.go (lines 163-205)
Changes Required
Replace the current logic (lines 163-205) with a more robust approach:
- Track the specific PipelineRun created by /test comment
- Use existing UntilPipelineRunHasReason function (already available in wait.go:95-116)
- Make regex pattern specific to the actual PipelineRun name
- Check Repository status BEFORE teardown
Detailed Implementation
// After line 163: Create the 3rd PipelineRun with /test comment err = twait.UntilPipelineRunCreated(ctx, g.Cnx.Clients, waitOpts) assert.NilError(t, err)
// NEW: Get the specific PipelineRun name that was created prs, err := g.Cnx.Clients.Tekton.TektonV1().PipelineRuns(g.TargetNamespace).List(ctx, metav1.ListOptions{ LabelSelector: fmt.Sprintf("%s=%s", keys.SHA, g.SHA) }) assert.NilError(t, err)
var targetPRName string for _, pr := range prs.Items { if pr.GetAnnotations()[keys.OriginalPRName] == "pipelinerun-on-push" & pr.GetAnnotations()[keys.EventType] == "test-comment" targetPRName = pr.GetName bre
} assert.Assert(t, targetPRName != "", "Could not find the PipelineRun created by /test comment")
// Send /cancel comment comment := "/cancel pipelinerun-on-push branch:" + g.TargetNamespace g.Cnx.Clients.Log.Infof("%s on Push Request", comment) _, _, err = g.Provider.Client().Repositories.CreateComment(ctx, g.Options.Organization g.Options.Repo, g.SHA &github.RepositoryComment{Body: github.Ptr(comment)} assert.NilError(t, err)
g.Cnx.Clients.Log.Infof("Waiting for PipelineRun %s to be cancelled or skip message", targetPRName)
// IMPROVED: Wait for PipelineRun to be cancelled using existing function waitOpts.MinNumberStatus = 1 // Looking for at least 1 cancelled PR err = twait.UntilPipelineRunHasReason(ctx, g.Cnx.Clients, tektonv1.PipelineRunReasonCancelled, waitOpts)
if err == nil { // PipelineRun was successfully cancelled, verify in Repository statu g.Cnx.Clients.Log.Infof("PipelineRun %s was cancelled, verifying Repository status", targetPRName repo, err := g.Cnx.Clients.PipelineAsCode.PipelinesascodeV1alpha1().Repositories(g.TargetNamespace).Get(ctx, g.TargetNamespace, metav1.GetOptions{} assert.NilError(t, err
cancelled := fals
for _, status := range repo.Status
if status.PipelineRunName == targetPRName
len(status.Conditions) > 0
status.Conditions[0].Reason == tektonv1.TaskRunReasonCancelled.String()
cancelled = t
br
assert.Assert(t, cancelled, fmt.Sprintf("Repository status does not show PipelineRun %s as cancelled", targetPRName)
// Final verification: ensure we have the expected number of PipelineRun
pruns, err := g.Cnx.Clients.Tekton.TektonV1().PipelineRuns(g.TargetNamespace).List(ctx, metav1.ListOptions{}
assert.NilError(t, err
assert.Equal(t, len(pruns.Items), numberOfStatus
retur
}
// If PipelineRun didn't get cancelled, it likely completed before cancel was processed // Check for the skip message in controller logs g.Cnx.Clients.Log.Infof("PipelineRun may have completed before cancellation, checking logs for skip message") numLines := int64(100) // Increased from 20 to capture more logs reg := regexp.MustCompile(fmt.Sprintf(".*skipping cancelling pipelinerun %s/%s.already done.", g.TargetNamespace, targetPRName) err = twait.RegexpMatchingInControllerLog(ctx, g.Cnx, *reg, 10, "controller", &numLines) if err != nil { t.Errorf("neither a cancelled pipelinerun in repo status or a request to skip the cancellation in the controller log was found: %s", err.Error() retur }
g.Cnx.Clients.Log.Infof("Found skip message for PipelineRun %s in controller logs", targetPRName)
Why This Fix Works
- Specific PipelineRun tracking: We now identify the exact PR created by /test comment
- Uses existing helper: UntilPipelineRunHasReason already exists in wait.go (line 95-116)
- Precise regex: Pattern now includes the actual namespace and PR name, not just "on-push"
- Checks before teardown: Repository status is verified before the test cleanup
- Better error messages: Includes the specific PR name in error messages
- Handles race condition: Accepts either cancelled status OR skip message
Files to Modify
| File | Lines | Change Description |
|---|---|---|
| test/github_push_retest_test.go | 163-205 | Replace test logic with improved version above |
Optional Enhancement
File: pkg/pipelineascode/cancel_pipelineruns.go (line 222)
Add original PR name to log message for better debugging:
// Current (line 222): p.logger.Infof("cancel-in-progress: skipping cancelling pipelinerun %v/%v, already done", pr.GetNamespace(), pr.GetName())
// Improved: originalName := pr.GetAnnotations()[keys.OriginalPRName] p.logger.Infof("cancel-in-progress: skipping cancelling pipelinerun %v/%v (original: %s), already done", pr.GetNamespace(), pr.GetName(), originalName
This makes the log message more informative and helps with debugging.
Testing the Fix
- Run the specific test multiple times: make test-e2e TEST_ARGS="-run TestGithubPushRequestGitOpsCommentCancel"
- Verify both scenarios work:
- Fast cancellation (PR completes first): Should find skip message
- Normal cancellation: Should find cancelled status in Repository
- Check that false positives are eliminated (only matches correct PR)
Summary
- Root cause: Test had imprecise wait conditions and overly broad regex pattern
- Product behavior: Correct - cancellation works as designed
- Fix complexity: Low - mainly improving test logic with existing helper functions
- Risk: Minimal - only changes test code, no product code changes needed
- Expected outcome: Test will reliably pass in both cancellation scenarios