Skip to content

Instantly share code, notes, and snippets.

@sibljon
Created April 16, 2026 10:19
Show Gist options
  • Select an option

  • Save sibljon/04290a40e9d7f8dd1f1d74d51869560b to your computer and use it in GitHub Desktop.

Select an option

Save sibljon/04290a40e9d7f8dd1f1d74d51869560b to your computer and use it in GitHub Desktop.
Backend Integration Test Flakiness Report

Backend Integration Test Flakiness Report

Scope: 35 failed integration_test CI runs, ~2 days (April 15–16, 2026) Total tests in suite: ~674 TL;DR: One data race is responsible for roughly 75% of all failures. Fix it and most of the CI pain goes away.


The main culprit: a data race in analytics

When Kunal saw 31 failures in a single run, they weren't 31 broken tests. They were 1 race condition killing the binary and taking everything running at that moment down with it.

Go's -race flag instruments every memory read and write. When two goroutines access the same memory concurrently without synchronization — one writing — it kills the test binary immediately. Every test that was in-flight at that moment gets marked FAIL. That's why a single race produces 20–40 failures at once with durations of 0.00s.

The race:

  • Reader: svc/analytics/firehose.go:133 — the firehose background goroutine is calling json.Marshal(e) on a buffered analytics event
  • Writer: cmd/svc/billing/internal/utils/utils.go:1314 — billing's SyncCustomerDataToSupport is doing append(tr, "Billing_Parent") on a slice

These seem unrelated, but they share a backing array. Here's how:

  1. Billing builds tagsToAdd/tagsToRemove slices and calls UpdateThread via the threading fused client
  2. The fused client is in-process — no serialization, no copy. The request proto holds the exact same []string slice headers
  3. The threading handler stores in.AddTags / in.RemoveTags in a ThreadEventUpdateTags struct and queues it as an analytics event's Attributes (server.go:5331)
  4. The firehose flushes and marshals that event, reading the slice's backing array
  5. Meanwhile billing's loop hits the next org iteration and does append(tr, "Billing_Parent") — if the slice has spare capacity, this writes into that same backing array in place

The fix is one line in threading/server.go:5330: copy the slices before storing them.

// Before
AddTags:    in.AddTags,
RemoveTags: in.RemoveTags,

// After
AddTags:    append([]string(nil), in.AddTags...),
RemoveTags: append([]string(nil), in.RemoveTags...),

The directory service's UpdateTags handler likely has the same pattern and should get the same treatment.

Most affected tests (all cascade victims — the tests themselves are fine):

Test Failures (35 runs) Example run
TestTwimulator_IncomingSIPCall_Transfer 16 run 24491638845
TestCallRecordingTranscription_EndToEnd 16 run 24471400445
TestTwimulator_IncomingSoftphoneCall_Transfer 15 run 24489987763
TestAITranscriptionSettings_Mutations 15 run 24487987796
TestTwimulator_OutgoingSIPCall_Transfer 14 run 24498542318

In failed runs with this race, you'll see testing.go:1712: race detected during execution of test in the output. Tests marked 0.00s didn't even start — they were killed by the race before they got a chance to run.


Second culprit: MySQL deadlock on ai_transcription_configuration

About 15% of failures are a deadlock unrelated to the race. Three transcription tests run in parallel, each creating an AI transcription configuration for a test org. They all INSERT into integration_excomms.ai_transcription_configuration concurrently and deadlock on the secondary index idx_endpoint_voicemail_config_org_id.

Error: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Affected tests:

  • TestCallRecordingTranscription_EndToEnd (example)
  • TestAudioMessageTranscription_EndToEnd (example)
  • TestAITranscriptionSettings_Mutations

The fix here is adding retry logic on deadlock (MySQL error 1213) in the DAL layer for this table, or serializing these three tests with t.Parallel() removed (simpler but slower).


Everything else (~10%)

Two smaller issues that only show up after the race is fixed and stop masking them:

Async call transfer timeoutsTestTwimulator_OutgoingSIPCall_Transfer fails with Condition never satisfied at sip.outgoing.star_menu_transfer_test.go:261. The test polls for a warm transfer to complete, but under CI load the state machine runs slower than the poll window. Increasing the timeout or reducing parallelism for call tests would help.

Missing PhoneTreeNodeDescriptionTestTwimulator_IncomingCall_CallFlowMenuHangup fails because a field is set asynchronously and the assertion runs before it's populated. 5 failures across 35 runs. Small but genuine.


Priority order

  1. Fix the data race (threading/server.go:5330, directory UpdateTags handler) — eliminates ~75% of all failures. This is a real bug in production too, not just a test issue. The race corrupts analytics events, silently writing wrong tag data to Firehose in production.

  2. Add deadlock retry for ai_transcription_configuration — eliminates ~15% of remaining failures.

  3. Increase timeout in TestTwimulator_OutgoingSIPCall_Transfer — small, targeted fix for the remaining call transfer flake.

The race fix alone should take most runs from 20–30 failures to 0–2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment