Scope: 35 failed integration_test CI runs, ~2 days (April 15–16, 2026)
Total tests in suite: ~674
TL;DR: One data race is responsible for roughly 75% of all failures. Fix it and most of the CI pain goes away.
When Kunal saw 31 failures in a single run, they weren't 31 broken tests. They were 1 race condition killing the binary and taking everything running at that moment down with it.
Go's -race flag instruments every memory read and write. When two goroutines access the same memory concurrently without synchronization — one writing — it kills the test binary immediately. Every test that was in-flight at that moment gets marked FAIL. That's why a single race produces 20–40 failures at once with durations of 0.00s.
The race:
- Reader:
svc/analytics/firehose.go:133— the firehose background goroutine is callingjson.Marshal(e)on a buffered analytics event - Writer:
cmd/svc/billing/internal/utils/utils.go:1314— billing'sSyncCustomerDataToSupportis doingappend(tr, "Billing_Parent")on a slice
These seem unrelated, but they share a backing array. Here's how:
- Billing builds
tagsToAdd/tagsToRemoveslices and callsUpdateThreadvia the threading fused client - The fused client is in-process — no serialization, no copy. The request proto holds the exact same
[]stringslice headers - The threading handler stores
in.AddTags/in.RemoveTagsin aThreadEventUpdateTagsstruct and queues it as an analytics event'sAttributes(server.go:5331) - The firehose flushes and marshals that event, reading the slice's backing array
- Meanwhile billing's loop hits the next org iteration and does
append(tr, "Billing_Parent")— if the slice has spare capacity, this writes into that same backing array in place
The fix is one line in threading/server.go:5330: copy the slices before storing them.
// Before
AddTags: in.AddTags,
RemoveTags: in.RemoveTags,
// After
AddTags: append([]string(nil), in.AddTags...),
RemoveTags: append([]string(nil), in.RemoveTags...),The directory service's UpdateTags handler likely has the same pattern and should get the same treatment.
Most affected tests (all cascade victims — the tests themselves are fine):
| Test | Failures (35 runs) | Example run |
|---|---|---|
TestTwimulator_IncomingSIPCall_Transfer |
16 | run 24491638845 |
TestCallRecordingTranscription_EndToEnd |
16 | run 24471400445 |
TestTwimulator_IncomingSoftphoneCall_Transfer |
15 | run 24489987763 |
TestAITranscriptionSettings_Mutations |
15 | run 24487987796 |
TestTwimulator_OutgoingSIPCall_Transfer |
14 | run 24498542318 |
In failed runs with this race, you'll see testing.go:1712: race detected during execution of test in the output. Tests marked 0.00s didn't even start — they were killed by the race before they got a chance to run.
About 15% of failures are a deadlock unrelated to the race. Three transcription tests run in parallel, each creating an AI transcription configuration for a test org. They all INSERT into integration_excomms.ai_transcription_configuration concurrently and deadlock on the secondary index idx_endpoint_voicemail_config_org_id.
Error: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
Affected tests:
TestCallRecordingTranscription_EndToEnd(example)TestAudioMessageTranscription_EndToEnd(example)TestAITranscriptionSettings_Mutations
The fix here is adding retry logic on deadlock (MySQL error 1213) in the DAL layer for this table, or serializing these three tests with t.Parallel() removed (simpler but slower).
Two smaller issues that only show up after the race is fixed and stop masking them:
Async call transfer timeouts — TestTwimulator_OutgoingSIPCall_Transfer fails with Condition never satisfied at sip.outgoing.star_menu_transfer_test.go:261. The test polls for a warm transfer to complete, but under CI load the state machine runs slower than the poll window. Increasing the timeout or reducing parallelism for call tests would help.
Missing PhoneTreeNodeDescription — TestTwimulator_IncomingCall_CallFlowMenuHangup fails because a field is set asynchronously and the assertion runs before it's populated. 5 failures across 35 runs. Small but genuine.
-
Fix the data race (
threading/server.go:5330,directoryUpdateTags handler) — eliminates ~75% of all failures. This is a real bug in production too, not just a test issue. The race corrupts analytics events, silently writing wrong tag data to Firehose in production. -
Add deadlock retry for
ai_transcription_configuration— eliminates ~15% of remaining failures. -
Increase timeout in
TestTwimulator_OutgoingSIPCall_Transfer— small, targeted fix for the remaining call transfer flake.
The race fix alone should take most runs from 20–30 failures to 0–2.