Alert 4421 - Datastore Infrastructure Analysis

Context

SEV-2 incident (#alert-4421-system_instability_unclear_why) declared at 12:17 PM PDT on 2026-04-06. System instability with unclear root cause. Impact window roughly 11:35 AM - 12:12 PM PDT (3:03 PM - 3:12 PM ET). ~40-50 users per 20-min window affected across payroll workflows (hours entry, pay dashboard, off-cycle payrolls, support pages).

Application-level findings from the channel:

KeyError: Enum PayrollBlockers::Enums::PayrollBlockerType key not found: :partner_tos_not_accepted (revert PR #329058)
http2 connection errors under load (HTTP fetch failed from 'zenpayroll')
Compute/network team confirmed pods healthy, scaling normal, SLO recovered
CX reports impact subsided

RDS Aurora (MySQL + Postgres) - CPU

Instance	Avg CPU	Max CPU	Trend	Concern?
`ibex-production-uw2-aurora-mysql-1`	18%	86%	Jumped from ~2.5% to 75-86% at ~11:00 AM PDT, still elevated	YES - investigate
`airflow-staging-uw2-aurora-postgresql-1`	37%	52%	Steady, no change	No (staging)
`zenpayroll-versions-production-uw2-aurora-mysql80-1`	24%	48%	Variable 15-48%, no clear spike correlated with incident	No
`email-events-production-1`	15%	56%	Elevated 34-55% between 8:52-9:45 AM PDT, then dropped. Before incident.	No
`tax-credits-sanitized`	16%	67%	Periodic spikes every ~2h (recurring batch pattern)	No
`copystorm-production-uw2-aurora-postgresql-1`	6%	66%	Periodic spikes every ~2h (recurring batch pattern)	No
`carrier-management-production-1`	10%	44%	Appeared recently with initial spike, settled to 8%	No

Verdict: ibex-production-uw2-aurora-mysql-1 has a significant CPU spike starting around the incident window. However, the CPU spike started at ~11:00 AM PDT and the incident impact was 11:35 AM - 12:12 PM PDT, so the ibex spike preceded the symptoms slightly. All other production RDS instances are within normal operating range.

RDS Aurora - Commit Latency

Instance	Avg Latency	Max Latency	Concern?
`string-theory-api-cold-storage`	3.2ms	64.7ms	Intermittent massive spikes (14:41, 17:03, 17:21 UTC) then subsides. Not correlated with incident timing.
`zenpayroll-versions-production-uw2-aurora-mysql80-1`	10.2ms	25.5ms	Elevated but steady - appears to be baseline for this instance
`ibex-production-uw2-aurora-mysql-1`	3.6ms	13ms	Jumped from ~1.8ms to 10ms at 11:14 AM PDT - correlates with CPU spike
`email-events-production-1`	4.4ms	9ms	Mid-range, trending down. Normal.
`payments-service-production-uw2-aurora-mysql-1`	5.2ms	9ms	Steady 4-6ms. Normal.
`consumer-banking-prod-uw2-aurora-mysql-1`	2.2ms	22ms	One spike at 14:59 UTC (7:59 AM PDT), otherwise normal. Not correlated.
`payroll-versions-production-uw2-aurora57-1`	1.8ms	5.3ms	Normal. Stable.
`tax-platform-production-uw2-aurora-mysql-2`	1.5ms	6ms	Normal. Stable.

Verdict: ibex-production-uw2-aurora-mysql-1 shows correlated commit latency increase (1.8ms -> 10ms) alongside its CPU spike. string-theory-api-cold-storage has dramatic spikes but they're intermittent and don't align with the incident window. The main zenpayroll production databases show no commit latency issues.

RDS Aurora - Replica Lag

All instances reporting 0 seconds replica lag. No replication issues detected.

Redis (ElastiCache) - Connections & CPU

Cluster	Avg Connections	Max Connections	Trend
`zp-production-uw2-sidekiq-jobs-2-002`	11,424	14,887	Normal oscillation
`zp-production-uw2-redis6x-multi-use-001`	8,444	13,730	Normal oscillation
`payroll-production-cache-001`	5,923	9,536	Normal oscillation
`payroll-production-session-store-001`	4,462	8,029	Normal oscillation
`zp-production-uw2-sidekiq-limiter-uqw-001`	4,086	6,166	Normal oscillation
`ai-platform-production-uw2-001`	3,391	3,699	Stable

Engine CPU utilization metric returned no data (metric name may differ in this environment).

Verdict: Redis connection counts are showing normal oscillation patterns. No connection storms, no unusual spikes correlated with the incident. Redis is clear.

Kafka (MSK) - CPU

Cluster	Avg CPU	Max CPU	Trend
Top cluster (unnamed in first series)	54.4%	57.2%	Steady, gradually increasing through the day
Second cluster	~17.8%	~24.8%	Normal

Two Kafka consumer lag monitors in Alert state but both are tagged (TESTING) - not production-critical monitors.

Verdict: Kafka CPU is steady at ~54% for the busiest cluster. No spikes correlated with the incident. Consumer lag monitors alerting are test monitors only. Kafka is clear.

Summary for Datastores Infrastructure

Datastore	Status	Notes
RDS Aurora (MySQL)	Mostly clear	`ibex-production` has CPU spike to 86% + commit latency 5x increase, but this is a single non-core instance. Core zenpayroll/payroll databases are healthy.
RDS Aurora (Postgres)	Clear	Normal or staging-only activity
Redis	Clear	Normal connection patterns, no anomalies
Kafka/MSK	Clear	Steady CPU, no spikes, test monitors only
Replica Lag	Clear	Zero across all instances

Bottom Line

No evidence that Datastores Infrastructure is the root cause of this incident. The core production databases (zenpayroll, payroll, payments-service) show normal CPU, commit latency, and zero replica lag. Redis and Kafka are operating normally.

The one anomaly is ibex-production-uw2-aurora-mysql-1 (CPU 86%, commit latency 5x), but this is an isolated instance and unlikely to be driving the broad payroll page failures described in the incident. The incident appears to be application-level based on channel evidence (enum KeyError, http2 errors, symptoms resolving after code revert).

Recommended Follow-up

Confirm that the ibex-production CPU spike is understood and not cascading
The incident channel indicates symptoms are subsiding - monitor for recurrence
No action needed from Datastores Infra at this time

faun/humming-painting-raven.md

Select an option

No results found