SEV-2 incident (#alert-4421-system_instability_unclear_why) declared at 12:17 PM PDT on 2026-04-06. System instability with unclear root cause. Impact window roughly 11:35 AM - 12:12 PM PDT (3:03 PM - 3:12 PM ET). ~40-50 users per 20-min window affected across payroll workflows (hours entry, pay dashboard, off-cycle payrolls, support pages).
Application-level findings from the channel:
KeyError: Enum PayrollBlockers::Enums::PayrollBlockerType key not found: :partner_tos_not_accepted(revert PR #329058)- http2 connection errors under load (
HTTP fetch failed from 'zenpayroll') - Compute/network team confirmed pods healthy, scaling normal, SLO recovered
- CX reports impact subsided
| Instance | Avg CPU | Max CPU | Trend | Concern? |
|---|---|---|---|---|
ibex-production-uw2-aurora-mysql-1 |
18% | 86% | Jumped from ~2.5% to 75-86% at ~11:00 AM PDT, still elevated | YES - investigate |
airflow-staging-uw2-aurora-postgresql-1 |
37% | 52% | Steady, no change | No (staging) |
zenpayroll-versions-production-uw2-aurora-mysql80-1 |
24% | 48% | Variable 15-48%, no clear spike correlated with incident | No |
email-events-production-1 |
15% | 56% | Elevated 34-55% between 8:52-9:45 AM PDT, then dropped. Before incident. | No |
tax-credits-sanitized |
16% | 67% | Periodic spikes every ~2h (recurring batch pattern) | No |
copystorm-production-uw2-aurora-postgresql-1 |
6% | 66% | Periodic spikes every ~2h (recurring batch pattern) | No |
carrier-management-production-1 |
10% | 44% | Appeared recently with initial spike, settled to 8% | No |
Verdict: ibex-production-uw2-aurora-mysql-1 has a significant CPU spike starting around the incident window. However, the CPU spike started at ~11:00 AM PDT and the incident impact was 11:35 AM - 12:12 PM PDT, so the ibex spike preceded the symptoms slightly. All other production RDS instances are within normal operating range.
| Instance | Avg Latency | Max Latency | Concern? |
|---|---|---|---|
string-theory-api-cold-storage |
3.2ms | 64.7ms | Intermittent massive spikes (14:41, 17:03, 17:21 UTC) then subsides. Not correlated with incident timing. |
zenpayroll-versions-production-uw2-aurora-mysql80-1 |
10.2ms | 25.5ms | Elevated but steady - appears to be baseline for this instance |
ibex-production-uw2-aurora-mysql-1 |
3.6ms | 13ms | Jumped from ~1.8ms to 10ms at 11:14 AM PDT - correlates with CPU spike |
email-events-production-1 |
4.4ms | 9ms | Mid-range, trending down. Normal. |
payments-service-production-uw2-aurora-mysql-1 |
5.2ms | 9ms | Steady 4-6ms. Normal. |
consumer-banking-prod-uw2-aurora-mysql-1 |
2.2ms | 22ms | One spike at 14:59 UTC (7:59 AM PDT), otherwise normal. Not correlated. |
payroll-versions-production-uw2-aurora57-1 |
1.8ms | 5.3ms | Normal. Stable. |
tax-platform-production-uw2-aurora-mysql-2 |
1.5ms | 6ms | Normal. Stable. |
Verdict: ibex-production-uw2-aurora-mysql-1 shows correlated commit latency increase (1.8ms -> 10ms) alongside its CPU spike. string-theory-api-cold-storage has dramatic spikes but they're intermittent and don't align with the incident window. The main zenpayroll production databases show no commit latency issues.
All instances reporting 0 seconds replica lag. No replication issues detected.
| Cluster | Avg Connections | Max Connections | Trend |
|---|---|---|---|
zp-production-uw2-sidekiq-jobs-2-002 |
11,424 | 14,887 | Normal oscillation |
zp-production-uw2-redis6x-multi-use-001 |
8,444 | 13,730 | Normal oscillation |
payroll-production-cache-001 |
5,923 | 9,536 | Normal oscillation |
payroll-production-session-store-001 |
4,462 | 8,029 | Normal oscillation |
zp-production-uw2-sidekiq-limiter-uqw-001 |
4,086 | 6,166 | Normal oscillation |
ai-platform-production-uw2-001 |
3,391 | 3,699 | Stable |
Engine CPU utilization metric returned no data (metric name may differ in this environment).
Verdict: Redis connection counts are showing normal oscillation patterns. No connection storms, no unusual spikes correlated with the incident. Redis is clear.
| Cluster | Avg CPU | Max CPU | Trend |
|---|---|---|---|
| Top cluster (unnamed in first series) | 54.4% | 57.2% | Steady, gradually increasing through the day |
| Second cluster | ~17.8% | ~24.8% | Normal |
Two Kafka consumer lag monitors in Alert state but both are tagged (TESTING) - not production-critical monitors.
Verdict: Kafka CPU is steady at ~54% for the busiest cluster. No spikes correlated with the incident. Consumer lag monitors alerting are test monitors only. Kafka is clear.
| Datastore | Status | Notes |
|---|---|---|
| RDS Aurora (MySQL) | Mostly clear | ibex-production has CPU spike to 86% + commit latency 5x increase, but this is a single non-core instance. Core zenpayroll/payroll databases are healthy. |
| RDS Aurora (Postgres) | Clear | Normal or staging-only activity |
| Redis | Clear | Normal connection patterns, no anomalies |
| Kafka/MSK | Clear | Steady CPU, no spikes, test monitors only |
| Replica Lag | Clear | Zero across all instances |
No evidence that Datastores Infrastructure is the root cause of this incident. The core production databases (zenpayroll, payroll, payments-service) show normal CPU, commit latency, and zero replica lag. Redis and Kafka are operating normally.
The one anomaly is ibex-production-uw2-aurora-mysql-1 (CPU 86%, commit latency 5x), but this is an isolated instance and unlikely to be driving the broad payroll page failures described in the incident. The incident appears to be application-level based on channel evidence (enum KeyError, http2 errors, symptoms resolving after code revert).
- Confirm that the
ibex-productionCPU spike is understood and not cascading - The incident channel indicates symptoms are subsiding - monitor for recurrence
- No action needed from Datastores Infra at this time