Skip to content

Instantly share code, notes, and snippets.

@faun
Created April 6, 2026 19:35
Show Gist options
  • Select an option

  • Save faun/54d7b7b78f13c8454365dc05e4ab6ba4 to your computer and use it in GitHub Desktop.

Select an option

Save faun/54d7b7b78f13c8454365dc05e4ab6ba4 to your computer and use it in GitHub Desktop.
Alert 4421 - Datastore Infrastructure Analysis (SEV-2)

Alert 4421 - Datastore Infrastructure Analysis

Context

SEV-2 incident (#alert-4421-system_instability_unclear_why) declared at 12:17 PM PDT on 2026-04-06. System instability with unclear root cause. Impact window roughly 11:35 AM - 12:12 PM PDT (3:03 PM - 3:12 PM ET). ~40-50 users per 20-min window affected across payroll workflows (hours entry, pay dashboard, off-cycle payrolls, support pages).

Application-level findings from the channel:

  • KeyError: Enum PayrollBlockers::Enums::PayrollBlockerType key not found: :partner_tos_not_accepted (revert PR #329058)
  • http2 connection errors under load (HTTP fetch failed from 'zenpayroll')
  • Compute/network team confirmed pods healthy, scaling normal, SLO recovered
  • CX reports impact subsided

RDS Aurora (MySQL + Postgres) - CPU

Instance Avg CPU Max CPU Trend Concern?
ibex-production-uw2-aurora-mysql-1 18% 86% Jumped from ~2.5% to 75-86% at ~11:00 AM PDT, still elevated YES - investigate
airflow-staging-uw2-aurora-postgresql-1 37% 52% Steady, no change No (staging)
zenpayroll-versions-production-uw2-aurora-mysql80-1 24% 48% Variable 15-48%, no clear spike correlated with incident No
email-events-production-1 15% 56% Elevated 34-55% between 8:52-9:45 AM PDT, then dropped. Before incident. No
tax-credits-sanitized 16% 67% Periodic spikes every ~2h (recurring batch pattern) No
copystorm-production-uw2-aurora-postgresql-1 6% 66% Periodic spikes every ~2h (recurring batch pattern) No
carrier-management-production-1 10% 44% Appeared recently with initial spike, settled to 8% No

Verdict: ibex-production-uw2-aurora-mysql-1 has a significant CPU spike starting around the incident window. However, the CPU spike started at ~11:00 AM PDT and the incident impact was 11:35 AM - 12:12 PM PDT, so the ibex spike preceded the symptoms slightly. All other production RDS instances are within normal operating range.

RDS Aurora - Commit Latency

Instance Avg Latency Max Latency Concern?
string-theory-api-cold-storage 3.2ms 64.7ms Intermittent massive spikes (14:41, 17:03, 17:21 UTC) then subsides. Not correlated with incident timing.
zenpayroll-versions-production-uw2-aurora-mysql80-1 10.2ms 25.5ms Elevated but steady - appears to be baseline for this instance
ibex-production-uw2-aurora-mysql-1 3.6ms 13ms Jumped from ~1.8ms to 10ms at 11:14 AM PDT - correlates with CPU spike
email-events-production-1 4.4ms 9ms Mid-range, trending down. Normal.
payments-service-production-uw2-aurora-mysql-1 5.2ms 9ms Steady 4-6ms. Normal.
consumer-banking-prod-uw2-aurora-mysql-1 2.2ms 22ms One spike at 14:59 UTC (7:59 AM PDT), otherwise normal. Not correlated.
payroll-versions-production-uw2-aurora57-1 1.8ms 5.3ms Normal. Stable.
tax-platform-production-uw2-aurora-mysql-2 1.5ms 6ms Normal. Stable.

Verdict: ibex-production-uw2-aurora-mysql-1 shows correlated commit latency increase (1.8ms -> 10ms) alongside its CPU spike. string-theory-api-cold-storage has dramatic spikes but they're intermittent and don't align with the incident window. The main zenpayroll production databases show no commit latency issues.

RDS Aurora - Replica Lag

All instances reporting 0 seconds replica lag. No replication issues detected.

Redis (ElastiCache) - Connections & CPU

Cluster Avg Connections Max Connections Trend
zp-production-uw2-sidekiq-jobs-2-002 11,424 14,887 Normal oscillation
zp-production-uw2-redis6x-multi-use-001 8,444 13,730 Normal oscillation
payroll-production-cache-001 5,923 9,536 Normal oscillation
payroll-production-session-store-001 4,462 8,029 Normal oscillation
zp-production-uw2-sidekiq-limiter-uqw-001 4,086 6,166 Normal oscillation
ai-platform-production-uw2-001 3,391 3,699 Stable

Engine CPU utilization metric returned no data (metric name may differ in this environment).

Verdict: Redis connection counts are showing normal oscillation patterns. No connection storms, no unusual spikes correlated with the incident. Redis is clear.

Kafka (MSK) - CPU

Cluster Avg CPU Max CPU Trend
Top cluster (unnamed in first series) 54.4% 57.2% Steady, gradually increasing through the day
Second cluster ~17.8% ~24.8% Normal

Two Kafka consumer lag monitors in Alert state but both are tagged (TESTING) - not production-critical monitors.

Verdict: Kafka CPU is steady at ~54% for the busiest cluster. No spikes correlated with the incident. Consumer lag monitors alerting are test monitors only. Kafka is clear.


Summary for Datastores Infrastructure

Datastore Status Notes
RDS Aurora (MySQL) Mostly clear ibex-production has CPU spike to 86% + commit latency 5x increase, but this is a single non-core instance. Core zenpayroll/payroll databases are healthy.
RDS Aurora (Postgres) Clear Normal or staging-only activity
Redis Clear Normal connection patterns, no anomalies
Kafka/MSK Clear Steady CPU, no spikes, test monitors only
Replica Lag Clear Zero across all instances

Bottom Line

No evidence that Datastores Infrastructure is the root cause of this incident. The core production databases (zenpayroll, payroll, payments-service) show normal CPU, commit latency, and zero replica lag. Redis and Kafka are operating normally.

The one anomaly is ibex-production-uw2-aurora-mysql-1 (CPU 86%, commit latency 5x), but this is an isolated instance and unlikely to be driving the broad payroll page failures described in the incident. The incident appears to be application-level based on channel evidence (enum KeyError, http2 errors, symptoms resolving after code revert).

Recommended Follow-up

  • Confirm that the ibex-production CPU spike is understood and not cascading
  • The incident channel indicates symptoms are subsiding - monitor for recurrence
  • No action needed from Datastores Infra at this time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment