Skip to content

Instantly share code, notes, and snippets.

@futzlarson
Last active March 15, 2026 15:32
Show Gist options
  • Select an option

  • Save futzlarson/7c95c7a7e451fab7769f3ec8d1d46964 to your computer and use it in GitHub Desktop.

Select an option

Save futzlarson/7c95c7a7e451fab7769f3ec8d1d46964 to your computer and use it in GitHub Desktop.
Risk Adjustment - Lambda Timeout Analysis Report (2026-03-15)

Risk Adjustment — Lambda Timeout Analysis Report

Date: March 15, 2026
Triggered by: CloudWatch Alarm vapor-Curitics-RA-production-d-timeout-warning
Alarm threshold: 28,000ms (approaching 30s Lambda limit)
Datapoint that fired: 28,877ms at 14:53 UTC


TL;DR

All timeout issues originate from a single tenant (tenant_id: 3). The root cause is data volume growth outpacing query performance — primarily on the Performance Dashboard and Encounter Logs pages. Three DashboardService methods have no caching and re-execute heavy queries on every user interaction. The dashboard timeout is accelerating: 58 of 151 total events occurred in just the last 2 days.


How This Was Identified

The LogSlowRequests middleware fires PerformanceMonitor::logTimeout() for any request exceeding 25,000ms, which logs a CRITICAL to CloudWatch and captures a Sentry error (always, regardless of the ENABLE_PERFORMANCE_LOGGING flag).

CloudWatch logs for the Lambda function confirmed the exact request:

"Potential request timeout" → method: GET, path: /, duration_ms: 28779.44
status_code: 200, user_id: 3236, tenant_id: 3
REPORT: Duration: 28877.69 ms ← exact datapoint that triggered the alarm

Sentry was then used to identify the full scope of timeout issues across the application.


Sentry Issue Summary

All 5 unresolved timeout issues — project: risk-adjustment, org: curitics-health.

Sentry ID Endpoint Events Avg Duration Range First Seen Last Seen
RISK-ADJUSTMENT-S0 GET / (dashboard) 151 30.2s 25–39s Feb 17 Today
RISK-ADJUSTMENT-S4 GET /logs/encounter-logs 63 43.7s 28–100s Feb 17 Mar 12
RISK-ADJUSTMENT-SA POST /livewire/update 24 ~30s 25–40s Feb 18 Mar 13
RISK-ADJUSTMENT-W6 GET /awv-upload-documents 1 35s Mar 12 Mar 12
RISK-ADJUSTMENT-VK GET /concurrent-review 1 27s Mar 9 Mar 9

Critical observation: Every single event across all 5 issues is attributed to tenant_id: 3.


Issue Deep Dives

1. GET / — Performance Dashboard (151 events, ACCELERATING)

File: app/Filament/Pages/Dashboard.php
Widgets: app/Filament/Widgets/Dashboard/
Service: app/Services/DashboardService.php

On every page load, the Filament dashboard mounts all widgets simultaneously. Each widget calls DashboardService methods that run independent, heavy database queries. On initial load there is no Livewire lazy-loading — everything executes synchronously in the same request.

Time-of-day distribution (UTC): Flat across all 24 hours — no peak window. This rules out scheduled jobs and confirms the cause is data volume, not concurrent load.

Volume trend:

  • Prior to Mar 14: ~93 events over ~4 weeks
  • Mar 14–15 alone: 58 events in 2 days
  • The issue is worsening as tenant 3's dataset grows

Queries that fire on every dashboard load:

DashboardService Method Cached? Notes
getMemberTypeCounts() ✅ 59 min Safe
getAwvStatusSummary() ✅ 59 min Safe
getAverageRaf() ✅ 30 min Safe
getAverageStarScore() ✅ 5 min Safe
getHccConditionCoverageSummary() ✅ 30 min Safe
getMedexClaimSummaryQuery() No cache Returns query builder; executed by ClaimsSnapshot widget on every request
getHedisGapTrackingQuery() No cache Most complex query — DISTINCT ON + multiple JOINs + LEFT JOIN subquery; executed by _GapTracker on every request
getProvidersWithMedexSummary() No cache Multi-JOIN with medex aggregation subquery; executed by _ProviderSummary on every request

The three uncached methods are the primary bottleneck for the initial dashboard load.


2. GET /logs/encounter-logs — Encounter Log Resource (63 events, up to 100s)

File: app/Filament/Clusters/Logs/Resources/EncounterLogResource.php
List page: app/Filament/Clusters/Logs/Resources/EncounterLogResource/Pages/ListEncounterLogs.php
Model: EncounterLogSummaryencounter_log_summaries table

Time-of-day distribution (UTC): Clusters at 16:00–18:00 UTC (9–11am Pacific) — this is business-hours driven, triggered by staff starting their day and loading the logs page.

Duration note: 50 of 63 events exceed 30 seconds (max: 100.5s). This page is on the web Lambda function (timeout: 600s), not the 30s function, but these durations are still completely unacceptable.

Root cause — N+1 queries on every row:

The ListEncounterLogs page never overrides the table query to eager-load relationships. Filament lazy-loads each relationship per rendered row. With the default page size of 25 rows:

  • provider_name → 1 query per row (25 queries)
  • created_by_name → 1 query per row (25 queries)
  • updated_by_name → 1 query per row (25 queries)

That's 75+ extra queries just to display the table, before counting the 7 relationship-based filter joins (tenant, status, provider ×2, concurrent review status, createdBy, updatedBy).

Fix: Add eager loading to the list page query:

// In ListEncounterLogs.php
public function getTableQuery(): Builder
{
    return parent::getTableQuery()
        ->with(['provider', 'createdBy', 'updatedBy', 'status', 'member']);
}

3. POST /livewire/update — Dashboard Filter Interactions (24 events)

Affected widgets: _GapTracker, ClaimsSnapshot, _ProviderSummary
Environments affected: Both production and uat

Every dashboard filter change (provider, market, IPA, DOS year) dispatches a chart-filter-changed Livewire event. All widgets listen to this event and re-execute their queries simultaneously in a new POST /livewire/update request — with no debouncing and no caching.

Smoking gun from Sentry data:
On March 12, 2026 between 16:25–16:36 UTC, 10 timeout events fired in 11 minutes. This is a single user clicking through the filter dropdowns — each click triggering the three uncached DashboardService methods in parallel, consuming ~30s per interaction.

The three uncached methods re-executed on every filter click:

  1. getHedisGapTrackingQuery() — Most dangerous. Complex subquery with:

    • DISTINCT ON (member_id) subquery
    • JOINs: hedis_gaps, member_market, markets, member_ipa, ipas
    • LEFT JOIN with aggregation subquery
    • GROUP BY on MasterHedisMeasure
    • No cache wrapper at all
  2. getMedexClaimSummaryQuery() — Returns a query builder (not a result), executed fresh by ClaimsSnapshot on every updateTable() call. No cache.

  3. getProvidersWithMedexSummary() — Complex provider query with:

    • Conditional JOINs: provider_market, markets, provider_ipa, ipas, member_provider, members
    • Subquery for medex_summary aggregation with raw SQL
    • No cache. Also called a second time in exportCsv().

Fix direction: Wrap each method in tenant+filter-aware cache keys (following the same pattern as the already-cached methods):

// Example key pattern
$cacheKey = "hedis_gap_tracking:{$tenantId}:{$providerId}:{$market}:{$ipa}:{$dosYear}";
Cache::remember($cacheKey, now()->addMinutes(5), fn() => /* query */);

Infrastructure Context

  • Function: vapor-Curitics-RA-production-d (Lambda, us-west-1)
  • Web timeout: 600s (vapor.yml), but effective query timeout ~30s for this function
  • Queue timeout: 900s
  • Runtime: PHP 8.3.30, Laravel 11, Filament, Laravel Vapor
  • Database: PostgreSQL (RDS RA-Prod)
  • Cache: Redis (RA-Production)

Recommended Fixes (Priority Order)

Priority 1 — Cache the three uncached DashboardService methods

Impact: Eliminates dashboard load timeouts and all /livewire/update timeouts
Files: app/Services/DashboardService.php
Methods: getHedisGapTrackingQuery(), getMedexClaimSummaryQuery(), getProvidersWithMedexSummary()
Approach: Use tenant+filter composite cache keys with a 5–15 minute TTL, matching the pattern used by getAverageStarScore() and getHccConditionCoverageSummary()

Priority 2 — Eager-load relationships in EncounterLogResource

Impact: Eliminates N+1 queries — reduces encounter-logs page from 75+ queries to ~8
File: app/Filament/Clusters/Logs/Resources/EncounterLogResource/Pages/ListEncounterLogs.php
Change: Override getTableQuery() to add .with(['provider', 'createdBy', 'updatedBy', 'status', 'member'])

Priority 3 — Enable slow query logging temporarily

Impact: Identifies the exact SQL causing the remaining slowness
Action: Set ENABLE_PERFORMANCE_LOGGING=true in production env vars
Note: This enables logSlowQuery() (threshold: 500ms) which logs to CloudWatch and sends to Sentry for queries >2s. Disable after diagnosis.

Priority 4 — Wire up Sentry user identity

Impact: Allows Sentry to track which users are affected (currently all show as null)
Action: Configure Sentry\Laravel\Integration::configureScope() in AppServiceProvider to call Sentry\configureScope() with the authenticated user's details


Monitoring

The existing LogSlowRequests middleware + PerformanceMonitor::logTimeout() setup is solid — it already caught this issue and reported it to both CloudWatch and Sentry. The CloudWatch alarm threshold of 28,000ms gives a 2-second buffer before the Lambda hard timeout.

Once fixes are deployed, the Sentry issues above can be resolved and the alarm should return to OK state within minutes of the first dashboard load completing under the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment