Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jordangarcia/7a7baccc8f3eaa2bd03dc6de9a40b04f to your computer and use it in GitHub Desktop.

Select an option

Save jordangarcia/7a7baccc8f3eaa2bd03dc6de9a40b04f to your computer and use it in GitHub Desktop.
Import service CPU spike investigation (May 9, 2026)

Import Service CPU Spike Investigation (May 9, 2026)

Summary

On May 9, import-prod avg request duration jumped from ~10s to 3,240s (54 minutes), autoscaling kicked in from 2 to 7 tasks, and 500 errors spiked 20x. No code was deployed to the import service. The root cause was a Transloadit outage that exposed the fact that the import service has no enforceable request-level timeout anywhere in the stack.

Agentic import was ruled out — its CPU was ~0.5%, and import-prod request volume was flat across the incident window.

Timeline (May 9 UTC)

Time Avg Duration Errors/hr Phase
00:00-03:00 25-35s 220-378 Transloadit partially degraded
04:00-08:00 4-11s 10-20 Brief recovery
09:00 14.5s 156 Re-escalation begins
10:00 519s (8.6 min) 560 Full outage — Transloadit down again
14:00-15:00 1,381-2,626s 746-1,066 Peak — requests piling up, S3 clock skew cascade
15:00-23:00 2,000-3,761s 385-884 Sustained, never recovered

Root Cause: Transloadit Outage + Cascade

create_screenshots (the Transloadit-dependent step) was timing out on 80% of requests:

  • Baseline (May 7): avg 13.6s, 0.8% error rate
  • Incident (May 9): avg 100s, 80% error rate — 8,230 TimeoutErrors

Because create_screenshots blocked for 2+ minutes before timing out, the entire parse_pptx_slides pipeline stalled. S3 upload operations that had already started hung for hours, then failed with RequestTimeTooSkewed (pre-signed timestamps expired).

Duration Blowup by Operation

Operation May 7 Avg May 9 Avg Factor
POST /api/v1/slides 3.9s 23.4 min 357x
upload_pptx_file 0.24s 15.6 min 3,896x
upload_pptx_assets 0.22s 5.8 min 1,582x
upload_slides_json 0.12s 4.1 min 2,067x
create_screenshots 13.6s 100s 7.4x (capped by 2min timeout)
aiohttp.request 0.087s 35.7s 411x

Error Cascade

Error Type May 7 May 9
TimeoutError (Transloadit) 64 8,230
ClientConnectionError (pool exhaustion) 0 4,286
FileNotFoundError (temp files cleaned up) 0 3,435
botocore ClientError (S3 clock skew) 0 2,455
CancelledError (asyncio cancellation) 0 1,032
fastapi HTTPException (propagated) 639 11,013

Why the Existing 120s Timeout Didn't Help

There is a 2-minute timeout in transloadit/client.py:93-96:

result = await asyncio.wait_for(
    asyncio.to_thread(assembly.create, wait=True),
    timeout=120,
)

asyncio.wait_for cancels the awaitable, but cannot interrupt the underlying OS thread. When the timeout fires, the background thread running assembly.create(wait=True) keeps running — stuck in the SDK's while not _assembly_finished polling loop forever. The orphaned thread holds file handles to temp files, which then get cleaned up by the context manager, causing FileNotFoundError.

For the 6-hour spans, the event loop was likely blocked or the thread pool exhausted, so wait_for never even got a chance to fire.

Missing Safeguards

Operation Location Issue
PPTX download download.py:7 requests.get() with no timeout=, synchronous (blocks event loop)
S3 uploads pptx_upload.py:202 upload_fileobj with no timeout
S3 JSON upload presentation.py:54 upload_file with no timeout
asyncio.gather main.py:103 No overall timeout wrapping the gather
Request-level main.py:49-134 No endpoint timeout, no FastAPI timeout middleware
Uvicorn pyproject.toml:51 No request timeout config
Transloadit TUS upload SDK assembly.py:73-78 Chunked upload with no timeout
Transloadit polling SDK assembly.py:109-118 while not _assembly_finished polls forever, uncancellable thread
Temp file cleanup main.py:85-108 Context manager deletes temp dir while orphaned thread still reads

Recommended Fixes

P0: Overall Request Timeout

Prevents 6-hour hangs entirely. Wrap the endpoint body in asyncio.wait_for:

@app.post("/api/v1/slides", ...)
async def slides(request: ParsePptRequest):
    try:
        return await asyncio.wait_for(_process_slides(request), timeout=300)
    except asyncio.TimeoutError:
        raise HTTPException(504, "Request timed out after 5 minutes")

P1: Fix Transloadit Cancellation

Don't use assembly.create(wait=True) in a thread. Create with wait=False, then poll asynchronously so each iteration is independently cancellable:

assembly_response = await asyncio.to_thread(assembly.create, wait=False)
assembly_url = assembly_response.data.get("assembly_ssl_url")

deadline = asyncio.get_event_loop().time() + 120
while True:
    status = await asyncio.to_thread(tl_client.get_assembly, assembly_url=assembly_url)
    if _assembly_finished(status):
        return status
    if asyncio.get_event_loop().time() > deadline:
        await asyncio.to_thread(tl_client.cancel_assembly, assembly_url=assembly_url)
        raise TimeoutError("Transloadit assembly timed out after 120s")
    await asyncio.sleep(status.data.get("info", {}).get("retryIn", 1))

P2: Add Timeout to PPTX Download

requests.get(url, stream=True, timeout=(10, 30))  # (connect, read)

And wrap in asyncio.to_thread so it doesn't block the event loop.

P3: Add Timeouts to S3 Operations

from botocore.config import Config
s3_config = Config(connect_timeout=10, read_timeout=30, retries={'max_attempts': 3})

P4: Prevent Temp File Cleanup While Threads Active

Either join orphaned threads before cleanup, or give Transloadit its own copy of the file so the context manager exit doesn't pull files out from under it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment