Import Service CPU Spike Investigation (May 9, 2026)

Summary

On May 9, import-prod avg request duration jumped from ~10s to 3,240s (54 minutes), autoscaling kicked in from 2 to 7 tasks, and 500 errors spiked 20x. No code was deployed to the import service. The root cause was a Transloadit outage that exposed the fact that the import service has no enforceable request-level timeout anywhere in the stack.

Agentic import was ruled out — its CPU was ~0.5%, and import-prod request volume was flat across the incident window.

Timeline (May 9 UTC)

Time	Avg Duration	Errors/hr	Phase
00:00-03:00	25-35s	220-378	Transloadit partially degraded
04:00-08:00	4-11s	10-20	Brief recovery
09:00	14.5s	156	Re-escalation begins
10:00	519s (8.6 min)	560	Full outage — Transloadit down again
14:00-15:00	1,381-2,626s	746-1,066	Peak — requests piling up, S3 clock skew cascade
15:00-23:00	2,000-3,761s	385-884	Sustained, never recovered

Root Cause: Transloadit Outage + Cascade

create_screenshots (the Transloadit-dependent step) was timing out on 80% of requests:

Baseline (May 7): avg 13.6s, 0.8% error rate
Incident (May 9): avg 100s, 80% error rate — 8,230 TimeoutErrors

Because create_screenshots blocked for 2+ minutes before timing out, the entire parse_pptx_slides pipeline stalled. S3 upload operations that had already started hung for hours, then failed with RequestTimeTooSkewed (pre-signed timestamps expired).

Duration Blowup by Operation

Operation	May 7 Avg	May 9 Avg	Factor
`POST /api/v1/slides`	3.9s	23.4 min	357x
`upload_pptx_file`	0.24s	15.6 min	3,896x
`upload_pptx_assets`	0.22s	5.8 min	1,582x
`upload_slides_json`	0.12s	4.1 min	2,067x
`create_screenshots`	13.6s	100s	7.4x (capped by 2min timeout)
`aiohttp.request`	0.087s	35.7s	411x

Error Cascade

Error Type	May 7	May 9
`TimeoutError` (Transloadit)	64	8,230
`ClientConnectionError` (pool exhaustion)	0	4,286
`FileNotFoundError` (temp files cleaned up)	0	3,435
`botocore ClientError` (S3 clock skew)	0	2,455
`CancelledError` (asyncio cancellation)	0	1,032
`fastapi HTTPException` (propagated)	639	11,013

Why the Existing 120s Timeout Didn't Help

There is a 2-minute timeout in transloadit/client.py:93-96:

result = await asyncio.wait_for(
    asyncio.to_thread(assembly.create, wait=True),
    timeout=120,
)

asyncio.wait_for cancels the awaitable, but cannot interrupt the underlying OS thread. When the timeout fires, the background thread running assembly.create(wait=True) keeps running — stuck in the SDK's while not _assembly_finished polling loop forever. The orphaned thread holds file handles to temp files, which then get cleaned up by the context manager, causing FileNotFoundError.

For the 6-hour spans, the event loop was likely blocked or the thread pool exhausted, so wait_for never even got a chance to fire.

Missing Safeguards

Operation	Location	Issue
PPTX download	`download.py:7`	`requests.get()` with no `timeout=`, synchronous (blocks event loop)
S3 uploads	`pptx_upload.py:202`	`upload_fileobj` with no timeout
S3 JSON upload	`presentation.py:54`	`upload_file` with no timeout
asyncio.gather	`main.py:103`	No overall timeout wrapping the gather
Request-level	`main.py:49-134`	No endpoint timeout, no FastAPI timeout middleware
Uvicorn	`pyproject.toml:51`	No request timeout config
Transloadit TUS upload	SDK `assembly.py:73-78`	Chunked upload with no timeout
Transloadit polling	SDK `assembly.py:109-118`	`while not _assembly_finished` polls forever, uncancellable thread
Temp file cleanup	`main.py:85-108`	Context manager deletes temp dir while orphaned thread still reads

Recommended Fixes

P0: Overall Request Timeout

Prevents 6-hour hangs entirely. Wrap the endpoint body in asyncio.wait_for:

@app.post("/api/v1/slides", ...)
async def slides(request: ParsePptRequest):
    try:
        return await asyncio.wait_for(_process_slides(request), timeout=300)
    except asyncio.TimeoutError:
        raise HTTPException(504, "Request timed out after 5 minutes")

P1: Fix Transloadit Cancellation

Don't use assembly.create(wait=True) in a thread. Create with wait=False, then poll asynchronously so each iteration is independently cancellable:

assembly_response = await asyncio.to_thread(assembly.create, wait=False)
assembly_url = assembly_response.data.get("assembly_ssl_url")

deadline = asyncio.get_event_loop().time() + 120
while True:
    status = await asyncio.to_thread(tl_client.get_assembly, assembly_url=assembly_url)
    if _assembly_finished(status):
        return status
    if asyncio.get_event_loop().time() > deadline:
        await asyncio.to_thread(tl_client.cancel_assembly, assembly_url=assembly_url)
        raise TimeoutError("Transloadit assembly timed out after 120s")
    await asyncio.sleep(status.data.get("info", {}).get("retryIn", 1))

P2: Add Timeout to PPTX Download

requests.get(url, stream=True, timeout=(10, 30))  # (connect, read)

And wrap in asyncio.to_thread so it doesn't block the event loop.

P3: Add Timeouts to S3 Operations

from botocore.config import Config
s3_config = Config(connect_timeout=10, read_timeout=30, retries={'max_attempts': 3})

P4: Prevent Temp File Cleanup While Threads Active

Either join orphaned threads before cleanup, or give Transloadit its own copy of the file so the context manager exit doesn't pull files out from under it.

jordangarcia/import-service-cpu-investigation.md

Select an option

No results found