On May 9, import-prod avg request duration jumped from ~10s to 3,240s (54 minutes), autoscaling kicked in from 2 to 7 tasks, and 500 errors spiked 20x. No code was deployed to the import service. The root cause was a Transloadit outage that exposed the fact that the import service has no enforceable request-level timeout anywhere in the stack.
Agentic import was ruled out — its CPU was ~0.5%, and import-prod request volume was flat across the incident window.
| Time | Avg Duration | Errors/hr | Phase |
|---|---|---|---|
| 00:00-03:00 | 25-35s | 220-378 | Transloadit partially degraded |
| 04:00-08:00 | 4-11s | 10-20 | Brief recovery |
| 09:00 | 14.5s | 156 | Re-escalation begins |
| 10:00 | 519s (8.6 min) | 560 | Full outage — Transloadit down again |
| 14:00-15:00 | 1,381-2,626s | 746-1,066 | Peak — requests piling up, S3 clock skew cascade |
| 15:00-23:00 | 2,000-3,761s | 385-884 | Sustained, never recovered |
create_screenshots (the Transloadit-dependent step) was timing out on 80% of requests:
- Baseline (May 7): avg 13.6s, 0.8% error rate
- Incident (May 9): avg 100s, 80% error rate — 8,230
TimeoutErrors
Because create_screenshots blocked for 2+ minutes before timing out, the entire parse_pptx_slides pipeline stalled. S3 upload operations that had already started hung for hours, then failed with RequestTimeTooSkewed (pre-signed timestamps expired).
| Operation | May 7 Avg | May 9 Avg | Factor |
|---|---|---|---|
POST /api/v1/slides |
3.9s | 23.4 min | 357x |
upload_pptx_file |
0.24s | 15.6 min | 3,896x |
upload_pptx_assets |
0.22s | 5.8 min | 1,582x |
upload_slides_json |
0.12s | 4.1 min | 2,067x |
create_screenshots |
13.6s | 100s | 7.4x (capped by 2min timeout) |
aiohttp.request |
0.087s | 35.7s | 411x |
| Error Type | May 7 | May 9 |
|---|---|---|
TimeoutError (Transloadit) |
64 | 8,230 |
ClientConnectionError (pool exhaustion) |
0 | 4,286 |
FileNotFoundError (temp files cleaned up) |
0 | 3,435 |
botocore ClientError (S3 clock skew) |
0 | 2,455 |
CancelledError (asyncio cancellation) |
0 | 1,032 |
fastapi HTTPException (propagated) |
639 | 11,013 |
There is a 2-minute timeout in transloadit/client.py:93-96:
result = await asyncio.wait_for(
asyncio.to_thread(assembly.create, wait=True),
timeout=120,
)asyncio.wait_for cancels the awaitable, but cannot interrupt the underlying OS thread. When the timeout fires, the background thread running assembly.create(wait=True) keeps running — stuck in the SDK's while not _assembly_finished polling loop forever. The orphaned thread holds file handles to temp files, which then get cleaned up by the context manager, causing FileNotFoundError.
For the 6-hour spans, the event loop was likely blocked or the thread pool exhausted, so wait_for never even got a chance to fire.
| Operation | Location | Issue |
|---|---|---|
| PPTX download | download.py:7 |
requests.get() with no timeout=, synchronous (blocks event loop) |
| S3 uploads | pptx_upload.py:202 |
upload_fileobj with no timeout |
| S3 JSON upload | presentation.py:54 |
upload_file with no timeout |
| asyncio.gather | main.py:103 |
No overall timeout wrapping the gather |
| Request-level | main.py:49-134 |
No endpoint timeout, no FastAPI timeout middleware |
| Uvicorn | pyproject.toml:51 |
No request timeout config |
| Transloadit TUS upload | SDK assembly.py:73-78 |
Chunked upload with no timeout |
| Transloadit polling | SDK assembly.py:109-118 |
while not _assembly_finished polls forever, uncancellable thread |
| Temp file cleanup | main.py:85-108 |
Context manager deletes temp dir while orphaned thread still reads |
Prevents 6-hour hangs entirely. Wrap the endpoint body in asyncio.wait_for:
@app.post("/api/v1/slides", ...)
async def slides(request: ParsePptRequest):
try:
return await asyncio.wait_for(_process_slides(request), timeout=300)
except asyncio.TimeoutError:
raise HTTPException(504, "Request timed out after 5 minutes")Don't use assembly.create(wait=True) in a thread. Create with wait=False, then poll asynchronously so each iteration is independently cancellable:
assembly_response = await asyncio.to_thread(assembly.create, wait=False)
assembly_url = assembly_response.data.get("assembly_ssl_url")
deadline = asyncio.get_event_loop().time() + 120
while True:
status = await asyncio.to_thread(tl_client.get_assembly, assembly_url=assembly_url)
if _assembly_finished(status):
return status
if asyncio.get_event_loop().time() > deadline:
await asyncio.to_thread(tl_client.cancel_assembly, assembly_url=assembly_url)
raise TimeoutError("Transloadit assembly timed out after 120s")
await asyncio.sleep(status.data.get("info", {}).get("retryIn", 1))requests.get(url, stream=True, timeout=(10, 30)) # (connect, read)And wrap in asyncio.to_thread so it doesn't block the event loop.
from botocore.config import Config
s3_config = Config(connect_timeout=10, read_timeout=30, retries={'max_attempts': 3})Either join orphaned threads before cleanup, or give Transloadit its own copy of the file so the context manager exit doesn't pull files out from under it.