Skip to content

Instantly share code, notes, and snippets.

@jordangarcia
Created April 22, 2026 17:55
Show Gist options
  • Select an option

  • Save jordangarcia/dff6eddde1673f88ef98978bff4b01c0 to your computer and use it in GitHub Desktop.

Select an option

Save jordangarcia/dff6eddde1673f88ef98978bff4b01c0 to your computer and use it in GitHub Desktop.
Why the slow ramp didn't catch the flux-2-klein outage

Why the slow ramp didn't catch the flux-2-klein outage

The problem

On 2026-04-21, Black Forest Labs' API endpoint api.bfl.ai/v1/flux-2-klein-9b-private started returning 404 Not Found on every request. This has been going on for 24+ hours. Every image generation request that targets flux-2-klein fails immediately (~49ms), the ImageModelFallback catches the error, and the system falls back to flux-1-quick or qwen-image-fast.

The fallback chain is working, so users still get images. But we're making ~300-500 wasted API calls per hour (spiking to 18k+ during peak) to an endpoint that's been dead for over a day. No automated system detected or responded to this.

What we wanted slow ramp to do

The orchestration slow ramp (OrchestrationService) sits in front of image model selection. When a request comes in, it calls selectAndReserve() which atomically checks each candidate model's inflight concurrency against a ceiling. If a model is over capacity, it gets skipped. The ceiling grows gradually (25% per 5-minute period) so a model recovering from downtime doesn't get slammed with full traffic immediately.

The idea: if a model starts failing, its concurrency should build up, hit the ceiling, and cause the orchestration layer to skip it in favor of healthier models.

Why it didn't work

The slow ramp tracks concurrent inflight requests, not error rates. The ceiling for flux-2-klein is 1,250 (the cold-start floor of minBaseRate: 1000 * (1 + maxRampPercent: 0.25)). That ceiling never moved during the entire outage.

The reason is simple: flux-2-klein requests fail in ~49ms. A request comes in, gets a slot, hits BFL, gets a 404, and releases the slot almost instantly. At 500 failures/hour, that's ~0.14 concurrent requests on average. The system sees a model with 1,250 capacity and 0.14 inflight requests and concludes everything is fine.

The slow ramp would only catch this if failed requests were slow enough to pile up and exceed the ceiling. A 404 that returns in 49ms is the exact opposite — it's the fastest possible failure mode and the one least visible to a concurrency-based system.

What would need to change

The slow ramp is a concurrency limiter. It answers "is this model overwhelmed?" not "is this model broken?" Those are different questions and need different mechanisms. An error-rate circuit breaker (like the one we have for LLM deployments in DeploymentCircuitBreakerService) would catch this, but it's not wired up for image models. The orchestration service's own TODO mentions this: "Unify with circuit breaker: fold error-rate-based disablement into this service so selectAndReserve can skip unhealthy deployments in the same atomic Lua pass."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment