Skip to content

Instantly share code, notes, and snippets.

@gmarchand
Last active April 8, 2025 08:51
Show Gist options
  • Save gmarchand/771458700dadfa8193f97db8c2768ec3 to your computer and use it in GitHub Desktop.
Save gmarchand/771458700dadfa8193f97db8c2768ec3 to your computer and use it in GitHub Desktop.
Stability Checklist

Stability checklist

Checklist

Patterns DNS CDN Load Balancer Auto Scaling Group Instance Container Application Datastore
Fail fast - Timeout
Circuit breaker
Retry with exponential backoff and Jitter
Feature Toggle
Bulkheads
Handshaking - Throttling
Decoupling middleware
Shed load

✅ : checked - ❓ : We don't know - ❌ :to do - ✔️ not required

Details

Source: Release it! (Second Edition)

Fail fast - Timeout

The Timeouts pattern is useful when you need to protect your system from someone else’s failure. Fail Fast is useful when you need to report why you won’t be able to process some transaction. Fail Fast applies to incoming requests, whereas the Timeouts pattern applies primarily to outbound requests. They’re two sides of the same coin.

Circuit breaker

A circuit breaker may also have a “fallback” strategy. Perhaps it returns the last good response or a cached value. It may return a generic answer rather than a personalized one. Or it may even call a secondary service when the primary is not available. Circuit breakers are a way to automatically degrade functionality when the system is under stress.

Shed load

At any moment, more than a billion devices could make a request. No matter how strong your load balancers or how fast you can scale, the world can always make more load than you can handle.

...

Services should model TCP’s approach. When load gets too high, start to refuse new requests for work. This is related to Fail Fast. ...

The ideal way to define “load is too high” is for a service to monitor its own performance relative to its SLA. When requests take longer than the SLA, it’s time to shed some load. Failing that, you may choose to keep a semaphore in your application and only allow a certain number of concurrent requests in the system. A queue between accepting connections and processing them would have a similar effect, but at the expense of both complexity and latency. When a load balancer is in the picture, individual instances can use a 503 status code on their health check pages to tell the load balancer to back off for a while. Inside the boundaries of a system or enterprise, it is more efficient to use backpressure to create a balanced throughput of requests across synchronous- ly-coupled services. Use load shedding as a secondary measure in these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment